# Sentiment Analysis and Collocation of Reviews

In this notebook we apply two techniques to the reviews for the Boston-area AirBnBs in our dataset: sentiment analysis and collocation.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
pd.set_option("max_columns", None)

reviews = pd.read_csv("../input/reviews.csv")
listings = pd.read_csv("../input/listings.csv")

In [None]:
listings.head()

In [None]:
reviews.head()

## Sentiment Analysis

First we note the highly skewed distribution of reviews on the Internet: many many positives, not many negatives. This holds just as true on AirBnB as everywhere else.

In [None]:
listings['review_scores_rating'].sort_values().reset_index(drop=True).dropna().plot()

There's an XKCD for this.

In [None]:
from IPython.display import Image

Image("https://imgs.xkcd.com/comics/star_ratings.png")

Ok, let's try out sentiment analysis.

Sentiment analysis is a technique in natural language processing which aims to retrieve the "sentiment" of a piece of text&mdash;positive, negative, or neutral. This is an easy way of summarizing the contents of a piece of text, and one that is easily understood.

Note, however, that sentiment analysis is a difficult problem. Humans agree on the sentiment of sentences only 80% of the time, and the best classifiers can get around that level of accuracy, but we're going to just use a built-in analyzer in the `nltk` (natural language toolkit) Python library.

So I don't expect our results to be astonishingly good, but let's see what we get...

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
for sentence in reviews['comments'].values[:5]:
    print(sentence)
    ss = sid.polarity_scores(sentence)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
    print()

Our reviews contain both null reviews and reviews in other languages. [langdetect](https://github.com/Mimino666/langdetect) makes this trivially easy, but it doesn't install on Kaggle for whatever reason. `nltk` can do this too, but for whatever reason it doesn't have a built-in for it. We'll use the following bit of code to filter out non English-language reviews, borrowed from elsewhere:

In [None]:
# Snippet from:
# http://h6o6.com/2012/12/detecting-language-with-python-and-the-natural-language-toolkit-nltk/

from nltk.corpus import stopwords   # stopwords to detect language
from nltk import wordpunct_tokenize # function to split up our words

def get_language_likelihood(input_text):
    """Return a dictionary of languages and their likelihood of being the 
    natural language of the input text
    """
 
    input_text = input_text.lower()
    input_words = wordpunct_tokenize(input_text)
 
    language_likelihood = {}
    total_matches = 0
    for language in stopwords._fileids:
        language_likelihood[language] = len(set(input_words) &
                set(stopwords.words(language)))
 
    return language_likelihood
 
def get_language(input_text):
    """Return the most likely language of the given text
    """ 
    likelihoods = get_language_likelihood(input_text)
    return sorted(likelihoods, key=likelihoods.get, reverse=True)[0]

In [None]:
reviews_f = [r for r in reviews['comments'] if pd.notnull(r) and get_language(r) == 'english']

Generate our scores.

In [None]:
pscores = [sid.polarity_scores(comment) for comment in reviews_f]

How do we score on...

**Neutrality**

In [None]:
pd.Series([score['neu'] for score in pscores]).plot(kind='hist')

**Positivity**

In [None]:
pd.Series([score['pos'] for score in pscores]).plot(kind='hist')

**Negativity**.

Almost none of the texts are classified as having significant amounts of negativity! In fact, a significant amount of them are given exactly 0.0 negativity.

In [None]:
pd.Series([score['neg'] for score in pscores]).plot(kind='hist', bins=100)

These charts tell us about the characteristics of the off-the-shelf sentiment classifier that we are used and its performance on our dataset. Although the compound score is supposed to be the best estimate of overall sentiment (not shown in the charts above), the fact that negativities are ranked so lowly hints that we're doing a not so great job with this.

In [None]:
scored_reviews = pd.DataFrame()
scored_reviews['review'] = [r for r in reviews_f if get_language(r) == 'english']
scored_reviews['compound'] = [score['compound'] for score in pscores]
scored_reviews['negativity'] = [score['neg'] for score in pscores]
scored_reviews['neutrality'] = [score['neu'] for score in pscores]
scored_reviews['positivity'] = [score['pos'] for score in pscores]

In [None]:
scored_reviews.head()

Let's look at our positive-negativity reviews. A lot of these aren't negative at all.

In [None]:
scored_reviews.query('negativity > 0')

Here are two that are:

In [None]:
scored_reviews.iloc[23]['review']

In [None]:
scored_reviews.iloc[28]['review']

Some more fiddling with queries...

In [None]:
scored_reviews.query('negativity > positivity').query('negativity > 0.1')

In [None]:
scored_reviews.query('negativity > positivity').query('compound < -0.2')

Here's an example of the kind of (funny, sarcastic) review that seriously trips our classifier up:

In [None]:
scored_reviews.iloc[1181]['review']

Here's another one. In this text's case even though we would say the sentiment with regards to the *lister* is positive, the sentiment of the overall paragraph is *negative* because of the renter's unfortunate experience with food poisoning, being "horrendously sick", etc.

This is a limitation inherent in all sentiment classification tasks. The best way to get around this is to use a technique called chunking to extract what sentiment is attached to what thing in the text, but that gets complicated very quickly.

In [None]:
scored_reviews.iloc[63836]['review']

Here are two more bad reviews because why not:

In [None]:
scored_reviews.iloc[62984]['review']

In [None]:
scored_reviews.iloc[198]['review']

We'll actually stop here. It's pretty clear that our sentiment analyzer is not doing a good enough job separating the wheat from the chaff to use our results for anything! That's unfortunate, but understandable.

There's a number of pre-processing techniques that we could apply to our dataset to make our sentiment analyzer work better (Google it!). We could also try a different sentiment analyzer (like the IBM or HP ones, available via API), particularly one perhaps better suited for the "Internet reviews" domain, and see if that would get better results.

## Collocation

According to Wikipedia "a collocation is a sequence of words or terms that co-occur more often than would be expected by chance." What we want to attempt now is to use `nltk` to find collocations which have a high amount of importance in the text, and we'd like to take and display them as "summaries" of our texts.

How do we tell when a particular combination of words is important? One way of doing it is look at those word's [pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information).  This is a metric which attached significance to words which appear next to one another in the text, for whom such co-occurrences are far-above-averagely-common, and which are otherwise rarely used in the language. According to this metric, for example, the words "puerto" and "rico" have a very high PMI, while the words "to" and "in" have a very low one.

If you use Yelp! a lot you are probably familiar with Yelp's so-called [review highlights](https://www.yelp-support.com/article/What-are-Review-Highlights?l=en_US). These kick in after a location has had a certain reasonably large amount of reviews written, and show, by default, snippets of three reviews mentioning a combination of words which appears especially often in reviews for the location. [Here's an example](https://www.yelp.com/biz/chelsea-market-new-york) of these highlights in action.

An answer on StackOverflow says that these highlights are probably implemented using [precisely the techniques spoken about above](http://stackoverflow.com/questions/2452982/how-to-extract-common-significant-phrases-from-a-series-of-text-entries). What we're going to now try and do is replicate Yelp! review highlights with AirBnB review highlights!

In [None]:
reviews_df = reviews[reviews.apply(lambda srs: pd.notnull(srs['comments']) and (get_language(srs['comments']) == 'english'), axis='columns')]

Let's try and find interesting word combinations for an example listing, just to see if it's possible. In this case we're picking an ID with 200 reviews to it, a substantial number which should hopefully let us mine good subject commonalities between them.

Note that in this case our "combinations of words" means bigrams: pairs of two words which appear right next to each other in the text. This can be extended to n-grams of arbitrary size, if you're so inclined, and Yelp! uses n-gram sizes between 1 and 3, but for simplicity's sake we're going to stick to bigrams (2-grams) here.

In [None]:
example_listing_reviews = reviews_df.query('listing_id == 1178162')

In [None]:
len(example_listing_reviews)

In [None]:
from nltk import word_tokenize

In [None]:
words = np.concatenate(np.array([word_tokenize(r) for r in example_listing_reviews['comments'].values]))

In [None]:
words

In [None]:
from nltk.collocations import BigramAssocMeasures, TrigramAssocMeasures, BigramCollocationFinder

bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words)

finder.apply_freq_filter(3) 
finder.nbest(bigram_measures.pmi, 10)  

Ok great. How many reviews do we have to work with per location?

In [None]:
reviews_df.groupby('listing_id')['comments'].count().plot(kind='hist', bins=20)

To process the words we're going to use a `BigramCollocationFinder`, which expects all of the text from our reviews, tokenized into individual words, as input. To do that we're going to use the `nltk` `word_tokenize` method on the words, then run a couple of maps on the result to tweak a couple of things: remove punctionation marks and recombine contractions that `word_tokenize` splits up (`word_tokenize` will render `didn't` as `["did", "n't"]` for example, which we don't want.

In [None]:
review_words = reviews_df.groupby('listing_id').apply(
    lambda df: np.concatenate(np.array([word_tokenize(r) for r in df['comments'].values]))
)

In [None]:
import string

ex = ['Hi', 'there', '.', '?', '!', ',']
[w for w in ex if w not in string.punctuation]

In [None]:
review_words_f = review_words.map(lambda arr: np.array([w for w in arr if w not in string.punctuation]))

In [None]:
review_words_f.head()

In [None]:
def reattach_contractions(wordlist):
    words = []
    for i, word in enumerate(wordlist):
        if word[0] == "'" or word == "n't":
            words[-1] = words[-1] + word
        else:
            words.append(word)
    return words

In [None]:
review_words_f = review_words_f.map(reattach_contractions)

Ok great! Let's see how many words we're working with for each of our reviews.

In [None]:
review_words_f.map(len).plot(kind='hist', bins=20)

There's going to be some sort of cut-off in terms of the number of words that, were we to use this result in production, we would need to find. Yelp! seems to put that cutoff at 20 or so reviews; below that there's not enough information for highlighting to work.

Not knowing any more about how they do things, we're just going to apply our collocation finder to all of the review texts. First we're going to retrieve a list of bigrams that appear in the review text at least three times. Then we'll pick the three "best" bigrams, where "best" means the large PMI.

That's what the function below does.

In [None]:
# from nltk.collocations import BigramAssocMeasures, TrigramAssocMeasures, BigramCollocationFinder

def bigramify(words):
    finder = BigramCollocationFinder.from_words(words)
    finder.apply_freq_filter(3) 
    return finder.nbest(bigram_measures.pmi, 3)

review_bigrams = review_words_f.map(bigramify)

Let's see what our results look like!

In [None]:
review_bigrams.head(20)

Not bad! It could definitely use improvement, but we're already seeing some interesting topics recur here.

Let's generate "Yelp! style" top-level highlights and print them to see what we get.

In [None]:
def sample_reviews(listing_id):
    bigrams = review_bigrams[listing_id]
    review_texts = reviews[reviews['listing_id'] == listing_id]['comments'].values
    sample_reviews = []
    for bigram in bigrams:
        sample_review_list = list(filter(lambda txt: " ".join(bigram) in txt, review_texts))
        num_reviews = len(sample_review_list)
        sample_review = sample_review_list[0]
        sample_review = sample_review.replace(" ".join(bigram), "****" + " ".join(bigram) + "****")
        start_index = sample_review.index("****")
        sample_text = "..." + sample_review[start_index - 47: start_index + 47] + "..."
        sample_reviews.append(sample_text)
    return sample_reviews

For reference I'll also provide listing URLs.

In [None]:
listings.query('id == 3353')['listing_url']

In [None]:
for review in sample_reviews(3353):
    print(review)

In [None]:
listings.query('id == 1497879')['listing_url']

In [None]:
for review in sample_reviews(1497879):
    print(review)

In [None]:
listings.query('id == 414419')['listing_url']

In [None]:
for review in sample_reviews(414419):
    print(review)

In [None]:
listings.query('id == 1136972')['listing_url']

In [None]:
for review in sample_reviews(1136972):
    print(review)

In [None]:
for review in sample_reviews(3353):
    print(review)

## Conclusion

The basic `nltk` sentiment analysis built-in did not do a good job analyzing the sentiments in our sample of AirBnB reviews. Without knowing more details about how the classifier was trained (there is a paper you can read FYI) I can't say for sure why that is, exactly, but it's nevertheless an interesting limitation to keep in mind, as most Internet review texts are going to be pretty similar to the AirBnB one. Perhaps other analyzers would do a better job.

Collocation with `nltk`, on the other hand, worked brilliantly! It turns out to be something that's pretty easy to do but which generates reasonably good results with just a little bit of elbow grease. You can apply this technique to just about about any reservoir of review texts out there, so keep it in mind because it's a useful tool to have under your belt!