## We will use Vader to carry out sentiment analysis. VADER’s SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories: negative, neutral, positive and compound (computed by normalizing the previous 3 scores).

In [1]:
import pandas as pd
import nltk
from tqdm import tqdm
#nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [2]:
data = pd.read_csv("hotels_final.csv")

### Finding the polarity score

In [5]:
allScores = []
for i in tqdm(range(len(data))):
    review = data['individualReview'][i][:-1]
    polarityScore = sid.polarity_scores(review)['compound']
    temp = {"polarityScore": polarityScore}
    allScores.append(temp)

100%|████████████████████████████████████████████████████████████████████████████| 6000/6000 [00:02<00:00, 2250.47it/s]


In [6]:
df = pd.DataFrame(allScores)
final_df = pd.concat([data, df], axis=1)
final_df.to_csv('hotels_with_polarity_score.csv', index=False, encoding='utf-8')

### Finding the aggregate (mean) of the polarity scores

In [13]:
data = pd.read_csv("hotels_with_polarity_score.csv")

In [14]:
# First we need to normalize the polarity score so that we get scores between 0 and 10 (like the ratings on booking.com)
allScores = []
for i in range(len(data)):
    score = data['polarityScore'][i]
    score = ((score * 10) + 10) / 2
    temp = {"normalizedPolarityScore": score}
    allScores.append(temp)

In [15]:
df = pd.DataFrame(allScores)
final_df = pd.concat([data, df], axis=1)

In [31]:
meanScores = []
for h in range(60):
    sum = 0
    for i in range(100):
        sum += final_df['normalizedPolarityScore'][i + 100*h]
    mean = sum/100
    temp = {"meanPolarityScore": mean}
    for j in range(100):
        meanScores.append(temp)

In [32]:
df_mean = pd.DataFrame(meanScores)
final_df_2 = pd.concat([final_df, df_mean], axis=1)

In [51]:
# Re-arranging the columns to make data easier to understand
columns_titles = ['city', 'hotelName', 'overallRating', 'individualReview', 'individualRating', 'polarityScore', 'normalizedPolarityScore', 'meanPolarityScore']
final_data=final_df_2.reindex(columns=columns_titles)

In [59]:
final_data.to_csv('hotels_with_polarity_score.csv', index=False, encoding='utf-8')

The Vader model that we have used for sentiment analyis has two shortcomings:
1. Since the algorithm is pre-trained it does not fit perfectly to our data. If we had a training set on a distribution similar to the reviews set, we could have gotten better results.
2. The model simply outputs 0 (here 0 means neutral as it is exactly between -1 and 1) for out of vocabulary words. Since the data is not perfectly clean, there are several words that have been misspelt and often customers have used abbreviations (example gr8 for great) to express themselves. This leads to some problems with the scoring.

There are two more inherent drawbacks in our scraped data itself that lead to wrong scores being assigned to the reviews. 
1. On booking.com some of the reviewers have put the positives and negatives separately. For example, a review may be: Positives: Food, view, swimming pool. Negatives: Staff. When we scrape this data we get one single string: "Food, view, swimming pool. Staff.". This review now does not make much sense, even though it did make sense on booking.com.
2. On several occasions the review does not align with the actual rating provided by the user. For example: review "good service" is given a rating of 10, while the review "fantastic service" is given the rating of 8. These small subjective changes from review to review lead to deviations between the polarity score assigned by our algorithm and the actual rating provided by the user.             