# Tutorial: Performing Sentiment Analysis

References:
- https://en.wikipedia.org/wiki/Sentiment_analysis
- https://towardsdatascience.com/sentiment-analysis-concept-analysis-and-applications-6c94d6f58c17
- https://www.kaggle.com/rahulin05/sentiment-labelled-sentences-data-set
- http://textblob.readthedocs.io/en/dev/
    


## Introduction to Sentiment Analysis

Sentiment Analysis, sometimes also reffered as Opinion Mining, is the process of analyzing textual data computationally as a means to identify and classify the opinions (emotions) depicted in the text. After performing this process of sentiment analysis the opinion expressed by the author of the text can be categorized in different classes.

Sentiment Analysis can be categorized into different classes such as:
- like, love, dislike, hate, desire, etc.
- positive, negative or neutral.

NOTE: In this tutorial we will be categorizing sentiments in positive and negative

Examples:
- Text: "I like sushi." =>  Sentiment: Positive;
- Text: "Very bad service." => Sentiment: Negative


## Applications of Sentiment Analysis

- Movie Reviews: Analyzing if a review is positive or negative
- Audience Analysis: Predicting voter sentiment during elections, twitter sentiments of users to determine stock prices of Fortune 500 companies
- Products: product positioning in markets by analysis on product reviews on amazon etc.
- Yelp (Restaurant reviews): reading restaurant reviews to gauge people sentiments for a restaurant and services

## Approach

In this tutorial we wil learn how to perform sentiment analysis. We will cover two techniques on how to perform sentiment analysis. We will start off with loading a labelled Yelp dataset (with sentiments given for each text) from the URL (*mentioned in Section 2*). 

- Firstly, we will use the TextBlob library (*explained below in Section 5.1*), calculate the accuracy achieved and other performance metrics. 
- After this we will split the given dataset in training and testing and then use machine learning algorithm (Random Forest - *Explained below in Section 5.2*) to learn on the training set and then predict the sentiment class for the test dataset, we will calulate the performance metrics for this step as well.

At the end we will conclude by comparing the results and discussing how we can improve our analysis for further learnings.


## 1. Imports

Import necessary libraries.

In [34]:
#for regular expressions
import re
import nltk
from collections import Counter
import pandas as pd
from textblob import TextBlob
%matplotlib inline
import string
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier

## 2. Data Extraction

Data can be extracted from this url: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#
The zip file contains three text files for Amazon product reviews, imdb movie reviews and yelp reviews. This data set contains 1000 rows in each of the three files with 500 for positive and 500 for negative sentiments. For this tutorial we will focus only on the yelp dataset.

I used MS Excel to clean the data and save it into a csv file which can be downloaded from this url: https://drive.google.com/open?id=1fsmORnvsHB9ydfVCed_bm75-9EZGJR7m

After dowloading this yelp.csv file please save it in the same directory as this jupyter notebook.

## 3. Data Loading

In [2]:
#reading the yelp.csv adding headers 'Review' and 'Sentiment'
yelpdf =  pd.read_csv("yelp.csv",names = ["Review", "Sentiment"])
yelpdf.head()

Unnamed: 0,Review,Sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


## 4. Data Exploration

In [3]:
print('Total size of the test dataset:',len(yelpdf))
print('Total positive reviews in the test dataset:',yelpdf['Sentiment'].sum())
print('Total negative reviews in the test dataset:',len(yelpdf)-yelpdf['Sentiment'].sum())

Total size of the test dataset: 1000
Total positive reviews in the test dataset: 500
Total negative reviews in the test dataset: 500


We can see that out of 1000 records 500 are positive and 500 are negative.

## 5. Performing Sentiment Analysis

### 5.1 Using TextBlob library

TextBlob is a Python library which can be utilized for processing textual data. This library provides user friendly API's which can be used to perform several NLP (natural language processing) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and many more. In our tutorial we will use this library to perform sentiment analysis of textual data.

Please install this library if you don't have using the command: pip install TextBlob


### 5.1.1 Data Cleaning & Processing

As text input might contains all kind of data (numbers, weblinks, special characters), these characters affect the performance of sentiment analysis so we must clean the text before it can be analyzed.

The method ('clean_text') below is used to clean the text input.

In [4]:
def clean_text(text):
    """  Utility function to clean the text in a text by removing links and special characters using regex. 
    Normalizes case and handles punctuation
    Inputs:
        text: str: raw text
    Outputs:
        (str): cleaned text
    """
    lowerText = text.lower()
    newText = lowerText.replace("'s", ' ')
    newText = newText.replace("'", '')
    newText = newText.replace("-", ' ')
    newText = re.sub('\d', '', newText)
    transtable = str.maketrans(string.punctuation.replace(""," "), ' '*len(string.punctuation.replace(""," "))) #map punctuation to space
    clean_words = newText.translate(transtable)
    return clean_words

Method ('analyze_sentiment') below calls the method ('clean_text') defined above to clean the text and using textblob library calcualted the polarity of the input text and returns 1 (for positive sentiments) or 0 (for negative sentiments).

[As our dataset only has positive and negative sentiments we are not categorizing into neutral, else with textblob we can get positive, neutral as well as negative polarity]

In [5]:
def analyze_sentiment(text):
    """ Function to classify the polarity of a text using textblob.
    Inputs:
        text: str: raw text
    Outputs:
        (int): predicted sentiment 1 for positive and 0 for negative
    """
    analysis = TextBlob(clean_text(text))
    if analysis.sentiment.polarity > 0:
    #for positive sentiments return 1
        return 1
    else:
    #for negative sentiments, we are not handling for neutral sentiments else even that is possible        
        return 0   

In [6]:
#adding a column 'pred' to the dataframe which contains the predicted sentiment as analyzed with the help of textblob library 
yelpdf['pred']=yelpdf['Review'].apply(analyze_sentiment)
yelpdf.head()

Unnamed: 0,Review,Sentiment,pred
0,Wow... Loved this place.,1,1
1,Crust is not good.,0,0
2,Not tasty and the texture was just nasty.,0,0
3,Stopped by during the late May bank holiday of...,1,1
4,The selection on the menu was great and so wer...,1,1


We can see above that a new column 'pred' has been added with the predicted sentiments for the input text ('Review').

### 5.1.2 Analyzing performance of this analysis

As we already have the labeled sentiments in the dataset we can calculate the performance of our analysis by comparing the predicted sentiments with the sentiment labels provided in the dataset.

In [7]:
#calculating accuracy of our predicted sentiments
x= (yelpdf['pred']==yelpdf['Sentiment'])
accuracy = sum(x)/len(yelpdf['pred'])
print('Accuracy of predicting sentiments of the dataset is:',accuracy*100,'%')

Accuracy of predicting sentiments of the dataset is: 77.2 %


Along with Accuracy, we can measure other performance metrics of our predictions based on the context of our analysis:

- Confusion matrix: its a tabular layout which describes the performance of any algorithm prediction, with both expected and predicted instances. Each row of this matrix denotes the instances in a predicted class while column in this matrix represents the instances in an actual class (or vice versa).

In terms of our tutorial, the Confusion matrix is like this,


                            |Actual Class     |
                            |Positive|Negative|
    Predicted Class|Positive|TP      |FP      |
    Predicted Class|Negative|FN      |TN      |
               
               TP: True Positive- Both actual and predicted sentiment is positive
               FP: False Positive- Actually negative sentiment but predicted as positive
               FN: False Negative- Actually positive sentiment but predicted as negative 
               TN: True Negative- Both actual and predicted sentiment is negative
               
               
               
- Precision: Precision sometimes also called as positive predicted value (PPV) is the proportion of predicted positive cases which were actually positive, i.e. TP/TP+FP


- Recall: Also known as sensitivity, it is the fraction of actual positive cases which were correctly predicted as positive, i.e. TP/TP+FN


- F1 score: this is a F measure which is calcualted using both Precision as well as Recall. It is basically the harmonic mean of Precision and recall,
F1 score = 2x(Precision)x(Recall)/(Precision+Recall)



In [8]:
print('Confusion Matrix')
print(confusion_matrix(yelpdf['Sentiment'], yelpdf['pred']))
print('Precision:',precision_score(yelpdf['Sentiment'], yelpdf['pred']))
print('Recall:',recall_score(yelpdf['Sentiment'], yelpdf['pred']))
print('F1 score:',f1_score(yelpdf['Sentiment'], yelpdf['pred']))

Confusion Matrix
[[379 121]
 [107 393]]
Precision: 0.764591439689
Recall: 0.786
F1 score: 0.775147928994


Confusion matrix can be seen above. Precision of prediction was 0.764, Recal is 0.786 and F1 score is 0.775;

## 5.2 Using Random forest for sentiment analysis

As our dataset already contain sentiment labels, our hypothesis is to use this textual reviews to train a random forest classifier and then to test this model on unseen data to predict a sentiment class for any given text.

- Random Forest: It is a widely used machine learning algorithm which find applications in numerous fields. It's an Ensemble method (Ensembling is a divide-and-conquer approach, basically we used several group of "weak-learners" combine their knowledge and make a "strong-learner") and can be used for both prediction as well as classification purpose. This algorithm is capable of handlings very large number of features, and it's helpful for estimating which of your variables are important in the underlying data being modeled.

A random forest fits a number of classifying decision trees on various sub-samples of the dataset and uses ensembling (averaging) to improve the predictive accuracy of the model while controling the over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement.


References:
- http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- http://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics
- https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd
- https://en.wikipedia.org/wiki/Random_forest


## 5.2.1 Install necessary libraries

In [9]:
#I've commented the code for installing if you don't have them already please uncomment the code 
#and execute all the lines in this code chunk 

#nltk.download('stopwords')
#nltk.download('wordnet')
#nltk.download('punkt')
#nltk.download()

#these would be used in data cleaning steps
lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()
stopwords=nltk.corpus.stopwords.words('english')

### 5.2.2 Splitting the dataset in train and test

Splitting the yelp.csv dataset into training and test set, i'm splitting it into 80% training and 20% test set. I'm reloading the dataset again as earlier i had updated it in the previous section while predicting sentiments using the TextBlob library.

In [10]:
#reading the yelp.csv adding headers 'Review' and 'Sentiment'
newdf= pd.read_csv("yelp.csv",names = ["Review", "Sentiment"])

#Splitting the dataset in train and testing set 20% test and rest training
train, test = train_test_split(newdf, test_size=0.2,random_state=100)

#dropping index in train set to handle randomness (because of splitting of dataset) in previous indexing
train = train.reset_index(drop=True)
train.head()

Unnamed: 0,Review,Sentiment
0,A fantastic neighborhood gem !!!,1
1,Best fish I've ever had in my life!,1
2,"Unfortunately, we must have hit the bakery on ...",0
3,If you love authentic Mexican food and want a ...,1
4,It's a great place and I highly recommend it.,1


### 5.2.3 Data Exploration

In [11]:
#dropping index in train set to handle randomness (because of splitting of dataset) in previous indexing
train = train.reset_index(drop=True)
print('Total size of the training dataset:',len(train))
print('Total positive reviews in the training dataset:',train['Sentiment'].sum())
print('Total negative reviews in the training dataset:',len(train)-train['Sentiment'].sum())

Total size of the training dataset: 800
Total positive reviews in the training dataset: 394
Total negative reviews in the training dataset: 406


We can see that out of 800 records in the training dataset 394 are positive and 406 are negative.

In [12]:
#dropping index in test set to handle randomness (because of splitting of dataset) in previous indexing
test = test.reset_index(drop=True)
print('Total size of the test dataset:',len(test))
print('Total positive reviews in the test dataset:',test['Sentiment'].sum())
print('Total negative reviews in the test dataset:',len(test)-test['Sentiment'].sum())

Total size of the test dataset: 200
Total positive reviews in the test dataset: 106
Total negative reviews in the test dataset: 94


We can see that out of 200 records in the test dataset 106 are positive and 94 are negative.

### 5.2.4 Data Cleaning & Processing

As text input might contains all kind of data (numbers, weblinks, special characters), these characters affect the performance of sentiment analysis so we must clean the text before it can be analyzed.


Here we are going to perform tokenization, stop word removals, stemming/lemmatizing etc. which we help us to identify accurate patterns with out considering redundant patterns or missing original patterns. 

Generally there are many words in our text which our rarely used. We might want to remove them and other stop words from our text dataset. Removing stop words is important as multiple instances of words like 'a', 'an', 'the' etc. does not help in our purpose of sentiment analysis.

Tokenization is the process of splitting a sequence of text into individual constituent pieces. For example 'James,' should be split as 'James' and ','. Lemmatizing involves combining words that carry same meaning as per the dictionary meaning.



The method ('text_tokenizer') below is used to clean the text input, lematize the text and returns the text in form of tokens.

In [13]:
def text_tokenizer(text):
    """ Normalizes case and handles punctuation
    Inputs:
        text: str: raw text
    Outputs:
        list(str): tokenized text
    """       
    listFinal=[]
    lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()
    clean_words = clean_text(text)
    tokens = nltk.word_tokenize(clean_words)
    for x in tokens:
        listFinal.append(lemmatizer.lemmatize(x))
    return listFinal
    pass

In [14]:
#tokenize the train set lemmatize as well
train['Review'] = train['Review'].apply(text_tokenizer)
processed_reviews=train
#dispaly head after processing the reviews
processed_reviews.head()

Unnamed: 0,Review,Sentiment
0,"[a, fantastic, neighborhood, gem]",1
1,"[best, fish, ive, ever, had, in, my, life]",1
2,"[unfortunately, we, must, have, hit, the, bake...",0
3,"[if, you, love, authentic, mexican, food, and,...",1
4,"[it, a, great, place, and, i, highly, recommen...",1


The method 'get_rare_words' helps us in finding rare words in our text data .

In [15]:
def get_rare_words(processed_reviews):
    """ use the word count information across all texts in training data to come up with a feature list
    Inputs:
        processed_reviews: pd.DataFrame: lematized list of words from the input text
    Outputs:
        list(str): list of rare words, sorted alphabetically.
    """    
    resultList=[]
    for index, row in processed_reviews.iterrows():
        resultList.extend((row['Review']))        
    cnt=dict(Counter(resultList))    
    resultList=[]
    for k,v in cnt.items():
        if v==1:
            resultList.append(k)
            
    a=sorted(resultList)
    return a    
    pass

Machine learning algorithms can't understand text data (unstructured data), so we need to convert our text data into a form suitable to be fed into our classifier (Random forest). We use the method 'create_features' to transform text data into a sparse matrix (bag-of-words feature matrix).

In [16]:
def create_features(processed_reviews):
    """ creates the feature matrix using the processed reviews text
    Inputs:
        processed_reviews: pd.DataFrame: text read from yelp.csv file, containing the column 'Review'
    Outputs:
        sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used
                                                we need this to tranform test reviews in the same way as train reviews
        scipy.sparse.csr.csr_matrix: sparse bag-of-words TF-IDF feature matrix
    """
    rare_words = get_rare_words(processed_reviews)
    temp=processed_reviews['Review'].apply(lambda x: ' '.join(x))
    #transform the text data using TfidfVectorizer, in stop words we have added the rare words from our dataset as well 
    sklearn_tfidf  = sklearn.feature_extraction.text.TfidfVectorizer(stop_words=rare_words+stopwords)
    features=sklearn_tfidf.fit_transform(temp)
    return sklearn_tfidf,features
    pass


In [17]:
#to generate features in a form suitable for our classification model
(tfidf, X) = create_features(processed_reviews)

In [18]:
#sentiment labels for training set
y_validation=train['Sentiment']

After transforming out Yelp reviews into a suitable format, now we will model a Random Forest classifier. We would use our training data set to train our classifier. 

But before that, we want to chose the best possible configuration for our model, we use GridSearchCV to compute the best possible configuration for which our training accuracy is best. This uses crossvalidation approach to find the best accuracy on the training set.



Next chunk of code takes a long execution time, executing 10-fold cross validation with several configuration, execute only when you have time (execution time approx: 15-20mins)

In [33]:
#takes a long execution time, crossvalidation
import warnings
warnings.filterwarnings('ignore')

from sklearn.grid_search import GridSearchCV
rfc = RandomForestClassifier(n_jobs=4, max_features='sqrt', oob_score = True) 
 
# Use a grid over parameters of interest
param_grid = { 
           "n_estimators" : [9, 18, 27, 36, 45, 54, 63],
           "max_depth" : [1, 5, 10, 15, 20, 25, 30],
           "min_samples_leaf" : [1, 2, 4, 6, 8, 10]}
 
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10)
CV_rfc.fit(X, y_validation)
print (CV_rfc.best_params_)

{'max_depth': 30, 'min_samples_leaf': 1, 'n_estimators': 36}


Now using the best configuration as given by the GridSearchCV above we will train our RandomForest classifer.

In [52]:
#training a optimized random forrest classification model
clf = RandomForestClassifier(n_estimators=36, max_depth=30, max_features='sqrt', min_samples_leaf = 1)

#X is the transformed text input as calculated above
classifier= clf.fit(X, y_validation) 

To evaluate our classifier we first test it by calculating training accuracy, validating on the training set

In [53]:
#using the classifier modeled in the code chunk above we predict outcome (sentiments) of the input transformed text (X)
predictList=classifier.predict(X)
result = predictList.tolist()

#to calculate accuracy
x= (y_validation==result)
accuracy = sum(x)/len(y_validation)
print('Training accuracy of the model is:',accuracy*100,'%')

Training accuracy of the model is: 98.0 %


The accuracy of the model when the training set was provided again to predict the sentiment class has been calulated above and it can be seen that on training set accuracy is 98%. Which is obvious As the model was trainined on the same dataset it would perform better on it. 

We want to see how our model performs on new unseen data, test set.

In [54]:
#this is the test set, without the sentiment labels
unlabeled_reviews = test[['Review']]
unlabeled_reviews.head()

Unnamed: 0,Review
0,Based on the sub-par service I received and no...
1,It shouldn't take 30 min for pancakes and eggs.
2,"Great steak, great sides, great wine, amazing ..."
3,What a mistake that was!
4,Generous portions and great taste.


This method 'classify_sentiments' can be used to classify unlabeled text into positive or negative sentiments using the classification model we developed above. 

In [55]:
def classify_sentiments(tfidf, classifier, unlabeled_reviews):
    """ predicts class labels for raw review text
    Inputs:
        tfidf: sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used on training data
        classifier: sklearn.ensemble.forest.RandomForestClassifier: classifier learnt
        unlabeled_reviews: pd.DataFrame: unlabeled text for sentiment analysis
    Outputs:
        numpy.ndarray(int): dense binary vector of class labels for unlabeled reviews
    """
    #tokenize and clean the unlabeled text
    unlabeled_reviews['Review'] = unlabeled_reviews['Review'].apply(text_tokenizer)
    
    #find the rare words in the test set
    rare_words = get_rare_words(unlabeled_reviews)
    
    temp=unlabeled_reviews['Review'].apply(lambda x: ' '.join(x))
    
    #TfidfVectorizer object used on training data, used to transform our text input to features 
    #to be fed into the classifier for predicting the sentiments of these text inputs
    features=tfidf.transform(temp)
    predictList=classifier.predict(features)
    
    return predictList
    pass

In [56]:
import warnings
warnings.filterwarnings("ignore")
unlabeled_reviews = test[['Review']]
y_pred = classify_sentiments(tfidf, classifier, unlabeled_reviews)

### 5.2.5 Analyzing performance of this analysis

In [57]:
#calculate the testing accuracy
x= (y_pred==test['Sentiment'])
accuracy = sum(x)/len(y_pred)
print('Accuracy of predicting sentiments of test dataset is:',accuracy*100,'%')

Accuracy of predicting sentiments of test dataset is: 71.0 %


In [58]:
#calcualting other performance metrics as computed in the earlier steps (with TextBlob library analysis)
print('Confusion Matrix')
print(confusion_matrix(test['Sentiment'], y_pred))
print('Precision:',precision_score(test['Sentiment'], y_pred))
print('Recall:',recall_score(test['Sentiment'], y_pred))
print('F1 score:',f1_score(test['Sentiment'], y_pred))

Confusion Matrix
[[78 16]
 [42 64]]
Precision: 0.8
Recall: 0.603773584906
F1 score: 0.688172043011


Confusion matrix for the sentiment predicted by our model can be seen above. Precision of prediction was 0.8, Recal is 0.604 and F1 score is 0.688;

## 6. Conclusion

In section 5. we observed how we can perform sentiment analysis on text data. We used two techniques for this, 
- in first we were able to analyze sentiments using TextBlob library. We can use this approach for any text and we don't need any sentiment labels to predict sentiments (but if we have sentiment labels we can compare our predictions to pre-labelled sentiments for accuracy and performance purpose).
- in second approach we need sentiment labels, to perform our analysis, to train a random forest classification model (text data and corresponding sentiments to train the model). After this step we use this trained classifier and a test data set to predict the sentiments for the test set.

After performing the sentiment analysis using both the techniques we also compared the performance metrics for both the steps we observed that the first method (TextBlob) was slightly better that the second (random forest) method.

Additionally, we need sentiment labels in the second case (for Random Forest), so if we don't have a pre-labeled data set (which is most genrally the case) we can use the TextBlob method to perform sentiment analysis and obtain a pretty good accuracy.

## 7. Further Motivation

As a next step we can take this tutorial and our analysis further by using word embeddings and neural networks approach for even better results for our analysis.

Also we can segregate our analysis of reviews for a restaurant further by categorize it by different aspects. Like quality of food, ambience, services, tangibles etc. for a holistic analysis of a restaurant by analyzing its reviews.

