## Week 6 Sentiment Analysis

###connect Colab to your Google Drive.

In [None]:
#connect Colab to your Google Drive.
from google.colab import drive
import os
drive.mount('/content/gdrive')

### Import IMDB Data Set

In [None]:
# Original Data Source
# https://ai.stanford.edu/~amaas/data/sentiment/
# https://www.imdb.com/interfaces/

# The same data source in a CSV format from Kaggle.
# https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/version/1

### As always, two different ways to load the files.

(1) save the file on your local computer and load it from there on your Jupyter Notebook.

(2) Save it in a cloud drive (Google drive) and use your Cloud Python (e.g. Colab) to load the file directly from your drive.

In [None]:
# In my case, I take the (2) approach.

import pandas as pd

#import csv file and put it into a Pandas dataframe.
movie=pd.read_csv('/content/gdrive/My Drive/CIS NLP Data Sets/IMDB Dataset.csv')

#assign column names. -> I don't need to do this since we already have col names in the file.
#news.columns=["col name"]

#movie.head()
print (movie.iloc[:10,:])

#How many columns and rows?
print ("Shape:", movie.shape)

#Column names?
print ("Column Names",movie.columns.values)

### About Data:
IMDB dataset having 50K movie reviews for natural language processing or Text analytics. 


This is a dataset for binary sentiment classification (positive and negative labels). <br> We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. <br> So, predict the number of positive and negative reviews using either classification or deep learning algorithms.

In [None]:
#What values exist within category?
categories = movie['sentiment']

labels = list(set(categories))
print('possible categories',labels)


#Check the frequency of each class label.
count=movie['sentiment'].value_counts()
print (count)



We can infer that we will need to encode class labels as numbers.<br><br>
e.g. positive -> 1 & negative -> 0

## How to Build a Sentiment Analysis Algorithm.

### 1. Regular ML Approach.

Requirements

- Need pre-labels for each document.
- Go through training and testing steps.

### 1-1: Data Preprocessing ###

Convert our labels to binary variables, 1 to represent 'positive' and 0 to represent 'negative' for ease of computation. 

In [None]:
movie['label_num'] = movie.sentiment.map({'negative':0, 'positive':1})

#How many columns and rows?
print ("Shape:", movie.shape)

#Column names?
print ("Column Names",movie.columns.values)

#movie.head()
print (movie.iloc[:10,:])

### 1-2: Training and testing sets (before we apply Count Vectorizer) ###

- Now we should split our data into two sets:
1. a training set (75%) used to discover potentially predictive relationships, and
2. a test set (25%) used to evaluate whether the discovered relationships hold and to assess the strength and utility of a predictive relationship.

>>**Instructions:**
Split the dataset into a training and testing set by using the train_test_split method in sklearn.
* `X_train` is our training data for the 'review' column.
* `y_train` is our training data for the 'label_num' column
* `X_test` is our testing data for the 'review' column.
* `y_test` is our testing data for the 'label_num' column


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(movie['review'], 
                                                    movie['label_num'],
                                                    random_state=0, 
                                                    test_size=0.25 #assign 25% to a test set.
                                                    )

print('Number of rows in the total set: {}'.format(movie.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

print (12500/50000)

### 1-3: Feature Extration ###

Covert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrance of each word or token in that document and then apply tf-idf to give different weights to words (tf-idf).

**Please Note:** 

* The CountVectorizer method automatically converts all tokenized words to their lower case form so that it does not treat words like 'He' and 'he' differently. To enable this, set `lowercase` parameter as `True`.

* It also ignores all punctuation so that words followed by a punctuation mark (e.g.'hello!') are not treated differently than the same word(e.g.'hello').To enable this, use `token_pattern` parameter which has a default regular expression which selects tokens of 2 or more alphanumeric characters.

* The third parameter to take note of is the `stop_words` parameter. To enable this, set 'stop_words' as english.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#generate CountVectorizer object.
tfidf_vector = TfidfVectorizer(
lowercase=True,                    
stop_words='english',
ngram_range=(1, 2),             #The lower and upper boundary of the range of n-values for different n-grams to be extracted.
max_df=0.3,                     #used for removing terms that appear too frequently
min_df=0.05                      #used for removing terms that appear too infrequently.  
)


#For the entire list of all the parameters:
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

#For the details about max_df & min_df: better explanations than the official document:
#https://stackoverflow.com/questions/27697766/understanding-min-df-and-max-df-in-scikit-countvectorizer

# Be careful with using max_df & min_df:
#https://stackoverflow.com/questions/37815899/valueerror-after-pruning-no-terms-remain-try-a-lower-min-df-or-a-higher-max-d

# You can also adjust max_features argument along with max_df & min_df

# Fit the training data and then return the matrix
training_data = tfidf_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data during the vectorization step!!
testing_data = tfidf_vector.transform(X_test)


print ("Shape of training set",training_data.shape)

print ("Shape of testing set",testing_data.shape)

In [None]:
vocab_dict=tfidf_vector.vocabulary_
print ("Unique Vocabulary: ",vocab_dict)
print (len(vocab_dict))

### 1-4.: Apply ML Model ###

In [None]:
from sklearn.naive_bayes import MultinomialNB

#choose a model.
naive_bayes = MultinomialNB()

#fit your training set to the model.
naive_bayes.fit(training_data, y_train)

#predict the labels for testing set.
predicted = naive_bayes.predict(testing_data)


### 1-5: Evaluate the Model. ###

Accuracy, precision, recall, F1 score


- Accuracy  
measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions.

- Precision 
what proportion of messages we classified as spam, actually were spam.
It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification).

`[True Positives/(True Positives + False Positives)]`

- Recall(sensitivity)
what proportion of messages that actually were spam were classified by us as spam.<br>
It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam.

`[True Positives/(True Positives + False Negatives)]`

For classification problems that are skewed in their classification distributions like in our case, (e.g. among 100 text messages and only 2 were spam) accuracy by itself is not a very good metric. <br><br>We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and recall come in very handy. These two metrics can be combined to get the F1 score, which is weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.

For all 4 metrics whose values can range from 0 to 1, having a score as close to 1 as possible is a good indicator of how well our model is doing.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy score: ', format(accuracy_score(y_test, predicted)))
print('Precision score: ', format(precision_score(y_test, predicted)))
print('Recall score: ', format(recall_score(y_test, predicted)))
print('F1 score: ', format(f1_score(y_test, predicted)))


In [None]:
# Precision/Recall/F1-score measures for each element in the test data.
from sklearn.metrics import classification_report

print(classification_report(y_test, predicted))

In [None]:
# Creating  a confusion matrix,which compares the y_test and y_pred.
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

cm = confusion_matrix(y_test, predicted)
cm_df = pd.DataFrame(cm,index = ['negative','positive'],
                     columns = ['negative','positive']  
                     )

#Plotting the confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(cm_df, annot=True , fmt=".0f")
plt.title('Confusion Matrix')
plt.ylabel('Actal Values')
plt.xlabel('Predicted Values')
plt.show()

### Can you try different ML algorithm to see how the output becomes diffierent from the current output?

### 2. Lexicon-based Approach.

- TextBlob
- Vader
- Text2emotion for emotion identification

** Pre-processing steps before applying one of the approaches.

2-1. TextBlob

In [None]:
from textblob import TextBlob

In [None]:
#How many columns and rows?
print ("Shape:", movie.shape)

#Column names?
print ("Column Names",movie.columns.values)

#movie.head()
print (movie.iloc[100:120,:])

In [None]:
#clean the texts using RE.
import regex as re

def cleaning(text):
# Removes all special characters and numericals leaving the alphabets
    text = re.sub('[^A-Za-z]+', ' ', text)
    return text

# Cleaning the text in the review column
movie['clean_review'] = movie['review'].apply(cleaning)

#movie.head()
print (movie.iloc[100:120,:])

In [None]:
#apply lower-case function.
movie['clean_review']=movie['clean_review'].str.lower()

print (movie.iloc[100:120,:])

In [None]:
#word tokenizer using RegexpTokenizer

from nltk.tokenize import regexp_tokenize
from nltk import RegexpTokenizer

tokenizer_re=RegexpTokenizer("[\w]+")

movie['clean_review']=movie['clean_review'].map(tokenizer_re.tokenize)
print (movie.iloc[100:120,:])

In [None]:
#remove stop-words & one more line of code to remove the words which are shorter than 2 letters.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk_stop_words=stopwords.words('english')


movie['clean_review']=movie['clean_review'].apply(lambda words: [word for word in words if word not in nltk_stop_words and len(word)>2])
print (movie.iloc[100:120,:])

In [None]:
movie['clean_review_to_string']=movie['clean_review'].apply(lambda x: (' '.join(x)))
print (movie.iloc[100:120,:])

In [None]:
# Lemmatization.-> This may takes some times.

import nltk
nltk.download('omw-1.4')
#Example of PoS taggings on tokenized sentence.
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

def get_pos_tags(word):
    """Map PoS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, #adjective
                "N": wordnet.NOUN,#noun
                "V": wordnet.VERB,#verb
                "R": wordnet.ADV} #adverb

    return tag_dict.get(tag, wordnet.NOUN)


def lemmatize_text(text):
  text=[WordNetLemmatizer().lemmatize(w, get_pos_tags(w)) for w in text]   
  return text

movie['clean_review']=movie['clean_review'].apply(lemmatize_text)
print (movie.iloc[100:120,:])

In [None]:
#polarity/subjectivity using TextBlob

def polarity(text):
  return TextBlob(text).sentiment.polarity

def subjectivity(text):
  return TextBlob(text).sentiment.subjectivity



In [None]:
#pass the data throught the above functions.

movie['polarity']=movie['clean_review_to_string'].apply(polarity)
movie['subjectivity']=movie['clean_review_to_string'].apply(subjectivity)
print (movie.iloc[100:120,:])

In [None]:
import matplotlib.pyplot as plt

#count the frequency of polarity.
num_bins=50
plt.figure(figsize=(10,6))
n, bins, patches=plt.hist(movie.polarity, num_bins, facecolor='blue')
plt.xlabel('polarity')
plt.ylabel('count')
plt.title('histogram of polarity')
plt.show()


In [None]:
import matplotlib.pyplot as plt

#count the frequency of polarity.
num_bins=50
plt.figure(figsize=(10,6))
n, bins, patches=plt.hist(movie.subjectivity, num_bins, facecolor='green')
plt.xlabel('polarity')
plt.ylabel('count')
plt.title('histogram of subjectivity')
plt.show()

In [None]:
#export some random rows for the manual check-up.
random_sample_movie = movie.sample(frac=0.1)

random_sample_movie.to_csv('/content/gdrive/My Drive/CIS NLP Data Sets/result_random_sampled.csv', index=False)

2-2. Vader

In [None]:
#you might need to pip install first.
!pip install vaderSentiment

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
#initiate the vader sentiment object.
analyzer = SentimentIntensityAnalyzer()

In [None]:
#build a small defined function to generate vader sentiment outputs.
def vader_score(text):
  return analyzer.polarity_scores(text)


In [None]:
#pass the data throught the above functions.
movie['vader_score']=movie['clean_review_to_string'].apply(vader_score)
print (movie.iloc[100:120,:])

In [None]:
#the vader sentiment outputs are stored in a dic format.
#you need to pull each key-value pair and need to store each pair into each column.
#The below codes will do the work for you. 
movie['vader_compound']=movie['vader_score'].apply(lambda score_dict: score_dict['compound'])
movie['vader_negative']=movie['vader_score'].apply(lambda score_dict: score_dict['neg'])
movie['vader_neutral']=movie['vader_score'].apply(lambda score_dict: score_dict['neu'])
movie['vader_positive']=movie['vader_score'].apply(lambda score_dict: score_dict['pos'])
print (movie.iloc[100:120,:])

In [None]:
import matplotlib.pyplot as plt

#count the frequency of vader outputs.
num_bins=50
plt.figure(figsize=(10,6))
n, bins, patches=plt.hist(movie.vader_compound    , num_bins, facecolor='green')
#plt.xlabel('polarity')
plt.ylabel('count')
plt.title('histogram of vader_compound')
plt.show()

### Wait: VADER is a module that was specifically created to work with text from a social media contexts.

If Vader can well understand the sentiments of texts which contains cpital words, punctuations (emphasizing certain words) and so on.

Why don't we try running Vader module without pre-processing the texts?

In [None]:
#How many columns and rows?
print ("Shape:", movie.shape)

#Column names?
print ("Column Names",movie.columns.values)

#movie.head()
print (movie.iloc[100:120,:])

In [None]:
#pass the data throught the above functions.
movie['vader_score_no_pre_processing']=movie['review'].apply(vader_score)
print (movie.iloc[100:120,:])

movie['vader_compound_no_pp']=movie['vader_score_no_pre_processing'].apply(lambda score_dict: score_dict['compound'])
movie['vader_negative_no_pp']=movie['vader_score_no_pre_processing'].apply(lambda score_dict: score_dict['neg'])
movie['vader_neutral_no_pp']=movie['vader_score_no_pre_processing'].apply(lambda score_dict: score_dict['neu'])
movie['vader_positive_no_pp']=movie['vader_score_no_pre_processing'].apply(lambda score_dict: score_dict['pos'])
print (movie.iloc[100:120,:])

2-3. Text2emotion for emotion identification

- Rule-based Algorithm
- Detect five different types of emotions such as happy, angry, sad, surprise, fear.

In [None]:
!pip install text2emotion

import text2emotion as emotion

!pip uninstall emoji
!pip install emoji==1.7

In [None]:
#build a small defined function to generate emotion outputs.
def emotion_score(text):
  return emotion.get_emotion(text)


In [None]:
# Randomly sample 30% of your dataframe
movie_random = movie['review'].sample(frac=0.001)
movie_random=pd.DataFrame(movie_random)
print ("Shape:", movie_random.shape)

print ("Column Names",movie_random.columns.values)

#pass the data throught the above functions.
#movie['emotion_score']=movie['review_random_sample'].apply(emotion_score)
#print (movie.iloc[100:120,:])

movie_random['emotion_score']=movie['review'].apply(emotion_score)
print (movie_random.iloc[100:120,:])

In [None]:
#break down the dictionary format outputs and insert each component into each column.
movie['angry']=movie_random['emotion_score'].apply(lambda score_dict: score_dict['angry'])
movie['fear']=movie_random['emotion_score'].apply(lambda score_dict: score_dict['fear'])
movie['happy']=movie_random['emotion_score'].apply(lambda score_dict: score_dict['happy'])
movie['sad']=movie_random['emotion_score'].apply(lambda score_dict: score_dict['sad'])
movie['surprise']=movie_random['emotion_score'].apply(lambda score_dict: score_dict['surprise'])
print (movie.iloc[100:120,:])