# Notebook Overview

In this series of notebook, I'm going to dive deep into the core information of natural language processing. 

Natural Language processing is a highly recognized and important field of machine learning. ....

https://www.kaggle.com/mgmarques/analyzing-movie-reviews-sentiment-analysis-i/notebook

# Notebook Setting

In [30]:
import numpy as np
import pandas as pd
import json
import gc
import gzip
import os
import glob
from datetime import datetime

# Dataset Information

In this series of notebook, I'm going to use the movie review dataset to analyze a large corpus of movie reviews.

First, we will try to derive the sentiment using 

In this notebook, we focus on trying to analyze a large corpus of movie reviews and derive the sentiment.

You can access this dataset via the link below:
    
    http://ai.stanford.edu/~amaas/data/sentiment/

In [34]:
filepath = 'aclImdb/train/unsup/'
files = glob.glob(filepath + '/*.txt')

In [74]:
get_df = lambda f: pd.read_csv(f, header=None, sep='\t')
df = pd.concat([get_df(item) for item in files], axis=0)
# dodf = {f: get_df(f) for f in files}

In [75]:
df = df[[df.columns[0]]]
df.index = range(len(df))
df.columns = ['Review']
df.head()

Unnamed: 0,Review
0,A newspaperman (Johnny Twennies) living in the...
0,As co-founder of Nicko & Joe's Bad Film Club S...
0,"Very good film from director Wyler, although i..."
0,"This flick will pass the time, and Kurt Russle..."
0,The Feeding is a terrible werewolf movie about...


In [92]:
df.to_csv('clean_df.csv')

---

# Preprocessing

In [93]:
df = pd.read_csv('clean_df.csv', index_col=0)

In [94]:
df.head()

Unnamed: 0,Review
0,A newspaperman (Johnny Twennies) living in the...
1,As co-founder of Nicko & Joe's Bad Film Club S...
2,"Very good film from director Wyler, although i..."
3,"This flick will pass the time, and Kurt Russle..."
4,The Feeding is a terrible werewolf movie about...


In [95]:
df.Review[0]

"A newspaperman (Johnny Twennies) living in the 90's with a complete 20's personality and lifestyle - fedora, manual typewriter, the Charleston, the works. It's a great idea for a movie and it couldn't have been done better.<br /><br />Johnny doesn't miss a cliche, but never uses the same one twice. You'll find yourself anticipating his reactions to the harsher '90s world as the movie goes along, you'll often guess right - but that makes the movie just that much more fun.<br /><br />Lots of fun when Johnny is called on to save the same damsel in distress (named Virginia, natch) on three different occasions. She responds with appropriate fluttering eyelids each time.<br /><br />His reaction to independent women, openly gay men, and the general '90s milieu is delightful. He remains happily oblivious.<br /><br />Don't worry, the movie never takes itself seriously. Nobody preaches about the evil of the present, or the shallowness of the past. You end up with a warm feeling for all the char

In [102]:
from bs4 import BeautifulSoup

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

## VADER

In [96]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [97]:
sid = SentimentIntensityAnalyzer()

In [128]:
vader_label = []
for rev in df.Review:
    score = sid.polarity_scores(rev)
    label = 'pos' if score['compound'] >= 0 else 'neg'
    vader_label.append(label)

In [129]:
vader_label[:10]

['pos', 'pos', 'pos', 'neg', 'neg', 'pos', 'neg', 'pos', 'pos', 'pos']

In [130]:
df['Vader'] = vader_label

In [131]:
df.sample(3)

Unnamed: 0,Review,Vader
37438,If you can get hold of this film it is well wo...,pos
13524,There's not really much to say about this movi...,neg
2367,I will admit I didn't pay full attention to ev...,pos


## AFINN

In [133]:
from afinn import Afinn
af = Afinn(emoticons=True)

In [140]:
afinn_label = []
for rev in df.Review:
    score = af.score(rev)
    label = 'pos' if score >= 0 else 'neg'
    afinn_label.append(label)

In [141]:
df['Afinn'] = afinn_label

In [142]:
df.head(10)

Unnamed: 0,Review,Vader,Afinn
0,A newspaperman (Johnny Twennies) living in the...,pos,pos
1,As co-founder of Nicko & Joe's Bad Film Club S...,pos,pos
2,"Very good film from director Wyler, although i...",pos,pos
3,"This flick will pass the time, and Kurt Russle...",neg,neg
4,The Feeding is a terrible werewolf movie about...,neg,neg
5,Here is another of those films that got panned...,pos,pos
6,"Not sure how a filmmaker as prolific as Joel ""...",neg,neg
7,What? I watched this movie with my two young n...,pos,pos
8,I was initially hesitant about watching Amu be...,pos,pos
9,This is no budget at its best. Hope to see mor...,pos,pos


### TextBlob

The sentiment function of textblob returns two properties, polarity, and subjectivity.

**Polarity** : float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. 

**Subjectivity** : sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

Resource: 

https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/

https://medium.com/@rahulvaish/textblob-and-sentiment-analysis-python-a687e9fabe96

In [144]:
from textblob import TextBlob

In [143]:
def textblob_parser(text_series):
    score = []
    subjectivity = []
    for text in text_series:
        try:
            score.append(np.round(TextBlob(text).sentiment.polarity,4))
            subjectivity.append(np.round(TextBlob(text).sentiment.subjectivity,4))
        except:
            score.append(np.nan)
    return(score, subjectivity)

In [145]:
score, subjectivity = textblob_parser(df.Review)

In [146]:
df['Textblob'] = score
df['Textblob_subjectivity'] = subjectivity

### IBM Watson

In [147]:
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 import Features, EntitiesOptions, KeywordsOptions, SentimentOptions, CategoriesOptions

In [148]:
nlp = NaturalLanguageUnderstandingV1(                                         
    version='2018-11-16',iam_apikey='Fb6dnuIRPPI8SgbPw4xMc_WgKJGKrQoumUl1gM32Jw9K',
    url='https://api.us-south.natural-language-understanding.watson.cloud.ibm.com/instances/b6a19795-fa79-4177-9aff-be92162438f4')

  This is separate from the ipykernel package so we can avoid doing imports until


In [150]:
def ibm_parser(input_text): 
    # Input text can be sentence, paragraph or document
    response = nlp.analyze(
        text = input_text,
        features = Features(sentiment=SentimentOptions()))
    result = response.get_result()
    # From the response extract score which is between -1 to 1
    res = result.get('sentiment').get('document').get('score')
    return res

In [151]:
ibm_score = []
for s in df.Review:
    ibm_score.append(ibmm_parser(s))

label = pd.Series(ibm_score).apply(lambda c: 'pos' if c >=0 else 'neg')
df['IBM'] = label

KeyboardInterrupt: 

### Plotting Functions

In [None]:
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=30)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label', fontsize=20)
    plt.xlabel('Predicted label', fontsize=20)
    plt.tick_params(axis='both', labelsize=15)

In [None]:

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix')

plt.show()

# Three Categories of Sentiment Analysis

There are three types of sentiment Analysis.

**Rule-based methods:**

1. TextBlob: Simple rule-based API for sentiment analysis

2. VADER: Parsimonious rule-based model for sentiment analysis of social media text.

**Feature-based methods:**

1. Logistic Regression: Generalized linear model in Scikit-learn.

2. Support Vector Machine (SVM): Linear model in Scikit-learn with a stochastic gradient descent (SGD) optimizer for gradient loss.

**Embedding-based methods:**
1. FastText: An NLP library that uses highly efficient CPU-based representations of word embeddings for classification tasks.
2. Flair: A PyTorch-based framework for NLP tasks such as sequence tagging and classification.

In this example, since we don't have origial tagging, we cannot use "supervised" learning.

Thus, we can only test the two rule-based methods.

---

## Rule-based methods

### TextBlob

The sentiment function of textblob returns two properties, polarity, and subjectivity.

**Polarity** : float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. 

**Subjectivity** : sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

Resource: 

https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/

https://medium.com/@rahulvaish/textblob-and-sentiment-analysis-python-a687e9fabe96

In [136]:
from textblob import TextBlob

In [155]:
def textblob_parser(text_series):
    start = datetime.now()
    score = []
    subjectivity = []
    i = 0
    for text in text_series:
        i += 1
        try:
            score.append(np.round(TextBlob(text).sentiment.polarity,4))
            subjectivity.append(np.round(TextBlob(text).sentiment.subjectivity,4))
        except:
            score.append(np.nan)
        if i % 500000 == 0:
            print(f'Done {i:<7}. Time: {datetime.now() - start}')
            start = datetime.now()
    return(score, subjectivity)

In [None]:
score, subjectivity = textblob_parser(review['text'])

Done 500000 . Time: 0:12:08.939863
Done 1000000. Time: 0:13:36.516645
Done 1500000. Time: 0:12:28.886209


In [154]:
TextBlob(review.iloc[0]['text']).sentiment.subjectivity

0.6166666666666667

In [141]:
TextBlob(review.iloc[0]['text']).sentiment.polarity

-0.3333333333333333

### VADER

In [78]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [79]:
sid = SentimentIntensityAnalyzer()

In [94]:
review.shape

(6685902, 7)

In [106]:
def vader_parser(text_series):
    start = datetime.now()
    neg = []
    neu = []
    pos = []
    score = []
    drop = []
    i = 0
    for text in text_series:
        i += 1
        try:
            s = sid.polarity_scores(text)
            neg.append(s['neg'])
            neu.append(s['neu'])
            pos.append(s['pos'])
            score.append(s['compound'])
        except:
            neg.append(np.nan)
            neu.append(np.nan)
            pos.append(np.nan)
            score.append(np.nan)
        if i % 500000 == 0:
            print(f'Done {i:<7}. Time: {datetime.now() - start}')
            start = datetime.now()
    
    label = pd.Series(score).apply(lambda c: 'pos' if c >=0 else 'neg')
    
    
    return(neg,neu,pos,score,label)

In [107]:
neg,neu,pos,score,label = vader_parser(review['text'])

Done 500000 . Time: 0:08:03.425690
Done 1000000. Time: 0:07:58.424441
Done 1500000. Time: 0:07:58.601121
Done 2000000. Time: 0:08:04.209375
Done 2500000. Time: 0:07:56.745506
Done 3000000. Time: 0:08:03.660124
Done 3500000. Time: 0:08:02.938929
Done 4000000. Time: 0:08:01.890153
Done 4500000. Time: 0:07:49.902701
Done 5000000. Time: 0:07:58.414470
Done 5500000. Time: 0:08:02.821643
Done 6000000. Time: 0:08:07.703370
Done 6500000. Time: 0:07:54.496754


In [130]:
review['vader_neg'] = neg
review['vader_neu'] = neu
review['vader_pos'] = pos
review['vader_score'] = score
review['vader_label'] = label

In [131]:
review = review.dropna()

In [133]:
review.head(2)

Unnamed: 0,business_id,user_id,review_id,text,cool,funny,useful,vader_neg,vader_neu,vader_pos,vader_score,vader_label
0,ujmEBvifdJM6h6RLv4wQIg,hG7b0MtEbXx5QzbzE6C_VA,Q1sbwvVQXV2734tPgoKj4Q,Total bill for this horrible service? Over $8G...,0,1.0,6.0,0.159,0.841,0.0,-0.7661,neg
1,NZnhc2sEQy3RmzKTZnqtwQ,yXQM5uF2jS6es16SJzNHfg,GJXCdrto3ASJOqKeVWPi6Q,I *adore* Travis at the Hard Rock's new Kelly ...,0,0.0,0.0,0.026,0.729,0.244,0.9971,pos


In [134]:
review.to_csv('sentiment_vader.csv')

## Feature-based Methods

As the feature-based methods, we transform text into features.

What we will use here is the TF-idf features.

### Create Pipeline

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipeline = Pipeline(
    [
        ('tfidf', CTfidfVectorizer()),
        ('clf', LogisticRegression(solver='liblinear', multi_class='auto')),
    ]
)

### Logistic Regression

In [None]:
from skle

---

### Resource

Fine-grained Sentiment Analysis in Python (Part 1)

https://towardsdatascience.com/fine-grained-sentiment-analysis-in-python-part-1-2697bb111ed4