# Sentiment Analysis of IMDB Movie Reviews

**Problem Statement:**

In this, we have to predict the number of positive and negative reviews based on sentiments by using different classification models.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import WordNetLemmatizer


from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.svm import SVC, LinearSVC

import re,string
from wordcloud import WordCloud,STOPWORDS

In [None]:
data = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
print(data.shape)
data.head()

In [None]:
#Summary of the dataset
data.describe()

In [None]:
#Class Distrubution
data['sentiment'].value_counts()

### Change Target variable

In [None]:
## 0 as Negative and 1 as Positive
data.sentiment = data.sentiment.apply(lambda x: 0 if x=='negative' else 1)

<h1>Feature Engineering</h1>

### Indirect features:
- count of sentences
- count of title words
- count of stop words
- count of words
- count/percentage of unique words
- count/percentage of punctuations

In [None]:
## Indirect features
eng_stopwords = set(stopwords.words("english"))

data["count_words_title"] = data["review"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
data["count_stopwords"] = data["review"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))

data['count_word'] = data["review"].apply(lambda x: len(str(x).split()))
data['count_unique_word'] = data["review"].apply(lambda x: len(set(str(x).split())))
data["count_punctuations"] = data["review"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
data['word_unique_percent'] = data['count_unique_word'] * 100 / data['count_word']
data['punct_percent'] = data['count_punctuations'] * 100 / data['count_word']


In [None]:
## Reordering the columns 
data = data[['review', 'count_words_title', 'count_stopwords',
             'count_word', 'count_unique_word', 'count_punctuations',
             'word_unique_percent', 'punct_percent','sentiment']]
data.head()

In [None]:
plt.hist(data[data['sentiment']==0]['count_word'], range=(0,1500), color='red', edgecolor='black', 
         label='positive reviews', alpha=0.5)
plt.hist(data[data['sentiment']==1]['count_word'], range=(0,1500), color='green', edgecolor='black', 
         label='negative reviews', alpha=0.1)

plt.title('Word Count Distribution')
plt.xlabel('Word Count')
plt.legend()
plt.show()

## Text Preprocessing of Reviews

<b> In machine learning task, cleaning or pre-processing the data is as important as model building if not more. And when it comes to unstructured data like text, this process is of most importance. IMDB reviews are posted by users manually, so we observe high usage of contractions and chat words in it. Also, some reviews are collected from other sites, so we also observe usage of many HTML tags in dataset.</b>

**a. Clean Contractions or Chat Words:**
As this is manually entered reviews, people do use a lot of abbreviated words in chat and so it is important for us to expand all such chat words and contractions used. I’ve used list of slangs and contractions from repo.

**b. Lower Casing** Lower casing is a common text preprocessing technique. The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way. This is more helpful for text featurization techniques like frequency, tfidf as it helps to combine the same words together thereby reducing the duplication and get correct counts / tfidf values.

**c. Removal Of Stop Words**
Stopwords are commonly occuring words in a language like 'the', 'a' and so on. They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis. These stopword lists are already compiled for different languages and we can safely use them. For example, the stopword list for english language is,

**d. Lemmatization**
Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language. As a result, this one is generally slower than stemming process. I’m using standard WordNetLemmatizer for work.

**e. Removal Of Urls & HTML Tags:**
We found large usage of HTML tags in dataset. To make sense of dataset, such tags to be removed.

**f. Removal Of Punctuations** In this process, we remove the punctuations (!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~) from the text data. This is a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way. Note of caution- This process has to be performed after removal of HTML tags else some standard tags of HTML will partially get removed in this process and afterwards HTML removal process will not give suitable results.

In [None]:
# Removing all punctuations from Text
mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }
PUNCT_TO_REMOVE = string.punctuation # '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
eng_stopwords = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

def clean_contractions(text, mapping):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in eng_stopwords])

def word_replace(text):
    return text.replace('<br />','')

def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

def preprocess(text):
    text = clean_contractions(text, mapping)
    text = text.lower()
    text = word_replace(text)
    text = remove_urls(text)
    text = remove_html(text)
    text = remove_stopwords(text)
    text = remove_punctuation(text)
    text = lemmatize_words(text)
    
    return text

In [None]:
data["reviews_preprocessed"]=data["review"].apply(lambda text: preprocess(text))
data.head()

# Word Cloud

In [None]:
# Positive Reviews.
plt.figure(figsize=(15, 15))
wc = WordCloud(max_words=200, width=1000, height=500, stopwords=STOPWORDS).generate(" ".join(data[data.sentiment==1].reviews_preprocessed))
plt.imshow(wc, interpolation='bilinear')

In [None]:
# Negative Reviews.
plt.figure(figsize=(15, 15))
wc = WordCloud(max_words=200, width=1000, height=500, stopwords=STOPWORDS).generate(" ".join(data[data.sentiment==0].reviews_preprocessed))
plt.imshow(wc, interpolation='bilinear')

- From these word clouds, we are not able to judge any starling differences in both the sentiments by looking at words. We don’t see usage of extreme negative connotation or abusive language used while writing negative reviews.

### Utility Function

In [None]:
def metrics(model, x , y):
    y_pred = model.predict(x)
    acc = accuracy_score(y, y_pred)
    f1 = f1_score(y, y_pred)
    print("\nAccuracy: ", round(acc,3))
    print("\nF1 Score: ", round(f1,3))
    
    cm=confusion_matrix(y, y_pred)
    plt.figure(figsize=(4, 4))
    sns.heatmap(cm, annot=True, cmap='coolwarm', xticklabels=[0,1], fmt='d', annot_kws={"fontsize":19})
    plt.xlabel("Predicted", fontsize=16)
    plt.ylabel("Actual", fontsize=16)
    plt.show()

# Model based on Indirect Features

In [None]:
X = data[['count_words_title', 'count_stopwords',
        'count_word', 'count_unique_word', 'count_punctuations',
        'word_unique_percent', 'punct_percent']]

y = data['sentiment']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
[i.shape for i in [X_train, X_test, y_train, y_test]]

In [None]:
linear_svc = LinearSVC(penalty='l2', dual=False)
linear_svc.fit(X_train, y_train)
metrics(linear_svc, X_test, y_test)

- As expected, this model is giving us poor accuracy of 58% as we depicted in EDA. Indirect features have very similar trends and patterns across both the classes, we have seen in EDA portion.

## N-gram Analysis
- The order that words are used in text is not random. In English, for example, you can say "the red apple" but not "apple red the". The general idea is that you can look at each pair (or double, triple etc.) of words that occur next to each other. In a sufficently-large corpus, you're likely to see "the red" and "red apple" several times, but less likely to see "apple red" and "red the". This is useful to know if, for example, you're trying to figure out what someone is more likely to say to help decide between possible output for an automatic speech recognition system. These co-occuring words are known as "n-grams", where "n" is a number saying how long a string of words you considered.

In [None]:
#dtype: string
texts = ' '.join(data['reviews_preprocessed'])
texts_to_list = texts.split(" ")

In [None]:
def draw_n_gram(texts_to_list, i):
    n_gram = (pd.Series(nltk.ngrams(texts_to_list, i)).value_counts())[:11]
    n_gram_df = pd.DataFrame(n_gram)
    n_gram_df = n_gram_df.reset_index()
    n_gram_df = n_gram_df.rename(columns={"index": "word", 0: "count"})
    print(n_gram_df.head(10))
    plt.figure(figsize=(10,5))
    return sns.barplot(x='count', y='word', data=n_gram_df, palette="Blues_d")

## Unigram Analysis

In [None]:
draw_n_gram(texts_to_list, 1)

## Bigram Analysis

In [None]:
draw_n_gram(texts_to_list, 2)

## Trigram Analysis

In [None]:
draw_n_gram(texts_to_list, 3)

## Quadgram Analysis

In [None]:
draw_n_gram(texts_to_list, 4)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data['reviews_preprocessed'], data['sentiment'], test_size=0.2, random_state=0)
[i.shape for i in [X_train, X_test, y_train, y_test]]

In [None]:
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 4),
    max_features=8000
)

word_vectorizer.fit(data['reviews_preprocessed'])

tfidf_train = word_vectorizer.transform(X_train)
tfidf_test = word_vectorizer.transform(X_test)

print('Shape of tfidf_train:', tfidf_train.shape)
print('Shape of tfidf_test:', tfidf_test.shape)

In [None]:
word_vectorizer2 = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 4),
    max_features=None
)

word_vectorizer2.fit(data['reviews_preprocessed'])

tfidf_train2 = word_vectorizer2.transform(X_train)
tfidf_test2 = word_vectorizer2.transform(X_test)

print('Shape of tfidf_train:', tfidf_train2.shape)
print('Shape of tfidf_test:', tfidf_test2.shape)

In [None]:
linear_svc = LinearSVC(penalty='l2', dual=False)
linear_svc.fit(tfidf_train, y_train)
metrics(linear_svc, tfidf_test, y_test)

In [None]:
linear_svc = LinearSVC(penalty='l2', dual=False)
linear_svc.fit(tfidf_train2, y_train)
metrics(linear_svc, tfidf_test2, y_test)

### **2) Count Vectorizer-** 


In [None]:
cv=CountVectorizer(analyzer='word', token_pattern=r'\w{1,}',
                   ngram_range=(1,3),max_features=10000)
cv.fit(data['reviews_preprocessed'])
cv_train = cv.transform(X_train)
cv_test = cv.transform(X_test)
print('Shape of cv_train:', cv_train.shape)
print('Shape of cv_test:', cv_test.shape)

In [None]:
cv4=CountVectorizer(analyzer='word', token_pattern = r'\w{1,}',
                    ngram_range=(1,4),max_features=10000)
cv4.fit(data['reviews_preprocessed'])
cv4_train = cv4.transform(X_train)
cv4_test = cv4.transform(X_test)
print('Shape of cv_train:', cv4_train.shape)
print('Shape of cv_test:', cv4_test.shape)

In [None]:
linear_svc = LinearSVC(C=0.5, dual=False, random_state=42)
linear_svc.fit(cv_train, y_train)

metrics(linear_svc, cv_test, y_test)

In [None]:
linear_svc = LinearSVC(penalty='l2', dual=False, random_state=42)
linear_svc.fit(cv4_train, y_train)

metrics(linear_svc, cv4_test, y_test)