<a href="https://colab.research.google.com/github/smriti-nayak/Basic-Sentiment-Analysis/blob/master/MovieReview_SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. Loading and Exploring the Dataset

In [None]:
import pandas as pd
data = pd.read_csv('/content/drive/My Drive/Datasets/IMDB Dataset.csv')
data['sentiment'].unique()

array(['positive', 'negative'], dtype=object)

In [None]:
# Import label encoder
from sklearn import preprocessing

# Object to understand word labels
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'sentiment'
data['sentiment'] = label_encoder.fit_transform(data['sentiment'])
data['sentiment'].unique()

array([1, 0])

In [None]:
data.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [None]:
data.review[:20]

0     One of the other reviewers has mentioned that ...
1     A wonderful little production. <br /><br />The...
2     I thought this was a wonderful way to spend ti...
3     Basically there's a family where a little boy ...
4     Petter Mattei's "Love in the Time of Money" is...
5     Probably my all-time favorite movie, a story o...
6     I sure would like to see a resurrection of a u...
7     This show was an amazing, fresh & innovative i...
8     Encouraged by the positive comments about this...
9     If you like original gut wrenching laughter yo...
10    Phil the Alien is one of those quirky films wh...
11    I saw this movie when I was about 12 when it c...
12    So im not a big fan of Boll's work but then ag...
13    The cast played Shakespeare.<br /><br />Shakes...
14    This a fantastic movie of three prisoners who ...
15    Kind of drawn in by the erotic scenes, only to...
16    Some films just simply should not be remade. T...
17    This movie made it into one of my top 10 m

In [None]:
print(data.review[10])

Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines.<br /><br />At first it was very odd and pretty funny but as the movie progressed I didn't find the jokes or oddness funny anymore.<br /><br />Its a low budget film (thats never a problem in itself), there were some pretty interesting characters, but eventually I just lost interest.<br /><br />I imagine this film would appeal to a stoner who is currently partaking.<br /><br />For something similar but better try "Brother from another planet"


### 2. Preprocessing the Data

*   Lowercasing
*   Removal of non-alphabetical characters
*   Tokenization
*   Stop-word removal
*   Stemming

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
import re
import string

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stop = stopwords.words('english')
punc = string.punctuation
print(stop)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'bef

In [None]:
# Lowercasing
data['review'] = data['review'].str.lower()

print(data.review[10])

phil the alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines.<br /><br />at first it was very odd and pretty funny but as the movie progressed i didn't find the jokes or oddness funny anymore.<br /><br />its a low budget film (thats never a problem in itself), there were some pretty interesting characters, but eventually i just lost interest.<br /><br />i imagine this film would appeal to a stoner who is currently partaking.<br /><br />for something similar but better try "brother from another planet"


In [None]:
# Removing non-alphabetic characters
data['review'] = data['review'].str.replace('[^a-z\s]', '')

print(data.review[10])

phil the alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlinesbr br at first it was very odd and pretty funny but as the movie progressed i didnt find the jokes or oddness funny anymorebr br its a low budget film thats never a problem in itself there were some pretty interesting characters but eventually i just lost interestbr br i imagine this film would appeal to a stoner who is currently partakingbr br for something similar but better try brother from another planet


In [None]:
# Tokennization
data['review'] = data['review'].apply(word_tokenize)

print(data.review[10])

['phil', 'the', 'alien', 'is', 'one', 'of', 'those', 'quirky', 'films', 'where', 'the', 'humour', 'is', 'based', 'around', 'the', 'oddness', 'of', 'everything', 'rather', 'than', 'actual', 'punchlinesbr', 'br', 'at', 'first', 'it', 'was', 'very', 'odd', 'and', 'pretty', 'funny', 'but', 'as', 'the', 'movie', 'progressed', 'i', 'didnt', 'find', 'the', 'jokes', 'or', 'oddness', 'funny', 'anymorebr', 'br', 'its', 'a', 'low', 'budget', 'film', 'thats', 'never', 'a', 'problem', 'in', 'itself', 'there', 'were', 'some', 'pretty', 'interesting', 'characters', 'but', 'eventually', 'i', 'just', 'lost', 'interestbr', 'br', 'i', 'imagine', 'this', 'film', 'would', 'appeal', 'to', 'a', 'stoner', 'who', 'is', 'currently', 'partakingbr', 'br', 'for', 'something', 'similar', 'but', 'better', 'try', 'brother', 'from', 'another', 'planet']


In [None]:
# Removing stopwords and Stemming

stemmer = PorterStemmer()

def clean(df):
  clean_tokens = []
  for word in df:
    if(word not in stop and word not in punc):
      clean_tokens.append(stemmer.stem(word))
  return clean_tokens

data['review'] = data['review'].apply(clean)

In [None]:
print(data.review[10])

['phil', 'alien', 'one', 'quirki', 'film', 'humour', 'base', 'around', 'odd', 'everyth', 'rather', 'actual', 'punchlinesbr', 'br', 'first', 'odd', 'pretti', 'funni', 'movi', 'progress', 'didnt', 'find', 'joke', 'odd', 'funni', 'anymorebr', 'br', 'low', 'budget', 'film', 'that', 'never', 'problem', 'pretti', 'interest', 'charact', 'eventu', 'lost', 'interestbr', 'br', 'imagin', 'film', 'would', 'appeal', 'stoner', 'current', 'partakingbr', 'br', 'someth', 'similar', 'better', 'tri', 'brother', 'anoth', 'planet']


In [None]:
data.head(10)

Unnamed: 0,review,sentiment
0,"[one, review, mention, watch, oz, episod, youl...",1
1,"[wonder, littl, product, br, br, film, techniq...",1
2,"[thought, wonder, way, spend, time, hot, summe...",1
3,"[basic, there, famili, littl, boy, jake, think...",0
4,"[petter, mattei, love, time, money, visual, st...",1
5,"[probabl, alltim, favorit, movi, stori, selfle...",1
6,"[sure, would, like, see, resurrect, date, seah...",1
7,"[show, amaz, fresh, innov, idea, first, air, f...",0
8,"[encourag, posit, comment, film, look, forward...",0
9,"[like, origin, gut, wrench, laughter, like, mo...",1


### 3. Analysis of Data (Word Frequency Distribution Analysis)

In [None]:
from nltk.probability import FreqDist

fdist = FreqDist()
for doc in data.review:
  for word in doc:
    fdist[word] +=1

In [None]:
vocab = fdist.most_common(50)
print(vocab)

[('br', 114890), ('movi', 98983), ('film', 92081), ('one', 53314), ('like', 43990), ('time', 29805), ('good', 28988), ('make', 28613), ('get', 27750), ('see', 27693), ('charact', 27602), ('watch', 27281), ('even', 25046), ('stori', 24274), ('would', 24024), ('realli', 22952), ('scene', 20706), ('show', 19407), ('well', 19303), ('look', 19284), ('much', 18947), ('end', 18155), ('great', 18067), ('peopl', 18052), ('also', 17818), ('bad', 17785), ('go', 17723), ('love', 17651), ('think', 17343), ('first', 17160), ('play', 16999), ('dont', 16954), ('act', 16813), ('way', 16529), ('thing', 16119), ('made', 15417), ('could', 15155), ('know', 14884), ('say', 14787), ('seem', 14075), ('mani', 13413), ('work', 13147), ('want', 13117), ('plot', 13099), ('seen', 13098), ('two', 13030), ('actor', 13019), ('come', 12986), ('take', 12939), ('never', 12874)]


In [None]:
# Storing the most common 50 words in a list

lst = [tup[0] for tup in vocab]
print(lst)

['br', 'movi', 'film', 'one', 'like', 'time', 'good', 'make', 'get', 'see', 'charact', 'watch', 'even', 'stori', 'would', 'realli', 'scene', 'show', 'well', 'look', 'much', 'end', 'great', 'peopl', 'also', 'bad', 'go', 'love', 'think', 'first', 'play', 'dont', 'act', 'way', 'thing', 'made', 'could', 'know', 'say', 'seem', 'mani', 'work', 'want', 'plot', 'seen', 'two', 'actor', 'come', 'take', 'never']


In [None]:
with open('nlargest.txt', 'w') as f:
  for item in lst:
    f.write("%s\n" % item)

### 4. Feature Preparation

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [None]:
#Preparing data to create TF-IDF features

d = data.review
merged = []
for doc in d:
  merged.append((' '.join(doc)))

data.review = merged
data.head(10)

Unnamed: 0,review,sentiment
0,one review mention watch oz episod youll hook ...,1
1,wonder littl product br br film techniqu unass...,1
2,thought wonder way spend time hot summer weeke...,1
3,basic there famili littl boy jake think there ...,0
4,petter mattei love time money visual stun film...,1
5,probabl alltim favorit movi stori selfless sac...,1
6,sure would like see resurrect date seahunt ser...,1
7,show amaz fresh innov idea first air first yea...,0
8,encourag posit comment film look forward watch...,0
9,like origin gut wrench laughter like movi youn...,1


In [None]:
# Create TFfidVectorizer object
vectorizer = TfidfVectorizer(ngram_range=(1,2))

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(data.review)

In [None]:
#print(tfidf_matrix.shape)
#print(tfidf_matrix.toarray())

### 5. Training the DataSet
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(tfidf_matrix, data.sentiment, test_size=0.01, random_state=42)

In [None]:
#print(x_train[0])
print(x_train.shape)

(49500, 2862811)


In [None]:
clf = MultinomialNB()
clf.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [None]:
preds = clf.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, preds)

array([[236,  17],
       [ 27, 220]])

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, preds)

0.912