<a href="https://colab.research.google.com/github/saloniasrani/sentimentanalysis/blob/main/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Mounting CSV from drive to colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
from nltk.corpus import stopwords

Importing dataset into pandas dataframe

In [None]:
train_document = pd.read_csv("/content/drive/MyDrive/SentimentAnalysis/movie.csv")

Downloading important libraries

In [21]:
import nltk
nltk.download('maxent_treebank_pos_tagger')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('maxent_treebank_pos_tagger')

[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/maxent_treebank_pos_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!


True

Making an array containing tuples of reviews and their corresponding sentiment/category

In [None]:
documents = [(train_document.iloc[i,0],train_document.iloc[i,1]) for i in range(train_document.shape[0])]
documents[0]

('I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played "Thunderbirds" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.',
 0)

In [38]:
from nltk import word_tokenize
documents = [(word_tokenize(word),category) for word,category in documents]

In [None]:
documents[:5]

In [40]:
from nltk.corpus import wordnet
def get_simple_pos(tag):
  if tag.startswith('J'):
    return wordnet.ADJ
  elif tag.startswith('V'):
    return wordnet.VERB
  elif tag.startswith('N'):
    return wordnet.NOUN
  elif tag.startswith('R'):
    return wordnet.ADV
  else:
    return wordnet.NOUN

  

Creating a list of stopwords

In [None]:

import string
stops = stopwords.words('english')
punctuations = list(string.punctuation)
stops += punctuations
stops

Cleansing of list of words using lemmitizer and removing stop words

In [42]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [43]:
from nltk import pos_tag
def clean_review(words):
  output_words =[]
  for w in words:
    if w.lower() not in stops:
      pos = pos_tag([w])
      clean_word = lemmatizer.lemmatize(w, pos=get_simple_pos(pos[0][1]))
      output_words.append(clean_word.lower())
  return output_words

In [45]:
documents = [(clean_review(words), category) for words, category in documents]

Splitting dataset for training and testing 

In [46]:
import random
random.shuffle(documents)


In [47]:
n=len(documents)
n

40000

In [48]:
n = int(.75*n)

In [49]:
training_documents = documents[0:n]
testing_documents = documents[n:]

collecting unique words and separating out the top 5000 frequently used words

In [50]:
all_words = []
for doc in training_documents:
  all_words+=doc[0]

In [51]:
freq = nltk.FreqDist(all_words)
common = freq.most_common(5000)
features = [i[0] for i in common]


In [None]:
features

**Making a feature dictionary for each review**

In [53]:
def get_feature_dict(words):
  current_features = {}
  word_set = set(words)
  for w in features:
    current_features[w] = w in word_set
  return current_features


In [None]:
get_feature_dict(training_documents[0][0])

**Model Training and Testing**

In [55]:
training_data = [(get_feature_dict(doc), category) for doc, category in training_documents]

In [56]:
testing_data = [(get_feature_dict(doc), category) for doc, category in testing_documents]

In [58]:
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(training_data)

In [59]:
nltk.classify.accuracy(classifier,testing_data)

0.8534

In [60]:
classifier.show_most_informative_features(15)

Most Informative Features
                    2/10 = True                0 : 1      =     88.6 : 1.0
                    3/10 = True                0 : 1      =     53.1 : 1.0
                    1/10 = True                0 : 1      =     38.0 : 1.0
                     uwe = True                0 : 1      =     36.2 : 1.0
                    4/10 = True                0 : 1      =     35.9 : 1.0
                    boll = True                0 : 1      =     35.5 : 1.0
                    7/10 = True                1 : 0      =     29.7 : 1.0
                    8/10 = True                1 : 0      =     26.5 : 1.0
                   worst = True                0 : 1      =     19.7 : 1.0
                   mst3k = True                0 : 1      =     19.2 : 1.0
                 stinker = True                0 : 1      =     17.7 : 1.0
              incoherent = True                0 : 1      =     16.5 : 1.0
                    9/10 = True                1 : 0      =     14.9 : 1.0