<a href="https://colab.research.google.com/github/turtlenoise/simple_sentiment_analysis/blob/master/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [52]:
import thinc.extra.datasets

imdb_data = thinc.extra.datasets.imdb()

sentences, labels = zip(*imdb_data[1])

In [53]:
print(str(labels[0]) + " " + sentences[0])

0 This movie can best be described as a very long episode of a very bad sitcom. How many vaguely humorous misunderstandings can you cram into just one movie? Notes are misplaced, bags are switched, conversations are misheard, people get mixed up, situations are misinterpreted, and somewhere along the line people are supposed to laugh about something. The writers are really struggling to keep everything going, which makes the dialogues feel really forced. If anyone in this movie acted like a real person all this would be resolved in around two minutes or so and everyone could go back to their lives, but they have to keep the misunderstandings going. At times this movie also tries to go for some juvenile laughs, but all those do is remember you about how funny "American Pie" was. The scene with the nerd telling the hooker (who he thinks is a foreign exchange student) to "eat his sausage" goes on forever, not one second of it is funny. I've got to give this movie some credit though: becau

In [54]:
negative_text = ""
for i in range(1,1500):
  if (labels[i] == 0):
    negative_text = negative_text + sentences[i]

In [60]:
import spacy
nlp = spacy.load('en_core_web_sm')
corpus = nlp(negative_text)

In [64]:
def has_negative_children(token):
  for child in token.children:
    if (child.dep_ == 'neg'):
      return True
  return False 

negative_word_frequency = dict()
for sent in corpus.sents:
    for token in sent:
        if (token.is_stop == False):
          if (token.lemma_ in negative_word_frequency.keys()) and (has_negative_children(token) == False):
              negative_word_frequency[token.lemma_] += 1
          else:
              negative_word_frequency[token.lemma_] = 1

In [65]:
cutRareWords = sorted(negative_word_frequency.items(), key=lambda item: item[1])[7000:-100]
negative_word_frequency = dict(cutRareWords)

In [66]:
negative_word_frequency['bad']

41

In [67]:
negative_word_frequency['good']

44

In [70]:
positive_text = ""
for i in range(1,1480):
  if (labels[i] == 1):
    positive_text = positive_text + sentences[i]

In [71]:
corpus = nlp(positive_text)

In [79]:
positive_word_frequency = dict()
for sent in corpus.sents:
    for token in sent:
        if (token.is_stop == False):
          if (token.lemma_ in positive_word_frequency.keys()) and (has_negative_children(token) == False):
              positive_word_frequency[token.lemma_] += 1
          else:
              positive_word_frequency[token.lemma_] = 1


In [80]:
cutRareWords = sorted(positive_word_frequency.items(), key=lambda item: item[1])[2000:]
positive_word_frequency = dict(cutRareWords)

In [81]:
positive_word_frequency['good']

357

In [82]:
positive_word_frequency['bad']

12

In [83]:
word_values = dict()
for word in negative_word_frequency:
  if (word not in word_values):
    word_values[word] = negative_word_frequency[word]

for word in positive_word_frequency:
  if (word in word_values):
    word_values[word] = positive_word_frequency[word] / (positive_word_frequency[word] + negative_word_frequency[word])
  else: 
    word_values[word] = 1

for word in negative_word_frequency:
    if (word_values[word] > 1):
      word_values[word] = 0


In [84]:
word_values['good']

0.8902743142144638

In [85]:
word_values['lousy']

0

In [86]:
word_values['blue']

0.2

In [108]:
import re

def get_score(example_sentence, pos_val, neg_val):
  example_sentence = re.sub(r'[^\w\s]','',example_sentence)  
  corpus = nlp(example_sentence)
  sum = 0
  count = 0
  for sent in corpus.sents:
      for token in sent:

        if (token.lemma_ in word_values) and (token.pos_ == 'ADJ' or token.pos_ == 'ADV'):
          if (word_values[token.lemma_]) < 0.5:
            value = neg_val
          else: 
            value = pos_val
          sum = sum + 10*value
          count = count + 10
        # if (token.lemma_ in word_values):
        #   if (word_values[token.lemma_]) < 0.5:
        #     value = 0
        #   else:
        #     value = 1
        #   sum = sum + value
        #   count = count + 1
  
  if (count == 0):
      average_sentiment = 0.5
  else:
    average_sentiment = sum/count
  
  return average_sentiment


In [93]:
example_sentence = "Best movie I have ever seen, amazing job, the director is a genius."
get_score(example_sentence)

0.68

In [94]:
example_sentence = "Scary, boring, totally disgusting, yucks."
get_score(example_sentence)

0.3

In [95]:
example_sentence = "Best movie I have ever seen although it had some scary, boring and disgusting scenes."
get_score(example_sentence)

0.49000000000000005

In [116]:
def cross_validate():
  neg_vals = [0.09,0.1,0.11]
  pos_vals = [0.67,0.68,0.69]

  highest_accuracy = 0
  highest_pos_val = 0
  highest_neg_val = 0

  for pos_val in pos_vals:
    for neg_val in neg_vals:
      correct = 0
      all = 0
      for i in range(1500,3000):
        value = get_score(sentences[i], pos_val, neg_val)  
        if (labels[i] == 0) and (value < 0.5):
            correct = correct + 1
        if (labels[i] == 1) and (value > 0.5):
            correct = correct + 1
        all = all + 1

  accuracy = correct/all
  if (accuracy*100 > highest_accuracy):
    highest_accuracy = accuracy*100
    highest_pos_val = pos_val
    highest_neg_val = neg_val
  return highest_accuracy,highest_pos_val,highest_neg_val


In [117]:
accuracy,pos_val,neg_val = cross_validate()
accuracy

70.33333333333334

In [118]:
pos_val

0.69

In [119]:
neg_val

0.11