<a href="https://colab.research.google.com/github/srushti1hub/MLH-init-MLtrack/blob/main/Natural-Language-Processing-and-Rules-Based-Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# A corpus is a collection of documents; each document is an individual string that we can operate on.
corpus = [
  "I love pineapple on pizza. I think it’s good!",
  "Pineapple on pizza is so bad.",
  "I HATE this pineapple-on-pizza trend.",
  "I am loving this big pizza I got with pineapple on it.",
]
corpus

['I love pineapple on pizza. I think it’s good!',
 'Pineapple on pizza is so bad.',
 'I HATE this pineapple-on-pizza trend.',
 'I am loving this big pizza I got with pineapple on it.']

In [2]:
# Here's a very simple kind of rule based model--we can pick some "good" words
# and some "bad" words, and check to see if there are more good or bad words
# in a sentence.
good_tokens = ["good", "love"]
bad_tokens = ["bad", "hate"]

# We'll define a "predict" function that takes a document, and lists of good
# and bad tokens.
def predict(document, preprocess=None, good_tokens=good_tokens, bad_tokens=bad_tokens):
    # We'll start with "neutral"
    sentiment = 0
    # Our default preprocessing will be just splitting along spaces, but
    # we can pass in a custom preprocess script later if we want
    if preprocess == None:
      tokens = document.split()
    else:
      tokens = preprocess(document)

    # We loop through all the tokens in our document
    for token in tokens:
        # If the token is one of our "good" ones, we'll add 1 to the sentiment
        if token in good_tokens:
            sentiment += 1
        # If the token is one of our "bad" ones, we'll subtract 1 from the sentiment
        if token in bad_tokens:
            sentiment -= 1
    

    if sentiment > 0:
      # By the end if our sentiment is greater than 0, this means we have more
      # positive words, so we'll say the sentiment is positive.
      return 'positive'
    elif sentiment < 0:
      # If it's less than 0, this means we have more negative words, so we'll
      # say the sentiment is negative
      return 'negative'
    else:
      # Otherwise, we're at 0, so we'll say it's neutral.
      return 'neutral'

for document in corpus:
  prediction = predict(document)
  print(f'Document: {document}\nPrediction: {prediction}\n\n')

Document: I love pineapple on pizza. I think it’s good!
Prediction: positive


Document: Pineapple on pizza is so bad.
Prediction: neutral


Document: I HATE this pineapple-on-pizza trend.
Prediction: neutral


Document: I am loving this big pizza I got with pineapple on it.
Prediction: neutral




So we got 1 of our 4 documents right--the first one does have positive sentiment. But the other 3 had problems. Why?

In [3]:
# Tokenizing is kind of a hard problem.
# We can think of additional rules (like check for punctuation),
# but there is a really good tokenizer we can just borrow from nltk!
import nltk
nltk.download('punkt')

nltk.word_tokenize(corpus[0])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['I',
 'love',
 'pineapple',
 'on',
 'pizza',
 '.',
 'I',
 'think',
 'it',
 '’',
 's',
 'good',
 '!']

In [4]:
# We can use word_tokenize as our preprocess function, and re-run our predictions.
for document in corpus:
  prediction = predict(document, preprocess=nltk.word_tokenize)
  print(f'Document: {document}\nPrediction: {prediction}\n\n')

Document: I love pineapple on pizza. I think it’s good!
Prediction: positive


Document: Pineapple on pizza is so bad.
Prediction: negative


Document: I HATE this pineapple-on-pizza trend.
Prediction: neutral


Document: I am loving this big pizza I got with pineapple on it.
Prediction: neutral




In [5]:
# We got another one right!
# We're getting document #3 wrong because it's all caps,
# even though it's a word we have in out list.
# This is easy, lets just make all the words lower case.

def normalize_caps_and_tokenize(document):
  return nltk.word_tokenize(document.lower())

for document in corpus:
  prediction = predict(document, preprocess=normalize_caps_and_tokenize)
  print(f'Document: {document}\nPrediction: {prediction}\n\n')

Document: I love pineapple on pizza. I think it’s good!
Prediction: positive


Document: Pineapple on pizza is so bad.
Prediction: negative


Document: I HATE this pineapple-on-pizza trend.
Prediction: negative


Document: I am loving this big pizza I got with pineapple on it.
Prediction: neutral




In [9]:
# We got another one right!
# Example #4 is tricky, because it doesn't actually have one of our words in it.
# BUT it does have the word `hated`--which can work for us if we use something
# called a "lemma"--the "standard" version of a word

from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
from collections import defaultdict
lemmatizer = WordNetLemmatizer()

doc = "I was having a better time."

print(f'doc: {doc}')

tokens = nltk.word_tokenize(doc)

print(f'tokens: {tokens}')

tagged_tokens = nltk.pos_tag(tokens)

print(f'tagged_tokens: {tagged_tokens}')

lemmas = []

# the lemmatizer only works with a few different parts of speech,
# and unfortunately it uses different labels for it's POS, so we have to convert.
tag_map = defaultdict(lambda : wordnet.NOUN)
tag_map['J'] = wordnet.ADJ
tag_map['V'] = wordnet.VERB
tag_map['R'] = wordnet.ADV

for token, pos in tagged_tokens:
  lemmatizer_tag = tag_map[pos[0]]
  lemma = lemmatizer.lemmatize(token, pos=lemmatizer_tag)
  lemmas.append(lemma)

print(f'lemmas: {lemmas}')

doc: I was having a better time.
tokens: ['I', 'was', 'having', 'a', 'better', 'time', '.']
tagged_tokens: [('I', 'PRP'), ('was', 'VBD'), ('having', 'VBG'), ('a', 'DT'), ('better', 'JJR'), ('time', 'NN'), ('.', '.')]
lemmas: ['I', 'be', 'have', 'a', 'good', 'time', '.']


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [10]:
def preprocess_with_lemmatization(document):
  tokens = nltk.word_tokenize(document.lower())
  tagged_tokens = nltk.pos_tag(tokens)
  lemmas = []
  for token, pos in tagged_tokens:
    lemmatizer_tag = tag_map[pos[0]]
    lemma = lemmatizer.lemmatize(token, pos=lemmatizer_tag)
    lemmas.append(lemma)
  return lemmas

for document in corpus:
  prediction = predict(document, preprocess=preprocess_with_lemmatization)
  print(f'Document: {document}\nPrediction: {prediction}\n\n')

Document: I love pineapple on pizza. I think it’s good!
Prediction: positive


Document: Pineapple on pizza is so bad.
Prediction: negative


Document: I HATE this pineapple-on-pizza trend.
Prediction: negative


Document: I am loving this big pizza I got with pineapple on it.
Prediction: positive




In [None]:
# So far we've looked at a few toy examples, but we've learned a lot.
# To recap:
# - Corpus: a group of documents
# - Document: a single peice of text we want to evaluate
# - Token: a single "word" in a document
# - Tokenizing: the process of turning a document into tokens--we can do it
#     ourselves, but a lot of times premade tokenizers work well
# - Case Normalization: making everything upper or lower case. Sometimes this
#     gets rid of important information, but in a lot of cases it can help.
# - Lemma: the "standard" version of a word (e.g. for "loving" it's "love")

# But how would our model work on a "real" dataset? Lets download some Amazon
# food reviews and try it out!

In [11]:
!pip install gdown
import pandas as pd
import gdown

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
url = 'https://drive.google.com/uc?id=1-WZKE5xHw-3m_SL_PtOgwkzdFROIWqih'
output = 'reviews.csv'
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-WZKE5xHw-3m_SL_PtOgwkzdFROIWqih
To: /content/reviews.csv
100%|██████████| 301M/301M [00:04<00:00, 70.4MB/s]


'reviews.csv'

In [13]:
# We'll grab the first 1000 records to play with.
df = pd.read_csv(output)
df = df[df['Score'] > 0][:100]
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [14]:
# We can update our predict function to predict the number of stars we think the review would give
def predict_stars(document, preprocess=None, good_tokens=good_tokens, bad_tokens=bad_tokens):
    # We'll start with "neutral"
    sentiment = 0
    # Our default preprocessing will be just splitting along spaces, but
    # we can pass in a custom preprocess script later if we want
    if preprocess == None:
      tokens = document.split()
    else:
      tokens = preprocess(document)

    # We loop through all the tokens in our document
    for token in tokens:
        # If the token is one of our "good" ones, we'll add 1 to the sentiment
        if token in good_tokens:
            sentiment += 1
        # If the token is one of our "bad" ones, we'll subtract 1 from the sentiment
        if token in bad_tokens:
            sentiment -= 1
    
    if sentiment > 1:
      return 5
    elif sentiment == 1:
      return 4
    elif sentiment == 0:
      return 3
    elif sentiment == -1:
      return 2
    else:
      return 1

df['Prediction'] = df['Text'].apply(lambda text: predict_stars(text, preprocess=preprocess_with_lemmatization))

In [15]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Prediction
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,5
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,3
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,3
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,4
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,3


In [16]:
# Now we can compute the accuracy of our predictions
correct_cases = sum(df['Prediction'] == df['Score'])
total_cases = len(df['Prediction'])
accuracy = correct_cases / total_cases
print(accuracy)

0.22


In [17]:
# Is this a good accuracy or a bad accuracy?
# That really depends on the context. In this case, there are 3 different categories
# (positive, negative, neutral), so if we were randomly guessing, we would
# expect to see an accuracy of about 1 / 5, or ~22%
# So in comparison the accuracy we did better than guessing.
# But we're not even getting half of the predictions right.
# How could we make it better?


In [18]:
# One option is to get a bigger list of positive an negative words.
more_bad_tokens = ['bad', 'poor', 'terrible' 'hate', 'dissapointed']
more_good_tokens = ['good', 'great', 'wonderful', 'love', 'impressed']
final_predict_model = lambda document: predict_stars(document, preprocess=preprocess_with_lemmatization, good_tokens=more_good_tokens, bad_tokens=more_bad_tokens)
df['Prediction'] = df['Text'].apply(final_predict_model)
df.head()


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Prediction
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,5
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,3
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,3
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,4
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,5


In [None]:
# Now we can compute the accuracy of our predictions
correct_cases = sum(df['Prediction'] == df['Score'])
total_cases = len(df['Prediction'])
accuracy = correct_cases / total_cases
print(accuracy)

0.33


In [None]:
# We did even better! Almost % accuracy.
# We could continue adding good and bad tokens, but this is tedious,
# and we'd likely make a lot of mistakes.
# Next time we'll take a look at how machine learning can make this process easier.