<a href="https://colab.research.google.com/github/sohammistri/CMU-CS-11-711-anlp/blob/main/Lec01_RuleBasedSentimentAnalyser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rule based sentiment analysis

This is my playground for the sentiment analysis code given by Prof. Graham Neubig. The original code can be accessed here [link](https://github.com/neubig/anlp-code/tree/main/01-rulebasedclassifier)

## Import data

Datasets are available and described here [link](https://github.com/neubig/anlp-code/tree/main/data)

In [1]:
!git clone https://github.com/neubig/anlp-code.git

Cloning into 'anlp-code'...
remote: Enumerating objects: 52, done.[K
remote: Counting objects: 100% (52/52), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 52 (delta 15), reused 47 (delta 11), pack-reused 0[K
Unpacking objects: 100% (52/52), 2.45 MiB | 6.81 MiB/s, done.


In [2]:
DATA_ROOT = "/content/anlp-code/data/sst-sentiment-text-threeclass"

## Replicating the approach followed in class.

Here I will be replicating the positive and negative word frequency based approach followed in class. With few modifications.

In [3]:
def get_score(sent, pos_words, neg_words, weights, preprocess=None):
  """
  @param sent: The input sentence
  @param pos_words: List of positive words
  @param neg_words: List of negative words
  @param weights: Set of weights given to [pos,neg,bias]
  @preprocess: Preprocess the sent if needed, by default None
  @return: +1 if positive, -1 if negative, 0 if neutral
  """

  if preprocess is not None:
    try:
      sent = preprocess(sent)
    except:
      sent = ""

  counts = [0,0]

  split_sent = sent.split(" ")
  for word in split_sent:
    if word in pos_words:
      # print("+++{}+++".format(word))
      counts[0]+=1
    if word in neg_words:
      # print("---{}---".format(word))
      counts[1]+=1

  score = counts[0]*weights[0]+counts[1]*weights[1]+weights[2]

  if score>0:
    return 1
  elif score<0:
    return -1
  else:
    return 0

In [4]:
def get_accuracy(X, y, pos_words, neg_words, weights, preprocess=None):
  pred = [get_score(sent, pos_words, neg_words, weights, preprocess) for sent in X]
  correct, wrong = 0,0
  for i in range(len(y)):
    if pred[i]==y[i]:
      correct+=1
    else:
      wrong+=1

  ground_dict = {0:0, 1:0, -1:0}
  pred_dict = {0:0, 1:0, -1:0}

  for i in y:
    ground_dict[i]+=1
  for i in pred:
    pred_dict[i]+=1

  for i in [-1,0,1]:
    print(i, ground_dict[i], pred_dict[i])

  return (correct/(correct+wrong))*100

In [5]:
def get_sent_list(path):
  X,y = [],[]
  with open(path) as f:
    for line in f:
      label, sent = line.split("|||")
      label = int(label.strip())
      sent = sent.strip()
      X.append(sent)
      y.append(label)

  return X,y

In [6]:
import os

X_train, y_train = get_sent_list(os.path.join(DATA_ROOT, "train.txt"))
X_dev, y_dev = get_sent_list(os.path.join(DATA_ROOT, "dev.txt"))
X_test, y_test = get_sent_list(os.path.join(DATA_ROOT, "test.txt"))

In [None]:
pos_words_1 = ["good", "love", "enjoy", "nice", "amazing"]
neg_words_1 = ["bad", "worst", "disappoint", "hate", "underwhelm"]
weights_1 = [1,1,0.5]

In [None]:
print(get_accuracy(X_train, y_train, pos_words_1, neg_words_1, weights_1))
print(get_accuracy(X_test, y_test, pos_words_1, neg_words_1, weights_1))

-1 3310 0
0 1624 0
1 3610 8544
42.25187265917603
-1 912 0
0 389 0
1 909 2210
41.13122171945701


## Get more words from ChatGPT

Add more pos and neg words from ChatGPT

**Prompt used**: "I am building a rule based sentiment analyser in Python. Give me a list of positive and negative words to be detected."

In [None]:
pos_words_2 = [
    'happy',
    'great',
    'excellent',
    'wonderful',
    'amazing',
    'fantastic',
    'love',
    'awesome',
    'good',
    'fantastic',
    'incredible',
    'fabulous',
    'superb',
    'delightful',
    'charming'
]

neg_words_2 = [
    'sad',
    'bad',
    'terrible',
    'awful',
    'horrible',
    'dislike',
    'hate',
    'negative',
    'disappointing',
    'frustrating',
    'annoying',
    'unpleasant',
    'lousy',
    'pathetic',
    'dreadful'
]

weights_2 = [1,1,0.5]

In [None]:
print(get_accuracy(X_train, y_train, pos_words_2, neg_words_2, weights_2))
print(get_accuracy(X_test, y_test, pos_words_2, neg_words_2, weights_2))

-1 3310 0
0 1624 0
1 3610 8544
42.25187265917603
-1 912 0
0 389 0
1 909 2210
41.13122171945701


In [None]:
pos_words_3 = [
    'happy',
    'great',
    'excellent',
    'wonderful',
    'amazing',
    'fantastic',
    'love',
    'awesome',
    'good',
    'fantastic',
    'incredible',
    'fabulous',
    'superb',
    'delightful',
    'charming'
]

neg_words_3 = [
    'sad',
    'bad',
    'terrible',
    'awful',
    'horrible',
    'dislike',
    'hate',
    'negative',
    'disappointing',
    'frustrating',
    'annoying',
    'unpleasant',
    'lousy',
    'pathetic',
    'dreadful'
]

weights_3 = [1,1,0]

In [None]:
print(get_accuracy(X_train, y_train, pos_words_3, neg_words_3, weights_3))
print(get_accuracy(X_test, y_test, pos_words_3, neg_words_3, weights_3))

-1 3310 0
0 1624 7584
1 3610 960
22.495318352059925
-1 912 0
0 389 1971
1 909 239
20.13574660633484


## Add preprocesing

Seems like all are getting mapped to a particular class, maybe preprocessing will help

In [7]:
!pip install pyspellchecker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspellchecker
  Downloading pyspellchecker-0.7.2-py3-none-any.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.7.2


## Big Dicts need to hide..

In [8]:
emotion_dict = {
    ":)": "happy",
    ":D": "happy",
    ":]": "happy",
    ":(": "sad",
    ":'(": "sad",
    ":'[": "sad",
    ":/": "confused",
    ":|": "neutral",
    ":O": "surprised",
    ":*": "love",
    ":P": "playful",
    ";)": "winking",
    ":')": "tears of joy",
    "<3": "heart",
    ":@": "angry",
    ":$": "embarrassed",
    ":S": "confused",
    ":\\": "confused",
    ":#": "silence",
    ":'D": "laughing",
    "XD": "laughing",
    "X-D": "laughing",
    ":|": "disappointed",
    ":>": "smug",
    ":-)": "happy",
    ":-D": "happy",
    ":-]": "happy",
    ":-(": "sad",
    ":'-(": "sad",
    ":'-[": "sad",
    ":-/": "confused",
    ":-|": "neutral",
    ":-O": "surprised",
    ":-*": "love",
    ":-P": "playful",
    ";-)": "winking",
    ":'-)": "tears of joy",
    "<3": "heart",
    ":-@": "angry",
    ":-$": "embarrassed",
    ":-S": "confused",
    ":-\\": "confused",
    ":-#": "silence",
    ":'-D": "laughing",
    "XD": "laughing",
    "X-D": "laughing",
    ":-|": "disappointed",
    ":->": "smug"
}


In [9]:
abbreviation_dict = {
    "lol": "laugh out loud",
    "omg": "oh my god",
    "brb": "be right back",
    "btw": "by the way",
    "idk": "I don't know",
    "jk": "just kidding",
    "tbh": "to be honest",
    "gtg": "got to go",
    "bff": "best friends forever",
    "imo": "in my opinion",
    "imho": "in my humble opinion",
    "fyi": "for your information",
    "np": "no problem",
    "thx": "thanks",
    "yw": "you're welcome",
    "rofl": "rolling on the floor laughing",
    "afk": "away from keyboard",
    "irl": "in real life",
    "nvm": "never mind",
    "smh": "shaking my head",
    "omw": "on my way",
    "ikr": "I know, right?",
    "tmi": "too much information",
    "btwn": "between",
    "wtf": "what the f***",
    "ftw": "for the win",
    "im": "instant message",
    "dm": "direct message",
    "np": "no problem",
    "pls": "please",
    "sry": "sorry",
    "tho": "though",
    "wth": "what the heck",
    "oml": "oh my lord",
    "ic": "I see",
    "omd": "oh my days",
    "ama": "ask me anything",
    "hmu": "hit me up",
    "rn": "right now",
    "gg": "good game",
    "fyf": "for your future",
    "fomo": "fear of missing out",
    "irl": "in real life",
    "lmk": "let me know",
    "nbd": "no big deal",
    "omgosh": "oh my gosh",
    "ttyl": "talk to you later",
    "yolo": "you only live once",
    "smh": "shaking my head",
    "imo": "in my opinion",
    "hth": "hope this helps",
    "yw": "you're welcome",
    "btw": "by the way",
    "omw": "on my way",
    "bff": "best friends forever",
    "imo": "in my opinion",
    "irl": "in real life",
    "jk": "just kidding",
    "lmao": "laughing my ass off",
    "lmfao": "laughing my f***ing ass off",
    "omfg": "oh my f***ing god",
    "wtf": "what the f***",
    "idc": "I don't care",
    "idgaf": "I don't give a f***",
}

In [10]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from spellchecker import SpellChecker

def lower_case(sent):
  return sent.lower()

def tokenize(sent):
  return nltk.word_tokenize(sent)

def rem_punc(sent):
  return sent.translate(str.maketrans("", "", string.punctuation))

def rem_stop_words(tokens):
  stop_words = set(stopwords.words('english'))
  return [word for word in tokens if word not in stop_words]

def rem_num_chars(tokens):
  return [word for word in tokens if not word.isnumeric()]

def stemm(tokens, lem_type="porter"):
  if lem_type=="wordnet":
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]
  else:
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokens]


def spell_check(tokens):
  spell = SpellChecker()

  corrected_tokens = []
  for word in tokens:
      corrected_word = spell.correction(word)
      corrected_tokens.append(corrected_word)

  return corrected_tokens

def replace_emoticons(sent):
  for emoticon, sentiment in emotion_dict.items():
    sent = sent.replace(emoticon, sentiment)
  return sent

def replace_abbr(sent):
  for abbreviation, expanded_form in abbreviation_dict.items():
    sent = sent.replace(abbreviation, expanded_form)
  return sent

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Restart

In [11]:
apply_order = [lower_case, replace_emoticons, replace_abbr, rem_punc, \
               tokenize, rem_stop_words, rem_num_chars, spell_check, stemm]

In [12]:
sent = """It 's a lovely film with lovely performances by Buy and Accorsi ."""

for func in apply_order:
  sent = func(sent)

sent

['love', 'film', 'love', 'perform', 'buy', 'actor']

In [13]:
pos_words = [
    'happy',
    'great',
    'excellent',
    'wonderful',
    'amazing',
    'fantastic',
    'love',
    'awesome',
    'good',
    'fantastic',
    'incredible',
    'fabulous',
    'superb',
    'delightful',
    'charming'
]

neg_words = [
    'sad',
    'bad',
    'terrible',
    'awful',
    'horrible',
    'dislike',
    'hate',
    'negative',
    'disappointing',
    'frustrating',
    'annoying',
    'unpleasant',
    'lousy',
    'pathetic',
    'dreadful'
]

preprocess_words = [spell_check, stemm]

for func in preprocess_words:
  pos_words = func(pos_words)
  neg_words = func(neg_words)

In [14]:
pos_words

['happi',
 'great',
 'excel',
 'wonder',
 'amaz',
 'fantast',
 'love',
 'awesom',
 'good',
 'fantast',
 'incred',
 'fabul',
 'superb',
 'delight',
 'charm']

In [15]:
neg_words

['sad',
 'bad',
 'terribl',
 'aw',
 'horribl',
 'dislik',
 'hate',
 'neg',
 'disappoint',
 'frustrat',
 'annoy',
 'unpleas',
 'lousi',
 'pathet',
 'dread']

In [16]:
def preprocess(x):
  for func in apply_order:
    x = func(x)

  return " ".join(x)

In [23]:
weights = [1,-1,0.5]

In [22]:
print(get_accuracy(X_train, y_train, pos_words, neg_words, weights, preprocess))
print(get_accuracy(X_test, y_test, pos_words, neg_words, weights, preprocess))

-1 3310 219
0 1624 7808
1 3610 517
23.057116104868914
-1 912 52
0 389 2008
1 909 150
21.49321266968326


In [24]:
print(get_accuracy(X_train, y_train, pos_words, neg_words, weights, preprocess))
print(get_accuracy(X_test, y_test, pos_words, neg_words, weights, preprocess))

-1 3310 219
0 1624 0
1 3610 8325
43.445692883895134
-1 912 52
0 389 0
1 909 2158
42.262443438914026


In [None]:
test_sent = "Here 's yet another studio horror franchise mucking up its storyline with glitches casual fans could correct in their sleep ."

In [None]:
get_score(test_sent, pos_words, neg_words, weights, preprocess)

1

In [None]:
preprocess(test_sent)

'love film love perform buy actor'

In [None]:
test_sent = "It 's a lovely film with lovely performances by Buy and Accorsi ."