**CSE 5522 Lab #2: Sentiment Analysis**

The goals of this lab are to familarize you with:

*   Naive Bayes
*   Binary Classification
*   Data exploration
*   Working with text-based data (Tweets)

**Initial notes**

* (If you are using Google Colab) Make a copy of this page in your google drive so that you can edit it.

* While not completely necessary for this assignment, you may want to familiarize yourself with the following packages: [numpy](https://numpy.org), [scikit-learn](https://scikit-learn.org), [pandas](https://pandas.pydata.org), [matplotlib](https://matplotlib.org).
 * Especially numpy, many of the calculations in this (and later) lab can be done in one line using numpy. Whereas raw python may require 5-10x that.

* Feel free to (please do!) change the structure of the document below. Especially, add code sections to break your code into logical pieces and add text sections to explain your code or results

---
---

**Part 1: A Simple Bayes Net: Naive Bayes**

In class, we discussed how conditional independences of a joint probablity distribution get encoded by a Bayesian Network. One of the simplest form of BNs is the Naive Bayes model which encodes a set of simple conditional independences:

- Given a single cause all of the effects are independent from each other.
- Mathematically:
$P($*cause*$, $*effect*$_1, ..., $*effect*$_n) = P($*cause*$) \prod_i P($*effect*$_i|$*cause*$)$

NB can be used for classification by assuming that cause is the true (unknown) label and it (probabilistically) generates all of the features (effects) while features are independent given the cause.

For example, in sentiment analysis the *cause* is the author's sentiment (say, unknown label from the set of {sad, happy, feared, suprised, disgusted, angry}) and the *effects* are words that s/he writes. The simplifying assumption of NB says that knowing the latent sentiment, words of the sentence are independent. We know this assumption is not true because grammar and word-use impose some dependency structure between words in the sentence, but we choose to ignore that in this model.

Although simple, NB has shown good performance in many classifcation tasks and has become a standard classic baseline for classification.

Today we want to perform Twitter sentiment analysis using NB. The goal is to figure out if a tweet has a positive or negative sentiment about the weather.  

**1.0:** Set up the environment (you can click on the play button below to import the appropriate modules).

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

**1.1** Read the data from GitHub into a pandas dataframe.

In [None]:
TweetUrl='https://github.com/aasiaeet/cse5522data/raw/master/db3_final_clean.csv'
tweet_dataframe=pd.read_csv(TweetUrl)

**1.2** Print out the top of the dataframe to make sure that the data loaded correctly.  It should be a data table with three columns (weight, tweet, label), and 3697 rows.

In [None]:
display(tweet_dataframe.shape)
tweet_dataframe.head()

(3697, 3)

Unnamed: 0,weight,tweet,label
0,1.0,it is very cold out want it to be warmer,-1
1,0.7698,dammmmmmm its pretty cold this morning burr lol,-1
2,0.6146,why does halsey have to be so far away think m...,-1
3,0.9356,dammit stop being so cold so can work out,-1
4,1.0,its too freakin cold,-1


Labels are -1 and +1 for negative and positive sentiments respectively. Multiple judges have been asked to choose a label for a tweet (this is an example of crowd-sourcing) from five possible labels:

- Tweet is not relevant to weather.
- I can't tell the sentiment.
- Neutral: author just sharing information.
- Positive
- Negative

The majority vote was picked as the label and its ratio was set as the weight of the tweet. So for the tweet in row 2 above, 61% of judges voted that the label is negative.

Note that tweets have been pre-processed (or cleaned). For example, :) and :( :) were replaced with "sad" and "smiley" and numbers with "num", etc. You can go further (as we ask in 1.12) and remove the stop words, i.e., repetitive non-informative words such as am, is, and are.

**1.3.** In the next step, we should build our feature matrix by converting the string of words to a vector of numeric values.

First we need to assign a unique id to each word and create the feature matrix with correct size:

In [None]:
# wordDict maps words to id
# X is the document-word matrix holding the presence/absence of words in each tweet
wordDict = {}
idCounter = 0
for i in range(tweet_dataframe.shape[0]):
  allWords = tweet_dataframe.iloc[i,1].split(" ")
  for word in allWords:
    if word not in wordDict:
      wordDict[word] = idCounter
      idCounter += 1
X = np.zeros((tweet_dataframe.shape[0], idCounter),dtype='float')

Checking head of the dictionary:

In [None]:
dict(list(wordDict.items())[0:10])

{'': 9,
 'be': 7,
 'cold': 3,
 'is': 1,
 'it': 0,
 'out': 4,
 'to': 6,
 'very': 2,
 'want': 5,
 'warmer': 8}

**1.4:** The simplest way of coding a tweet to numbers is to mark the occurrence of a word, and forget about its frequency in the document (tweet). This works well with tweets as there are not many repetitive words in a single tweet. So let's fill the document-word matrix:

In [None]:
for i in range(tweet_dataframe.shape[0]):
  allWords = tweet_dataframe.iloc[i,1].split(" ")
  for word in allWords:
    X[i, wordDict[word]]  = 1

Now we check if the number of words are correct:

In [None]:
np.sum(X[0:5, ], axis = 1)

array([10.,  9., 17.,  9.,  4.])

Finally, we extract the labels from the dataframe:

In [None]:
y = np.array(tweet_dataframe.iloc[:,2])
y[0:5]

array([-1, -1, -1, -1, -1])

Let's compute the total number of positive and negative tweets:

In [None]:
numNeg = np.sum(y<0)
numPos = np.sum(y>=0) #len(y) - numNeg
probNeg = numNeg / (numNeg + numPos)
probPos = 1 - probNeg
display(numNeg, numPos, probNeg, probPos)

1650

2047

0.4463078171490398

0.5536921828509602

So samples 0:1649 are negative and 1650:-1 are positive.

**1.5: Train/Test Split** Now with do the 20/80 split and learn the word probabilities using the 80 % part and test the NB performance on the 20 % part.

In [None]:
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2, random_state = 0)
display(xTrain.shape, xTest.shape, yTrain.shape, yTest.shape)
#Note: random_state=0 fixes the random seed so we get the same split every run. Don't use this below

(2957, 5989)

(740, 5989)

(2957,)

(740,)

**1.6: Computing Probabilities by Counting** Now the real work begins. Write the code that, from the train feature matrix xTrain computes the needed word probabilites, i.e., $P(word|label)$ where label is + or - and word is any of the words saved in the `wordDict`:

In [None]:
# compute three distributions (four variables):

#The slow way
def compute_distros(x,y):
  count_positive=0
  for tweetIndex in range(x.shape[0]):
    if y[tweetIndex]>0:
      count_positive+=1

  probWordGivenPositive=np.zeros((x.shape[1]))
  probWordGivenNegative=np.zeros((x.shape[1]))
  for wordIndex in range(x.shape[1]): #Go through each word and estimate it's distributions
    count_present_positive=0
    count_present_negative=0
    for tweetIndex in range(x.shape[0]):
      #Go through each tweet and count depending on positive vs. negative
      if x[tweetIndex,wordIndex]>0:
        if y[tweetIndex]>=0:
          count_present_positive+=1
        else:
          count_present_negative+=1
      probWordGivenPositive[wordIndex]=count_present_positive/count_positive # Present-positive vs. total positive
      probWordGivenNegative[wordIndex]=count_present_negative/(x.shape[0]-count_positive) # Present-negative vs. total negative

  priorPositive=count_positive/x.shape[0]
  priorNegative=1-priorPositive

  return probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative

#The fast way
def compute_distros(x,y):
  # probWordGivenPositive: P(word|Sentiment = +ive)
  probWordGivenPositive=np.mean(x[y>=0,:],axis=0) #Sum each word (column) to count how many times each word shows up (in positive examples)
                                                  #and Divide by total number of (positive) examples to give distribution

  # probWordGivenNegative: P(word|Sentiment = -ive)
  probWordGivenNegative=np.mean(x[y<0,:],axis=0)

  # priorPositive: P(Sentiment = +ive)
  priorPositive = np.mean(y>=0) #Number of positive examples vs. all examples
  # priorNegative: P(Sentiment = -ive)
  priorNegative = 1 - priorPositive
  #  (note these last two form one distribution)
  return probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative

# compute distributions here
probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)

# checking the results
display(probWordGivenPositive[0:5])
display(probWordGivenNegative[0:5])
display(priorPositive, priorNegative)

array([0.1185006 , 0.20737606, 0.01088271, 0.01451028, 0.10217654])

array([0.14504988, 0.19493477, 0.00537222, 0.09669992, 0.13967767])

0.5593506932702063

0.44064930672979374

Note that you only needed to compute $P(word = 1| +)$ or $P(word = 1| -)$ and the probabilities of the word being absent from a tweet is just 1 minus those probabilities.

However, as we see in 1.7, for convenience, we will also want to compute $log P(word = 1 | +)$, $log P(word = 0 | +)$, $log P(word = 1 | -)$ and $log P(word = 0 | -)$.  Also we should compute the log priors.  Let's do so now.


In [None]:
# compute the following:
# logProbWordPresentGivenPositive
# logProbWordAbsentGivenPositive
# logProbWordPresentGivenNegative
# logProbWordAbsentGivenNegative
# logPriorPositive
# logPriorNegative
def compute_logdistros(distros, min_prob):
  if True:
    #Assume missing words are simply very rare
    #So, assign minimum probability to very small elements (e.g. 0 elements)
    distros=np.where(distros>=min_prob,distros,min_prob)
    #Also need to consider minimum probability for "not" distribution
    distros=np.where(distros<=(1-min_prob),distros,1-min_prob)

    return np.log(distros), np.log(1-distros)
  else:
    #Ignore missing words (assume they have P==1, i.e. force log 0 to 0)
    return np.log(np.where(distros>0,distros,1)), np.log(np.where(distros<1,1-distros,1))

min_prob = 1/yTrain.shape[0] #Assume very rare words only appeared once
logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive,min_prob)
logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative,min_prob)
logPriorPositive, logPriorNegative = compute_logdistros(priorPositive,min_prob)

# Did this work, or did you get an error?  (Read below.)
display(logProbWordPresentGivenPositive[0:5])
display(logProbWordAbsentGivenPositive[0:5])
display(logProbWordPresentGivenNegative[0:5])
display(logProbWordAbsentGivenNegative[0:5])
display(logPriorPositive, logPriorNegative)

array([-2.13283722, -1.57322143, -4.52058012, -4.23289805, -2.28105316])

array([-0.12613096, -0.23240639, -0.01094236, -0.01461658, -0.10778182])

array([-1.93067756, -1.63509031, -5.22651443, -2.33614267, -1.96841789])

array([-0.15671216, -0.21683197, -0.0053867 , -0.10170047, -0.15044815])

-0.5809786442688406

-0.819505942727632

You likely received an error when you tried to take $log(0)$ at some point.  Can your group think of a way to avoid taking $log(0)$?  Check in with your instructor/TA to see if what you're thinking will work.  Implement that change in your code above.

**1.7: Math of NB** Here we provide the derivation of NB when we want to classify the $i$th tweet $\textbf{x}^{(i)}$ and the size of dictionary is $p$, i.e., each tweet is a binary vector of size $p$ as $\textbf{x}^{(i)} = (x_1^{(i)},\dots, x_p^{(i)})$.

Note that we computed $P(x_j^{(i)} = 1|+)$ and $P(x_j^{(i)} = 1|-)$ in above code from `xTrain` and now want to classify `xTest` samples.

**Classification Rule:** For each tweet $i$ NB classifier assigns label + if $P(+|\textbf{x}^{(i)}) > P(-|\textbf{x}^{(i)})$ and negative otherwise.

These posterior probabilities can be computed using prior probabilities (that we got from `xTrain`) and Bayes rule as follows:

\begin{align}
P(+|\textbf{x}^{(i)}) &= \alpha P(\{\textbf{x}^{(i)}\}_{i=1}^n | +)P(+)
\\
(\text{NB Assumption}) &= \alpha P(+) \prod_{j=1}^p P(x_j^{(i)}|+)
\end{align}

For computational convinence (preventing underflow while dealing with small numbers) we work with the $\log$ of probabilities:

\begin{align}
\log(P(+|\textbf{x}^{(i)})) &\propto \log P(+) + \sum_{j=1}^p \log P(x_j^{(i)}|+)
\\
\log(P(-|\textbf{x}^{(i)})) &\propto \log P(-) + \sum_{j=1}^p \log P(x_j^{(i)}|-)
\end{align}

Finally we can compute the confidence of our prediction as the log of the ratio of posteriors:
$\log(\frac{P(\text{predicted label}|\textbf{x}^{(i)})}{P(\text{the other label}|\textbf{x}^{(i)})})$


**1.8: Implementing NB** Now write a function that takes a row of `xTest` and output a label for it based on NB classification rule.


In [None]:
def logSignGivenTweet(words, logWordsPresentGivenSign, logWordsAbsentGivenSign, logPriorSign):
  temp = words.copy()
  for wordIndex, word in enumerate(temp):
    if word == 0:
      temp[wordIndex] = logWordsAbsentGivenSign[wordIndex]
    else:
      temp[wordIndex] = logWordsPresentGivenSign[wordIndex]
  result = sum(temp) + logPriorSign
  return result

In [None]:
# classifyNB:
#   words - vector of words of the tweet (binary vector)
#   logProbWordPresentGivenPositive - log P(x_j = 1|+)
#   logProbWordAbsentGivenPositive  - log P(x_j = 0|+)
#   logProbWordPresentGivenNegative - log P(x_j = 1|-)
#   logProbWordAbsentGivenNegative  - log P(x_j = 0|-)
#   logPriorPositive - log P(+)
#   logPriorNegative - log P(-)
#   returns (label of x according to the NB classification rule, confidence about the label)

# Note: you can also change the function definition if you wish to encapsulate all six log probs
# as one model; just make sure to follow through below

def classifyNB(words,logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
               logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
               logPriorPositive, logPriorNegative):
  # fill in function definition here
  logPositiveGivenTweet = logSignGivenTweet(words, logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive, logPriorPositive)
  logNegativeGivenTweet = logSignGivenTweet(words, logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative, logPriorNegative)
  prediction = [np.exp(logNegativeGivenTweet), np.exp(logPositiveGivenTweet)]
  if(prediction.index(max(prediction)) == 0):
    label = -1
  else:
    label = 1
  confidence = np.log(max(prediction)/min(prediction))

  return (label, confidence)
print(classifyNB(xTest[700, ], logProbWordPresentGivenPositive,logProbWordAbsentGivenPositive,
                               logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
                               logPriorPositive, logPriorNegative))

(1, 4.37706070095421)


**1.9:** Compute the output of `classifyNB` for all test data and output the average error.  

In [None]:
# testNB: Classify all xTest
#   xTest - test data features
#   yTest - true label of test data
#   logProbWordPresentGivenPositive - log P(x_j = 1|+)
#   logProbWordAbsentGivenPositive  - log P(x_j = 0|+)
#   logProbWordPresentGivenNegative - log P(x_j = 1|-)
#   logProbWordAbsentGivenNegative  - log P(x_j = 0|-)
#   logPriorPositive - log P(+)
#   logPriorNegative - log P(-)
#   returns Average test error
def testNB(xTest, yTest,
           logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
           logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
           logPriorPositive, logPriorNegative):
  correctCases = 0
  for tweetIndex, words in enumerate(xTest):
    prediction = classifyNB(words, logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
           logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
           logPriorPositive, logPriorNegative)
    realResult = yTest[tweetIndex]
    if prediction[0] == realResult:
      correctCases += 1

    avgErr = 1 - (correctCases/len(xTest))
  print("Average error of NB is", avgErr)
  return avgErr

testNB(xTest, yTest,
       logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
       logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
       logPriorPositive, logPriorNegative)

Average error of NB is 0.1702702702702703


0.1702702702702703

**1.10:** Now write an outer wrapper that performs 10 train/test splits and compute the mean and standard deviation of the average accuracy across 10 runs.

In [None]:
# 10 train/test splits
tenErrors = np.array([])
for _ in range(10):
  xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2)
  probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)
  min_prob = 1/yTrain.shape[0] #Assume very rare words only appeared once
  logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive,min_prob)
  logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative,min_prob)
  logPriorPositive, logPriorNegative = compute_logdistros(priorPositive,min_prob)
  avgErr = testNB(xTest, yTest,
       logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
       logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
       logPriorPositive, logPriorNegative)
  tenErrors = np.append(tenErrors, avgErr)

errorMean = np.average(tenErrors)
errorStd = np.std(tenErrors)
print("The mean of the average error: %f" %(errorMean))
print("The standard deviation of the average error: %f" %(errorStd))

Average error of NB is 0.16891891891891897
Average error of NB is 0.18648648648648647
Average error of NB is 0.16081081081081083
Average error of NB is 0.18108108108108112
Average error of NB is 0.18108108108108112
Average error of NB is 0.18108108108108112
Average error of NB is 0.16891891891891897
Average error of NB is 0.16621621621621618
Average error of NB is 0.16756756756756752
Average error of NB is 0.1635135135135135
[0.16891892 0.18648649 0.16081081 0.18108108 0.18108108 0.18108108
 0.16891892 0.16621622 0.16756757 0.16351351]
The mean of the average error: 0.172568
The standard deviation of the average error: 0.008505


**Conclusion**

The mean is about 0.17 and the standard deviation is about 0.0085.

---
---

**Part 2: An Alternate Model (50 pts)**

In Part 1, you calculated the probability of a tweet by incorporating both the probability of words present in the tweet $P\left(x^i_j=1 | +\right)$ and the probability of words absent from the tweet $P\left(x^i_j=0 | +\right)$.

Now, modify your code to *only* incorporate the probability of words present in the tweet $P\left(x^i_j=1 | +\right)$ (thus ignoring absent words).

Compare this to the original approach in Part 1. Follow reasonable experimental procedure and write up an explanation of the results you find.


In [None]:
# Copy your classifyNB() function and modify as specified
def logSignGivenTweet_IgnoreAbsent(words, logWordsPresentGivenSign, logPriorSign):
  temp = words.copy()
  for wordIndex, word in enumerate(temp):
    if word != 0:
      temp[wordIndex] = logWordsPresentGivenSign[wordIndex]
  result = sum(temp) + logPriorSign
  return result

In [None]:
# classifyNB:
#   words - vector of words of the tweet (binary vector)
#   logProbWordPresentGivenPositive - log P(x_j = 1|+)
#   logProbWordAbsentGivenPositive  - log P(x_j = 0|+)
#   logProbWordPresentGivenNegative - log P(x_j = 1|-)
#   logProbWordAbsentGivenNegative  - log P(x_j = 0|-)
#   logPriorPositive - log P(+)
#   logPriorNegative - log P(-)
#   returns (label of x according to the NB classification rule, confidence about the label)

# Note: you can also change the function definition if you wish to encapsulate all six log probs
# as one model; just make sure to follow through below

def classifyNB_IgnoreAbsent(words,logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
               logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
               logPriorPositive, logPriorNegative):
  # fill in function definition here
  logPositiveGivenTweet = logSignGivenTweet_IgnoreAbsent(words, logProbWordPresentGivenPositive, logPriorPositive)
  logNegativeGivenTweet = logSignGivenTweet_IgnoreAbsent(words, logProbWordPresentGivenNegative, logPriorNegative)
  prediction = [np.exp(logNegativeGivenTweet), np.exp(logPositiveGivenTweet)]
  if(prediction.index(max(prediction)) == 0):
    label = -1
  else:
    label = 1
  confidence = np.log(max(prediction)/min(prediction))

  return (label, confidence)
print(classifyNB(xTest[700, ], logProbWordPresentGivenPositive,logProbWordAbsentGivenPositive,
                               logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
                               logPriorPositive, logPriorNegative))

(-1, 8.194909102228266)


**Note:** The prediction changed to -1, orginally it was 1 without ignoring the absent.

In [None]:
# testNB: Classify all xTest
#   xTest - test data features
#   yTest - true label of test data
#   logProbWordPresentGivenPositive - log P(x_j = 1|+)
#   logProbWordAbsentGivenPositive  - log P(x_j = 0|+)
#   logProbWordPresentGivenNegative - log P(x_j = 1|-)
#   logProbWordAbsentGivenNegative  - log P(x_j = 0|-)
#   logPriorPositive - log P(+)
#   logPriorNegative - log P(-)
#   returns Average test error
def testNB_IgnoreAbsent(xTest, yTest,
           logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
           logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
           logPriorPositive, logPriorNegative):
  correctCases = 0
  for tweetIndex, words in enumerate(xTest):
    prediction = classifyNB_IgnoreAbsent(words, logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
           logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
           logPriorPositive, logPriorNegative)
    realResult = yTest[tweetIndex]
    if prediction[0] == realResult:
      correctCases += 1

    avgErr = 1 - (correctCases/len(xTest))
  print("Average error of NB is", avgErr)
  return avgErr

testNB_IgnoreAbsent(xTest, yTest,
       logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
       logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
       logPriorPositive, logPriorNegative)

Average error of NB is 0.16486486486486485


0.16486486486486485

In [None]:
# 10 train/test splits
tenErrors = np.array([])
for _ in range(10):
  xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2)
  probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)
  min_prob = 1/yTrain.shape[0] #Assume very rare words only appeared once
  logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive,min_prob)
  logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative,min_prob)
  logPriorPositive, logPriorNegative = compute_logdistros(priorPositive,min_prob)
  avgErr = testNB_IgnoreAbsent(xTest, yTest,
       logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
       logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
       logPriorPositive, logPriorNegative)
  tenErrors = np.append(tenErrors, avgErr)

errorMean = np.average(tenErrors)
errorStd = np.std(tenErrors)
print("The mean of the average error: %f" %(errorMean))
print("The standard deviation of the average error: %f" %(errorStd))

Average error of NB is 0.17567567567567566
Average error of NB is 0.18243243243243246
Average error of NB is 0.177027027027027
Average error of NB is 0.16756756756756752
Average error of NB is 0.17837837837837833
Average error of NB is 0.14729729729729735
Average error of NB is 0.17297297297297298
Average error of NB is 0.1837837837837838
Average error of NB is 0.177027027027027
Average error of NB is 0.16486486486486485
The mean of the average error: 0.172703
The standard deviation of the average error: 0.010145


**Conclusion**

The mean of the average error is also about 0.17 and the standard deviation of the average error is about 0.010. Comparing with the result which does not ignore the absent words, the average error is about the same. In conclusion, ignoring the absent words does not really affect the accuracy of the prediction.

---
---
**Part 3: An Additional Experiment (50 pts)**

Implement an experiment to change the model, and report on the results of the experiment, comparing to your baseline model from Part 1 (and your alternate model from Part 2).

Choose one (and only one) of the following options.

*Please make clear which option you choose! (For example, by deleting the options you are not choosing.)*

**3.1: Option 1: Removing stop words**

Investigate the effect of removing the 25, 50, 100, and 200 most frequent words from the calculation.

**3.2: Option 2: Sample weights (3 bonus points for difficulty)**

Recall that the labels for each of our data points/samples came with a weight. This weight was based on the proportion of labelers that agreed on this label, so it serves as a kind of measure of confidence we should have in each data point. That is, a weight near 1 indicates everyone agreed on the same label. Whereas a weight below 0.5 means not even a majority agreed on the chosen label.

Devise a method for weighting samples, and use that method to recalculate the probability distributions.  Report the effect of weighting samples on the test set.

(Hint: Re-examine part 1.6 and think about how you would change this to make it pay more attention to data points with higher weights.)

**3.3: Option 3: Sticky terms (6 bonus points for difficulty)**

A "sticky term" is two words which are more likely to occur together than independently.

You can use "Pointwise Mutual Information" (PMI) to determine the stickiness, using: $PMI=\frac{P(w_1,w_2)}{P(w_1)P(w_2)}$.  For all pairs of <u>adjacent</u> words in the tweet corpus, find the top n pairs according to PMI and add them as additional features in your Naive Bayes Model.

Find the top 100, 200, 500 "sticky terms" and add these as features to the model.

(Note, you cannot use X to calculate the above joint distribution, since the above is about adjacent words. You will have to go back to the raw text.)

---

Remember, you need to *compare* your chosen option with your previous work. Just writing the code is not sufficient. Follow reasonable experimental procedure and write up a discussion of your results and why you think they turned out that way.

**Option 1**

In [None]:
# Further modify your code based on the option you choose above
def removeStopWords(tweet, count, logProbWordPresentGivenPositive, logProbWordPresentGivenNegative, logProbWordAbsentGivenPositive, logProbWordAbsentGivenNegative):
  listWordCounts = np.sum(tweet, axis = 0)
  #Find the top n number of words without changing the indices
  topNWords = np.sort(np.argpartition(listWordCounts, len(listWordCounts) - count)[-count:])
  for wordIndex in topNWords:
    logProbWordPresentGivenPositive[wordIndex] = 0
    logProbWordPresentGivenNegative[wordIndex] = 0
    logProbWordAbsentGivenPositive[wordIndex] = 0
    logProbWordAbsentGivenNegative[wordIndex] = 0

**Top 25 most frequent words**

In [None]:
count = 25
tenErrors = np.array([])
for _ in range(10):
  xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2)
  probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)
  min_prob = 1/yTrain.shape[0] #Assume very rare words only appeared once
  logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive,min_prob)
  logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative,min_prob)
  logPriorPositive, logPriorNegative = compute_logdistros(priorPositive,min_prob)
  removeStopWords(xTrain,count,logProbWordPresentGivenPositive, logProbWordPresentGivenNegative,
                  logProbWordAbsentGivenPositive, logProbWordAbsentGivenNegative)
  avgErr = testNB(xTest, yTest,
       logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
       logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
       logPriorPositive, logPriorNegative)
  tenErrors = np.append(tenErrors, avgErr)

errorMean = np.average(tenErrors)
errorStd = np.std(tenErrors)
print("The mean of the average error: %f" %(errorMean))
print("The standard deviation of the average error: %f" %(errorStd))

Average error of NB is 0.17567567567567566
Average error of NB is 0.17972972972972978
Average error of NB is 0.16486486486486485
Average error of NB is 0.16756756756756752
Average error of NB is 0.17837837837837833
Average error of NB is 0.17297297297297298
Average error of NB is 0.1945945945945946
Average error of NB is 0.17972972972972978
Average error of NB is 0.18918918918918914
Average error of NB is 0.17432432432432432
The mean of the average error: 0.177703
The standard deviation of the average error: 0.008552


**Top 50 most frequent words**

In [None]:
count = 50
tenErrors = np.array([])
for _ in range(10):
  xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2)
  probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)
  min_prob = 1/yTrain.shape[0] #Assume very rare words only appeared once
  logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive,min_prob)
  logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative,min_prob)
  logPriorPositive, logPriorNegative = compute_logdistros(priorPositive,min_prob)
  removeStopWords(xTrain,count,logProbWordPresentGivenPositive, logProbWordPresentGivenNegative,
                  logProbWordAbsentGivenPositive, logProbWordAbsentGivenNegative)
  avgErr = testNB(xTest, yTest,
       logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
       logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
       logPriorPositive, logPriorNegative)
  tenErrors = np.append(tenErrors, avgErr)

errorMean = np.average(tenErrors)
errorStd = np.std(tenErrors)
print("The mean of the average error: %f" %(errorMean))
print("The standard deviation of the average error: %f" %(errorStd))

Average error of NB is 0.22297297297297303
Average error of NB is 0.23108108108108105
Average error of NB is 0.22432432432432436
Average error of NB is 0.2256756756756757
Average error of NB is 0.21216216216216222
Average error of NB is 0.21621621621621623
Average error of NB is 0.22297297297297303
Average error of NB is 0.20270270270270274
Average error of NB is 0.21351351351351355
Average error of NB is 0.22162162162162158
The mean of the average error: 0.219324
The standard deviation of the average error: 0.007788


**Top 100 most frequent words**

In [None]:
count = 100
tenErrors = np.array([])
for _ in range(10):
  xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2)
  probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)
  min_prob = 1/yTrain.shape[0] #Assume very rare words only appeared once
  logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive,min_prob)
  logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative,min_prob)
  logPriorPositive, logPriorNegative = compute_logdistros(priorPositive,min_prob)
  removeStopWords(xTrain,count,logProbWordPresentGivenPositive, logProbWordPresentGivenNegative,
                  logProbWordAbsentGivenPositive, logProbWordAbsentGivenNegative)
  avgErr = testNB(xTest, yTest,
       logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
       logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
       logPriorPositive, logPriorNegative)
  tenErrors = np.append(tenErrors, avgErr)

errorMean = np.average(tenErrors)
errorStd = np.std(tenErrors)
print("The mean of the average error: %f" %(errorMean))
print("The standard deviation of the average error: %f" %(errorStd))

Average error of NB is 0.254054054054054
Average error of NB is 0.25135135135135134
Average error of NB is 0.2635135135135135
Average error of NB is 0.26216216216216215
Average error of NB is 0.27297297297297296
Average error of NB is 0.22972972972972971
Average error of NB is 0.27972972972972976
Average error of NB is 0.25540540540540535
Average error of NB is 0.23378378378378384
Average error of NB is 0.27297297297297296
The mean of the average error: 0.257568
The standard deviation of the average error: 0.015552


**Top 200 most frequent words**

In [None]:
count = 200
tenErrors = np.array([])
for _ in range(10):
  xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2)
  probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)
  min_prob = 1/yTrain.shape[0] #Assume very rare words only appeared once
  logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive,min_prob)
  logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative,min_prob)
  logPriorPositive, logPriorNegative = compute_logdistros(priorPositive,min_prob)
  removeStopWords(xTrain,count,logProbWordPresentGivenPositive, logProbWordPresentGivenNegative,
                  logProbWordAbsentGivenPositive, logProbWordAbsentGivenNegative)
  avgErr = testNB(xTest, yTest,
       logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,
       logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,
       logPriorPositive, logPriorNegative)
  tenErrors = np.append(tenErrors, avgErr)

errorMean = np.average(tenErrors)
errorStd = np.std(tenErrors)
print("The mean of the average error: %f" %(errorMean))
print("The standard deviation of the average error: %f" %(errorStd))

Average error of NB is 0.29729729729729726
Average error of NB is 0.29054054054054057
Average error of NB is 0.28378378378378377
Average error of NB is 0.31351351351351353
Average error of NB is 0.2851351351351351
Average error of NB is 0.29324324324324325
Average error of NB is 0.29324324324324325
Average error of NB is 0.2783783783783784
Average error of NB is 0.31351351351351353
Average error of NB is 0.31756756756756754
The mean of the average error: 0.296622
The standard deviation of the average error: 0.013035


**Conclusion**:

Removing the top 25 words, the accuracy does not change much because the top 25 words are probably including auxiliary verbs, pronouns, and articles that do not have any effects on the sentiment of those tweets. But if continue removing words, there is a decrease on accuracy of the prediction. Therefore, the sweet spot is probably removing top 25 words.