# Naive Bayes

Last week we took a look at our first classification algorithm: k-NN. There we used a simple Euclidean distance to compare an unknown point with points we knew the class for, and select the $k$ closest as being most informative to predict the class of the unknown point. This algorithm works surprisingly well, but it has a couple of drawbacks.

First, it requires that a distance between different examples of your class *can* be properly defined. If you have all numeric features in your data, then you can just use a Euclidean distance. However, if you have categorical features for example, or some other type of data, then distance can be harder to define. Additionally, every time you want to classify a new point, k-NN has to loop over *all* data points to find the $k$ nearest. Generally, the more data you use, the better your prediction results will be. However, if you have too much data, k-NN might become very slow.

So in some cases, other classification algorithms might work better, even if they are both designed to solve classification problems. In fact, one of the most important skills you can have in machine learning, is understanding when to use which type of algorithm. For this assignment we'll be looking at a different classification algorithm, *Naive Bayes*, and see what types of problems *it* is well-suited to solve. As the name might suggest, it is based on *Bayes' rule* and so is all about computing the probability of a new sample belonging to one class or the other.

## Sentiment Analysis

The problem we'll be applying the *Naive Bayes* algorithm to is that of sentiment analysis. The classic example of a *Naive Bayes* application is spam filtering, where you get a whole bunch of emails texts and for each of those a label whether or not that email is considered spam. You can then use that to build a model of the text to classify whether new emails are spam or not. The problem of sentiment analysis is very similar, but instead of classifying whether an email is spam, the problem is to classify whether a text is generally positive or generally negative in tone, i.e. what the sentiment of a text is.

This type of classification is widely used in social media analysis, to determine how many posts related to a topic are positive and how many are negative. We'll use a big data set from *Twitter*, where each tweet has already been labeled as a positive or negative sentiment. It consists of 1.6 million tweets, so this is definitely not the kind of data set you want to use k-NN for either. Unzip the file `sentiment_data.zip` to get started.

Actually processing the text of the tweets is not the main focus of this assignment, so we've provided some code below to get you started. It reads the data file into a *Pandas* DataFrame, which is just a big *Excel* table we can work with in *Python*. The function `parse()` takes a text, splits the text into a list of words and removes the punctuation. It also removes any words containing `@`, `#` or `/`, so at-mentions, hashtags or URLs will not be included in the analysis. Each row in the DataFrame consists of one tweet, with its date, the user who posted it, actual text of the tweet and the labeled sentiment. After processing the tweets, there will be an additional column in the DataFrame with the list of parsed words for that same tweet.

Below is the code to load and process the tweets. Note that running this cell might take *a while*, as there are 1.6 million tweets that need to be processed. When the processing is done, the first 5 rows of the DataFrame should be shown below the cell. These processed results are stored in the variable `data`, which you will use for the rest of the assignment. As long as you don't modify or overwrite the `data` variable, you only have to run the code below *once*.


In [None]:
import string
import pandas as pd

def parse(text):
    punct = str.maketrans('', '', string.punctuation)
    word_list = []
    for word in text.split():
        for char in ['@', '#', '/']:
            if char in word:
                break
        else:
            word = word.lower().translate(punct)
            if word != '':
                word_list.append(word)
    return word_list


data = pd.read_csv('training.1600000.processed.noemoticon.csv',  encoding='ISO-8859-1',
                   names=["sentiment", "ids", "date", "flag", "user", "text"])

data = data.drop(labels=['ids', 'flag'], axis=1)

data['parsed'] = data['text'].map(parse)

data.head()

## Training and Testing

Take a look at the table produced by the code above. The rows each correspond to a different tweet in the data set, where these are the first 5 tweets out of the 1.6 million in the full table. The columns are the attributes for each tweet, with the most important columns here being the last two; namely the *raw text* of the tweet and the *parsed list of words* for that tweet. In addition, there is a sentiment label for each tweet, with a label of $0$ indicating the sentiment of the tweet is negative and a label of $4$ indicating the sentiment is positive. The labels $0$ and $4$ are the only labels used in this data set, thus tweet is always considered either negative or postive. There are no *neutral* or *slightly positive* labels, or other intermediate values.

First we will split this data set into a training and a testing set. The code below uses a function from *scikit-learn* to solve this problem. All this code does is split the data set in to *70%* training data and *30%* testing data randomly. The results are stored into 2 variables, `training_data` and `testing_data` (returning and assigning multiple variables this way in *Python* is called *tuple unpacking*). For now, we'll only be working with the training data, which means we'll just have randomly selected subset of *70%* of our original data. Given that we've started with 1.6 million tweets, this should still be plenty to work with. Run the code to create your random split.


In [None]:
from sklearn.model_selection import train_test_split

training_data, testing_data = train_test_split(data, train_size=0.7)

training_data.head()

## Bags of Words

With the data preparation done, we can get started on creating a model of the training data. This model should eventually estimate the probability of a tweet having a positive or negative sentiment, based on the words in the tweet. For this model we'll make quite a few simplifying assumptions, both because they make model easier to build, but also because it turns out we can still build a pretty good model with all these assumptions. The first simplification will be to use a *Bag of Words* approach, meaning that we'll group all of the words together in a giant "bag", and count them, which will simplify the model *a lot*.

*It is imporant to note that this will discard a lot of the information in each tweet, for example the place of the word in the tweet, what words were next to it and other important contextual elements. But as we will see, even with such a simple model of the words we can still make reasonably accuracte predictions.*

Start by sorting out the words in these tweets into 2 lists; `postive_words` and `negative_words`, containing all of words in tweets with a postive sentiment label $4$ or negative sentiment label $0$ respectively. Below is some code to get you started, looping through each row of the DataFrame, where `sentiment` is the sentiment label and `words` is the parsed list of words for that tweet. These two lists are returned together by the function and can again be assigned together using *tuple unpacking*.

In [None]:
def bag_words(training_data):
    positive_words = []
    negative_words = []

    for row in training_data.itertuples(False, 'Row'):
        sentiment = row.sentiment
        words = row.parsed

        # YOUR CODE HERE
        
            
    return positive_words, negative_words

pos_words, neg_words = bag_words(training_data)
  
print("\nThe 50 first words with a postive sentiment are:")
print(pos_words[:50])

print("\nThe 50 first words with a negative sentiment are:")
print(neg_words[:50])

## Counting words

The next step in the model will be to count how often each word occurs in the positive and negative bags. There are quite a few ways to count strings in *Python*, but we'll be using a nice built-in data structure called a *default dictionary*. Default dictionaries work very similarly to lists, but instead of *numbers* to index places in the list, you can use *strings*, so words, as the index. So retrieving and updating a count for a word is very easy with the standard dictionary syntax:

    count = word_count[word]
    word_count[word] = new_count

In addition, if a word is *not* in the dictionary yet, the *default* dictionary returns a default value instead. Normally, for this kind of counting task, the default count for each word would start at 0. Here we will have it start at 1 instead, assuming we have seen every possible word once to begin with. This will help in the next few steps to avoid having words with a probability of 0, which will cause problems in the model. The technical term for this *Laplace smoothing*, but the important thing to remember is that we want to avoid having a probability of 0 for some words occurring, so we assume all words occur at least once.

In [None]:
from collections import defaultdict

def count_words(word_list):
    word_count = defaultdict(lambda: 1)
    
    # YOUR CODE HERE
    
            
    return word_count

pos_count = count_words(pos_words)
neg_count = count_words(neg_words)

print('Word\t-\t', 'Pos Count\t-\t', 'Neg Count', '\n'+50*'=')
for word in pos_words[:10]:
    print(word, '\t-\t', pos_count[word], '\t-\t', neg_count[word])

# Modelling probabilities

With the words divided into positive and negative sentiment bags and these counts made, we can start estimating the model probabilities. We want to model the probability of the sentiment of a tweet being positive or negative, so for this we will use a random variable $S$. This random variable $S$ has a *Binomial* distribution with just a single trial, which is also called a *Bernoulli* distribution, and has probability $r$ of having a positive sentiment.

$$S \sim B(1, r)$$

In order to predict the sentiment, we need know the probability of the sentiment of a tweet being positive *given the words in that tweet*, i.e. $P(S=p\ |\ t)$ for some tweet $t$. Modelling this conditional probability directly is hard, but we can rewrite the conditional probability using *Bayes' Rule*:

$$P(S=p\ |\ t)\ =\ \frac{P(t\ |\ S=p) P(S=p)}{P(t)}$$

Next, we can rewrite the marginal probability of the tweet using the *Law of Total Probability*:

$$P(t)\ =\ P(t\ |\ S=p) P(S=p)\ +\ P(t\ |\ S=n) P(S=n)$$

Combining the two gives:

$$P(S=p\ |\ t)\ =\ \frac{P(t\ |\ S=p) P(S=p)}{P(t\ |\ S=p) P(S=p)\ +\ P(t\ |\ S=n) P(S=n)}$$

## Prior probabilities

The probabilities $P(S=p)$ and $P(S=n)$ are called the **prior** probability of the sentiment. They represent the probability of a tweet being being positive or negative, *without* of knowing anything about the content of the tweet yet. It is possible to estimate these probabilities from the data, or use another source of information about how likely a tweet is to have a positive sentiment to begin with. The simplest estimate would be to just count how often tweets are positive and how often they are negative in the data set. The code below uses your `count_words()` function to actually count how often the different values in the sentiment column of the data set occur.


In [None]:
print('\nDistribution of sentiments in the training data')
print(dict(count_words(training_data['sentiment']).items()))

print('\nDistribution of sentiments in the complete data set')
print(dict(count_words(data['sentiment']).items()))

## Posterior probability

As you can see, it seems like exactly *half* the tweets in the data set are positive and half are negative. Note that the counts will be 1 higher that they actually are, due to how `count_words()` was constructed, but that doesn't matter here. It seems unlikely that a tweet is exactly equally likely to have a positive or a negative sentiment, and more likely that this is just how *this specific data set* was put together by whoever labeled it.

If we don't have any information about how the *prior* is distributed, the most common assumption is still to use a *uniform* prior, so every outcome is equally likely. This means $r=0.5$ for the Binomial distribution and that $P(S=p)\ =\ P(S=n)$. If we substitute $P(S=p)$ for $P(S=n)$ in the equation above, then we get:

$$P(S=p\ |\ t)\ =\ \frac{P(t\ |\ S=p) P(S=p)}{P(t\ |\ S=p) P(S=p)\ +\ P(t\ |\ S=n) P(S=p)}$$


Now we can just divide the whole right-hand side by $P(S=p)$, which gives us:

$$P(S=p\ |\ t)\ =\ \frac{P(t\ |\ S=p)}{P(t\ |\ S=p)\ +\ P(t\ |\ S=n)}$$

The only remaining terms that need to be modeled in order to compute $P(S=p\ |\ t)$ are $P(t\ |\ S=p)$ and $P(t\ |\ S=n)$, which are called the **posterior** probabilities. They are the probability of a specific tweet occuring, given that you know the sentiment is positive or negative. These posterior probabilities will form the basis of our model.

# Naive Bayes

Modelling the probability of a whole tweet $t$ might be complicated, but using the *bag of words* assumption, we can instead model the joint probability of all *indivdual words* occurring together:

$$P(t\ |\ S=p)\ =\ P(w_1, \dots, w_m |\ S=p)$$

This is still a very large joint probability, but we will make one last simplifying asssumption, which will turn this Baysian model into *Naive Bayes*. We will assume the probability of each word occurring to be **independent**. So if one word occurrs, we assume that this doesn't influence the probability of other words occuring. This assumption definitely won't hold for words, as certains words are much more likely when combined with other words (which is what makes this a *naive* assumption), but it greatly simplifies our model yet again, making the problem much more solvable.

If the probability of each word occurring is independent, then the probability of all these words occurring *together* simply becomes the product of all the individual word probabilities:

$$P(w_1, \dots, w_m\ |\ S=p)\ \ =\ \ P(w_1\ |\ S=p) \cdot P(w_2\ |\ S=p) \cdot \dots \cdot P(w_m\ |\ S=p)$$

Or more formally:

$$P(w_1, \dots, w_m\ |\ S=p)\ =\ \prod_{i=1}^m P(w_i\ |\ S=p)$$

where $w_i$ is the $i^{th}$ word in the tweet and $P(w_i\ |\ S=p)$ is the probability of that word occurring given that the tweet is positve.

## Word probabilities

After all this theory and all these simplying assumptions, we have actually broken down the problem into something that is very easy to estimate from the data we have. The only probabilties we will need to estimate are $P(w\ |\ S=p)$ and $P(w\ |\ S=n)$ for every word $w$ and we can build a complete model to predict the sentiment of a tweet.

Write a function that takes a dictionary of word counts and outputs a dictionary of word probabilities. This word count dictionary can be either positive sentiment word counts to compute $P(w\ |\ S=p)$ or negative sentiment word counts to compute $P(w\ |\ S=n)$.

Here will again use a default dictionary to ensure that words that don't occur in the training data, don't result in a 0 probability. Just 1 word having a 0 probability would multiply all other word probabilities with 0 and result in a 0 probability for the whole tweet. The probability for an unknown word is estimated as

$$\frac{1}{total\_word\_count + 1}$$

This is a continuation of the *Laplace smoothing* in earlier word counting, where we assumed every word occurs at least once. This way the probability of an unknown word is lower that for any of words that *do* occur in the training data, but not zero. The default dictionary to do this has already been provided.

Fill the dictionary with the probabilities of all the words that *have* been counted and return this dictionary of probabilities.


In [None]:
def word_probabilities(word_counts):
    total_count = sum(word_counts.values())
    probabilities  = defaultdict(lambda: 1 / (total_count + 1))
    
    # YOUR CODE HERE
    

pos_probabilities = word_probabilities(pos_count)
neg_probabilities = word_probabilities(neg_count)

print('Word\t-\t', 'Pos Probability\t-\t', 'Neg Probability', '\n'+80*'=')
for word in pos_words[:10]:
    print(word, '\t-\t', pos_probabilities[word], '\t-\t', neg_probabilities[word])

## Question: Inspecting the probabilities

These two dictionaries, `pos_probabilities` and `neg_probabilities`, should now contain the estimates for $P(w\ |\ S=p)$ and $P(w\ |\ S=n)$, for every word $w$, respectively. They are the only estimated probabilities we will for the entire *Naive Bayes* model.

Take a look at the results printed above, showing these probability for the first 10 words encountered, and make sure that they make sense to you. The probabilities should all be pretty small numbers, as the probability of one specific word occurring in a tweet should be quite low. Note that the number `1e-05` represents $1 \cdot 10^{-5}$ or $0.00001$, which is called [scientific notation](https://en.wikipedia.org/wiki/Scientific_notation).

Compare the positive and negative probabilities for these 10 words. Do the words for which the positive probability is larger, indeed correspond to words which you would consider more likely to be contained in a positive tweet? Explain your answer.

*Your answer goes here.*


## Computing the posterior probability

Next, write a function to combine these word probabilities to the complete posterior for a tweet $P(t\ |\ S)$, using the *naive* assumption of words being independent. 

You function should compute the posterior based on the lised of parsed `words` from the tweet and dictionary of word `probabilities`, given the sentiment is positive or negative. When using the dictionary $P(w\ |\ S=p)$, your function should compute $P(t\ |\ S=p)$ and when using $P(w\ |\ S=n)$, your function should compute $P(t\ |\ S=n)$.

In [None]:
def compute_posterior(words, probabilities):
    # YOUR CODE HERE
    


## Computing the sentiment probability

Now, write the function that computes $P(S=p\ |\ t)$, using the assumption of a *uniform prior* for $S$.

The function should compute $P(S=p\ |\ t)$ based on a list of parsed `words` for a tweet $t$ and the two dictionaries containing the estimates for $P(w\ |\ S=p)$ and $P(w\ |\ S=n)$. Use the function you just wrote, `compute_posterior()` to construct $P(t\ |\ S=p)$ and $P(t\ |\ S=n)$.

In [None]:
def compute_sentiment(words, positive_probs, negative_probs):
    # YOUR CODE HERE
    


## Testing the predictions

This function should be the last puzzle piece, as you can now use it to compute $P(S=p\ |\ t)$, which is the probability of the sentiment being positive, given the content of the tweet.

Write four short test sentences and try out the function. You can use the `parse()` function from the very first cell to split your sentences into a list of parsed words. Apply the `compute_sentiment()` function to this list of `words`, together with the `pos_probabilities` and `neg_probabilities` dictionaries. Print the words from the sentence and the computed $P(S=p\ |\ t)$. Make sure these results make sense to you before continuing. 
 

In [None]:
# YOUR CODE HERE

## Classifying tweets

Next, write a simple function `classify_tweet()` that takes a list of parsed `words`, together with the `pos_probabilities` and `neg_probabilities` dictionaries, and classifies whether the sentiment is postive or negative. If it is more probable that these words come from a tweet with a positive sentiment, the function should return the label $4$ and if it is more probable to have a negative sentiment, it should return the label $0$. Use your function `compute_sentiment()` to determine this probability.

Finally, write a function `model_accuracy()` which loops through rows in a data set and counts how many of the tweets are classified correctly and how many are classified incorrectly. The function should return the the accuracy of the constucted model on that data sets, i.e. the percentage of tweets that was classified correctly within the set.

In [None]:
def classify_tweet(words, positive_probs, negative_probs):
    # YOUR CODE HERE
    

def model_accuracy(data_set, positive_probs, negative_probs):
    correct, incorrect = 0, 0

    for row in data_set.itertuples(False, 'Row'):
        words = row.parsed
        sentiment = row.sentiment
        
        # YOUR CODE HERE
        

print('\nThe accuracy of the model on the training data is')
print(model_accuracy(training_data, pos_probabilities, neg_probabilities))

print('\nThe accuracy of the model on the testing data is')
print(model_accuracy(testing_data, pos_probabilities, neg_probabilities))

## Question: Comparing Training and Testing results

The results printed above compute the accuracy on the training data, that is how well the model is able to reproduce the labels from the part of the data we used to construct that model. This indicated how well the model fit on data we used to *train* it.

However, it is possible to get perfect accuracy on your training data, just by making your model store all the training examples and for every "prediction" you just look up what the sample was (which is *very* similar to what a Nearest Neighbour algorithm would do). So, in order to get a good sense of how well the model captures the overall pattern in the data, including examples it hasn't seen, we use the testing data.

Testing data is a part of the data set which wasn't used to build the model, but can be used estimate model performance on *new* data samples. When this system would be deployed to classify tweets on twitter, it would obviously be applied to tweets that were not in the data, so testing data also gives an estimate of *real world* performance.

When building and evaluating any machine learning model, it is always essential to split the data into train and test sets, to better estimate the performance of the system. Compare the accuracy on the training and the testing set. Do these results make sense to you? If so, why? If not, why not? What do these results tell you about how well the model solved this problem? 

*Your answer goes here.*