# Naive Bayes

Last week we took a look at our first classification algorithm: k-NN. There we used a simple Euclidean distance to compare an unknown point with points we knew the class for, and select the $k$ closest as being most informative to predict the class of the unknown point. This algorithm works surprisingly well, but it has a couple of drawbacks.

First, it requires that a distance between different examples of your class *can* be properly defined. If you have only quantitative features in your data, so numerical interval or ratio measurements, then you can just use a Euclidean distance. However, if you have qualitative features, like category labels, or some other type of data, then distance between samples can be harder to define. Additionally, every time you want to classify a new point, k-NN has to loop over *all* data points to find the $k$ nearest. Generally, the more data you use, the better your prediction results will be, but with k-NN, if you have too much data, the algorithm might become very slow.

So in some cases, a different *classification algorithm* might work better, even though they are both designed to solve the same type of problem. In fact, one of the most important skills you can have in machine learning, is understanding when to use which type of algorithm. For this assignment we'll be looking at another classification algorithm, *Naive Bayes*, and see what types of problems *it* is well-suited to solve. As the name might suggest, it is based on *Bayes' rule* and so is all about computing the probability of a new sample belonging to one class or the other.

## Sentiment Analysis

The problem we'll be applying the *Naive Bayes* algorithm to is that of sentiment analysis. The classic example of a *Naive Bayes* application is spam filtering, where you get a whole bunch of texts of emails and for each of those texts a *label* whether or not that email was considered spam. You can then use that to build a model of the texts and classify if new emails are spam or not. The problem of sentiment analysis is very similar, but instead of classifying whether an email is spam, the problem is to classify whether a text is generally positive or generally negative in tone, i.e. what the sentiment of a text is.

This type of classification is widely used in social media analysis, to determine how many posts related to a topic are positive and how many are negative. We'll use a big data set from *Twitter*, where each tweet has already been labeled as a positive or negative sentiment. It consists of 1.6 million tweets, so this is definitely not the kind of data set you want to use k-NN for either. Unzip the file `sentiment_data.zip` to get started.

### Loading in the data set

Actually processing the text of the tweets is not the main focus of this assignment, so we've provided some code below to get you started. It reads the data file into a *Pandas* DataFrame, which is just a big *Excel* table we can work with in *Python*. The function `parse()` takes a text, splits the text into a list of words and removes the punctuation. It also removes any words containing `@`, `#` or `/`, so *at-mentions*, *hashtags* or *URLs* will not be included in the analysis. Each row in the DataFrame consists of one tweet, with its date, the user who posted it, actual text of the tweet and the labeled sentiment. After processing the tweets, there will be an additional column in the DataFrame which contains the list of parsed words for that same tweet.

Below is the code to load and process the tweets. Note that running this cell might take *a while*, as there are 1.6 million tweets that need to be processed. When the processing is done, the first 5 rows of the DataFrame should be shown below the cell. These processed results are stored in the variable `data`, which you will need to use for the rest of the assignment. As long as you don't modify or overwrite the `data` variable, you only have to run the code below *once*.


In [None]:
import string
import pandas as pd

def parse(text):
    """ Takes a text and converts it into a list of words.
    
    This function will split the input string text into words, converting all words to
    lowercase and removing any punctation. Additionally, it will remove any words
    containing the characters '@', '#' or '/', which commonly occur in tweets as 
    hashtags or URL's. The function returns a list of parsed words.
    
    Note: You are not expected to read or understand the content of this function, just
    its input and output, so the function has not been documented further.
    """
    punct = str.maketrans('', '', string.punctuation)
    word_list = []
    
    for word in text.split():
        for char in ['@', '#', '/']:
            if char in word:
                break
        
        else:
            word = word.lower().translate(punct)
            if word != '':
                word_list.append(word)
    
    return word_list


# Load the data
data = pd.read_csv('training.1600000.processed.noemoticon.csv',  encoding='ISO-8859-1',
                   names=["sentiment", "ids", "date", "flag", "user", "text"])

# Remove some less useful features
data = data.drop(labels=['ids', 'flag'], axis=1)

# Parse the text of every tweet and create a new feature with the parsed results
data['parsed'] = data['text'].map(parse)

# Show the first 5 rows of the data set
data.head()

### Training and testing split of the data

Take a look at the table produced by the code above. The rows each correspond to a different tweet in the data set, where these are the first 5 tweets out of the 1.6 million in the full table. The columns are the attributes for each tweet, with the most important columns here being the last two; namely the *raw text* of the tweet and the *parsed list of words* for that tweet. In addition, there is a sentiment label for each tweet, with a label of $0$ indicating the sentiment of the tweet is negative and a label of $4$ indicating the sentiment is positive. The labels $0$ and $4$ are the only labels used in this data set, thus a tweet is always considered either negative or positive. These two labels could have just as easily been $-1$ and $1$ for negative and positive sentiment, but in this data set the choice was made to use $0$ and $4$. There are no *neutral* or *slightly positive* labels, or other intermediate values in this data set.

First we will split this data set into a training and a testing set. The code below uses a function from *scikit-learn* to solve this problem. All this code does is split the data set into *70%* training data and *30%* testing data randomly. The results are stored into 2 variables, `training_data` and `testing_data` (returning and assigning multiple variables this way in *Python* is called *tuple unpacking*). For now, we'll only be working with the training data, which means we'll just have a randomly selected subset of *70%* of our original data. Given that we've started with 1.6 million tweets, this should still be plenty to work with. Run the code to create your random split.


In [None]:
from sklearn.model_selection import train_test_split

# Split the data into 70% training and 30% testing
training_data, testing_data = train_test_split(data, train_size=0.7)

# Show the first 5 rows of the training set
training_data.head()

### Assignment 1: Bags of Words

With the data preparation done, we can get started on creating a model of the training data. This model will need to estimate the probability of a tweet having a positive or negative sentiment, based on the words in the tweet. For this model we'll make quite a few simplifying assumptions, both because they make the model easier to build, but also because it turns out we can still build a pretty good model with all these assumptions.

The first simplification we're going to make is using a *Bag of Words* approach, which means we'll ignore *where* words occur in a tweet and even which specific tweet a word belongs to. Instead, we'll just group all of the words together in a giant "bag", and count how often certain words occur in the negative bag and how often they occur in the positive bag, which will simplify the model a lot.

*It is important to note that this will discard a lot of the information in each tweet, for example removing the place of the word in the tweet also removes the information of what words were next to it and other important contextual elements. But as we will see, even with such a simple model of the words we can still make reasonably accurate predictions.*

Start by sorting out the words in these tweets into 2 lists; `positive_words` and `negative_words`, containing all of the words in tweets with the positive sentiment label $4$ or the negative sentiment label $0$ respectively. Below is some code to get you started, looping through each row of the DataFrame, where `sentiment` is the sentiment label for a tweet and `words` is the parsed list of words of that same tweet.

Complete the code below to fill-in both these lists. The lists should be completely *flat*, which means they should each be a single list containing all the words, and so no *list of lists* or other similar structures. The `positive_words` list should contain *all* of the words from tweets with a positive sentiment label, and similarly for the `negative_words` list should contain *all* of the words from tweets with a negative sentiment label. Words can of course occur multiple times in each list if these words occur several times in the same tweet or in several different tweets, as we're going to use these lists to count the occurrences of words for the two different sentiments.

These two lists are returned together by the function and are then assigned together using *tuple unpacking* again. The last lines of the cell call your function and print 50 random words from both sentiment categories. Check that the words you get from both categories look sensible and output seems correct, before moving on to the next assignment.

In [None]:
def bag_words(training_data):
    positive_words = []
    negative_words = []
    
    # Loop through every row in the training data
    for row in training_data.itertuples(False, 'Row'):
        sentiment = row.sentiment
        words = row.parsed
        
        # YOUR CODE HERE
    
    return positive_words, negative_words


# Create the positive and negative bags of words for the training data
pos_words, neg_words = bag_words(training_data)
  
print("\nA list of the 50 first words with a positive sentiment:")
print(pos_words[:50])

print("\nA list of the 50 first words with a negative sentiment:")
print(neg_words[:50])

In [None]:
assert type(pos_words[0]) == str, 'pos_words must be a single flat list with all words from every positive tweet.'
assert type(neg_words[0]) == str, 'neg_words must be a single flat list with all words from every negative tweet.'

assert len(pos_words) > 5*10**6, 'pos_words must be one big list with all words from every positive tweet.'
assert len(neg_words) > 5*10**6, 'neg_words must be one big list with all words from every negative tweet.'

print('Solution seems correct!')

### Assignment 2a: Starting count for each word

The next step in the model will be to count how often each word occurs in the positive and negative bags. There are quite a few ways to count strings in Python, but we'll be using a **dictionary**. If you haven't done so yet, follow the short introduction to dictionaries at the beginning of module 3 in *Python for Data Processing*.

The function `count_words()` should return a dictionary where each word is a *key* and the corresponding *values* are the counts of how often that specific word occurs in the `word_list`. So, the function should

* Loop through every word in the list,
* retrieve what the previous count was for that word,
* add 1 to that previous count, and insert this new count back into the dictionary.

However, while looping through all the words, you will of course encounter words you have not seen before, that do not have a previous count to retrieve, and trying to retrieve these counts would result in a *KeyError*. So, to avoid that problem, and to actually make the rest of the code easier to write, we will start with a different function first; `start_count()`. This function should take every word in the training data and give it a **starting count of 1.** We can then use this starting count dictionary when actually counting each word, and will never get a *KeyError*, because each word is already in the dictionary with a count of 1. 

It might seem strange to start every count at $1$, and not at $0$, like you might expect. The underlying assumption here is that we've seen every possible word once already. Because we'll use these counts to estimate the probabilities of words occurring, this assumption will help in the next few steps to avoid having words with a probability of exactly $0$, which would cause problems in the model. The technical term for this correction is *Laplace smoothing*, but the important thing to remember is that we want to avoid having a probability of $0$ for some words occurring, so we assume every possible word occurs at least once.

Complete the `start_count()` function below. The function should return a dictionary which has every word in the training data as a key, and each word has a corresponding starting count of $1$ as its value, no matter how often the word actually occurs in the training data.

In [None]:
def start_count(training_data):
    count_dict = {}
    
    # Loop through every row in the training data
    for row in training_data.itertuples(False, 'Row'):
        # Loop through each parsed word in a tweet
        for word in row.parsed:
            # YOUR CODE HERE
    
    return count_dict


In [None]:
for test, mes in [('abcdef', 'All words should occur in the dict and have count of 1'),
                  ('aaabbb', 'Words that occur several times should only count as 1')]:
    assert start_count(pd.DataFrame(list(test), columns=['parsed'])) == dict(zip(test, [1]*len(test))), mes

print('Solution seems correct!')

### Assignment 2b: Counting words

With the `start_count()` function written, the `count_words()` function actually becomes a lot easier. The function `count_words()` now takes two arguments

* `word_list`: The list of words that should be counted. These will be either all the words from tweets with negative or positive sentiment, as constructed by you in assignment 1.
* `word_count`: The dictionary with the starting counts for each word. This will contain the dictionary you created in assignment 2a, where every word has a starting count of 1.

The function `count_words()` should modify the `word_count` dictionary and should return that same dictionary, where each word is a *key* and the corresponding *values* are the counts of how often that specific word occurs in the `word_list`, plus one additional count for the starting counts with Laplace smoothing applied. Complete the code for the `count_words()` function below, the steps for this function now are:

* Loop through every word in the list,
* retrieve what the previous count was for that word,
* add 1 to that previous count, and insert this new count back into the dictionary.

At the end of this code cell, your function is called for both the positive word list and the negative word list, resulting in two count dictionaries. The final step prints these counts for 10 words from both dictionaries. Make sure these results look sensible and output seems correct, before moving on to the next assignment.

In [None]:
def count_words(word_list, word_count):
    # YOUR CODE HERE
    
    return word_count


# Create the positive and negative word counts for the training data
pos_count = count_words(pos_words, start_count(training_data))
neg_count = count_words(neg_words, start_count(training_data))

print(f'{"Word":20s}| {"Pos count":20s}| {"Neg count":20s}\n' +60*'=')
for word in pos_words[:10]:
    print(f'{word:20s}| {str(pos_count[word]):20s}| {str(neg_count[word]):20s}')

In [None]:
assert count_words(list('ab'), {'a': 1, 'b': 1}) == {'a': 2, 'b': 2}, 'Words are not added to the counts correctly' 
assert count_words(list('aaaabb'), {'a': 1, 'b': 1}) == {'a': 5, 'b': 3}, 'Multiple occurrences counted incorrectly'

print('Solution seems correct!')

# Modelling probabilities

With the words divided into positive and negative sentiment bags and all the word counts made for both bags, we can start estimating the model probabilities. We want to model the probability of the sentiment of a tweet being positive or negative, so for this we will use a random variable $S$. This random variable $S$ has a *Binomial* distribution with just a single trial (which is also called a *Bernoulli* distribution) and has probability $r$ of having a positive sentiment.

$$S \sim B(1, r)$$

This random variable $S$ has two possible outcomes; $S=p$ for when the sentiment of a tweet is positive and $S=n$ for when the sentiment of a tweet is negative. In order to predict the sentiment of a tweet $t$, we need to estimate the probability of the sentiment of that tweet being positive *given the words in that tweet*, i.e. $P(S=p\ |\ t)$ for some tweet $t$.

If we estimate the probability $P(S=p\ |\ t)$ to be greater than $0.5$, then we can classify the tweet $t$ as having a positive sentiment and if we estimate the probability $P(S=p\ |\ t)$ to be less than $0.5$, then we can classify the tweet $t$ as having a negative sentiment. Modelling the conditional probability $P(S=p\ |\ t)$ directly is hard, but fortunately we can rewrite the conditional probability using *Bayes' Rule*:

$$P(S=p\ |\ t)\ =\ \frac{P(t\ |\ S=p) P(S=p)}{P(t)}$$

Next, we can rewrite the marginal probability of the tweet using the *Law of Total Probability*:

$$P(t)\ =\ P(t\ |\ S=p) P(S=p)\ +\ P(t\ |\ S=n) P(S=n)$$

Combining the two gives:

$$P(S=p\ |\ t)\ =\ \frac{P(t\ |\ S=p) P(S=p)}{P(t\ |\ S=p) P(S=p)\ +\ P(t\ |\ S=n) P(S=n)}$$

## Prior probabilities

The probabilities $P(S=p)$ and $P(S=n)$ are called the **prior** probability of the sentiment. They represent the probability of a tweet being positive or negative, *without* knowing anything about the content of the tweet yet. It is possible to estimate these probabilities from the data, or use another source of information about how likely a tweet is to have a positive sentiment in the first place. 

The simplest estimate would be to just count how often tweets are positive and how often they are negative in the data set, without considering any information about the specific tweets. If there are a lot more positive sentiment tweets than negative sentiment tweets, then the prior probability of a tweet being positive will be higher, regardless of the content of any specific tweet. This is what we'll try next in the assignment.

### Assignment 3: Counting sentiments

Complete the function `count_sentiments()`, which should count the amount of times the negative sentiment with label $0$ and the positive sentiment with the label $4$ occur in the data set. The function takes one argument, `sentiment_list`, which is a list with all the sentiment labels in the data set and should return the dictionary `sentiment_count` which has the sentiment labels as *keys* and the counts for each label as their respective *values*. 

The function here should be pretty similar to what you did in assignment 2b for counting the words, except you're counting sentiment labels instead. However, here there is no dictionary with starting counts, so you will actually get a *KeyError* when you try to retrieve the count for sentiment that you have not counted before. Therefore, your code will need to check if a sentiment is already a key in the dictionary and there is an existing count you can use, or if this is a sentiment you've not seen before and you'll need to insert a new key into the dictionary. You can check if a key occurs in a dictionary using the `in` operation.

The code below will run your counting function for both the training data and the full data set. Based on these counts, we can then make estimates for the prior probabilities for the sentiment $P(S=p)$ and $P(S=n)$ in the next step.

In [None]:
def count_sentiments(sentiment_list):
    sentiment_count = {}
    
    # YOUR CODE HERE
    
    return sentiment_count


print('\nDistribution of sentiments in the training data')
print(count_sentiments(list(training_data['sentiment'])))

print('\nDistribution of sentiments in the complete data set')
print(count_sentiments(list(data['sentiment'])))

### Discounting the Prior

In the training set it seems about half the tweets have a positive label and half the tweets have a negative label, and in the full data set it even looks like *exactly* half the tweets are positive and half are negative. It seems unlikely that a tweet is naturally perfectly equally likely to have a positive or a negative sentiment, and so probably this data set was manually constructed to have a balanced distribution of both classes. This means the counting information of the sentiments can't actually be used to compute a more informative prior and so we'll need to use a different solution for the prior.

If we don't have any information about how the *prior* is distributed, the most common assumption is still to use a *uniform* prior, where every outcome is equally likely. This means the Binomial distribution of $S$ has $r = 0.5$ and that $P(S=p)\ =\ P(S=n)$.

Using the uniform prior, we can substitute $P(S=p)$ for $P(S=n)$ in the Bayes equation above:

$$P(S=p\ |\ t)\ =\ \frac{P(t\ |\ S=p) P(S=p)}{P(t\ |\ S=p) P(S=p)\ +\ P(t\ |\ S=n) P(S=p)}$$

Now we can actually simplify the equation and just divide the whole right-hand side by $P(S=p)$, which gives us:

$$P(S=p\ |\ t)\ =\ \frac{P(t\ |\ S=p)}{P(t\ |\ S=p)\ +\ P(t\ |\ S=n)}$$

So using a uniform prior actually means we don't need to model the prior at all!

### Likelihoods $P(t\ |\ S=p)$ and $P(t\ |\ S=n)$

With this simplification, the only remaining terms that need to be modeled in order to compute $P(S=p\ |\ t)$ are $P(t\ |\ S=p)$ and $P(t\ |\ S=n)$. These are called the **likelihoods**, because they model how *likely* it is for a specific tweet to  occur, given that you know the sentiment is positive or that you know the sentiment is negative. These likelihoods will form the basis of our model.

Using the likelihoods to model a posterior probability is actually at the core of any Bayesian model. The posterior probability, i.e. the probability of a sample being a specific class *given the data it contains*, is often hard to directly model, so instead we model the likelihoods and apply Bayes' rule. These likelihoods can be a bit unintuitive, as in some sense they actually model the inverse of what we want to model, i.e. what is the likelihood of a specific sample occurring given you know it comes from that specific class. However, using the equation above we can take those likelihood models and convert them into a proper posterior probability of a specific class, given the data of the sample.

To actually model these likelihoods, we'll need one last simplification, which will give us the namesake of the algorithm. This will be the last theoretical step before we have all the pieces to start computing the probabilities and putting everything together.

## Naive Bayes

Modelling the likelihood of a whole tweet $t$ occurring is complicated, but using the *bag of words* assumption from earlier, we can actually just model the joint probability of *all individual words* occurring together, as we don't need to worry about their position in the tweet.

$$P(t\ |\ S=p)\ =\ P(w_1, \dots, w_m |\ S=p)$$

This is still a very large joint probability, but we will make one last simplifying assumption, which will turn this Baysian model into *Naive Bayes*. We will assume the probability of each word occurring to be **independent**. So if one word occurs, we assume that this doesn't influence the probability of other words occurring.

This assumption definitely won't actually hold for these tweets, as some words are much more likely to occur together with other words, while others are very unlikely to occur together in a tweet. This is why this assumption is called *naive*, as word occurrences will never actually be independent of one another, but this assumption greatly simplifies our model yet again, making the problem much more solvable.

If the probability of each word occurring is independent, then the probability of all these words occurring *together* simply becomes the product of all the individual word probabilities:

$$P(w_1, \dots, w_m\ |\ S=p)\ \ =\ \ P(w_1\ |\ S=p) \cdot P(w_2\ |\ S=p) \cdot \dots \cdot P(w_m\ |\ S=p)$$

Or more formally:

$$P(w_1, \dots, w_m\ |\ S=p)\ =\ \prod_{i=1}^m P(w_i\ |\ S=p)$$

where $w_i$ is the $i^{th}$ word in the tweet and $P(w_i\ |\ S=p)$ is the probability of that word occurring given that the tweet is positive.

With this final simplification, all that is left to do for the model is estimate these individual word probabilities!


### Assigment  4: Word probabilities

After all this theory and all these simplifying assumptions, we have actually broken down the problem into something that is very easy to estimate from the data we have. The only probabilities we will need to estimate are $P(w\ |\ S=p)$ and $P(w\ |\ S=n)$ for every word $w$ and we can build a complete model to predict the sentiment of a tweet.

At the start of this notebook you divided the words into two bags, one for words from tweets with a positive sentiment and one for words from tweets with a negative sentiment. We can use counts from the positive sentiment bag to estimate $P(w\ |\ S=p)$ and the counts from the negative sentiment bag to estimate $P(w\ |\ S=n)$. Given all the words you've seen with a positive sentiment, how would you estimate the probability of seeing that specific word $w$, using the counts of all these positive words?

#### Step 1: Building the dictionary

Write a function that takes a dictionary of word counts and outputs a dictionary of word probabilities. This word count dictionary can be either positive sentiment word counts to compute $P(w\ |\ S=p)$ or negative sentiment word counts to compute $P(w\ |\ S=n)$.

Start by looping over all the words that have been counted and estimate the probability of that specific word occurring, using the training data counts you created earlier. How would you estimate these probabilities using the data you have? At the end of this loop, an estimated probability of occurrence for *every* counted word should be added to the `probabilities` dictionary.

Complete this part of the function first, so build the dictionary containing the word probabilities for all the counted words, before moving to the next step.

#### Step 2: Adding unknown words

As stated earlier, it will be important to ensure that words which don't occur in the training data, don't result in a 0 probability. Just one word having a 0 probability would multiply all other word probabilities with 0, using the naive estimate of the joint word probabilities. This would result in a 0 likelihood for that whole tweet, no matter what the other words in the tweet were, which would lead to problems in the calculation and less informative predictions altogether. So instead, the probability for an unknown word should always be estimated as

$$\frac{1}{total\_word\_count + 1}$$

This is a continuation of the *Laplace smoothing* as used earlier in the word counting, where we assumed every word occurs at least once. By using this estimate, the probability of an unknown word will be lower than that for any of words that do occur in the training data, as any words that *did occur* in the data will have a count of *at least 2*. Using this smoothing, the probability of unknown words will be small, but not zero!

We'll want to add this unknown word probability to that same `probabilities` dictionary, however we don't actually have a word we can use as a *key* in this case, of course. So, you should use a seperate string `"unknown_word"`, reserved especially for storing the probability of an unknown word. This reserved "word" should never occur naturally in any tweet, as all *underscores* (`_`) were removed with the rest of the punctuation, so we can safely use it as a special separate key. Add this `"unknown_word"` key to the `probabilities` dictionary, using the *Laplace smoothing* estimate for the probability of unknown words. You'll need to add this separate key only once, as there is only one probability to use for *all* unknown words, after you've completed adding all the regular words from step 1.

Finally, the function should return the `probabilities` dictionary, containing all these estimates of probabilities of occurrence, based on the training data counts. The last step of this cell will print the probabilities of some words occurring in tweets with positive or negative sentiment.


In [None]:
def word_probabilities(word_counts):
    probabilities = {}
    
    # Compute the sum of all word counts
    total_count = sum(word_counts.values())
    
    # YOUR CODE HERE
    
    return probabilities


# Estimate the positive and negative word probabilities for the training data
pos_probabilities = word_probabilities(pos_count)
neg_probabilities = word_probabilities(neg_count)


print(f'{"Word":20s}| {"Positive : P(w|S=p)":30s}| {"Negative : P(w|S=n)":30s}\n' + 80*'=')
for word in pos_words[:10]:
    print(f'{word:20s}| {str(pos_probabilities[word]):30s}| {str(neg_probabilities[word]):30s}')

In [None]:
assert abs(sum(pos_probabilities.values()) - 1) < 10**-6, "Sum of all probabilities should be 1."
assert abs(sum(neg_probabilities.values()) - 1) < 10**-6, "Sum of all probabilities should be 1."

assert 'unknown_word' in pos_probabilities, "The key 'unknown_word' should contain the unknown word probability."
assert 'unknown_word' in neg_probabilities, "The key 'unknown_word' should contain the unknown word probability."

print('Solution seems correct!')

### Inspecting the probabilities

These two dictionaries, `pos_probabilities` and `neg_probabilities`, should now contain the estimates for $P(w\ |\ S=p)$ and $P(w\ |\ S=n)$, for every word $w$, respectively. These are the only estimated probabilities we will need for the entire *Naive Bayes* model!

Take a look at the results printed above, showing these probability for the first 10 words encountered, and make sure that they make sense to you. The probabilities should all be pretty small numbers, as the probability of one specific word occurring in a tweet should be quite low. Note that the number `1e-05` represents $1 \cdot 10^{-5}$ or $0.00001$, which is called [scientific notation](https://en.wikipedia.org/wiki/Scientific_notation).

**Q1. Compare the positive and negative probabilities for these 10 words. Do the words for which the positive probability is larger, indeed correspond to words which you would consider more likely to be contained in a positive tweet? Explain your answer.**

*Your answer goes here.*


### Assignment 5a: Computing the likelihood

Next, write a function to combine these word probabilities into the complete likelihood for a tweet $P(t\ |\ S)$, using the *naive* assumption of words being independent. 

Your function should compute the likelihood based on the list of parsed `words` from the tweet and dictionary of word `probabilities`, given the sentiment is positive or negative. When using the dictionary $P(w\ |\ S=p)$, your function should compute $P(t\ |\ S=p)$ and when using $P(w\ |\ S=n)$, your function should compute $P(t\ |\ S=n)$.

Your function should combine all the individual word probabilities into the likelihood for the whole tweet. If the probability of some individual word is not stored in the dictionary, you should use the stored `"unknown_word"` probability instead.

***Note:*** If you are unsure if your code for this function is correct, continue with the next two cells anyway. This function will be hard to test just by itself and you'll need a couple more pieces to test it properly. Once you are working on the *5c: Testing the predictions* cell, you can better inspect the results of this function too.

In [None]:
def compute_likelihood(words, probabilities):
    # YOUR CODE HERE
    


### Assignment 5b: Computing the posterior probability

Now, write the function that computes the posterior probability $P(S=p\ |\ t)$, using the assumption of a *uniform prior* for $S$.

The function should compute the posterior $P(S=p\ |\ t)$ based on a list of parsed `words` for a tweet $t$ and the two dictionaries containing the estimates for $P(w\ |\ S=p)$ and $P(w\ |\ S=n)$. Use the function you just wrote, `compute_likelihood()` to construct $P(t\ |\ S=p)$ and $P(t\ |\ S=n)$.

In [None]:
def compute_posterior(words, positive_probs, negative_probs):
    # YOUR CODE HERE
    


### Assignment 5c: Testing the predictions

This function should be the last puzzle piece, as you can now use it to compute $P(S=p\ |\ t)$, which is the probability of the sentiment being positive, given the content of the tweet. If this posterior probability $P(S=p\ |\ t)$ is greater than $0.5$, then the tweet is more likely to have a positive sentiment according to the Naive Bayes model, and similarly, if the posterior probability $P(S=p\ |\ t)$ is less than $0.5$, then the tweet is more likely to have a negative sentiment.

Write four short test sentences and try out the functions. You can use the `parse()` function from the very first cell to split your sentences into a list of parsed words. Apply the `compute_posterior()` function to this list of `words`, together with the `pos_probabilities` and `neg_probabilities` dictionaries, which were estimated from the training data. Print the words from the sentence and the computed posterior probability $P(S=p\ |\ t)$.

***Note:*** Make sure that these results make sense to you, before continuing to the next cells. You can add print statements to the `compute_likelihood()` and `compute_posterior()` functions to show any intermediate computations. If your test sentences are not too long, you can still check all the computations that are done before the final probability is returned. Once you are convinced your functions work well and computation of the probability is correct, make sure to remove any print statements you added, before handing in your code.
 

In [None]:
# YOUR CODE HERE

### Assignment 6a: Classifying tweets

Next, write a simple function `classify_tweet()` that takes a list of parsed `words`, together with the `pos_probabilities` and `neg_probabilities` dictionaries, and classifies whether the sentiment is positive or negative. If it is more probable that these words come from a tweet with a positive sentiment, the function should return the label $4$ and if the tweet is more probable to have a negative sentiment, the function should return the label $0$. Use your function `compute_posterior()` to determine this posterior probability.


In [None]:
def classify_tweet(words, positive_probs, negative_probs):
    # YOUR CODE HERE
    

### Assignment 6b: Determining model accuracy

Finally, write a function `model_accuracy()` which loops through all the rows in a data set and counts how many of the tweets are classified correctly and how many are classified incorrectly. This function should return the accuracy of the constructed model on that data set, i.e. the percentage of tweets that was classified correctly within the set.

In [None]:
def model_accuracy(data_set, positive_probs, negative_probs):
    correct = 0
    incorrect = 0
    
    # Loop through every row in the data set
    for row in data_set.itertuples(False, 'Row'):
        words = row.parsed
        sentiment = row.sentiment
        
        # YOUR CODE HERE


print('\nThe accuracy of the model on the training data is')
print(model_accuracy(training_data, pos_probabilities, neg_probabilities))

print('\nThe accuracy of the model on the testing data is')
print(model_accuracy(testing_data, pos_probabilities, neg_probabilities))

### Comparing *training* and *testing* results

The first results printed above are the accuracy on the training data, i.e. how well the model is able to reproduce the labels from the part of the data we used to construct that model. This indicates how well the model fits on data we used to *train* it.

However, it is possible to get perfect accuracy on your training data, just by making your model store all the training examples and for every "prediction" you just look up what the sample was (which is very similar to what a *Nearest Neighbour* algorithm would do). So, it is standard machine learning practice to also use data samples the model *hasn't* seen yet as testing data, in order to get a good sense of how well the model captures the *overall* pattern in the data set.

Testing data is the part of the data set which wasn't used to build the model, but it can be used to estimate model performance on *new* data samples. When this system would be deployed to classify tweets on Twitter, it would obviously be also applied to tweets that were not in the data set. So, depending on the quality of your data set, testing data can also be used to give an estimate of *real world* performance.

When building and evaluating any machine learning model, it is always essential to split the data into train and test sets, to get a better estimate of the model performance. Compare the accuracy of the *Naive Bayes* model on the training and the testing set for the sentiment analysis data.

**Q2. Do these results make sense to you? If so, why? If not, why not?**

*Your answer goes here.*

**Q3. What do these results tell you about how well the model solved this problem? Explain your answer.**

*Your answer goes here.*
