**<span style="color:#448844">Note</span>** This notebook is meant to be interactive. Launch this notebook in Jupyter to see its full potential.

Name: Aaron Palpallatoc

Section: S11

# Naive Bayes Exercise

In this notebook, you will learn to implement a Naive Bayes classifier using sklearn. We will be creating two classifiers, one which assumes a Gaussian distribution, and another that assumes a multinomial distrbution.

## Instructions
* Read each cell and implement the TODOs sequentially. The markdown/text cells also contain instructions which you need to follow to get the whole notebook working.
* Do not change the variable names unless the instructor allows you to.
* Answer all the markdown/text cells with "A: " on them. The answer must strictly consume one line only.
* You are expected to search how to some functions work on the Internet or via the docs. 
* There are commented markdown cells that have crumbs. Do not delete them or separate them from the cell originally directly below it.  
* You may add new cells for "scrap work" as long as the crumbs are not separated from the cell below it.
* The notebooks will undergo a "Restart and Run All" command, so make sure that your code is working properly.
* You are expected to understand the data set loading and processing separately from this class.
* You may not reproduce this notebook or share them to anyone.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.style.use('ggplot')

plt.rcParams['figure.figsize'] = (12.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'

# Fix the seed of the random number 
# generator so that your results will match ours
np.random.seed(1)

%load_ext autoreload
%autoreload 2

# Gaussian Naive Bayes


Our first dataset (iris dataset), our assumption is our data follows a Gaussian distribution.

**Dataset:**

Our first data set is the iris data set which contains 3 classes of 50 instances each. Each class refers to a type of iris plant.

Attribute Information:

1. Sepal length in cm
2. Sepal width in cm
3. Petal length in cm
4. Petal width in cm
5. Class (Species):
    - Iris Setosa
    - Iris Versicolour
    - Iris Virginica

## Preprocessing our data

Let's load the iris dataset.

In [2]:
import pandas as pd

iris = pd.read_csv('iris.csv')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
iris.groupby('species').describe()

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_length,sepal_length,sepal_length,sepal_length,sepal_length,sepal_length,sepal_width,sepal_width,...,petal_length,petal_length,petal_width,petal_width,petal_width,petal_width,petal_width,petal_width,petal_width,petal_width
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
setosa,50.0,5.006,0.35249,4.3,4.8,5.0,5.2,5.8,50.0,3.418,...,1.575,1.9,50.0,0.244,0.10721,0.1,0.2,0.2,0.3,0.6
versicolor,50.0,5.936,0.516171,4.9,5.6,5.9,6.3,7.0,50.0,2.77,...,4.6,5.1,50.0,1.326,0.197753,1.0,1.2,1.3,1.5,1.8
virginica,50.0,6.588,0.63588,4.9,6.225,6.5,6.9,7.9,50.0,2.974,...,5.875,6.9,50.0,2.026,0.27465,1.4,1.8,2.0,2.3,2.5


Right now, we have to convert our nominal labels (word labels: setosa, versicolor, viriginica) into numerical labels (number labels: 0,1,2; 0 for setosa, 1 for versicolor, 2 for viriginica)

In [4]:
from sklearn import preprocessing

This will convert the unique nominal values there are in iris["species"] to a unique number

In [5]:
label_enc = preprocessing.LabelEncoder()
label_enc.fit(iris["species"])

We can check the original labels

This will transform the list to match the numerical code mapping

In [6]:
label_enc.transform(iris["species"])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Let's see the mapping of the original nominal labels and the numerical codes

In [7]:
print("Original labels:", label_enc.classes_, "\n")

print("Mapping from nominal to numerical labels:")
print(dict(zip(label_enc.classes_,label_enc.transform(label_enc.classes_))))

Original labels: ['setosa' 'versicolor' 'virginica'] 

Mapping from nominal to numerical labels:
{'setosa': 0, 'versicolor': 1, 'virginica': 2}


Now that we have the numerical encoding and the mapping, we can now change the `species` column to its numerical mapping

In [8]:
iris["species"] = label_enc.transform(iris["species"])

Let's see the results now:

In [9]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Like in the previous notebooks, we will separate our `X` from our target `y` (species). 


__Note__: `iris.values[:,:-1]` will get all rows, and all columns except for the last column


__Note__: `iris.values[:,-1]` will get the last column only. We set the the labels as integers because its default data type is float.

In [10]:
X = iris.values[:,:-1]
y = iris.values[:,-1].astype(int)

In [11]:
print("X shape: ", X.shape)
print("y shape: ", y.shape)

X shape:  (150, 4)
y shape:  (150,)


Let's separate the training from the test set. 

Set the test size to `0.3`, make sure to stratify based on `species`/`y`. Also set the `random_state` to `42` so our results match.

In [12]:
from sklearn.model_selection import train_test_split
# train_test_split?

In [13]:
# write code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [14]:
X_train.shape

(105, 4)

**Sanity Check:** X_train should have a shape of `(105, 4)`

## Building our model
Because our features `X` are continuous values, we will use `sklearn`'s `GaussianNB` model.

In [15]:
from sklearn.naive_bayes import GaussianNB

Intialize a `GaussianNB` model

In [16]:
# write code here
iris_nb = GaussianNB()

Train it

In [17]:
# write code here
iris_nb.fit(X_train, y_train)

And, get its training predictions

In [18]:
# write code here
predictions = iris_nb.predict(X_train)

predictions

array([1, 1, 0, 2, 1, 2, 0, 0, 0, 2, 2, 0, 0, 1, 1, 2, 0, 0, 2, 2, 0, 2,
       2, 2, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 2, 2, 0, 2, 0, 1, 0,
       2, 1, 0, 2, 1, 2, 1, 0, 1, 2, 1, 2, 0, 1, 0, 1, 1, 1, 2, 1, 1, 2,
       2, 0, 2, 1, 1, 2, 0, 2, 2, 1, 0, 2, 2, 0, 0, 2, 2, 2, 0, 2, 1, 2,
       2, 0, 1, 1, 1, 1, 1, 0, 2, 1, 2, 0, 0, 1, 0, 1, 0])

We will be computing for the accuracy multiple times in this notebook, so let's create a function for this.

`compute_accuracy()` will compute for the accuracy given two vectors of equal length

__Inputs:__
- `predictions`: A numpy array of shape `(N,)` consisting of `N` samples representing the predicted values
- `actual`: A numpy array of shape `(N,)` consisting of `N` samples representing the actual (target) values

__Outputs:__
- `accuracy`: A scalar representing the percentage of elements where `predictions` and `actual` match out of the total number of elements

In [19]:
def compute_accuracy(predictions, actual):
    accuracy = np.sum(predictions == actual) / len(actual)
    return accuracy * 100

Let's see how well our model performed on the training data

In [20]:
print("Training accuracy: ", compute_accuracy(predictions, y_train), "%")

Training accuracy:  98.09523809523809 %


That's a good result. Let's see if it will perform well on our test set.

In [21]:
# write code here
predictions = iris_nb.predict(X_test)

predictions

array([2, 1, 1, 1, 2, 2, 1, 1, 0, 2, 0, 0, 2, 2, 0, 2, 1, 0, 0, 0, 1, 0,
       1, 2, 2, 1, 1, 1, 1, 0, 1, 2, 1, 0, 2, 0, 0, 0, 0, 2, 1, 0, 1, 2,
       1])

In [22]:
print("Test accuracy: ", compute_accuracy(predictions, y_test), "%")

Test accuracy:  91.11111111111111 %


**Sanity Check:** You should get a 91.111% accuracy

## Checking the learned parameters
We can also peer into the parameters the model learned.

This is how you get the number of instances (of each class) the model received as the training set

In [23]:
iris_nb.class_count_

array([35., 35., 35.])

You can also get the priors the model learned

In [24]:
# write code here
iris_nb.class_prior_

array([0.33333333, 0.33333333, 0.33333333])

**Question #1:** What are the priors computed for each class?

<!--crumb;qna;Question: What are the priors computed for each class?-->

A: Iris Setosa: 0.33333333, Iris Versicolour: 0.33333333, Iris Virginica: 0.33333333

**Question #2:** How are the priors calculated?

<!--crumb;qna;Question: How are the priors calculated?-->

A: They are calculated based on the frequency of each class in the training data. Since the training data is equally split, it makes sense that all three classes has the same prior of 0.33333333.

Gaussian Naive Bayes classifiers have **`k * d * 2`** number of parameters (not including the priors)

> where <br>
> **`k`** - number of classes <br>
> **`d`** - number of dimensions/features <br>
> **`2`** - because we calculate for the means and variances of each feature <br>

Get the computed means of the model

In [25]:
# write code here
means = iris_nb.theta_

means.shape

(3, 4)

**Question #3:** What is the shape of the computed means?

<!--crumb;qna;Question: What is the shape of the computed means?-->

A: (3, 4)

Get the computed variances of the model

In [26]:
# write code here
variances = iris_nb.var_

variances.shape

(3, 4)

**Question #4:** What is the shape of the computed variances?

<!--crumb;qna;Question: What is the shape of the computed variances?-->

A: (3, 4)

____________

# Multinomial Naive Bayes

Our second dataset (spam/not spam), our assumption is our data follows a multinomial distribution.

**Dataset:**

Our goal with this dataset is to classify a sentence as either **spam** or **not spam** (ham). 

You can check out `the spam/ham.csv` for examples of spam and not spam messages. Check the file and see its body contents.

(This section is a slight modification from <a src="http://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/">Ritchie Ng's notebook</a>)

## Sample data

Before we go and train with the spam/ham dataset, we have to convert the `content` column into numbers we can crunch. In our case, our features will be the frequency of words in the data instance.

**Example:**


|                                                  | Never | gonna | give | you | up | let | down | make | cry | say | goodbye |
|--------------------------------------------------|-------|-------|------|-----|----|-----|------|------|-----|-----|---------|
|                          Never gonna give you up |   1   |   1   |   1  |  1  |  1 |  0  |   0  |   0  |  0  |  0  |    0    |
| Never gonna give you up Never gonna let you down |   2   |   2   |   1  |  2  |  1 |  1  |   1  |   0  |  0  |  0  |    0    |
|                         Never gonna make you cry |   1   |   1   |   0  |  1  |  0 |  0  |   0  |   1  |  1  |  0  |    0    |
|                          Never gonna say goodbye |   1   |   1   |   0  |  0  |  0 |  0  |   0  |   0  |  0  |  1  |    1    |

<div style="text-align: right"><sub>Reference: Never Gonna Give You Up by Rick Astley</sub></div>

In [27]:
data = ["Never gonna give you up",
        "Never gonna give you up Never gonna let you down",
        "Never gonna make you cry",
        "Never gonna say goodbye"]

First, let's convert our words all to lower case. This is a common practice.

In [28]:
for i in range(len(data)):
    data[i] = data[i].lower()
    
data

['never gonna give you up',
 'never gonna give you up never gonna let you down',
 'never gonna make you cry',
 'never gonna say goodbye']

Now, we'll count for the frequency of each word of each sentence.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to convert the text into a matrix of word/token counts

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

This following code will get the words in our dataset

In [30]:
count_vect.fit(data)

Let's see our words/tokens/features:

In [31]:
word_features = count_vect.get_feature_names_out()
word_features

array(['cry', 'down', 'give', 'gonna', 'goodbye', 'let', 'make', 'never',
       'say', 'up', 'you'], dtype=object)

The following code computes for the counts of each word for each of our data sentences. 

It outputs a **sparse count matrix**. 

__Note:__ The sparse refers to the matrix having mostly 0 values for the columns (see table above). If we store this as a normal matrix, it will take up a lot of space. To save space, the following data is stored in this fashion:
> `(<sentence>, <word>)         <count>`

All combinations where the count is 0 will be ignored

In [32]:
count_sparse_matrix = count_vect.transform(data)
print(count_sparse_matrix)

  (0, 2)	1
  (0, 3)	1
  (0, 7)	1
  (0, 9)	1
  (0, 10)	1
  (1, 1)	1
  (1, 2)	1
  (1, 3)	2
  (1, 5)	1
  (1, 7)	2
  (1, 9)	1
  (1, 10)	2
  (2, 0)	1
  (2, 3)	1
  (2, 6)	1
  (2, 7)	1
  (2, 10)	1
  (3, 3)	1
  (3, 4)	1
  (3, 7)	1
  (3, 8)	1


It may seem a lot of work to save little space, but as your data grows this will save a ton of memory.

In [33]:
n_sentences = count_sparse_matrix.shape[0]
n_word_features = count_sparse_matrix.shape[1]

# header
for i in range(n_word_features):
    print(word_features[i], end ="\t")
print("sentence", end="\n")
    
for i in range(n_sentences):
    for j in range(n_word_features):
        print(count_sparse_matrix[i, j], end="\t")
    print(data[i], end="\n")

cry	down	give	gonna	goodbye	let	make	never	say	up	you	sentence
0	0	1	1	0	0	0	1	0	1	1	never gonna give you up
0	1	1	2	0	1	0	2	0	1	2	never gonna give you up never gonna let you down
1	0	0	1	0	0	1	1	0	0	1	never gonna make you cry
0	0	0	1	1	0	0	1	1	0	0	never gonna say goodbye


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document/sentence is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document/sentence** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

Here is our data in a pandas DataFrame

In [34]:
count_dense_matrix = count_sparse_matrix.toarray()
pd.DataFrame(count_dense_matrix, columns=count_vect.get_feature_names_out(), index=data)

Unnamed: 0,cry,down,give,gonna,goodbye,let,make,never,say,up,you
never gonna give you up,0,0,1,1,0,0,0,1,0,1,1
never gonna give you up never gonna let you down,0,1,1,2,0,1,0,2,0,1,2
never gonna make you cry,1,0,0,1,0,0,1,1,0,0,1
never gonna say goodbye,0,0,0,1,1,0,0,1,1,0,0


**Summary:**

- `vect.fit(train)` **learns the vocabulary** of the training data
- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

# On with the spam/ham dataset
## Preprocessing

Load the text data from the csv file

In [35]:
spamham = pd.read_csv("spam_ham.csv")
spamham.dropna(inplace=True)
spamham.head(10)

Unnamed: 0,type,location,body
0,spam,data/000/001,LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...
1,spam,data/000/002,Academic Qualifications available from prestig...
2,ham,data/000/003,Greetings all. This is to verify your subscrip...
3,spam,data/000/004,try chauncey may conferred the luscious not co...
4,ham,data/000/005,"It's quiet. Too quiet. Well, how about a straw..."
5,ham,data/000/006,It's working here. I have departed almost tota...
6,spam,data/000/008,The OIL sector is going crazy. This is our wee...
7,spam,data/000/009,Little magic. Perfect weekends.http://othxu.rz...
8,ham,data/000/010,Greetings all. This is a mass acknowledgement ...
9,spam,data/000/011,"Hi, L C P A X V V e I r m a A I v A o b n L A ..."


In [36]:
from sklearn import preprocessing

Before we proceed to vectorizing, let's change our label type from "spam" and "ham" to numerical values.

Use `sklearn`'s `LabelEncoder` to to encode the spamham dataset's labels (`type` in the csv)

In [37]:
# write code here
label_enc = preprocessing.LabelEncoder()
label_enc.fit(spamham["type"])


Then, get the mapping so we know what the `0`s and `1`s mean later in the notebook

In [38]:
mapping = dict(zip(label_enc.classes_, label_enc.transform(label_enc.classes_)))

print("Mapping:", mapping)

Mapping: {'ham': 0, 'spam': 1}


Now, call `LabelEncoder` to label encode the `type` column

In [39]:
# write code here
spamham["type"] = label_enc.transform(spamham["type"])

spamham.head(10)

Unnamed: 0,type,location,body
0,1,data/000/001,LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...
1,1,data/000/002,Academic Qualifications available from prestig...
2,0,data/000/003,Greetings all. This is to verify your subscrip...
3,1,data/000/004,try chauncey may conferred the luscious not co...
4,0,data/000/005,"It's quiet. Too quiet. Well, how about a straw..."
5,0,data/000/006,It's working here. I have departed almost tota...
6,1,data/000/008,The OIL sector is going crazy. This is our wee...
7,1,data/000/009,Little magic. Perfect weekends.http://othxu.rz...
8,0,data/000/010,Greetings all. This is a mass acknowledgement ...
9,1,data/000/011,"Hi, L C P A X V V e I r m a A I v A o b n L A ..."


**Sanity Check:** The type column should now be in 1's and 0's. Make sure that they are still properly labelled.

Now, we will separate our features `X` from our labels `y`. Disregard the `location` column (it points to the text file where the text `body` came from)

In [40]:
# write code here
X = spamham["body"]
y = spamham["type"]

print("X shape : ", X.shape)
print("y shape : ", y.shape)

X shape :  (30974,)
y shape :  (30974,)


**Sanity Check:**
You should see the following:
```
X shape :  (30974,)
y shape :  (30974,)
```

In [41]:
print(X[0:5])
print(y[0:5])

0    LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...
1    Academic Qualifications available from prestig...
2    Greetings all. This is to verify your subscrip...
3    try chauncey may conferred the luscious not co...
4    It's quiet. Too quiet. Well, how about a straw...
Name: body, dtype: object
0    1
1    1
2    0
3    1
4    0
Name: type, dtype: int32


  print(X[0:5])
  print(y[0:5])


**Sanity Check:**
You should see the following:
```
0    LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...
1    Academic Qualifications available from prestig...
2    Greetings all. This is to verify your subscrip...
3    try chauncey may conferred the luscious not co...
4    It's quiet. Too quiet. Well, how about a straw...
Name: body, dtype: object
0    1
1    1
2    0
3    1
4    0
Name: type, dtype: int64
```

Split the dataset into train and test data sets. Set the test size to 30%, and `random_state` to 42. Make sure we also stratify based on the type (spam/ham).

In [42]:
# write code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

In [43]:
y_train.value_counts()

1    13496
0     8185
Name: type, dtype: int64

In [44]:
y_test.value_counts()

1    5784
0    3509
Name: type, dtype: int64

You should see that the distribution of classes in the train and test sets are maintained (1.648:1)

### Vectorization

Let's process the data as we did in the section before. Note that we will get a new dictionary based on the training dataa (we won't use the *Never gonna give you up* dataset anymore).

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

# write code here
count_vect = CountVectorizer()

Get the words from the training set (we should train without knowing the words from the test set)

In [46]:
# write code here
count_vect.fit(X_train)

**Sanity Check:** This is a large dataset, it may take a few seconds.

And, then get the frequency of each word in  each sentence

In [47]:
# write code here
X_train_count_sparse_matrix = count_vect.transform(X_train)

__Note:__ A shorthand of these two lines is `count_vect.fit_transform()`

In [48]:
X_train_count_sparse_matrix.shape

(21681, 147622)

**Sanity Check:** The shape should be around (21681, 147622)

Let's check out the fitted vocabulary

In [49]:
count_vect.get_feature_names_out()

array(['00', '000', '0000', ..., 'ｔ谷', 'ｗ６２', 'ｙ里様お互いがくつろげるような'],
      dtype=object)

**Sanity Check:** This will really get funny characters. Try seeing the 45,000th words onward to see more "normal" words

In [50]:
count_vect.get_feature_names_out()[45000:]

array(['demütige', 'den', 'denardo', ..., 'ｔ谷', 'ｗ６２', 'ｙ里様お互いがくつろげるような'],
      dtype=object)

Now, we also have to transform our test data to our fitted vocab. 


**Note:** We should not fit the test data's vocabulary. We're going to use the word features we culled from the training dataset.

In [51]:
# write code here
X_test_count_sparse_matrix = count_vect.transform(X_test)

X_test_count_sparse_matrix.shape

(9293, 147622)

**Sanity Check:**
The number of features (dimensions, not instances) of the train and test should match

**Now we have two transformed sparse matrices:**
- X_train_count_sparse_matrix
- X_test_count_sparse_matrix



## Modelling

Now that we've got preprocessing done, we can focus on building the model. Here, we will use sklearn's `MultinomialNB` because our assumption is that our data follows a multinomial distribution.

In [52]:
from sklearn.naive_bayes import MultinomialNB
# MultinomialNB?

Initialize your `MultinomialNB` model

In [53]:
# write code here
spam_nb = MultinomialNB()

Fit it to out training data (`X_train_count_sparse_matrix`)

In [54]:
# write code here
spam_nb.fit(X_train_count_sparse_matrix, y_train)

And, get our training predictions

In [55]:
# write code here
predictions = spam_nb.predict(X_train_count_sparse_matrix)

predictions

array([1, 1, 0, ..., 1, 1, 1])

Let's see how well our model worked on our training data

In [56]:
print("Spam/ham training accuracy: ", compute_accuracy(predictions, y_train), "%")

Spam/ham training accuracy:  99.04985932383192 %


Then, let's see how it does on our test set

In [57]:
predictions = spam_nb.predict(X_test_count_sparse_matrix)

predictions

array([1, 0, 1, ..., 1, 0, 1])

In [58]:
print("Spam/ham test accuracy: ", compute_accuracy(predictions, y_test), "%")

Spam/ham test accuracy:  98.21370924351662 %


**Sanity Check:** You should get around a 98% accuracy

We should also be able to call `classification_report` to see how well our model performed with different metrics

In [59]:
from sklearn.metrics import classification_report
# classification_report?

Print the test classification report of our model. Set the `target_names` to `mapping.keys()` so we can see what `0` and `1` refers to.

In [60]:
# write code here
print(classification_report(y_test, predictions, target_names=mapping.keys()))

              precision    recall  f1-score   support

         ham       0.96      1.00      0.98      3509
        spam       1.00      0.97      0.99      5784

    accuracy                           0.98      9293
   macro avg       0.98      0.99      0.98      9293
weighted avg       0.98      0.98      0.98      9293



**Sanity check:** You should get the following results
```
              precision    recall  f1-score   support

         ham     0.9573    0.9972    0.9768      3509
        spam     0.9982    0.9730    0.9855      5784

    accuracy                         0.9821      9293
   macro avg     0.9778    0.9851    0.9811      9293
weighted avg     0.9828    0.9821    0.9822      9293
```

**Question #5:** Among the classes (`ham` or `spam`), which is more likely to get labelled its class?

<!--crumb;qna;Question: Among the classes (ham or spam), which is more likely to get labelled its class?-->

A: spam

## Testing our model with our own input

In [None]:
input_test = input("Enter text to check if spam or ham : ")
while input_test.lower() != "q":

    input_test_matrix = count_vect.transform([input_test])

    results = spam_nb.predict(input_test_matrix)
    results_label = ["HAM", "SPAM"]
    print("Text : " + input_test + " is " + results_label[results[0]])

    input_test = input("Enter text to check if spam or ham : ")

## Checking the learned parameters
Let's see the parameters the `MultinomialNB` model learned.

Get the token counts the model computed

In [61]:
token_counts = spam_nb.feature_count_
token_counts.shape

(2, 147622)

**Sanity check:** You should get a `(2, 147622)` matrix

__Question #6:__ Why did we get a `(2, 147622)` matrix for the token counts?

<!--crumb;qna;Question: Why did we get a (2, 147622) matrix for the token counts?-->

A: It means that there are 2 classes (spam and ham) and 147622 features (words in the dictionary). The value at the i-th row and j-th column of `feature_count_` is the number of times feature j appears in samples of class i in the training data.

To get the token counts of `spam` or `ham`, we can use our `mapping`.

In [62]:
spam_token_counts = token_counts[mapping["spam"]]
ham_token_counts = token_counts[mapping["ham"]]

We can sort the token counts to see the word that occurs the less/most for that class

In [63]:
np.sort(spam_token_counts)

array([    0.,     0.,     0., ..., 38981., 41424., 53076.])

While `np.sort` returns the actual counts, `np.argsort` returns the sorted indices

In [64]:
np.argsort(spam_token_counts)

array([ 73810, 119826, 119827, ..., 128265,  21596, 124004], dtype=int64)

**Sanity check:** You should see the following:

`array([ 73810, 119826, 119827, ..., 128265,  21596, 124004])`

The two sorts show that the `73,810th` word occurred `0` times, while the `124,004th` occurred `53,076` times in the spam sentences. Note that these are raw counts that are skewed because there are significantly more spam sentences. **The model normalizes the counts relative to the class.**

To get the the `ith` word/token, we can use `count_vect.get_feature_names()`

In [66]:
count_vect.get_feature_names_out()[73810]

'the'

__Question #7:__ What word occurred the most in the spam sentences?

<!--crumb;qna;Question: What word occurred the most in the spam sentences?-->

A: the

The following code lists the top occurring words per class:

In [68]:
top = 50

ham_idx = np.argsort(ham_token_counts)[::-1][:top]
spam_idx = np.argsort(spam_token_counts)[::-1][:top]

print("spam \t ham")
print("------------")

for i in range(top):
    print(count_vect.get_feature_names_out()[ham_idx[i]], "\t", count_vect.get_feature_names_out()[spam_idx[i]])

spam 	 ham
------------
the 	 the
to 	 and
of 	 to
and 	 of
in 	 in
is 	 you
for 	 is
it 	 font
that 	 http
you 	 for
on 	 this
this 	 padding
with 	 our
be 	 it
from 	 your
have 	 with
are 	 we
at 	 com
as 	 0px
not 	 that
if 	 price
or 	 border
by 	 product_table
can 	 on
but 	 are
edu 	 from
will 	 color
an 	 top
my 	 95
we 	 size
your 	 as
all 	 weight
one 	 at
would 	 be
there 	 will
was 	 info
any 	 by
so 	 my
http 	 or
use 	 all
has 	 not
do 	 none
com 	 left
they 	 have
what 	 right
which 	 more
about 	 company
some 	 no
www 	 www
list 	 out


<s>With the code above, you can now reword your scam so that it can bypass a common spam filter.</s>

The model does not depend on raw counts but instead uses the log probability. Get the model's log probabilities.

In [70]:
# write code here
spam_nb.predict_log_proba(X_test_count_sparse_matrix)

array([[-145.55795392,    0.        ],
       [   0.        , -193.71611942],
       [-209.09346911,    0.        ],
       ...,
       [-145.30437722,    0.        ],
       [   0.        , -770.70093955],
       [-187.27847763,    0.        ]])

Let's sort the `feature_log_prob_` similar to the way we sorted the token counts

In [71]:
np.sort(spam_nb.feature_log_prob_[mapping["spam"]])

array([-14.58437171, -14.58437171, -14.58437171, ...,  -4.01351643,
        -3.95273186,  -3.70487274])

In [72]:
np.argsort(spam_nb.feature_log_prob_[mapping["spam"]])

array([ 73810, 119826, 119827, ..., 128265,  21596, 124004], dtype=int64)

We can see that the order is maintained, and the `124,004th` word is still the most occurring word in the `spam` sentences.

We can also see the class count and computed priors for each class

In [73]:
spam_nb.class_count_

array([ 8185., 13496.])

In [74]:
spam_nb.class_log_prior_

array([-0.97413309, -0.47404296])

Note that the priors are computed based on the count of each class (spam or not spam) in the dataset. The log probability is computed.

# Tuning our Naive Bayes model

In this section we will reuse our spam/ham dataset. We will resplit our dataset in the following manner:
1. Allot 20% of the original dataset as our hold-out test set.
1. Allot 25% of our remaining data as our validation data set. The remaining 80% will serve as our training data.

We will use `sklearn`'s `ParameterGrid` to tune our hyperparameters

Let's separate our test set. Set the test set to `20%`, stratify based on the target class, and set the `random_state` to 42.

In [75]:
# write code here
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("X_test shape : ", X_test.shape)
print("y_test shape : ", y_test.shape)

X_test shape :  (6195,)
y_test shape :  (6195,)


We will the same thing to separate our validation set. Set the validation set size to `25%`, stratify based on the target class, and set the `random_state` to 42. 

__Don't forget that we are now splitting `X_train_val` and `y_train_val`.__ There should be no data leakage.

In [76]:
# write code here
X_train, X_validation, y_train, y_validation = train_test_split(X_train_val, y_train_val, test_size=0.25, stratify=y_train_val, random_state=42)

print("X_train shape : ", X_train.shape)
print("y_train shape : ", y_train.shape)
print("X_validation shape : ", X_validation.shape)
print("y_validation shape : ", y_validation.shape)

X_train shape :  (18584,)
y_train shape :  (18584,)
X_validation shape :  (6195,)
y_validation shape :  (6195,)


## Vectorization

Now that we have our data sets prepared, we can now start computing for the token counts. Remember that we will have to refit our vectorizer to our new train data.

Initialize a `CountVectorizer`. Set it to remove English stop words. This should remove common words like `the`, `of`, `and` that likely will not anything meaningful to distinguish the two classes apart.

In [77]:
# write code here
count_vect = CountVectorizer(stop_words="english")

Get the count matrix for the training and validation set. 

In [78]:
# write code here
X_train_count_sparse_matrix = count_vect.fit_transform(X_train)
X_validation_count_sparse_matrix = count_vect.transform(X_validation)

In [79]:
print("X_train_count_sparse_matrix shape: ", X_train_count_sparse_matrix.shape)
print("X_validation_count_sparse_matrix shape: ", X_validation_count_sparse_matrix.shape)

X_train_count_sparse_matrix shape:  (18584, 146302)
X_validation_count_sparse_matrix shape:  (6195, 146302)


**Sanity check:** You should see the following values:
```
X_train_count_sparse_matrix shape:  (18584, 146302)
X_validation_count_sparse_matrix shape:  (6195, 146302)
```

**Question #8:** Why should we not get the count vectorizer to fit on the validation data set instead?

<!--crumb;qna;Question: Why should we not get the count vectorizer to fit on the validation data set instead?-->

A: If we fit the count vectorizer on the validation set, our model will have prior knowledge about the validation set, causing overfitting which defeats its purpose. Our goal is to make our model perform well on new, real-world(unseen) data.

## GridSearch with `ParameterGrid`
In this section, we will use `ParameterGrid` to get the combinations of hyperparameters we will try on our model.

In [91]:
from sklearn.model_selection import ParameterGrid

Set our base classifier for the spam/ham classifier. Don't train yet

In [92]:
# write code here
spam_nb = MultinomialNB()

For this model, we can tweak the `alpha` (our smoothing operator) and whether or not we want to compute for the prior (`fit_prior`). You can read more about this in the docs.

In [93]:
spam_nb.get_params()

{'alpha': 1.0, 'class_prior': None, 'fit_prior': True, 'force_alpha': 'warn'}

For the following section, we will define our hyperparameters. For now, set the following hyperparameter choices:

__Hyperparameters__:
- alpha could be 1, 3, 5, 10, 15, 20, 50
- fit_prior could be true or false

In [94]:
hyperparameters = [{
    'alpha': [1, 3, 5, 10, 15, 20, 50],
    'fit_prior': [False, True]
}]

If we call `ParameterGrid`, it should list the following:

In [95]:
list(ParameterGrid(hyperparameters))

[{'alpha': 1, 'fit_prior': False},
 {'alpha': 1, 'fit_prior': True},
 {'alpha': 3, 'fit_prior': False},
 {'alpha': 3, 'fit_prior': True},
 {'alpha': 5, 'fit_prior': False},
 {'alpha': 5, 'fit_prior': True},
 {'alpha': 10, 'fit_prior': False},
 {'alpha': 10, 'fit_prior': True},
 {'alpha': 15, 'fit_prior': False},
 {'alpha': 15, 'fit_prior': True},
 {'alpha': 20, 'fit_prior': False},
 {'alpha': 20, 'fit_prior': True},
 {'alpha': 50, 'fit_prior': False},
 {'alpha': 50, 'fit_prior': True}]

For every iteration, we will:
1. Set the parameters of our base model to the current hyperparameter combination 
1. Fit our model to our training data
1. Compute for our training accuracy
1. Run predictions on our validation data
1. Compute for our training accuracy
1. Keep track of the best performing validation accuracy and its associate hyperparam combo.

In [96]:
best_score = 0
for g in ParameterGrid(hyperparameters):
    print(g)
    
    spam_nb.set_params(**g)
    
    # write code here

    predictions = spam_nb.fit(X_train_count_sparse_matrix, y_train).predict(X_train_count_sparse_matrix)
    train_acc = compute_accuracy(predictions, y_train)
    
    # write code here
    predictions = spam_nb.predict(X_validation_count_sparse_matrix)
    val_acc = compute_accuracy(predictions, y_validation)
    
    print(f"Train acc: {train_acc}% \t Val acc: {val_acc}%", end="\n\n")
    
    if val_acc > best_score:
        best_score = val_acc
        best_grid = g

print("Best accuracy: ", best_score, "%")
print("Best grid: ", best_grid)

{'alpha': 1, 'fit_prior': False}
Train acc: 99.02066293585881% 	 Val acc: 98.36965294592413%

{'alpha': 1, 'fit_prior': True}
Train acc: 99.08523461041756% 	 Val acc: 98.4503631961259%

{'alpha': 3, 'fit_prior': False}
Train acc: 98.59018510546707% 	 Val acc: 98.07909604519774%

{'alpha': 3, 'fit_prior': True}
Train acc: 98.63861386138613% 	 Val acc: 98.15980629539952%

{'alpha': 5, 'fit_prior': False}


Train acc: 98.32651743435214% 	 Val acc: 97.88539144471348%

{'alpha': 5, 'fit_prior': True}
Train acc: 98.40723202755058% 	 Val acc: 98.01452784503631%

{'alpha': 10, 'fit_prior': False}
Train acc: 97.82070598364184% 	 Val acc: 97.38498789346247%

{'alpha': 10, 'fit_prior': True}
Train acc: 97.98213517003875% 	 Val acc: 97.67554479418887%

{'alpha': 15, 'fit_prior': False}
Train acc: 97.69156263452432% 	 Val acc: 97.30427764326069%

{'alpha': 15, 'fit_prior': True}
Train acc: 97.72384847180369% 	 Val acc: 97.49798224374496%

{'alpha': 20, 'fit_prior': False}
Train acc: 97.53013344812742% 	 Val acc: 97.12671509281678%

{'alpha': 20, 'fit_prior': True}
Train acc: 97.65389582436505% 	 Val acc: 97.41727199354318%

{'alpha': 50, 'fit_prior': False}
Train acc: 97.05122686181662% 	 Val acc: 96.86844229217111%

{'alpha': 50, 'fit_prior': True}
Train acc: 97.17498923805424% 	 Val acc: 97.14285714285714%

Best accuracy:  98.4503631961259 %
Best grid:  {'alpha': 1, 'fit_prior': True}


__Question #9:__ What is the best found value for `alpha`?

<!--crumb;qna;Question: What is the best found value for alpha?-->

A: 1

__Question #10:__ What is the best found value for `fit_prior`?

<!--crumb;qna;Question: What is the best found value for fit_prior?-->

A: True

## Retraining our estimator with the best hyperparameters

Now that we know the best hyperparameters, we can now make a new classifier and retrain it.

In [97]:
# write code here
spam_nb = MultinomialNB()

Make sure you train it with both our training and validation set. You can keep the trained `count_vect`.

In [98]:
# write code here
X_train_val_count_sparse_matrix = count_vect.fit_transform(X_train_val)

In [99]:
# write code here
spam_nb.fit(X_train_val_count_sparse_matrix, y_train_val)

## Testing phase

Run predictions on the test data set

In [100]:
# write code here
X_test_val_count_sparse_matrix = count_vect.transform(X_test)
predictions = spam_nb.predict(X_test_val_count_sparse_matrix)

Compute for the test accuracy

In [102]:
# write code here
test_acc = compute_accuracy(predictions, y_test)
print("Test accuracy: ", test_acc, "%")

Test accuracy:  98.46650524616626 %


__Question #11__: What is the final test accuracy?

<!--crumb;qna;Question: What is the final test accuracy?-->

A: 98.47%

# Summary

In this notebook, we created two kinds of Naive Bayes models: Gaussian and Multinomial. 

We also saw the models' learned parameters. For Gaussian NB models, the model learns the mean and standard deviation of each feature per class, while multinomial NB models learn the log probability of each token per class.

We also experienced creating a natural language processing (NLP) machine learning model. Unlike its deep learning counterpart, the features are more hand-crafted because we dictate what the model should look at. In this case, we specifically designed it to look at token/term frequency/count, but we could build more sophisticated versions like inverse document frequency or term frequency-inverse document frequency (TF-IDF). 

## <center>fin</center>


<!-- DO NOT MODIFY OR DELETE THIS -->

<sup>made/compiled by daniel stanley tan & courtney anne ngo 🐰 & thomas james tiam-lee</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> danieltan07@gmail.com & courtneyngo@gmail.com & thomasjamestiamlee@gmail.com</sup><br>
<sup>please cc your instructor, too</sup>
<!-- DO NOT MODIFY OR DELETE THIS -->