# Natural language Processing


**_Author: Jessica Cervi_**

**Expected time = 2 hours**

**Total points = 80 points**


## Project Overview


This assignment provides an overview of *Natural Language Processing* (NLP) as an approach to classification problems in supervised learning involving textual data (in particular using *Naive Bayes classifiers*). In spite of the "naive" assumptions involved, Naive Bayes works very well in practice particularly for text analysis in, for instance, spam filtering or document classification. As such, for this assignment, you will build a very simple model of spam filtering to get a sense of how naive Bayes classification really works. Then, you will explore the use of tools within Scikit-Learn for NLP.

The primary goals of the current assignment are:
+ to become familiar with the terminology and tools available for NLP;
+ to practice the application of Bayes' theorem for probabilistic reasoning; and
+ to develop a (highly simplified) model of text analysis for spam classification using the naive Bayes classification framework.



This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from the module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question. 
- You can download the Grading Report after submitting the assignment. This will include feedback and hints on incorrect questions. 

### Learning Objectives


- Learn the main concepts behind the theory of Natural Language Processing
- Represent a document-term matrix in Python
- Learn the theory behind TD-IDF Vectorizer and its Python implementation
- Implement Bayes Theorem in Python
- Prearing text for analysis and distinguish between spam and ham messages
- Computing prions and likelihoods of your prediction
-  Using a `Scikit-Learn`  `MultinomialNB` Estimator



---

## Index: 

####   Natural language Processing

- [Question 1](#q01)
- [Question 2](#q02)
- [Question 3](#q03)
- [Question 4](#q04)
- [Question 5](#q05)
- [Question 6](#q06)
- [Question 7](#q07)
- [Question 8](#q08)
- [Question 9](#q09)
- [Question 10](#q10)
- [Question 11](#q11)



##  Natural language Processing

**Natural Language Processing**, usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language.
The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.

In this assingment, we will guide your through your own implementation of an algorithm to analyze text.

As usual we begin by importing the necessery libraries.

In [23]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pathlib, re

To prepare for the eventual task of classifying text messages, here is a Python string containing the body of several text messages on separate lines:

In [24]:
messages = '''Have a safe trip to Nigeria. Wish you happiness and very soon company to share moments with
Well keep in mind I've only got enough gas for one more round trip barring a sudden influx of cash
Yes i have. So that's why u texted. Pshew...missing you so much
This school is really expensive. Have you started practicing your accent. Because its important. And have you decided if you are doing 4years of dental school or if you'll just do the nmde exam.
Sorry, I'll call later
Anything lor. Juz both of us lor.
Get me out of this dump heap. My mom decided to come to lowes. BORING.
Why don't you wait 'til at least wednesday to see if you get your .
REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode
This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.
Pity, * was in mood for that. So...any other suggestions?'''
print(messages)

Have a safe trip to Nigeria. Wish you happiness and very soon company to share moments with
Well keep in mind I've only got enough gas for one more round trip barring a sudden influx of cash
Yes i have. So that's why u texted. Pshew...missing you so much
This school is really expensive. Have you started practicing your accent. Because its important. And have you decided if you are doing 4years of dental school or if you'll just do the nmde exam.
Sorry, I'll call later
Anything lor. Juz both of us lor.
Get me out of this dump heap. My mom decided to come to lowes. BORING.
Why don't you wait 'til at least wednesday to see if you get your .
REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode
This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.
Pity, * was in mood for that. So...any other 

The content of the preceding string illustrates a lot of the challenges with **natural language processing** from text:

+ The text needs to be split into individual *tokens* (i.e., words & punctuation).
+ There can be difficulties with upper- & lower-case interspersed with numerals and punctuation characters.
+ Many words add little contextual information such as articles, conjunctions, etc. These are *stop words*.
+ Similar words can occur with common roots (e.g., 'go' and 'goes', 'liked' and 'likes' and 'liked', etc.
+ Words can be spelled incorrectly.

Let's convert all the text to lower case and construct a list with the individual messages as a *corpus* of text.

As a reminder, **a text corpus is a large body of text**. 

In [25]:
corpus = messages.lower().split('\n')
corpus

['have a safe trip to nigeria. wish you happiness and very soon company to share moments with',
 "well keep in mind i've only got enough gas for one more round trip barring a sudden influx of cash",
 "yes i have. so that's why u texted. pshew...missing you so much",
 "this school is really expensive. have you started practicing your accent. because its important. and have you decided if you are doing 4years of dental school or if you'll just do the nmde exam.",
 "sorry, i'll call later",
 'anything lor. juz both of us lor.',
 'get me out of this dump heap. my mom decided to come to lowes. boring.',
 "why don't you wait 'til at least wednesday to see if you get your .",
 'reminder from o2: to get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode',
 'this is the 2nd time we have tried 2 contact u. u have won the £750 pound prize. 2 claim is easy, call 087187272008 now1! only 10p per minute. bt-national-rate.',
 'pit

## Terminology

We'll use the following terms at various places in the assignment. More details & examples will be proveded where appropriate.

+ *Stop Words* -- Specific words that are not considered important for text analysis, e.g., 'the', 'is', 'a',  etc.
+ *Tokenization*  -- Segmentation of text into separate *tokens* (i.e., words or punctuation marks). This is a form of feature extraction.
+ *Stemming* -- Reducing words to their root form by truncating characters, e.g., car, cars, car’s, cars’ all have *stem* 'car'.
+ *Lemmatization* -- Grouping together the inflected forms of a word as a single item known as the *lemma* or dictionary form.
+ *Word Embedding* -- Explicit mapping to represent sequences of tokens (words extracted from text) to vectors of real numbers.
+ *$n$-grams* -- Sequences of words or tokens (i.e., phrases) rather than single words. Helps with better understanding of text; 'not happy' instead of 'happy,' e.g bi-gram per token. For example, the following sentence decomposes as shown into unigrams (1-grams) or bigrams (2-grams):

    Sentence:	`The movie was not great.`<br>
    Uni-grams:	`[‘The’, ‘movie’, ‘was’, ‘not’, ‘great.’]`<br>
    Bi-grams:	`[‘The movie’, ‘movie was’, ‘was not’, ‘not great.’]`

## Constructing a Word Embedding with the `CountVectorizer` class

Scikit-Learn provides many important tools for converting textual data to numerical data (as numerical data is required for most machine learning techniques). One such tool is the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class from the module `sklearn.feature_extraction.text`. The outputs of the `transform` and `fit_transform` methods of the `CountVectorizer` class are [*document-term matrices*](https://en.wikipedia.org/wiki/Document-term_matrix) that contain word counts for each word in the corpus.

As an example, here is a (modified) excerpt from the Scikit-Learn [`CountVectorizer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(f"The object X obtained has type {type(X)}")
print(f"The columns of X correspond to the words:\n{vectorizer.get_feature_names()}")

The object X obtained has type <class 'scipy.sparse.csr.csr_matrix'>
The columns of X correspond to the words:
['087187272008', '10p', '2nd', '4years', '50', '750', 'accent', 'and', 'any', 'anything', 'are', 'at', 'barring', 'because', 'boring', 'both', 'bt', 'call', 'cash', 'claim', 'come', 'company', 'contact', 'credit', 'decided', 'dental', 'details', 'do', 'doing', 'don', 'dump', 'easy', 'enough', 'exam', 'expensive', 'for', 'free', 'from', 'gas', 'get', 'got', 'great', 'happiness', 'have', 'heap', 'house', 'if', 'important', 'in', 'influx', 'is', 'its', 'just', 'juz', 'keep', 'later', 'least', 'll', 'lor', 'lowes', 'me', 'mind', 'minute', 'missing', 'mom', 'moments', 'mood', 'more', 'much', 'my', 'name', 'national', 'nigeria', 'nmde', 'no', 'now1', 'o2', 'of', 'offers', 'one', 'only', 'or', 'other', 'out', 'per', 'pity', 'pls', 'postcode', 'pound', 'pounds', 'practicing', 'prize', 'pshew', 'rate', 'really', 'reminder', 'reply', 'round', 'safe', 'school', 'see', 'share', 'so', 'soo

In [27]:
X = X.toarray() # convert to dense representation
print(X)
print(f"After calling toarray(), the object X obtained has type {type(X)}")

[[0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 1 0]
 ...
 [0 0 0 ... 0 0 1]
 [1 1 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
After calling toarray(), the object X obtained has type <class 'numpy.ndarray'>


In [28]:
# Represent document-term matrix using a DataFrame instead:
X = pd.DataFrame(data=X, columns=vectorizer.get_feature_names())
# Extract select columns
X[['and', 'contact', 'wednesday', 'yes', 'you']]

Unnamed: 0,and,contact,wednesday,yes,you
0,1,0,0,0,1
1,0,0,0,0,0
2,0,0,0,1,1
3,1,0,0,0,4
4,0,0,0,0,0
5,0,0,0,0,0
6,0,0,0,0,0
7,0,0,1,0,2
8,2,0,0,0,0
9,0,1,0,0,0


Notice a few particular properties of the `CountVectorizer` class from the preceding example.

+ Each column of the document-term matrix corresponds to a particular word (as displayed by the `get_feature_names` method).
+ By default, the text is converted to lower case (e.g., "Wednesday" $\mapsto$ "wednesday").
+ The specific vocabulary that determines the results of `get_feature_names` can be learned from some input text/corpus or predetermined when the object is instantiated (see documentation).
+ Each row of the document-term matrix corresponds to a sentence from the original corpus.
+ The numerical entries of the document-term matrix are nonnegative integers corresponding to counts of the occurrences of each word in the text corpus.
+ The default input for the `fit` method is a list of strings or file objects corresponding to documents from which the distributions of word counts can be learned.
+ The default output after applying the `fit_transform` method to text (or, equivalently, applying the `fit` method and subsequently applying the `transform` method to the same data) is a *sparse matrix* with only nonzero entries represented (i.e., to make storage more efficient). The purpose of calling the `toarray` method, then, is to transform the sparse matrix into a dense representation (i.e., by putting the zeros back in explicitly) for printing/display.

In the next exercise, you will use the Scikit-Learn `CountVectorizer` to create a document-term matrix from slightly different text. The corpus here is modelled using a list called `text` whose entries are sentences (strings), each of which is related to the topics *TV* and *radio*.

In [29]:
text = [
       'TV programs are not interesting -- TV is annoying.',
       'Kids like TV'
       ,'We receive TV by radio waves'
       ,'It is interesting to listen to the radio'
       ,'On the waves, kids programs are rare.'
       ,'The kids listen to the radio; it is rare.'
       ]
print(text)
print(f'There are {len(text)} sentences.')

['TV programs are not interesting -- TV is annoying.', 'Kids like TV', 'We receive TV by radio waves', 'It is interesting to listen to the radio', 'On the waves, kids programs are rare.', 'The kids listen to the radio; it is rare.']
There are 6 sentences.


### Constructing a Document-Term Matrix

[Back to top](#Index:) 
<a id='q01'></a>


### Question 1:

*5 points*

Construct a document-term matrix from the preceding list of strings `text`.
+ Use an instance of the `CountVectorizer` class as in the preceding example.
+ You can apply the `fit_transform` method or apply the `fit` method, and then apply the `transform` method to the data `text`.
+ Convert the sparse array returned to a dense Numpy array.
+ Assign the final object obtained to the identifier `transformed_text`.

In [30]:
### GRADED

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
### YOUR SOLUTION HERE
vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(text)
transformed_text = X.toarray()
###
### YOUR CODE HERE
###


In [31]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Using a `DataFrame` to Represent a Document-Term Matrix

[Back to top](#Index:) 
<a id='q02'></a>


### Question 2:

*5 points*

Your task here is to represent the document-term matrix `transformed_text` as a Pandas `DataFrame`. In particular, adapt the construction preceding Question 01 into the body of a function `make_dtm_df`.

+ Define a function signature is `make_dtm_df(corpus)` where `corpus` is a list of strings as can be used as an input to `CountVectorizeer.fit`.
+ The value returned is a Pandas DataFrame.
  + The rows of the DataFrame correspond to the entries of the input corpus (i.e., there are `len(corpus)` rows).
  + The columns of the DataFrame correspond to the words extracted using `get_feature_names` once the `CountVectorizer` is `fit` to the input `corpus`.
  + The entries of the DataFrame are the counts of each word as they occur in the entries of `corpus` (as in the document-term matrix).

```python
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...    'Is this the first document?' ]
>>> df = make_dtm_df(corpus)
>>> df
```
<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>and</th>      <th>document</th>      <th>first</th>      <th>is</th>      <th>one</th>      <th>second</th>      <th>the</th>      <th>third</th>      <th>this</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>0</td>      <td>1</td>      <td>1</td>      <td>1</td>      <td>0</td>      <td>0</td>      <td>1</td>      <td>0</td>      <td>1</td>    </tr>    <tr>      <th>1</th>      <td>0</td>      <td>2</td>      <td>0</td>      <td>1</td>      <td>0</td>      <td>1</td>      <td>1</td>      <td>0</td>      <td>1</td>    </tr>    <tr>      <th>2</th>      <td>1</td>      <td>0</td>      <td>0</td>      <td>1</td>      <td>1</td>      <td>0</td>      <td>1</td>      <td>1</td>      <td>1</td>    </tr>    <tr>      <th>3</th>      <td>0</td>      <td>1</td>      <td>1</td>      <td>1</td>      <td>0</td>      <td>0</td>      <td>1</td>      <td>0</td>      <td>1</td>    </tr>  </tbody></table>

In [32]:
### GRADED

### YOUR SOLUTION HERE
def make_dtm_df(corpus):
    '''
    This function will take in a list of and return a document-term matrix as a DataFrame.
    INPUT:
      corpus: list of sentences (strings)
    OUTPUT:
      returns DataFrame indexed by the feature names corresponding to columns of the document-term matrix.
    '''
    vectorizer = CountVectorizer() 
    X = vectorizer.fit_transform(corpus)
    transformed_text = X.toarray()
    return pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
# corpus = ['This is the first document.',
#           'This document is the second document.',
#           'And this is the third one.',
#           'Is this the first document?' ]
# make_dtm_df(corpus)
###
### YOUR CODE HERE
###


In [33]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


As seen previously, the DataFrame output using `text` as defined previously yields a DataFrame that can be visualized with a heatmap like this:

![](./assets/table.png)

As you can see, the list of features from this corpus has some words that do not contribute significant information to a topic analysis of this corpus, e.g., `'are'`, `'by'`, `'is'`, `'it'`, `'not'`, `'on'`, `'the'`, `'to'`, and `'we'`.  Such words are called *stop words* in the parlance of Natural Language Processing because they usually don't contribute much to common NLP tasks like document classification. 

For instance, suppose we consider the sentences in the previous list `text` as "documents" and each is associated with the topic `'TV'` or `'radio'`. We want to use `text` to model the topics `'TV'` and `'radio'` with the goal of classifying new documents. In that case, the words `'are'`, `'it'`, and so on from the preceding list would likely be present with similar (high) frequencies in all documents from both topics `'TV'` & `'radio'` (and hence they do not help meaningfully in distinguishing the topics).

It is typical to maintain [lists of standard stop words](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html) (which vary from language to language) and to filter them out prior to building a document-term matrix. In Scikit-Learn, the `CountVectorizer` class, when instantiated, has an option `stop_words` (with default value `None`) that allows the `CountVectorizer` object to encode a corpus while omitting all occurences of stop words when counting words. The keyword argument `stop_words='english'` filters a standard built-in list of English stop words from the corpus analyzed by the `CountVectorizer`.

### Accepting Extra Keyword Arguments

[Back to top](#Index:) 
<a id='q03'></a>


### Question 3:

*10 points*

Your task now is to modify the function `make_dtm_df` from the preceding question to allow for [*variable length keyworded arguments*](https://docs.python.org/3/tutorial/controlflow.html#keyword-arguments).

+ The function `make_dtm_df()` you'll define below  will have a single positional argument `corpus` (as before) and an additional argument [`**kwargs`](https://docs.python.org/3/tutorial/controlflow.html#keyword-arguments).
  + Whatever keyword arguments are passed into `make_dtm_df` should be passed into the call to `CountVectorizer` within the function body (and hence should be valid inputs for `CountVectorizer`). For instance, when calling `make_dtm_df(corpus, stop_words='english')`, the call to `CountVectorizer` inside the function body should be `CountVectorizer(stop_words='english')`.
  + Valid values for `**kwargs` would be any keyword arguments from Scikit-Learn's [`CountVectorizer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).
+ As an example:

```python
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...    'Is this the first document?' ]
>>> make_dtm_df(corpus, stop_words='english') # Different from above...
```
<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>document</th>      <th>second</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>1</td>      <td>0</td>    </tr>    <tr>      <th>1</th>      <td>2</td>      <td>1</td>    </tr>    <tr>      <th>2</th>      <td>0</td>      <td>0</td>    </tr>    <tr>      <th>3</th>      <td>1</td>      <td>0</td>    </tr>  </tbody></table>

In [34]:
### GRADED

def make_dtm_df(corpus, **kwargs):
    '''
    This function will take in a list of and return a document-term matrix as a DataFrame.
    INPUT:
      corpus: list of sentences (strings)
      kwargs: any keyword arguments (e.g., stop_words=None, encoding='utf-8', etc.) `CountVectorizer` accepts
    OUTPUT:
      returns DataFrame indexed by the feature names corresponding to columns of the document-term matrix.
    '''
    vectorizer = CountVectorizer(**kwargs) 
    X = vectorizer.fit_transform(corpus)
    transformed_text = X.toarray()
    return pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names()) 
###
### YOUR CODE HERE
###


In [35]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Introducing n-grams

[Back to top](#Index:) 
<a id='q04'></a>


### Question 4:

*10 points*

Use the function `make_dtm_df` to introduce different feature encodings with [*$n$-grams*](https://en.wikipedia.org/wiki/N-gram) (i.e., sequences of $n$ consecutive words found in the input corpus).
+ Consult the [documentation for `CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from [Scikit-Learn](https://scikit-learn.org).
+ The required argument for `CountVectorizer` is now of the form $\mathtt{ngram\_range}=(p,q)$ where $p$ and $q$ are positive integers with $p<q$. This implies that the features will consist of all $k$-grams (possibly after stripping stop-words) with $p\leq k \leq q$. For instance, `CountVectorizer(ngram_range=(2,3))` uses all bigrams (i.e., $1$-grams) and trigrams (i.e., $3$-grams) as features.
+ You will construct two DataFrames as the document-term matrices extracted from `text` (as given below). In particular:
  + a DataFrame `df1` that consists of  using individual words (i.e., $1$-grams) and bigrams as features *without* filtering any stop-words.
  + a DataFrame `df2` that consists of bigrams only *after* filtering the standard `'english'` stop-words.

In [36]:
### GRADED

###
from sklearn.feature_extraction.text import CountVectorizer
text = [
       'TV programs are not interesting -- TV is annoying.',
       'Kids like TV'
       ,'We receive TV by radio waves'
       ,'It is interesting to listen to the radio'
       ,'On the waves, kids programs are rare.'
       ,'The kids listen to the radio; it is rare.'
]

df1 =  make_dtm_df(text, ngram_range=(1,2))

df2 = make_dtm_df(text,stop_words = 'english', ngram_range=(2, 2))
# print(vectorizer.get_feature_names())
### YOUR SOLUTION HERE
# df1 = None
# df2 = None
    
###
### YOUR CODE HERE
###
# Verification:
print(f"The DataFrame df1 has shape {df1.shape}")
print(f"The DataFrame df2 has shape {df2.shape}")

The DataFrame df1 has shape (6, 50)
The DataFrame df2 has shape (6, 16)


In [37]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


---

In addition to the `CountVectorizer`, we have an option to use the [*term frequency-inverse document frequency*](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) vectorization approach.  Here, rather than just pure word counts, we attempt to measure the rarity of words in the text.

From the [user guide](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

>  TF means *term-frequency* while TF–IDF means *term-frequency times inverse document-frequency*: 
>
>  $$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t).$$
> 
>  Using the `TfidfTransformer`’s default settings, `TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)` the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
> 
>  $$\text{idf}(t) = \log \frac{1 + n}{1 + \text{df}(t)} + 1$$
>
>  where $n$ is the total number of documents in the document set, and $\text{df}(t)$  is the number of documents in the document set that contain term $t$.

### Introducing the TD-IDF Vectorizer

[Back to top](#Index:) 
<a id='q05'></a>


### Question 5:

*10 points*

Using the function `make_dtm_df` as a model, construct a function `make_tfidf_df` that constructs a term-frequency-inverse-document-frequency matrix from a corpus of text. 

+ Consult the [documentation for `TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from [Scikit-Learn](https://scikit-learn.org).
+ The function `make_tfidf_df` will have a single positional argument `corpus` (as with `make_dt_df`) and an additional argument [`**kwargs`](https://docs.python.org/3/tutorial/controlflow.html#keyword-arguments).
  + Whatever keyword arguments are passed into `make_tfidf_df` should be passed into the call to `TfidfVectorizer` within the function body (and hence should be valid inputs for `TfidfVectorizer`). For instance, when calling `make_tfidf_df(corpus, stop_words='english')`, the call to `TfidfVectorizer` inside the function body should be `TfidfVectorizer(stop_words='english')`.
  + Valid values for `**kwargs` would any keyword arguments from Scikit-Learn's [`TfidfVectorizer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).
+ You will construct two DataFrames as the TFIDF matrices extracted from `text` (as given below). In particular:
  + a DataFrame `df1` constructed *without* filtering any stop-words.
  + a DataFrame `df2` constructed *after* filtering the standard `'english'` stop-words.

In [38]:
### GRADED

### YOUR SOLUTION HERE
from sklearn.feature_extraction.text import TfidfVectorizer
def make_tfidf_df(corpus, **kwargs):
    '''
    This function will take in a list of and return a TFIDF matrix as a DataFrame.
    INPUT:
      corpus: list of sentences (strings)
      kwargs: any keyword arguments (e.g., stop_words=None, encoding='utf-8', etc.) `TfidfVectorizer` accepts
    OUTPUT:
      returns DataFrame indexed by the feature names corresponding to columns of the TFIDF matrix.
    '''
    vectorizer =  TfidfVectorizer(**kwargs) 
    X = vectorizer.fit_transform(corpus)
    transformed_text = X.toarray()
    return pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
    
###
### YOUR CODE HERE
###


In [39]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


---

## Bayes's theorem

Now that you've built some routines for working encoding corpuses of text as matrices (i.e., either as arrays or as DataFrames), you can build an NLP model for *spam classification*. An important part of reasoning in spam classification tasks (and related tasks using NLP) is *Bayes's theorem*:

$$\displaystyle{\boxed{p(A\,|\,B) = \frac{p(B\,|\,A) p(A)}{p(B)}}}.$$

In the preceding, consider $A$ and $B$ as some possible *events*. With that in mind,
+ $p(A\,|\,B)$ is the *posterior probability of $A$ given $B$*;
+ $p(B\,|\,A)$ is the *likelihood of $B$ given $A$*;
+ $p(A)$ is the *prior probability of $A$*; and
+ $p(B)$ is the *evidence* (sometimes called a normalizing factor).

For a more meaningful context to think about this, when sampling email or text messages, treat $A$ as the event that an incoming message is spam (i.e., undesired) and $B$ as the event that the incoming message contains the word "bargain". Then, what Bayes's theorem allows us to do is compute the posterior probability that an incoming message is spam given that it contains the word "bargain" from the corresponding likelihood of a message containing the word "bargain" given that it is known to be spam, from the prior probability of a message being spam, and from the probability of the word "bargain" appearing in any message. For convenience, let's write this as

$$\displaystyle{p(\text{spam}\,|\,\text{bargain}) = \frac{p(\text{bargain}\,|\,\text{spam}) p(\text{spam})}{p(\text{bargain})}}.$$

The next question requires you to apply Bayes' theorem to reason about probabilities (in the context of spam classification). Remember, the goal is to identify messages as *spam* (not wanted, undesirable) or *ham* (i.e., the opposite of spam messages).

### Reasoning using Bayes's theorem

[Back to top](#Index:) 
<a id='q06'></a>


### Question 06:

*5 points*

Assume in the following that you have a training set of 2,500 messages known to be spam and 1,250 messages known to be ham (i.e., not spam). Suppose further that the word "holiday" occurs in 275 of the spam messages and 13 of the ham messages.

+ Assume the empirical prior probability of an incoming message being ham or spam is provided by the respective fractions of ham or spam messages in the training set.
     + Assign the  prior probability of spam to `prior_spam` (i.e., $p(\text{spam})$).
     + Assign the  prior probability of ham to `prior_ham`  (i.e., $p(\text{ham})$).
     
     
+ Assume the empirical likelihood of the word "holiday" occurring in a message known to be spam (respectively, ham) is given by the counts above.
     + Assign the (estimated) likelihood of "holiday" occurring in an incoming spam message to `likelihood_holiday_spam` (i.e., $p(\text{holiday}\,|\,\text{spam})$).
     + Assign the (estimated) likelihood of "holiday" occurring in an incoming ham message to `likelihood_holiday_ham` (i.e., $p(\text{holiday}\,|\,\text{ham})$).
     

+ Finally, combine the preceding computations to estimate the *posterior* probability of an incoming message being spam given that it contains the word "holiday" (that is, $p(\text{spam}\,|\,\text{holiday})$).
     + Assign the posterior probability $p(\text{spam}\,|\,\text{holiday})$ to `posterior_spam_holiday`.
    + Assign all the values computed here to Python floating-point values up to three decimal places.

In [40]:
### GRADED

### YOUR SOLUTION HERE:
prior_spam = 2500/3750
prior_ham = 1250/3750
likelihood_holiday_spam = 275/2500
likelihood_holiday_ham = 13/1250
posterior_spam_holiday = (likelihood_holiday_spam*prior_spam)/(275/(275+13))

###
### YOUR CODE HERE
###

### For verifying answer:
print('prior_spam: {:5.3f}'.format(prior_spam))
print('prior_ham:  {:5.3f}'.format(prior_ham))
print('likelihood_holiday_spam: {:5.3f}'.format(likelihood_holiday_spam))
print('likelihood_holiday_ham : {:5.3f}'.format(likelihood_holiday_ham))
print('posterior_spam_holiday: {:5.3f}'.format(posterior_spam_holiday))

prior_spam: 0.667
prior_ham:  0.333
likelihood_holiday_spam: 0.110
likelihood_holiday_ham : 0.010
posterior_spam_holiday: 0.077


In [41]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


---

## Filtering Spam from SMS Messages


Now, let's introduce a particular dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), namely the [*SMS Spam Collection*]( https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). This is a famous public set of labeled SMS messages that have been collected for spam research. This is provided for you locally in the file `data/SMSSpamCollection.txt`. Here is an excerpt of various lines from `data/SMSSpamCollection.txt`:
```
spam    Your credits have been topped up for http://www.bubbletext.com Your renewal Pin is tgxxrz
ham     U dun say so early hor... U c already then say...
ham     Nah I don't think he goes to usf, he lives around here though
ham     Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham     K..k:)how much does it cost?
ham     First answer my question.
spam    Are you unique enough? Find out from 30th August. www.areyouunique.co.uk
ham     I'm home.
ham     Dear, will call Tmorrow.pls accomodate.
```
Notice that the text is substantially messier than the toy corpuses considered above (with spelling mistakes, extra punctuation, nonstandard capitalization, and other confounding properties). The messages (or documents) are classified as `spam` (i.e., undesired messages) or as `ham` (i.e., useful, or desired messages). The labels `ham` or `spam` are separated from the rest of the text in each line by a *tab* character. The data set is also unbalanced in that there are many more `ham` messages than `spam` messages (as one would hope!). 

### Preparing the SMS Messaging Data (I)

[Back to top](#Index:) 
<a id='q07'></a>


### Question 7:

*5 points*

Your next task is to load the SMS messaging data into a Pandas DataFrame `sms_messages`.

+ The data is stored in a file whose location is provided for you as `FILE_PATH`.
+ Use the function `pd.read_csv` to create the DataFrame `df` with the options `sep=\t` (because the file is tab-separated), `header=None` (because there is no header line), and `names=['label', 'msg']`. 
+ Extract the leading column from the CSV file into a *DataFrame* `labels` (and *not* a *Series*) whose entries are either `ham` or `spam`.

In [55]:
### GRADED
import pandas as pd
FILE_PATH = pathlib.Path().cwd().joinpath('data', 'SMSSpamCollection.txt')
### YOUR SOLUTION HERE
df = pd.read_csv(FILE_PATH, sep='\t', header=None, names=['label','msg'])
labels = df['label']
labels = labels.to_frame()
###
### YOUR CODE HERE
###
# For verifying answer:
type(labels)
labels

Unnamed: 0,label,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [43]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


---

The text messages are noticeably messier than the toy corpuses from earlier exercises. If you apply the function `make_dtm_df` to a subset of `df['msg']`, you can see a number of meaningless tokens ("words") containing numbers that are extracted:

In [44]:
make_dtm_df(df.msg.head(5))

Unnamed: 0,08452810075over18,2005,21st,87121,already,amore,apply,around,available,buffet,...,tkts,to,txt,until,usf,wat,wif,win,wkly,world
0,0,0,0,0,0,1,0,0,1,1,...,0,0,0,1,0,1,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,1,1,1,1,0,0,1,0,0,0,...,1,3,1,0,0,0,0,1,1,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,1,0,0,0,0,0


To clean the text up, you will remove punctuation, numbers, and tokens that begin with numbers from the messages (this does remove possibly meaningful tokens like "2nd" but also removes phone numbers and other distracting tokens). This is most easily achieved using the [`re`](https://docs.python.org/3/library/re.html) module for working with *regular expressions*. The essential logic is provided in the following function `process_text` (which also transforms the text to lower-case):

In [57]:
from string import punctuation
def process_text(text):
    new_text = re.sub('\d+\w*', '', text.lower())
    for char in punctuation:
        new_text = new_text.replace(char, '')
    return new_text

### Preparing the SMS Messaging Data (II)

[Back to top](#Index:) 
<a id='q08'></a>


### Question 8:

*5 points*

Your task now is to construct a document-term matrix `sms_messages` (encoded as a DataFrame) from the `msg` column of the DataFrame `df` from the preceding exercise.

+ The function `process_text` is provided for you above to strip out the offending tokens.
+ Use the function `make_dtm_df` from before to construct the document-term matrix required. You can do this by passing the additional arguments `stop_words='english'` and using the `preprocessor=process_text` to `msg` column of the DataFrame `df`.
+ The result should be bound to the identifier `sms_messages`.

In [66]:
### GRADED

### YOUR SOLUTION HERE

sms_messages = make_dtm_df(df['msg']stop_words='english',preprocessor= process_text)
###
### YOUR CODE HERE
###
sms_messages
# sms_messages.tail(10)

AttributeError: 'Series' object has no attribute 'lower'

In [47]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Preparing the SMS Messaging Data (III)

[Back to top](#Index:) 
<a id='q09'></a>


### Question 09:

*10 points*

Finally, as usual in supervised learning, you want to split the data into training and testing data sets so that the model can be assessed after fitting it to the data.  Notice the use of the keyword argument `stratify`.

+ You'll use the `train_test_split` function from the Scikit-Learn module `sklearn.model_selelction`.
  + Use the keyword argument `random_state=13` to fix a reproducible splitting.
  + Use the keyword argument `stratify=sms_messages['labels']` to ensure that the proportion of ham and spam messages match in the training and the testing datasets.
  + Assign the outputs to the identifiers `sms_messages_train` & `sms_messages_test`. Notice these are both DataFrames that include both the features (i.e., the columns of the document-term matrices) and the targets/labels (i.e., ham or spam).

In [48]:
### GRADED

from sklearn.model_selection import train_test_split
### YOUR SOLUTION HERE
X, y, l, p = train_test_split(sms_messages,labels, random_state=13, stratify= sms_messages['labels']
sms_messages_train = X
sms_messages_test = y 
labels_train = l 
labels_test = p

###
### YOUR CODE HERE
###
# Verification:
print('There are {} training observations & {} testing observations.\n'
        .format(len(sms_messages_train), len(sms_messages_test)))

SyntaxError: invalid syntax (<ipython-input-48-52eb5780fc82>, line 6)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


With the data loaded and the training & test sets prepared, we can use the training data to determine empirical estimates for the prior probabilities of messages being classified as spam or ham. Notice we want to avoid using data from the test set to compute anything to ensure that it can be used to assess the accuracy of the model later.

### Computing Priors

[Back to top](#Index:) 
<a id='q10'></a>


### Question 10:

*5 points*

Estimate prior probabilities of messages being `ham` and `spam` empirically using the training set.

+ Construct a Pandas Series `priors` with Index values `'ham'` and `'spam'` and corresponding values given by the fraction of `ham` and `spam` messages (respectively) in the training set `sms_messages_train`.


+ HINT: the Pandas Series method `value_counts` is useful here. Alternatively, you can do this using a suitable call to the `groupby` method.

In [None]:
### GRADED

### YOUR SOLUTION HERE:
ham = labels_train.value_counts().to_numpy()[0]
spam = labels_train.value_counts().to_numpy()[1]
priors = pd.series([ham,spam])
priors.index(['ham','spam'])
###
### YOUR CODE HERE
###

### For verifying answer:
print('Training priors (%):\n===================\n{}\n'.format(100 * priors))

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


With the prior probabilities estimated from the training set, you need to be able to compute *likelihoods* next. Remember, given any word $w$ in the vocabulary given by the labels of the document-term matrix,
the likelihood $p(w\,|\,\text{ham})$ (and, respectively $p(w\,|\,\text{spam})$) is a conditional probability of a message containing the word $w$ given that it is ham (respectively, spam). This can be worked out by iterating over the messages, counting the number of ham (respectively, spam) messages in which the word $w$ occurs and dividing by the total number of ham (respectively, spam) messages. We can write this as

$$ p(w\,|\,C) \simeq \frac{|C \cap w|}{|C|} $$

where $C$ is the set of all ham (respectively, spam) messages, $C \cap w$ is the set of messages in set $C$ that include the word $w$, and the cardinality of sets is denoted using a delimiting pair of vertical bars (e.g., $|C|$ is the number of elements in the set $C$).

There is a problem, however, when a particular word in the vocabulary fails to occur in a particular subset of the training data. For instance, suppose the word *meeting* occurs in the set of ham training messages but does not in any of the *spam* training messages. In that case, the set $\text{spam} \cap \text{meeting}$ is empty and the empirical likelihood of finding the word "meeting" in a spam message will be computed as zero. This is an artifact of the particular corpus of training messages; that is, that word does not happen to occur in any of the finite number of spam messages in the training set. Unfortunately, it is not reasonable to assume that, more generally, the likelihood of the word "meeting" occurring in *any* incoming spam message is zero.

To compensate for this kind of problem, you can use [*Laplace smoothing*](https://en.wikipedia.org/wiki/Laplacian_smoothing). That is, modify the computation of an empirical likelihood as follows:

$$ p(w\,|\,C) \simeq \frac{|C \cap w| + \color{red}{\beta}}{|C| + \color{red}{n_{\text{w}}} } $$

In the above, $C$ refers to any available categories in the classification problem (i.e., ham or spam here), $w$ is the sought event (in this case, a word $w$ occurring in a message in the set $C$), $\beta$ is a *smoothing parameter* (typically 1) and $n_w$ is another parameter (in this case, the number of words in the vocabulary to scan for). This has the effect of augmenting likelihoods away from zero (that are problematic when multiplying likelihoods together as is required for naive Bayes classification).

Your next task is to construct a function to compute likelihoods of given words in spam or ham.

### Computing Likelihoods

Next, we encapsulate the ideas above into another function that operates on DataFrames. The function `get_likelihoods` below computes the *empirical likelihoods* of a word being in a message of all possible categories (`ham` and `spam` in this case).


+ The inputs are `words` (a list of strings), `messages` (a DataFrame), and `beta` (default 1.0).
+ The input DataFrame `messages` is assumed to have structure as output by `make_dtm_df` from above. This is precisely the form of `sms_messages_train`.
+ The strings in `words` can be upper or lower case. If a string in `words` is not in `messages.columns`, it will be ignored.
+ The function returns a Pandas DataFrame:
  + The Index contains the categories of `messages['labels']` (`ham` and `spam` in this case).
  + The columns are given by the input list `words` (assumed to be a subset of the columns of the input `messages`).
  + The values in the DataFrame column labeled `'word'` are computed by Laplace smoothing as described above.

In [None]:


def get_likelihoods(words, messages, labels, beta=1):
    n = labels.label.value_counts()
    n_w = len(words)
    is_spam = (labels.label == 'spam')
    d = {}
    # Clean up words as needed
    new_words = pd.Index(list(map(lambda t: t.lower(), words))).intersection(sms_messages_train.columns)
    for word in new_words:
        count_spam = messages.loc[ is_spam, word].sum()
        count_ham  = messages.loc[~is_spam, word].sum()
        d[word] = pd.Series({'ham': (count_ham + beta)/(n['ham'] + n_w), 'spam': (count_spam + beta)/(n['spam'] + n_w)})
    return pd.DataFrame(data=d)



# Verification:
words = ['customer', 'congrats', 'meeting']
likelihoods = get_likelihoods(words, sms_messages_train, labels_train)
print(likelihoods)

## The Naive Bayes's Assumption

Finally, you can now try to see how to put these pieces together to classify the testing data as spam or ham. The key assumption in [*naive Bayes classification*](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) is that the *likelihoods are independent* (i.e., that the likelihood of each word occurring in a message known to be spam or ham is independent of the other words occurring). This is the "naive" assumption, but it makes the computations much simpler (particularly when the vocabulary is large, say, thousands of words).

Under this assumption, the joint likelihood of the words $w_1$ and $w_2$ occuring in category $C$ can be determined as a product:

 $$ p(w_1 \cap w_2 \,|\,C) = p(w_1\,|\,C)\times p(w_2\,|\,C). $$
 
 This generalizes easily to arbitrarily many words/features (and even arbitrarily many classes in a multi-class classification problem): 
 
  $$ p(w_1, w_2, \dotsc, w_d\,|\, C) = \prod_{k=1}^{d}p(w_{k}\,|\,C).$$

Let's see how to do this with an example message from the test data. We'll extract a single row (a count-vectorized message) and extract the columns that correspond to words that actually occurred in that message.

In [None]:
k = 10                                   # Integer index of message to extract
message = sms_messages_test.iloc[k]      # Row of document-term matrix
y_true = labels_test.iloc[k, 0]          # Corresponding label
words = list(message[message > 0].index) # Corresponding words in message
print(f'After applying the count vectorizer, message {k} includes the following words: {words}.\nThis message is known a prior to be {y_true}.')

In the implementation developed so far, we can compute the likelihoods of the words in `words` occurring in a ham or spam message using the function `get_likelihoods`. From there, the *joint likelihoods* of all the relevant words occurring in a message are easily computed using the Pandas DataFrame method `prod`. Observe that some words are more likely to occur in spam than in ham and some are less likely to occur in spam than in ham.

In [50]:
likelihoods = get_likelihoods(words, sms_messages_train, labels_train)
likelihoods

NameError: name 'get_likelihoods' is not defined

In [51]:
joint_likelihoods = likelihoods.prod(axis=1)    # Use Naive Bayes assumption of independence of individual likelihoods
print(joint_likelihoods)

NameError: name 'likelihoods' is not defined

Now that the joint likelihoods of the meassage  being spam or ham (given the words) are known, they can be combined to compute the priors and finally the posterior probabilities as Bayes's theorem tells us. And when the posterior probabilities are known for both the ham and spam classes, the larger of the two can be used to decide how to classify a new message instance.

In [52]:
evidence = (joint_likelihoods * priors).sum()
posteriors = joint_likelihoods * priors / evidence  # Bayes' theorem
print('Posteriors:\n{}\n'.format(posteriors))
y_pred = posteriors.idxmax()     # Take largest posterior probability to predict class
print('y_true = {}\ny_pred = {}'.format(y_true, y_pred))

NameError: name 'joint_likelihoods' is not defined

### Computing Posteriors


Next, we combine the ideas from the preceding lines of code into a function `get_posteriors` that returns a DataFrame with posterior probabilities of messages being ham or spam.

+ The inputs are `messages` (a DataFrame), `training` (another DataFrame) and `beta` (default 1.0).
  + The input DataFrame `messages` is similar, but it does not include a `labels` column (i.e., the targets for supervised learning).
  + The input DataFrame `training` is assumed to have structure as output by `make_dtm_df` from above. This is precisely the form of `sms_messages_train`.
  + The input `beta` is the parameter required for Laplace smoothing (default value 1.0).
+ The function returns a Pandas DataFrame:
  + The (row) `Index` is the same as the input DataFrame `messages` (i.e., the labels of the messages to classify).
  + The `columns` attribute contains the labels `'ham'` and `'spam'`.
  + The values are the corresponding posterior probabilities of each message being classified as ham or spam respectively
  contains the categories of `messages['labels']` (`ham` and `spam` in this case).

In [None]:
def get_posteriors(messages, training, labels, beta=1):
    """Computes empirical posterior probabilities for classification of rows of *messages*
    INPUT:
      messages:    DataFrame with document-term frequency matrix (ints)
      training:    DataFrame with document-term frequency matrix (ints)
      labels:      DataFrame with column 'label' (categorical)
      beta:        (default value 1) Smoothing constant as required by get_likelihoods
    OUTPUT:
      posteriors:  DataFrame with same (row) index as messages['labels'] as (row) Index and training['labels']
                   as column index. The entries are the empirical posterior probabilities of each given row
                   being classified as the corresponding category (ham or spam in this case).
                   Empirical likelihoods computed using Laplace smoothing
    EXAMPLE:
    >>> messages = sms_messages_test.iloc[:3]
    >>> posteriors = get_posteriors(messages, sms_messages_train, labels_train)
    >>> print(posteriors)
                   ham      spam
    1303  1.600012e-01  0.839999
    4198  1.137179e-22  1.000000
    1710  9.998436e-01  0.000156
    >>> posteriors = get_posteriors(messages, sms_messages_train, labels_train, beta=0.5)
    >>> print(posteriors)
                   ham      spam
    1303  8.702614e-01  0.129739
    4198  1.943228e-24  1.000000
    1710  9.999977e-01  0.000002
    """
    priors = labels.label.value_counts()/len(labels)
    posteriors = {}
    for idx, row in messages.iterrows():
        words = row[row>0].index
        likelihoods = get_likelihoods(words, training, labels, beta)
        joint_likelihoods = likelihoods.prod(axis=1)
        evidence = (joint_likelihoods * priors).sum()
        posteriors[idx] = joint_likelihoods * priors / evidence  # Bayes' theorem
    return pd.DataFrame(data=posteriors).T

Below, we modify the function `get_posteriors` above to yield a function `classify_messages`.


+ The inputs are the same as for `get_posteriors`, i.e., `messages` (a DataFrame), `training` (another DataFrame) and `beta` (default 1.0).
+ This function returns a Pandas Series:
  + The (row) `Index` is the same as the input DataFrame `messages` (i.e., the labels of the messages to classify).
  + The values are the labels `'ham'` and `'spam'` (according to how each message is classified).
  + The labels are determined using whichever category has the largest corresponding posterior probability of being classified as ham or spam respectively.

In [None]:
def classify_messages(messages, training, labels, beta=1):
    priors = labels.label.value_counts()/len(labels)
    posteriors = {}
    for idx, row in messages.iterrows():
        words = row[row>0].index
        likelihoods = get_likelihoods(words, training, labels, beta)
        joint_likelihoods = likelihoods.prod(axis=1)
        evidence = (joint_likelihoods * priors).sum()
        posteriors[idx] = joint_likelihoods * priors / evidence  # Bayes' theorem
    posteriors = pd.DataFrame(data=posteriors).T
    preds = posteriors.apply(lambda t: t.idxmax(), axis=1)
    return pd.DataFrame(preds)

#Verification
messages = sms_messages_test.iloc[:3]
y_pred = classify_messages(messages, sms_messages_train, labels_train)
y_pred

In practice, you don't have to implement all these probabilistic computations from scratch. The *Multinomial Naive Bayes estimator* in [`sklearn.naive_bayes.MultinomialNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) implements a lot of this logic for you. As with other Scikit-Learn Estimator classes, there are `fit` and `predict` methods that provide the required logic.

From the [Scikit-Learn documentation]((https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB):

> The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

As preparation for the next exerise, you will import theh `MultinomialNB` class from the module `sklearn.naive_bayes`. It's also useful to apply the DataFrame `squeeze` method to transform the single-column DataFrames `labels`, `labels_train`, and `labels_test` into Pandas Series; this is required to ensure that the targets passed into the `MultinomialNB` estimators have the appropriate shape (some Scikit-Learn estimator classes are more forgiving in accepting  distinct Pandas/NumPy compatible inputs than others).

In [49]:
from sklearn.naive_bayes import MultinomialNB
print(type(labels), type(labels_train), type(labels_test))
labels = labels.squeeze()
labels_train = labels_train.squeeze()
labels_test = labels_test.squeeze()
print(type(labels), type(labels_train), type(labels_test))

NameError: name 'labels_train' is not defined

### Using a Scikit-Learn `MultinomialNB` Estimator
[Back to top](#Index:) 
<a id='q11'></a>


### Question 11:

*10 points*


For this exercise, you will instantiate and apply a `MultinomialNB` estimator to make spam predictions.

+ The `MultinomialNB` class is imported from `sklearn.naive_bayes` for you.
+ You will instantiate an object of the `MultinomialNB` class without keyword arguments.
+ You will fit the model to the training data using `sms_messages_train` and `labels_train`.
+ In addition, once the model has been fit to the training data, use the `score` method and the testing data `sms_messages_test` & `labels_test` to determine how well the model works. Assign the corresponding result to `accuracy`.

In [None]:
### GRADED

### YOUR SOLUTION HERE
nbayes = None
accuracy = None

###
### YOUR CODE HERE
###
# Verification:
print(nbayes.classes_)
print(nbayes.class_count_)
print(f'The accuracy of the classifier with default (alpha=1.0) smoothing is {accuracy:5.3f}.')

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


---

Not bad! You can experiment with a TFIDF matrix or a DTM with $n$-grams and with various levels of Laplacian smoothing to see to what degree various hyperparameters affect the efficacy of the classifier obtained.


Beyond Scikit-Learn, many more convenient implementations of tools for natural language processing are available through the [NLTK](https://www.nltk.org/) (the Python [Natural Language Toolkit](https://www.nltk.org/)). The NLTK contains many useful modules for manipulating text, easy-to-use interfaces to [large corpora of text](http://www.nltk.org/nltk_data/), and an [online book for learning NLP](http://nltk.org/book).