# Introduction to Machine Learning 3

See https://learning.anaconda.cloud/getting-started-with-ai-ml

Cover major supervised machine learning algorithmss which use labelled data to make predicitons:

- linear regression
- Logistic regression
- **Naive Bayes**
- Decision trees / random forests
- Neural networks

Using `scikit-learn` for implementation

## Naive Bayes

**Naive Bayes**, a machine learning algorithm often used for text classification. 

In this module, you'll learn: 

- brief overview of Naive bayes
- Vectorize text with scikit-learn
- Build an email spam filter



**Naive Bayes** is a ML application of Bayes Theorem that merges probabilities of multiple features to predict a category.

Often used to classify text, learns quickly with little data.

Maps probabilities of each individual feature occurring / not occurring for a given category.

Commonly used for discrete variables like words, can also be used with continuous variables using statistical distributions #FurtherLearning



### Exploring naive Bayes with `scikit-learn`

Several Naive Bayes models in `scikit-learn`, these exercises use `MultinomialNB`.

This model allows us to predict one or more categories.

To transform text into numeric inputs, these exercises use `CountVectorizer` which turns text inputs into columns, counting the number of instances for each word.

### Algorithm

To predict a category for a set of features:

1. For a given category, combine the probabilities of each feature occurring and not occurring by multiplying:

$$Occur\;Product = P_{f1} * P_{f2} * P_{f3} * \dots * P_{fn}$$

$$Not\;Occur\;Product = (1 - P_{f1}) * (1 - P_{f2}) * (1 - P_{f3}) * \dots * (1 - P_{fn})$$

2. Combine the products from above:
$$
Combined\;Probability = \frac{Occur\;Product}{Occur\;Product + Not\;Occur\;Product }
$$

3. Calculate the combined probability for every category, pick the category with highest combined probability.

Still need to use practices like train/test splits.

### Example - a simple spam filter

In [2]:
import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

# turn off scientific notation
np.set_printoptions(suppress=True)

In [3]:
df = pd.read_csv('https://bit.ly/3zQBV5y')
df

Unnamed: 0,msg,spam_ind
0,Hey there! I thought you might find this inter...,1
1,Get viagra for a discount as much as 90%,1
2,Viagra prescription for less,1
3,"Even better than Viagra, try this new prescrip...",1
4,"My name is Natasha, I want to meet you",1
5,Meet the hottest singles on the #1 dating site,1
6,"Hey, I left my phone at home. Email me if you ...",0
7,Please see attachment for notes on today's mee...,0
8,An item on your Amazon wish list received a di...,0
9,Your prescription drug order is ready,0


In [4]:
# Vectorize the message in each email by counting each word occurrence, 
# and break it up into input X and output Y columns.
cv = CountVectorizer()
X = cv.fit_transform(df['msg'])
Y = df['spam_ind']

# Print count vectorizer as a table 
pd.DataFrame(X.toarray(),columns= cv.get_feature_names_out())

Unnamed: 0,90,account,afternoon,amazon,an,anything,as,at,attachment,be,...,this,thought,to,today,try,viagra,want,wish,you,your
0,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,1,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,1,0,0,1,0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
7,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
8,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [5]:
# Break up the emails into train/test datasets
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=7)

model = MultinomialNB().fit(x_train, y_train)

In [6]:
# Score the accuracy of the model.
result = model.score(x_test,y_test)

print(result)

confusion_matrix(y_true=y_test, y_pred=model.predict(x_test))

0.75


array([[2, 0],
       [1, 1]], dtype=int64)

### Extended example with user input

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

np.set_printoptions(suppress=True)

Use this `message`` variable to create a test message of your choosing. Note that if the entire message contains words the classifier has never seen before, it will be on the fence whether or not it is spam.

In [14]:
message = "Meet hot singles now, no need for onions"

In [15]:
# read training data, add input to DataFrame
df = pd.read_csv('https://bit.ly/3zQBV5y')
df.loc[len(df.index)] = [message, 1] # add record, defining it as spam
df

Unnamed: 0,msg,spam_ind
0,Hey there! I thought you might find this inter...,1
1,Get viagra for a discount as much as 90%,1
2,Viagra prescription for less,1
3,"Even better than Viagra, try this new prescrip...",1
4,"My name is Natasha, I want to meet you",1
5,Meet the hottest singles on the #1 dating site,1
6,"Hey, I left my phone at home. Email me if you ...",0
7,Please see attachment for notes on today's mee...,0
8,An item on your Amazon wish list received a di...,0
9,Your prescription drug order is ready,0


In [16]:
# vectorize training data along with user input
cv = CountVectorizer()
X_all = cv.fit_transform(df['msg'])

# Print count vectorizer as a table 
pd.DataFrame(X_all.toarray(),columns= cv.get_feature_names_out())

Unnamed: 0,90,account,afternoon,amazon,an,anything,as,at,attachment,be,...,this,thought,to,today,try,viagra,want,wish,you,your
0,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,1,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,1,0,0,1,0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
7,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
8,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [17]:
# extract the vectorized training data
# omitting the message we added earlier
X_train = X_all[:-1,:]
Y_train = df["spam_ind"].iloc[:-1]

# extract out the test input from the last row
X_test = X_all[-1:, :]

In [18]:
# Fit the MulinomialNB model to the training data, 
# and predict the probability of being spam for the test email. 
# Note after we predict the probability with predict_proba() 
# it will return two values, one for the probability of 
# not being spam and the other for being spam. 
# We want the second value so we extract it.

# Create multinomial Naive Bayes and train model
model = MultinomialNB().fit(X_train, Y_train)

# Test the user input for spam
probability_of_spam = model.predict_proba(X_test).flatten()[1]
print("Spam probability: {0}".format(probability_of_spam))

Spam probability: 0.8548727242515902
