# Naive Bayes

# Review: Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$. 

1. __Regression__: The target variable $y \in \mathcal{Y}$ is continuous:  $\mathcal{Y} \subseteq \mathbb{R}$.
2. __Classification__: The target variable $y$ is discrete and takes on one of $K$ possible values:  $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a *class* that we want to predict.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
%matplotlib inline

In [None]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/text-analysis/data/recipes.csv -P data

File ‘data/recipes.csv’ already there; not retrieving.



In [None]:
df = pd.read_csv("data/recipes.csv")
df.head(20)

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."
5,jamaican,6602,"plain flour, sugar, butter, eggs, fresh ginger..."
6,spanish,42779,"olive oil, salt, medium shrimp, pepper, garlic..."
7,italian,3735,"sugar, pistachio nuts, white almond bark, flou..."
8,mexican,16903,"olive oil, purple onion, fresh pineapple, pork..."
9,italian,12734,"chopped tomatoes, fresh basil, garlic, extra-v..."


## What are we doing and why are we using Naive Bayes?

We have a bunch of recipes in categories. Maybe someone sends us new recipes, what category do the new recipes belong in?

We're going to train a classifier to recognize italian food, so that if someone sends us new recipes, we know if it's italian because we love italian food and we only want to eat italian food.

For classification algorithms, you must have labels for your dataset.

**For clustering**

1. You'll get a lot of documents
2. You feed it to an algorithm, tell it create `x` number of categories
3. The machine gives you back categories whether they make sense or not

**For classification (which we are doing now)**

1. You'll get a lot of documents
2. You'll classify some of them into categories that you know and love
3. You'll ask the algorithm what categories a new bunch of unlabeled documents end up in

All mean the same thing: CATEGORY = CLASS = LABEL

The reason why you use machine learning is to not do things manually. So if you can do things manually, do it. Otherwise just try different algorithms until one works well (but you might need to know some upsides or downsides of each to interpret that).


## How does Naive Bayes work? With Bayes Theorem

Bayes Theorem or Bayes Rule is the mathematical rule that describes how to update a belief, given some evidence. In other words – it describes the act of learning.

<img width=70% src="https://www.freecodecamp.org/news/content/images/2020/07/Screenshot-2020-07-19-at-22.58.48.png">


There are four parts:

*   **Posterior probability** (updated probability after the evidence is considered)
*   **Prior probability** (the probability before the evidence is considered)
*   **Likelihood** (probability of the evidence, given the belief is true)
*   **Marginal probability** (probability of the evidence, under any circumstance)



<img width=70% src="https://www.freecodecamp.org/news/content/images/2020/07/Screenshot-2020-07-22-at-23.44.18.png">

**Example**

* If you see a word that is normally in a spam email, there's a higher chance it's spam
* If you see a word that is normally in a non-spam email, there's a higher chance it's not spam

**Naive:** every word/ingredient/etc is **independent** of any other word or feature

Easier Interpretation: If you see ingredients that are normally in italian food, it's probably italian.

Secret trick: you can't just use text, you have to convert into numbers.

## Types of Naive Bayes

Naive Bayes works on words, and SOMETIMES your text is long and SOMETIMES your text is short.

**Multinominal Naive Bayes - (multiple numbers)**: You count the words. You care about whether a word appears once or twice or three times or ten times. *This is better for long passages*

**Bernoulli Naive Bayes - True/False Bayes:** You only care if the word shows up (`True`) or it doesn't show up (`False`) - *this is better for short passages*

### Let's convert our text data into numerical data

In [None]:
df.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."


**Our problem:** Everything is text - cuisine is text, ingredient list is text, id is a number but it doesn't matter

**Two things to convert into numbers:**

* Our labels (a.k.a. the categories everything belongs in)
* Our features

### Converting our labels into numbers

We have two labels

* italian = `1`
* not italian = `0`

In [None]:
df.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."


In [None]:
def make_label(cuisine):
    if cuisine == "italian":
        return 1
    else:
        return 0

In [None]:
df['label'] = df['cuisine'].apply(make_label)
df.head(20)

Unnamed: 0,cuisine,id,ingredient_list,label
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0
3,indian,22213,"water, vegetable oil, wheat, salt",0
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0
5,jamaican,6602,"plain flour, sugar, butter, eggs, fresh ginger...",0
6,spanish,42779,"olive oil, salt, medium shrimp, pepper, garlic...",0
7,italian,3735,"sugar, pistachio nuts, white almond bark, flou...",1
8,mexican,16903,"olive oil, purple onion, fresh pineapple, pork...",0
9,italian,12734,"chopped tomatoes, fresh basil, garlic, extra-v...",1


### Converting our features into numbers

**Feature selection:** The process of selecting the features that matter, in this case - what ingredients do we want to look at?

Our feature is going to be: whether it has spaghetti or not and whether it has curry powder or not

In [None]:
df['has_spaghetti'] = df['ingredient_list'].str.contains("spaghetti")
df['has_curry_powder'] = df['ingredient_list'].str.contains("curry powder")
df.head(10)

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0,False,False
3,indian,22213,"water, vegetable oil, wheat, salt",0,False,False
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0,False,False
5,jamaican,6602,"plain flour, sugar, butter, eggs, fresh ginger...",0,False,False
6,spanish,42779,"olive oil, salt, medium shrimp, pepper, garlic...",0,False,False
7,italian,3735,"sugar, pistachio nuts, white almond bark, flou...",1,False,False
8,mexican,16903,"olive oil, purple onion, fresh pineapple, pork...",0,False,False
9,italian,12734,"chopped tomatoes, fresh basil, garlic, extra-v...",1,False,False


## Let's run our tests

Let's feed our labels and our features to a machine that likes to learn and then see how well it learns!!!!

### Looking at our labels

We stored it in `label`, and if it's `0` it's not italian, if it's `1` it is Italian

In [None]:
df['label'].head()

0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

### Look at our features

We have two features `has_spaghetti` and `has_curry_powder`.

In [None]:
df[['has_spaghetti', 'has_curry_powder']].head()

Unnamed: 0,has_spaghetti,has_curry_powder
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False


In [None]:
# We need to split into training and testing data
from sklearn.model_selection import train_test_split

In [None]:
# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES
    df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing

## Using Bernoulli Naive Bayes

In [None]:
# Import naive_bayes to get access to ALL kinds of naive bayes classifiers
# But REMEMBER we're using Bernoulli because it's for true/false which is fine
# for small passages
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Feed the classifier two things:
#   * our training features (X_train)
#   * our training labels (y_train)
# To help it study for the exam later when we test it
clf.fit(X_train, y_train)

BernoulliNB()

In [None]:
# This looks ugly but in theory it's what every recipe is
# All those zeroes = not italian
# We know the first three aren't italian and the last three aren't italian
clf.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
clf.score(X_test, y_test)

0.8120678818353237

In [None]:
df['cuisine'].value_counts()

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

In [None]:
df.head()

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0,False,False
3,indian,22213,"water, vegetable oil, wheat, salt",0,False,False
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0,False,False


### Wow, we did a really great job! Let's try another cuisine

## Preparing our data

### Creating labels that scikit-learn can use

Our cuisine is , so we'll do `0` and `1` as to whether it's that cuisine or not 

In [None]:
def make_label(cuisine):
    if cuisine == "brazilian":
        return 1
    else:
        return 0

df['is_brazilian'] = df['cuisine'].apply(make_label)

In [None]:
df.head(2)

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder,is_brazilian
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False,0
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False,0


### Creating features that scikit-learn can use

It's Bernoulli Naive Bayes, so it's `True` and `False`

In [None]:
df['has_water'] = df['ingredient_list'].str.contains('water')
df['has_salt'] = df['ingredient_list'].str.contains('salt')

In [None]:
df.head(2)

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder,is_brazilian,has_water,has_salt
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False,0,False,False
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False,0,False,True


### Create the test/train split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_water', 'has_salt']], # the first is our FEATURES
    df['is_brazilian'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing

### Create classifier, train and test

In [None]:
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Fit with our training data
clf.fit(X_train, y_train)

BernoulliNB()

In [None]:
clf.score(X_test, y_test)

0.9896920175989944

## Let's fix up our labels

Before we had this:

    def make_label(cuisine):
        if cuisine == "brazilian":
            return 1
        else:
            return 0

which does not scale well. If we wanted to add in more different cuisines, we'd need to keep adding in else ifs again and again and again until our fingers fell off. And we'd probably misspell something. And if we're anything, it's LAZY.

## LabelEncoder to the rescue: Converts categories into numeric labels

In [None]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()

In [None]:
# LabelEncoder has two parts: FIT and TRANSFORM
# FIT learns all of the possible labels
# TRANSFORM takes a list of categories and converts them into numbers

In [None]:
# Teach the label encoder all of the possible labels
# It doesn't care about duplicates 
le.fit(['orange', 'red', 'red', 'red', 'yellow', 'blue'])

LabelEncoder()

In [None]:
# Get the labels out as numbers
le.transform(['orange', 'blue', 'yellow'])

array([1, 0, 3])

In [None]:
# Send the label encoder each and every cuisine
le.fit(df['cuisine'])

LabelEncoder()

In [None]:
le.transform(df['cuisine'])

array([ 6, 16,  4, ...,  8,  3, 13])

In [None]:
df['cuisine_label'] = le.transform(df['cuisine'])
df.head(3)

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder,is_brazilian,has_water,has_salt,cuisine_label
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False,0,False,False,6
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False,0,False,True,16
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0,False,False,0,False,True,4


## Let's train and test with our new labels

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df[['has_water', 'has_salt']], # the first is our FEATURES
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing

In [None]:
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Learn how related every cuisine is to water and salt
clf.fit(X_train, y_train)

BernoulliNB()

In [None]:
clf.score(X_test, y_test)

0.20188560653676932

# Let's add some more features to see if we can do a better job

Right now I'm only looking at water and salt which doesn't tell you much, maybe you're looking at tortillas or cumin or soy sauce which tells you a little bit more.

In [None]:
df['has_miso'] = df['ingredient_list'].str.contains("miso")
df['has_soy_sauce'] = df['ingredient_list'].str.contains("soy sauce")
df['has_cilantro'] = df['ingredient_list'].str.contains("cilantro")
df['has_black_olives'] = df['ingredient_list'].str.contains("black olives")
df['has_tortillas'] = df['ingredient_list'].str.contains("tortillas")
df['has_turmeric'] = df['ingredient_list'].str.contains("turmeric")
df['has_pistachios'] = df['ingredient_list'].str.contains("pistachios")
df['has_lemongrass'] = df['ingredient_list'].str.contains("lemongrass")

Our new feature set is!!! `df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']]`

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']], # the first is our FEATURES
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing

In [None]:
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Learn how related every cuisine is to water and salt
clf.fit(X_train, y_train)

BernoulliNB()

In [None]:
clf.score(X_test, y_test)

0.3714644877435575

# This is taking forever, please let there be an automatic way to pick out all of the words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# STEP ONE: .fit to learn all of the words
# STEP TWO: .transform to turn a sentence into numbers

#vectorizer = CountVectorizer()
# So now 'olive' and 'oil' and 'olive oil' instead of just 'olive' and 'oil'
# Only pick the top 3000 most frequent ngrams
vectorizer = CountVectorizer(ngram_range=(1,2), max_features=3000)

In [None]:
# We have some sentences
# We're going to feed it to the vectorizer
# and it's going to learn all of the words
sentences = [
    "cats are cool",
    "dogs are cool"
]
vectorizer.fit(sentences)

CountVectorizer(max_features=3000, ngram_range=(1, 2))

In [None]:
# We're going to take some sentences and feed it to the vectorizer
# and its' going to convert it into numbers
vectorizer.transform(sentences)

<2x7 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [None]:
# But it looks bad to look at so I'll use .toarray()
vectorizer.transform(sentences).toarray()

array([[1, 1, 1, 1, 1, 0, 0],
       [1, 1, 0, 0, 1, 1, 1]])

In [None]:
# In our case, our text is the list of ingredients. We can get it through
df['ingredient_list'].head()

0    romaine lettuce, black olives, grape tomatoes,...
1    plain flour, ground pepper, salt, tomatoes, gr...
2    eggs, pepper, salt, mayonaise, cooking oil, gr...
3                    water, vegetable oil, wheat, salt
4    black pepper, shallots, cornflour, cayenne pep...
Name: ingredient_list, dtype: object

In [None]:
# Dear vectorizer, please learn all of these words
vectorizer.fit(df['ingredient_list'])

CountVectorizer(max_features=3000, ngram_range=(1, 2))

In [None]:
# Dear vectorizer, please convert ingredient_list into features
# That we can do machine learning on

every_single_word_features = vectorizer.transform(df['ingredient_list'])
every_single_word_features

<39774x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 1243216 stored elements in Compressed Sparse Row format>

### Now let's try with our new complete labels and our new complete features that includes every single word

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    every_single_word_features,
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing

### This is Naive Bayes with every word as a feature pushed through the CountVectorizer

In [None]:
print("This is Naive Bayes")

from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()
%time clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", clf.score(X_test, y_test))

This is Naive Bayes
CPU times: user 47 ms, sys: 1.82 ms, total: 48.9 ms
Wall time: 48.3 ms
Training score: (stuff it already knows) 0.7155787422609133
Testing score: (stuff it hasn't seen before): 0.6803268384663733


### How do you do this in the real world with new data?

In [None]:
every_single_word_features = vectorizer.transform(df['ingredient_list'])


In [None]:
# Import the Naive bayes thing
from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()

# Give the classifier EVERYTHING we know, not holding back anything
clf.fit(every_single_word_features, df['cuisine_label'])

# We have some new stuff we have not categorized
incoming_recipes = [
    "spaghetti tomato sauce garlic onion water",
    "soy sauce ginger sugar butter",
    "green papaya thai chilies palm sugar",
    "butter oil salt black pepper water milk bubblegumpie"
]

features_for_new_recipes = vectorizer.transform(incoming_recipes)
features_for_new_recipes

<4x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [None]:
predictions = clf.predict(features_for_new_recipes)
predictions

array([ 4, 11,  4, 16])

In [None]:
# The predictions are all categories that the labelencoder decided on
# Let's convert those numeric ones back into real fun cuisine words
le.inverse_transform(predictions)

array(['filipino', 'japanese', 'filipino', 'southern_us'], dtype=object)