# Naive Bayes, meet Naive Humans

> This is a story of how you can be led astray by machine learning, courtesy me being pissy about [FOIA Predictor](https://datadotworld.shinyapps.io/foia_shiny_app/), which was inspired by [some work by CJS students](https://www.cjr.org/analysis/foia-request-how-to-study.php)

> Also let's talk about the [fake news challenge code](https://github.com/FakeNewsChallenge/fnc-1-baseline/blob/master/feature_engineering.py)

I love cooking, but I hate actually reading a recipe to see what cuisine it is. I only cook... italian food, let's say. If only there were a machine that could process the recipe for me!

"Soma, Soma!" you exclaim. "We just learned about **Naive Bayes**, I bet you can use it to automatically classify recipes!"

Okay, cool, I just machines to accomplish anything: **let's do it!**

## Preparing our data

### Step 1.1: Read in our data

This time it's just a csv.

In [1]:
import pandas as pd

df = pd.read_csv("recipes.csv")
df.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."


### Step 1.2: Creating a label column

It needs to be a number, right? Let's say everything with a cuisine of "italian" is going to be `1` and everything with another cuisine is going to be `0`.

In [3]:
df['is_italian'] = (df['cuisine'] == 'italian').astype(int)

In [5]:
df[df.is_italian == 1].head(4)

Unnamed: 0,cuisine,id,ingredient_list,is_italian
7,italian,3735,"sugar, pistachio nuts, white almond bark, flou...",1
9,italian,12734,"chopped tomatoes, fresh basil, garlic, extra-v...",1
10,italian,5875,"pimentos, sweet pepper, dried oregano, olive o...",1
12,italian,2698,"Italian parsley leaves, walnuts, hot red peppe...",1


### Step 1.3: Create our features dataframe

I'm going to predict what cuisine our recipe is based on **only two ingredients**, because I know something about cooking.

What are two ingredients that are very much about Italian food?

In [6]:
df.ingredient_list.str.contains("tomato").astype(int)

0        1
1        1
2        0
3        0
4        0
5        0
6        0
7        0
8        0
9        1
10       0
11       0
12       0
13       1
14       0
15       1
16       0
17       0
18       0
19       0
20       0
21       1
22       0
23       0
24       0
25       0
26       1
27       0
28       0
29       0
        ..
39744    0
39745    0
39746    1
39747    0
39748    0
39749    1
39750    0
39751    0
39752    0
39753    0
39754    0
39755    1
39756    0
39757    0
39758    0
39759    0
39760    0
39761    0
39762    0
39763    0
39764    1
39765    0
39766    0
39767    0
39768    0
39769    0
39770    0
39771    0
39772    0
39773    1
Name: ingredient_list, Length: 39774, dtype: int64

In [8]:
features_df = pd.DataFrame({
#    'has_something': df.ingredient_list.blah blah blah
    'has_tomatoes': df.ingredient_list.str.contains("tomato").astype(int),
    'has_olive_oil': df.ingredient_list.str.contains("olive oil").astype(int),
    'has_soy_sauce': df.ingredient_list.str.contains("soy sauce").astype(int)
})
features_df.head(3)

Unnamed: 0,has_olive_oil,has_soy_sauce,has_tomatoes
0,0,0,1
1,0,0,1
2,0,1,0


## Step 2: Using the classifier

### Step 2.1: Import the classifier

What kind of Naive Bayes classifier are we going to use?

In [10]:
from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB()

### Step 2.2: Split our data into test and train data

In [12]:
# train_test_split will split our data into two parts
from sklearn.model_selection import train_test_split

# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    features_df.values, 
    df.is_italian, 
    test_size=0.2) 

# the first parameter is our FEATURES. can't just do words_df, it won't work :(
# the second parameter is the LABEL as a number (so 0/1, not neg/pos)
# 80% training, 20% testing

### Step 2.3: Train the classifier

In [13]:
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

## Testing our classifier

Let's test it against the test data!

In [15]:
clf.score(X_test, y_test)

0.79057196731615331

And how about the training data?

In [16]:
clf.score(X_train, y_train)

0.7897168358527924

# Now it's your turn!

At each table, you'll need to

* Pick a cuisine (or two, if you'd like!)
* Create another label column based on that cuisine
* Pick two or three or four or five or more ingredients you think are representative of your cuisine selection
* Create another features dataframe using those ingredients
* Train and test a classifier

Remember that your ingredients can signal the **presence of the cuisine** or the **absense of a cuisine** - if I was doing Japanese food, "miso" and "cheese" would be good options because they'd point firmly in one direction or the other - "YES this is japanese" and "NO this is not japanese."

In [17]:
df.cuisine.unique()

array(['greek', 'southern_us', 'filipino', 'indian', 'jamaican', 'spanish',
       'italian', 'mexican', 'chinese', 'british', 'thai', 'vietnamese',
       'cajun_creole', 'brazilian', 'french', 'japanese', 'irish',
       'korean', 'moroccan', 'russian'], dtype=object)

In [21]:
df['is_thai'] = (df['cuisine'] == 'thai').astype(int)
df[df.is_thai == 1].head()

Unnamed: 0,cuisine,id,ingredient_list,is_italian,is_spanish,is_thai
18,thai,2941,"sugar, hot chili, asian fish sauce, lime juice",0,0,1
20,thai,13121,"pork loin, roasted peanuts, chopped cilantro f...",0,0,1
33,thai,33465,"eggs, shallots, firm tofu, beansprouts, turnip...",0,0,1
60,thai,38233,"sugar, chicken thighs, cooking oil, fish sauce...",0,0,1
61,thai,39267,"lemongrass, large garlic cloves, rice, unsweet...",0,0,1


In [24]:
df[df.is_thai == 1].ingredient_list.value_counts()

sweet chili sauce, egg whites, salt, corn starch, lime juice, baking powder, all-purpose flour, water, boneless skinless chicken breasts, cilantro leaves, sesame, cooking oil, garlic, oil                                                                                                                                                   2
sugar, reduced sodium soy sauce, freshly ground pepper, fresh lime juice, kosher salt, lime wedges, garlic cloves, fresh basil leaves, fish sauce, steamed rice, vegetable oil, carrots, red chili peppers, low sodium chicken broth, scallions, ground beef                                                                                  2
sweet chili sauce, garlic, onions, fish sauce, lime juice, oil, roasted cashews, ground chicken, cilantro leaves, butter lettuce, sweet soy sauce, white sesame seeds                                                                                                                                                                   

In [37]:
features_df2 = pd.DataFrame({
#    'has_something': df.ingredient_list.blah blah blah
    'has_hot_chilli': df.ingredient_list.str.contains("hot chilli").astype(int),
    'has_lime': df.ingredient_list.str.contains("lime").astype(int),
    'has_cilantro': df.ingredient_list.str.contains("cilantro").astype(int),
    'has_garlic': df.ingredient_list.str.contains("garlic").astype(int)
})
features_df2.head(3)

Unnamed: 0,has_cilantro,has_garlic,has_hot_chilli,has_lime
0,0,1,0,0
1,0,0,0,0
2,0,1,0,0


In [38]:
# train_test_split will split our data into two parts
from sklearn.model_selection import train_test_split

# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    features_df2.values, 
    df.is_thai, 
    test_size=0.2) 

# the first parameter is our FEATURES. can't just do words_df, it won't work :(
# the second parameter is the LABEL as a number (so 0/1, not neg/pos)
# 80% training, 20% testing

In [39]:
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [40]:
clf.score(X_test, y_test)

0.93438089252042744

In [41]:
clf.score(X_train, y_train)

0.94016153870329044

In [42]:
df.is_thai.value_counts(normalize=True)

0    0.961306
1    0.038694
Name: is_thai, dtype: float64

In [43]:
df.is_italian.value_counts(normalize=True)

0    0.802937
1    0.197063
Name: is_italian, dtype: float64

In [46]:
df.is_thai.value_counts()

0    38235
1     1539
Name: is_thai, dtype: int64

In [48]:
# Dummy Classifier, always pick the largest amount (only based on quantity)

from sklearn.dummy import DummyClassifier

clf = DummyClassifier(strategy='constant', constant=1)

In [49]:
clf.fit(X_train, y_train)

DummyClassifier(constant=1, random_state=None, strategy='constant')

In [50]:
clf.score(X_test, y_test)

0.039094908862350723

## Label Encoders: What if we want more than `is_italian`?

A **LabelEncoder** will convert labels to numbers for you.

It has has two parts: **fit** and **transform**.

* **fit** learns all of the possible labels
* **transform** takes a list of categories and converts them into numbers

In [None]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()

In [None]:
# Teach the label encoder all of the possible labels
# It doesn't care about duplicates 
# le.fit(['orange', 'red', 'red', 'red', 'yellow', 'blue'])

In [None]:
# Get the labels out as numbers
# le.transform([])

In [None]:
df.cuisine.head(10)

In [None]:
# Send the label encoder each and every cuisine


In [None]:
# What does it give back when .transform-d?

In [None]:
# Add it back into the dataframe as cuisine_label


In [None]:
# Check value_counts of each to see if they match