In this challenge we are given a dataset that contains a cuisine type, an ID number and a list of ingredients. This is only my second time attempting any machine learning, and my first time trying to create a pipeline that can understand text data. I learned a ton during this challenge, and welcome any feedback on how I can improve in the future!

First we just have to input all of the necessary python modules to complete this challenge:

In [None]:
#Math and DataFrame stuff
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#basic computer stuff
import os
print(os.listdir("../input"))

#plotting stuff
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')

# machine learning
from sklearn.feature_extraction.text import CountVectorizer



I read in both the test and train data sets using pandas

In [None]:
train_df = pd.read_json('../input/train.json')
test_df = pd.read_json('../input/test.json')

To get a better understanding of the datasets I print the column names

In [None]:
print(train_df.columns.values)
print(test_df.columns.values)


So the training dataset contains the cuisne, but the test dataset only contains the id number and ingredients.

I also check out the first 10 lines of the dataframe

In [None]:
train_df.head(10)

The ingredients for each recipie are recorded in a list. This will be hard for us to understand and program around, so I am going to simply get rid of the list using the pandas function DataFrameName.apply.(','.join) where DataFrameName is whatever your dataframe is called. 

In [None]:
train_df['ingredients'] = train_df['ingredients'].apply(', '.join)
test_df['ingredients'] = test_df['ingredients'].apply(', '.join)
combine = [train_df,test_df]

In [None]:
train_df.head(10)

In [None]:
test_df.head(10)

What is the distribution of cuisines? Lets plot it and find out:


In [None]:
sns.countplot(y = 'cuisine',data = train_df)
sns.set(rc = {'figure.figsize' : (8,5)})


So clearly italian and mexican dominate the distributions. That most likely means we'll be seeing a lot of garlic and oil!

I really want the machine learning program to understand the individual ingredients, not the list for the entire recipie. So I can write a simple for loop that seperates the ingredients. I will change this to a lambda function when I use it in the CountVectorizor from sklearn, but I always check that my for loop does what I want it to before I commit it to a lambda function. This helps me debug my code more easily. 

In [None]:
common_ing = []
for x in np.arange(len(train_df['cuisine'])):
    for i in train_df['ingredients'][x].split(','):
        common_ing.append(i.strip())
common_ing = pd.DataFrame(common_ing, columns=['common_ing'])
          

What are the 10 most common ingredients??

In [None]:
common_ing['common_ing'].value_counts().head(10)


Yay! I was totally right about the garlic and olive oil! :)

So that worked! I now have a dataframe called 'common_ing' with all of the ingredients in the train_df dataframe. I feel confident turning this into a lambda function.

Here I start followint the "[Working With Text Data](http://http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html )" tutorial on sklearn, and follow it pretty consistently

From what I can understand, CountVectorizor builds a dictionary and the index value of a word in the vocabulary is linked to its frequency. If someone can better explain this to me in laymans terms it would be much appreciated!!

In [None]:
count_vec = CountVectorizer(tokenizer=lambda x: [i.strip() for i in x.split(',')], lowercase=False)
X_train_counts = count_vec.fit_transform(train_df['ingredients']) 
X_train_counts.shape

TfidTransformer divides the number of occurrences of each word in a document by the total number of words in the document. This is supposed to be better for longer documents where the count of a particular word is not as insightful as the frequency.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

We can create two different pipelines- one with a TfidTransformer, and one without- and compare them to see what works best. We will be using a support vector machine (SVM) which is supposed to be one of the best for text. 

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer(tokenizer=lambda x: [i.strip() for i in x.split(',')], lowercase=False)),('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)),])
text_tdif_clf = Pipeline([('vect', CountVectorizer(tokenizer=lambda x: [i.strip() for i in x.split(',')], lowercase=False)),('tfidf', TfidfTransformer()),('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)),])


In [None]:
text_tdif_clf.fit(train_df['ingredients'], train_df['cuisine']) 

In [None]:
predicted = text_tdif_clf.predict(train_df['ingredients'])
np.mean(predicted == train_df['cuisine'])

Including Tfid we have a prediction rate of 72%- not the best! Lets see where it got confused.

In [None]:
from sklearn import metrics

In [None]:
print(metrics.classification_report(train_df['cuisine'],predicted))

In [None]:
cm = metrics.confusion_matrix(train_df['cuisine'],predicted)

In [None]:
cm.shape

In [None]:
legend = ['brazilian','british','cajun_creole','chinese','filipino','french','greek','indian','irish','italian','jamaican','japanese','korean','mexican','moroccan','russian','southern_us','spanish','thai','vietnamese']

In [None]:
df_cm = pd.DataFrame(cm,index = legend,columns=legend)

In [None]:
plt.figure
sns.set(font_scale= 1.4,rc = {'figure.figsize' : (15,15)})
sns.heatmap(df_cm,annot = True, linewidths=.5,fmt = 'd',cmap = 'viridis',cbar = False).set_title('Confusion Matrix With Tdif')


The pipeline got most confused between italian, french, and southern us cooking.  Now lets try without Tfid to see if the prediction rate gets better or worse:

In [None]:
text_clf.fit(train_df['ingredients'], train_df['cuisine']) 

In [None]:
predicted = text_clf.predict(train_df['ingredients'])
np.mean(predicted == train_df['cuisine'])

Without Tfid we went from 72% to 79%! In this case it seems count is more important than frequency.

In [None]:
print(metrics.classification_report(train_df['cuisine'],predicted))
cm = metrics.confusion_matrix(train_df['cuisine'],predicted)

In [None]:
df_cm = pd.DataFrame(cm,index = legend,columns=legend)
plt.figure
sns.set(font_scale= 1.4,rc = {'figure.figsize' : (15,15)})
sns.heatmap(df_cm,annot = True, linewidths=.5,fmt = 'd',cmap = 'viridis',cbar = False).set_title('Confusion Matrix Without Tdif')


Just using CountVectorizor actually worked better than Tfid! We obtained an accuracy of 79% with most of the confusion *still*  happening between French, Italian and Southern US- but the confusion was less with CountVectorizor alone. 

Lets submit this guy!

In [None]:
sub = pd.read_csv('../input/sample_submission.csv')

In [None]:
sub.head()

In [None]:
final_predicted = text_clf.predict(test_df['ingredients'])


In [None]:
predictions = pd.DataFrame({'cuisine' : final_predicted , 'id' : test_df.id })
predictions = predictions[[ 'id' , 'cuisine']]

In [None]:
predictions.to_csv('submit.csv', index = False)