# What's Cooking?

#### before we start with the problem itself there are some questions we need to answer:
1. What is the business question?
2. What each row represent?
3. What is the evaluation method?

#### for this problem (and all kaggle problems) the answers to these questions is always in the problem's overview page.
1. What is the category of a dish's cuisine given a list of its ingredients? (Supervised ML Problem)
2. Each row represent a recipe.
3. Submissions are evaluated on the categorization accuracy (the percent of dishes that you correctly classify).

# 1. Important imports
### let's start by importing needed libraries.

In [None]:
# load data libraries
import numpy as np # linear algebra library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import zipfile # to read zip files
from sklearn.model_selection import train_test_split


# data understanding libraries
import matplotlib.pyplot as plt # ploting library
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from collections import Counter


# data preparation
import re
from nltk.stem import PorterStemmer


# ADS Creation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import StandardScaler

# Modeling
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import MultinomialNB

# Evaluation and Model Selection
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn import metrics
from sklearn.model_selection import learning_curve
from sklearn.model_selection import GridSearchCV

In [None]:
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', -1)  # or 199
pd.set_option('display.precision',150)
pd.options.display.float_format = '{:,.3f}'.format

# 2. Load Data
### Let's load the data and have a look on it.
1. data is provieded in a zip file, so we need to unzip it first using zipfile library.
2. the traning/ testing files available in json file format, to read it we use pd.read_json function.
        we read the data into pandas dataframes which is a 2-dimensional labeled data structure with columns of
        potentially different types. You can think of it like a spreadsheet or SQL table.
3. to view some rows of the dataframe we use df_name.head() method which output the first 5 rows of the dataframe.

In [None]:
#unzip the files
archive_train = zipfile.ZipFile('/kaggle/input/whats-cooking/train.json.zip')

#read training json file 
train = pd.read_json(archive_train.read('train.json'))

#output the frist 5 rows
train.head()

> There are only 3 columns: id, cuisine and ingredients

In [None]:
train_data, test_data = train_test_split(train, test_size=0.4, random_state=1)

train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

In [None]:
print("Train set size is ",len(train_data))
print("Test set size is ",len(test_data))

# 2. Data Preparation

## 2.1 Data Cleansing
### First let's have another look on the ingredients text.

In [None]:
train_data['ingredients_txt'] = pd.Series([' , '.join(row["ingredients"]) for ind,row in train_data.iterrows()])

In [None]:
ingredients = pd.Series((','.join([','.join(row["ingredients"]) for ind,row in train_data.iterrows()])).split(','))
words = pd.Series(' '.join(ingredients).split())

In [None]:
len(set(words))

In [None]:
train_data['ingredients_txt'].sample(150)

In [None]:
ingredients = pd.Series((' '.join([','.join(row["ingredients"]) for ind,row in train_data.iterrows()])).split(','))

In [None]:
pd.Series([s for s in ingredients if "-" in s]).unique()

In [None]:
pd.Series([s for s in ingredients if any(char.isdigit() for char in s)]).unique()

In [None]:
pd.Series([s for s in ingredients if "®" in s]).unique()

In [None]:
pd.Series([s for s in ingredients if "'" in s]).unique()

In [None]:
pd.Series([s for s in words if re.findall('[^a-zA-Z]',re.sub(r'[^\w\s]','',s))]).unique()

In [None]:
pd.Series([s for s in ingredients if " oz" in s]).unique()

What is need to be cleaned?
- lower and upper case data.
- punctuation
- dashed data
- numbers
- non-english char

In [None]:
stopwords = set(["Campbell's","hellmann","oz","M&M","Pasoâ„¢","I Can't Believe It's Not Butter!®"])
porter = PorterStemmer()
# lancaster=LancasterStemmer()

def ret_words(ingredients):
    ingredients_text = ' '.join(ingredients)
    ingredients_text = ingredients_text.lower()
    ingredients_text = ingredients_text.replace('-', '')
    ingredients_text = ingredients_text.replace(',', ' ')
    ingredients_text = ingredients_text.replace('\'', '')
    words = []
    for word in ingredients_text.split():
        if re.findall('[0-9]', word): continue
        if len(word) <= 2: continue
        if '®' in word: continue
        if word in stopwords: continue
        if re.findall('[^a-zA-Z]',re.sub(r'[^\w\s]','',word)): continue
        if len(word) > 0: words.append(porter.stem(re.sub(r'[^\w\s]','',word)))
    return ' '.join(words)

def preprocess(df,flag):
    # add column
    df["ingredients_num"]=df["ingredients"].apply(len)
    
    # Remove recipes with only one Ingredient
    if flag == 0 :
        df = df.drop(df[df["ingredients_num"]<=1].index)
    
    # Convert list of ingredients to string
    df['ingredients_txt'] = df["ingredients"].apply(ret_words)
    
    return df

In [None]:
train_preprocessed = preprocess(train_data,0)
test_preprocessed = preprocess(test_data,1)

In [None]:
train_preprocessed.head(10)

In [None]:
len(set(pd.Series(' '.join([row["ingredients_txt"] for ind,row in train_preprocessed.iterrows()]).split(' '))))

### Sperate the data

In [None]:
id_train, X_train, y_train = train_preprocessed['id'], train_preprocessed['ingredients_txt'], train_preprocessed['cuisine']
id_test, X_test, y_test = test_preprocessed['id'], test_preprocessed['ingredients_txt'], test_preprocessed['cuisine']

## ADS Creation

In [None]:
# BoW
BoW = CountVectorizer()

BoW.fit(X_train)
Count_data = BoW.transform(X_train)

BoW_X_train = pd.DataFrame(Count_data.toarray(),columns=BoW.get_feature_names())

BoW_X_train

In [None]:
X_train.head()

In [None]:
BoW.fit(X_train.head())
Count_data = BoW.transform(X_train.head())
BoW_X_train = pd.DataFrame(Count_data.toarray(),columns=BoW.get_feature_names())
BoW_X_train

In [None]:
# TFIDF
TFIDF = TfidfVectorizer(sublinear_tf=True, min_df=5, max_df=0.25, norm='l2', encoding='latin-1',\
                ngram_range=(1, 2), stop_words='english')

TFIDF.fit(X_train)
Count_data = TFIDF.transform(X_train)
TFIDF_X_train = pd.DataFrame(Count_data.toarray(),columns=TFIDF.get_feature_names())


TFIDF_X_train

In [None]:
X_train.head(5)

In [None]:
TFIDF = TfidfVectorizer()
TFIDF.fit(X_train.head(5))
Count_data = TFIDF.transform(X_train.head(5))
TFIDF_X_train = pd.DataFrame(Count_data.toarray(),columns=TFIDF.get_feature_names())


TFIDF_X_train