# CLEAN & BALANCE THE DATA TO GET BETTER RESULTS
This dataset includes 385 columns indicating all kinds of ingredients in various cuisines from a given set of 5 cuisines (thai, indian, korean, japanese, chinese). We will clean and balance this dataset to get better results

Install Imblearn which will enable SMOTE, a Scikit-learn package that helps handle imbalanced data when performing classification 

In [None]:
pip install imblearn

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from imblearn.over_sampling import SMOTE

ModuleNotFoundError: No module named 'pandas'

In [2]:
df  = pd.read_csv('./cuisines.csv')

NameError: name 'pd' is not defined

Checks the data's shape

In [None]:
df.head()

Gets info about this data 

In [None]:
df.info()

Discovers the distribution of the data, per cuisine

In [None]:
df.cuisine.value_counts()

Shows the cuisines in a bar graph

In [None]:
df.cuisine.value_counts().plot.barh()

Finds out how much data is available per cuisine and prints it 

In [None]:
thai_df = df[(df.cuisine == "thai")]
japanese_df = df[(df.cuisine == "japanese")]
chinese_df = df[(df.cuisine == "chinese")]
indian_df = df[(df.cuisine == "indian")]
korean_df = df[(df.cuisine == "korean")]

print(f'thai df: {thai_df.shape}')
print(f'japanese df: {japanese_df.shape}')
print(f'chinese df: {chinese_df.shape}')
print(f'indian df: {indian_df.shape}')
print(f'korean df: {korean_df.shape}')

## LEARN ABOUT THE TYPICAL INGREDIENTS PER CUISINE & CLEAN RECURRENT DATA THAT CREATES CONFUSION BETWEEN CUISINES

create_ingredient() drops an unhelpful column and sorts through ingredients by their count

In [None]:
def create_ingredient_df(df):
    ingredient_df = df.T.drop(['cuisine','Unnamed: 0']).sum(axis=1).to_frame('value')
    ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
    ingredient_df = ingredient_df.sort_values(by='value', ascending=False,
    inplace=False)
    return ingredient_df

Gets top 10 most popular ingredients by cuisine

In [None]:
thai_ingredient_df = create_ingredient_df(thai_df)
thai_ingredient_df.head(10).plot.barh()

In [None]:
japanese_ingredient_df = create_ingredient_df(japanese_df)
japanese_ingredient_df.head(10).plot.barh()

In [None]:
chinese_ingredient_df = create_ingredient_df(chinese_df)
chinese_ingredient_df.head(10).plot.barh()

In [None]:
indian_ingredient_df = create_ingredient_df(indian_df)
indian_ingredient_df.head(10).plot.barh()

In [None]:
korean_ingredient_df = create_ingredient_df(korean_df)
korean_ingredient_df.head(10).plot.barh()

Drops the most common ingredients that creates confusion between distinct cuisines

In [None]:
feature_df= df.drop(['cuisine','Unnamed: 0','rice','garlic','ginger'], axis=1)
labels_df = df.cuisine #.unique()
feature_df.head()

## BALANCE THE DATASET USING SMOTE (Synthetic Minority Over-sampling Technique)

Generates new samples by interpolation. By balancing data, we'll have better results when classifying it because if most of our data is one class, a ML model is goung to predict that class more frequently since there is more data for it. Thus, balancing the data takes any skewed data and helps remove this imbalance.

In [None]:
oversample = SMOTE()
transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)

In [None]:
print(f'new label count: {transformed_label_df.value_counts()}')
print(f'old label count: {df.cuisine.value_counts()}')

Saves the balanced data, including labels and features, into a new dataframe that can be exported into a file

In [None]:
transformed_df = pd.concat([transformed_label_df,transformed_feature_df],axis=1, join='outer')

Saves this data into a file that can be used in the future

transformed_df.head()
transformed_df.info()
transformed_df.to_csv("../data/cleaned_cuisines.csv")