# RecipeClassification


## Identifying Which feature is best to classify Recipe Dataset

### Importing necessary libraries

###### The following code is written in Python 3.x. Libraries provide pre-written functionally to perform necessary tasks

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')
import warnings
warnings.filterwarnings('ignore')

In [None]:
#reading the data
recipe = pd.read_csv("../input/recipe-dataset/recipe_classification.csv")

In [None]:
#print number of rows and number of columns of data
recipe.shape

In [None]:
#printing top 6 rows 
recipe.head(6)

In [None]:
#Check Missing Values
recipe.isnull().sum()

Here, I am removing uncessary column with drop function. The resulting dataframe will have 8 columns out of 9. The column containing 'unnamed' variable will be removed.

### Check Numeric and Categorical Features

A dataset consists of numerical and categorical columns.

Looking at the dataset, I can identify the categorical and continuous columns in it.But it might also be possible that the numerical values are represented as strings in some feature. Or the categorical values in some features might be represented as some other datatypes instead of strings. I will be going to check for the datatypes of all the features.

In [None]:
#Identifying Numeric Features
numeric_data = recipe.select_dtypes(include=np.number) # select_dtypes select data with numeric features
numeric_col = numeric_data.columns  # we will store the numeric features in a variable
print("Numeric Features:")
print(numeric_data.head())
print("==="*20)

In [None]:
#Identifying Categorical Features
categorical_data = recipe.select_dtypes(exclude=np.number) # we will exclude data with numeric features
categorical_col = categorical_data.columns  # we will store the categorical features in a variable
print("Categorical Features:")
print(categorical_data.head())
print("==="*20)

In [None]:
#CHECK THE DATATYPES OF ALL COLUMNS:

print(recipe.dtypes)

### Check for Class Imabalance

Class imbalance occurs when the observation belonging to one class in the target are significantly higher than the other class or classes.This dataset is multi-class distribution problem.
Since most machine learning algorithms assume that the data is equally distributed, applying them on imbalance data often results in bias towards majority classes and poor classification of minority classes. Hence it's need to identify and deal with Class Imbalance.

In [None]:
# we are finding the percentage of each class in the feature 'Cuisine'
class_values = (recipe['Cuisine'].value_counts()/recipe['Cuisine'].value_counts().sum())*100
print(class_values)

In [None]:
# we are finding the percentage of each class in the feature 'Cuisine'
class_values = (recipe['Category'].value_counts()/recipe['Category'].value_counts().sum())*100
print(class_values)

In [None]:
# we are finding the percentage of each class in the feature 'Cuisine'
class_values = (recipe['Yield'].value_counts()/recipe['Yield'].value_counts().sum())*100
print(class_values)

#### Observations:

The Class Distribution in 'Cuisine' feature is ~67:22:6:4 for Indian,European,American & Chinese Classes. This is clear indication of imbalance.
The Class Distribution in 'Category' feature is ~29:24:22:11:10:1. It is quite imbalance.
Even 'Yield' feature is also imbalanced.
Now, I will be going to identify which feature is best suitable to classify recipe dataset.

### Univariate Analysis of Categorical Variable

Univariate Analysis means analysis of single variable. It's mainly describe the characteristics of the variable.
'Recipe_Name', 'Nutrition' have too many classes.This features are not suitable for classification.

In [None]:
recipe['Cuisine'].value_counts().plot.bar()
plt.title('Recipe_Cuisine')

In [None]:
recipe['Category'].value_counts().plot.bar()
plt.title('Recipe_Category')

In [None]:
#Creating frequency table for Categorical Variable 'Yield'
recipe['Yield'].value_counts().plot.bar()
plt.title('Recipe_Yield')

#### Observations:

From the above visuals, we can make the following observations.

Most recipes belong to Indian Cuisine.There are least Chinese recipes in a dataset.

Lunch, Snacks and Dessert recipes are more compared to dinner, breakfast and salads.

Percentage of yield 4 serving is high compared to other classes. Majority recipes will serve four number of people.

### Univariate Analysis of Continuous Variable

By performing the univariate analysis of Continuous variable, we can get sense of the distribution values in every column and of the outlier in the data.

In [None]:
#Plotting 'histogram' for the 'Preptime' Variable
plt.figure(figsize=(20,5))
plt.subplot(121)
sns.distplot(recipe['Preptime'])
plt.title('Recipe_Preptime')

In [None]:
#Plotting 'histogram' for the 'Tottime' Variable
plt.figure(figsize=(20,5))
plt.subplot(121)
sns.distplot(recipe['Tottime'])
plt.title('Recipe_Tottime')

### Categorical - Continuous Bivariate Analysis

In [None]:
recipe.groupby('Cuisine')['Preptime'].mean().plot.bar()
plt.title('Cuisine Vs Preptime')

In [None]:
recipe.groupby('Category')['Preptime'].mean().plot.bar()
plt.title('Category Vs Preptime')

#### Observations:

More time is required to prepare American recipes. While less time is required to cook Chinese recipes.
However, Salad recipes are really required less time to prepare while as desserts required more time.

In [None]:
plt.figure(figsize=(20,4))
plt.subplot(121)
sns.countplot(x=recipe['Category'],hue=recipe['Cuisine'],data=recipe)
plt.title('Cuisine Vs Category')    
plt.xticks(rotation=90)

#### Observation:

From the above Visual understanding, there are many Lunch recipes belong to Indian Cuisine. Most of the salad recipes are Europian Cuisine.

#### Univariate Outlier Detection

Boxplots are the best choice for visualizing outliers.

In [None]:
#Creating 'Preptime' box plot
recipe['Preptime'].plot.box()

#### Bivariate Outlier Detection

In [None]:
recipe.plot.scatter('Preptime','Tottime')

### Removing outliers from the Dataset

In [None]:
recipe = recipe[recipe['Preptime']<150]

In [None]:
recipe.shape

### Replacing Outliers in 'Preptime' with the mean 'Preptime'

In [None]:
recipe.loc[recipe['Preptime']>100,'Preptime']=np.mean(recipe['Preptime'])

In [None]:
recipe.plot.scatter('Preptime','Tottime')