<a href="https://colab.research.google.com/github/uditaagarwal31/cuisine-ml-model/blob/main/classification_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BUILD CLASSIFICATION MODELS
Using the dataset that we balanced and cleaned, we will using a variety of classifiers to predict a given national cuisine based on a group of ingredients

In [2]:
import pandas as pd


In [None]:
from google.colab import files
uploaded = files.upload()

In [4]:
import io
cuisines_df = pd.read_csv(io.BytesIO(uploaded['cleaned_cuisines.csv']))

In [5]:
cuisines_df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np

Divides the X and y coordinates into two dataframes for training

In [7]:
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

Drops the Unnamed: 0 column and the cuisine column, using drop(). Saves the rest of the data as trainable features

In [8]:
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


## CHOOSING OUR CLASSIFIER
Since we have multiclass data, we can use Multiclass Logistic Regression, Linear SVC, KNeighbors Classifier, Support Vector Classifier or Ensemble Classifier

## LOGISTIC REGRESSION 

Splits our data into training and testing groups

In [9]:
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

Creates a logistic regression with multi_class set to ovr and the solver set to liblinear

In [10]:
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))

accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))

Accuracy is 0.8256880733944955


Checks the model

In [11]:
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')

ingredients: Index(['chicken_broth', 'sesame_oil', 'soy_sauce', 'starch'], dtype='object')
cuisine: chinese


Checks the accuracy of this prediction

In [12]:
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()

  "X does not have valid feature names, but"


Unnamed: 0,0
chinese,0.894563
thai,0.052129
japanese,0.034173
korean,0.018609
indian,0.000526


Prints a classification report

In [13]:
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

     chinese       0.76      0.68      0.72       238
      indian       0.91      0.91      0.91       230
    japanese       0.82      0.82      0.82       258
      korean       0.86      0.86      0.86       218
        thai       0.78      0.86      0.82       255

    accuracy                           0.83      1199
   macro avg       0.83      0.83      0.83      1199
weighted avg       0.83      0.83      0.82      1199



## LINEAR SVC CLASSIFIER, K-NEIGHBORS CLASSIFIER, AND SUPPORT VECTOR CLASSIFIER

In [20]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
import numpy as np

Splits training and testing data 

In [14]:
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

Create an array of classifiers that we'll progressively add to as we test

In the Support-Vector clustering (SVC) method, we can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. We set the kernel to 'linear' to ensure that we leverage linear SVC. We set probability to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data to get probabilities


In the K-Neighbors method, a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data

In the Support Vector Classifier method, training examples are mapped to points in space to maximize the distance between two categories. Data is mapped into this space so their category can be predicted

In [26]:
C = 10
# Create different classifiers.
classifiers = {
    'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0), 'KNN classifier': KNeighborsClassifier(C), 'SVC': SVC()
}

Trains the model using these classifiers and print out a report

In [27]:
n_classifiers = len(classifiers)

for index, (name, classifier) in enumerate(classifiers.items()):
    classifier.fit(X_train, np.ravel(y_train))

    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
    print(classification_report(y_test,y_pred))

Accuracy (train) for Linear SVC: 81.2% 
              precision    recall  f1-score   support

     chinese       0.64      0.77      0.70       211
      indian       0.90      0.92      0.91       249
    japanese       0.85      0.74      0.79       250
      korean       0.88      0.73      0.80       232
        thai       0.81      0.88      0.84       257

    accuracy                           0.81      1199
   macro avg       0.82      0.81      0.81      1199
weighted avg       0.82      0.81      0.81      1199

Accuracy (train) for KNN classifier: 74.8% 
              precision    recall  f1-score   support

     chinese       0.64      0.73      0.68       211
      indian       0.84      0.81      0.83       249
    japanese       0.69      0.78      0.74       250
      korean       0.92      0.56      0.70       232
        thai       0.74      0.83      0.78       257

    accuracy                           0.75      1199
   macro avg       0.76      0.74      0.74    

The SVC method had the best accuracy of 83.8%. The Linear SVC method also performed well, with an accuracy of 81.2%. The K-Neighbors method didn't perform as well, with an accuracy of 74.8%. 