## BUILD CLASSIFICATION MODELS
Using the dataset that we balanced and cleaned, we will using a variety of classifiers to predict a given national cuisine based on a group of ingredients

In [None]:
import pandas as pd
cuisines_df = pd.read_csv("./cleaned_cuisines.csv")
cuisines_df.head()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np

Divides the X and y coordinates into two dataframes for training

In [None]:
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

Drops the Unnamed: 0 column and the cuisine column, using drop(). Saves the rest of the data as trainable features

In [None]:
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()

## CHOOSING OUR CLASSIFIER
Since we have multiclass data, we can use Multiclass Logistic Regression, Linear SVC, KNeighbors Classifier, Support Vector Classifier or Ensemble Classifier


## LOGISTIC REGRESSION 

Splits our data into training and testing groups

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

Creates a logistic regression with multi_class set to ovr and the solver set to liblinear

In [None]:
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))

accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))

Checks the model

In [None]:
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')

Checks the accuracy of this prediction

In [None]:
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()

Prints a classification report

In [None]:
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))

## LINEAR SVC CLASSIFIER, K-NEIGHBORS CLASSIFIER, AND SUPPORT VECTOR CLASSIFIER

Splits training and testing data 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

Create an array of classifiers that we'll progressively add to as we test

In the Support-Vector clustering (SVC) method, we can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. We set the kernel to 'linear' to ensure that we leverage linear SVC. We set probability to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data to get probabilities


In the K-Neighbors method, a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data

In the Support Vector Classifier method, training examples are mapped to points in space to maximize the distance between two categories. Data is mapped into this space so their category can be predicted

In [None]:
C = 10
# Create different classifiers.
classifiers = {
    'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0), 'KNN classifier': KNeighborsClassifier(C), 'SVC': SVC()
}

Trains the model using these classifiers and print out a report

In [None]:
n_classifiers = len(classifiers)

for index, (name, classifier) in enumerate(classifiers.items()):
    classifier.fit(X_train, np.ravel(y_train))

    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
    print(classification_report(y_test,y_pred))

The SVC method had the best accuracy of 83.2%. The Linear SVC method also performed well, with an accuracy of 78.6%. The K-Neighbors method didn't perform as well, with an accuracy of 73.8%. The accuracy for the AdaBoost method was also low with 72.4%