En este archivo utilizaremos el dataset `cleaned_cuisines.csv` que fue preparado anteriormente en `1.Introduccion` para predecir la nacionalidad de la cocina en base a un grupo de ingredientes.

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np

cuisine_df = pd.read_csv('./data/cleaned_cuisines.csv')
cuisines_label_df = cuisine_df['cuisine']
cuisines_feature_df = cuisine_df.drop(['Unnamed: 0', 'cuisine'], axis=1)

# Separamos los datos en labels y features.

### Choosing your classifier

Scikit-learn groups classification under Supervised Learning, and in that category you will find many ways to classify. The variety is quite bewildering at first sight. The following methods all include classification techniques:

- Linear Models
- Support Vector Machines
- Stochastic Gradient Descent
- Nearest Neighbors
- Gaussian Processes
- Decision Trees
- Ensemble methods (voting Classifier)
- Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)
- You can also use neural networks to classify data, but that is outside the scope of this lesson.

#### What classifier to go with?

So, which classifier should you choose? Often, running through several and looking for a good result is a way to test. Scikit-learn offers a [side-by-side comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) on a created dataset, comparing KNeighbors, SVC two ways, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB and QuadraticDiscrinationAnalysis, showing the results visualized:



A better way than wildly guessing, however, is to follow the ideas on this downloadable [ML Cheat sheet](https://learn.microsoft.com/es-es/azure/machine-learning/algorithm-cheat-sheet?view=azureml-api-1&WT.mc_id=academic-77952-leestott). Here, we discover that, for our multiclass problem, we have some choices:

___

Let's see if we can reason our way through different approaches given the constraints we have:

- Neural networks are too heavy. Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, neural networks are too heavyweight for this task.
- No two-class classifier. We do not use a two-class classifier, so that rules out one-vs-all.
- Decision tree or logistic regression could work. A decision tree might work, or logistic regression for multiclass data.
- Multiclass Boosted Decision Trees solve a different problem. The multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

# Create a logistic regression with multi_class set to ovr and the solver set to liblinear:
lr = LogisticRegression(multi_class="ovr", solver="liblinear")
model = lr.fit(X_train, y_train) # np.ravel(y_train)

accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))

Accuracy is 0.7981651376146789


In [40]:
# Printing real data (test values)
row = 211
print(f'ingredients: {X_test.iloc[row][X_test.iloc[row]!=0].keys()}')
print(f'cuisine: {y_test.iloc[row]}')

ingredients: Index(['cayenne', 'cilantro', 'egg', 'onion', 'tomato', 'turmeric',
       'vegetable_oil'],
      dtype='object')
cuisine: indian


In [37]:
# Checking the accuracy of the prediction
test= X_test.iloc[row].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()



Unnamed: 0,0
indian,0.920791
thai,0.050735
chinese,0.014611
korean,0.011549
japanese,0.002313


In [38]:
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

     chinese       0.76      0.74      0.75       250
      indian       0.90      0.92      0.91       232
    japanese       0.68      0.80      0.74       256
      korean       0.84      0.77      0.81       224
        thai       0.85      0.76      0.80       237

    accuracy                           0.80      1199
   macro avg       0.81      0.80      0.80      1199
weighted avg       0.80      0.80      0.80      1199

