# Klasifikacijski modeli

## Učitavanje podataka

In [1]:
import pandas as pd

train_path = './train.json'
data = pd.read_json(train_path)

pd.set_option("max_colwidth", 200)

data.head()

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese crumbles]"
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, milk, vegetable oil]"
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, soy sauce, butter, chicken livers]"
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, chili powder, passata, oil, ground cumin, boneless chicken skinless thigh, garam m..."


Podijeliti ćemo dostupan skup podataka na train i test set na 80-20% pri tome čuvajući omjer klasa. 

In [2]:
from sklearn.model_selection import StratifiedShuffleSplit
X = data['ingredients']
y = data['cuisine']


sss = StratifiedShuffleSplit(n_splits=1, test_size=0.20, random_state=42)

train_index, test_index = list(sss.split(X, y))[0]

### Izgradnja matrice u kojem redak predstavlja recept, a stupci sastojak. Ako je na (i, j) mjestu u matrici 1 to znaci da je j-ti sastojak zastupljen u i-tom receptu

Svaki sastojak ćemo tokenizirati pomoću NLTK paketa, pretvoriti sva slova u mala, maknuti posebne znakove i na kraju provesti lemmatizaciju koristeći WordNetLemmatizer iz NLTK paketa. Sveukupno bi taj postupak na trening setu trebao trajati oko jedne minute.

In [3]:
%%time

import nltk
from unidecode import unidecode
from nltk.stem import WordNetLemmatizer
from collections import Counter
from scipy.sparse import lil_matrix


def lemmantize(set_of_ing):
    lemmatizer = WordNetLemmatizer()
    # unidecode(word).lower() smanjio sa 3341 na 2912
    return set([lemmatizer.lemmatize(unidecode(word).lower()) 
                for word_list in [nltk.word_tokenize(ing) for ing in set_of_ing] 
                for word in word_list])

ingredients_counter = Counter(ingredient for ingredients_list in X.apply(set).apply(lemmantize)
                              for ingredient in ingredients_list)

print("Broj jedinstvenih sastojaka", len(ingredients_counter))

ingredientToInd = dict([(y, x) for x, y in enumerate(ingredients_counter)])
indToIngredient = dict([(x, y) for x, y in enumerate(ingredients_counter)])

# trebat ce matrica len(train) x broj_jedinstvenih_sastojaka -> 39774 x 2912 = 267042636 ~ 2 * 10^8
def create_cnt_matrix(ingredients_data):
    processed_data = ingredients_data.apply(set).apply(lemmantize)
    
    cnt_matrix = lil_matrix((len(ingredients_data), len(ingredients_counter)), dtype=bool, copy=False)

    for i, row in enumerate(processed_data):
        for ingredient in row:
            if ingredient in ingredientToInd:
                cnt_matrix[i, ingredientToInd[ingredient]] = 1
            
    return cnt_matrix

Broj jedinstvenih sastojaka 2912
CPU times: user 39.7 s, sys: 591 ms, total: 40.3 s
Wall time: 40.3 s


In [4]:
%%time

cnt_matrix = create_cnt_matrix(X)
print("Popunjenost matrice: ", cnt_matrix.count_nonzero() / (cnt_matrix.shape[0] * cnt_matrix[1].shape[1]))

Popunjenost matrice:  0.00645310668739919
CPU times: user 43.2 s, sys: 71.8 ms, total: 43.3 s
Wall time: 43.3 s


## Naivni prediktor koji samo predviđa talijansku kuhinju

In [7]:
from sklearn.base import BaseEstimator, ClassifierMixin

class BaselineClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y):
        pass
    
    def predict(self, X):
        return ["italian"] * X.shape[0]
    
    def score(self, X, y):
        return sum(self.predict(X) == y) / X.shape[0]

baseline = BaselineClassifier()

print("training accuracy: ", baseline.score(X[train_index], y[train_index]))
print("test accuracy: ", baseline.score(X[test_index], y[test_index]))

0.19705207580376505
0.19710873664362036


Naivni prediktor nam služi za referencu koliko dobro rade modeli strojnog učenja

## Treniranje raznih modela

### Random forrest

In [5]:
%%time

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [None, 50],
              'min_samples_split': [2, 5],
              'n_estimators': [200, 500],
             'max_features': [2, 5, 10, 20, 50]} 

grid_random_forest = GridSearchCV(RandomForestClassifier(), param_grid, refit = True, verbose = 3, n_jobs=4, 
                                  scoring='accuracy', return_train_score=True, cv = [(train_index, test_index)])

grid_random_forest.fit(cnt_matrix, y)

pd.DataFrame(grid_random_forest.cv_results_)[
    ['param_max_depth', 'param_n_estimators', 'param_max_features', 'param_min_samples_split',
     'split0_test_score', 'split0_train_score', 'rank_test_score']].sort_values(['rank_test_score'])

Fitting 1 folds for each of 40 candidates, totalling 40 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed: 15.5min
[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed: 22.9min finished


CPU times: user 3min 54s, sys: 1.48 s, total: 3min 56s
Wall time: 26min 46s


Unnamed: 0,param_max_depth,param_n_estimators,param_max_features,param_min_samples_split,split0_test_score,split0_train_score,rank_test_score
17,,500,50,2,0.762791,0.999749,1
13,,500,20,2,0.762791,0.999749,1
16,,200,50,2,0.759271,0.999749,3
19,,500,50,5,0.759145,0.988435,4
9,,500,10,2,0.758642,0.999749,5
5,,500,5,2,0.758391,0.999749,6
1,,500,2,2,0.758265,0.999749,7
12,,200,20,2,0.758265,0.999749,7
4,,200,5,2,0.757008,0.999749,9
18,,200,50,5,0.756631,0.987429,10


Slučajne šume imaju savršen rezultat na trening setu, a na test setu imaju točnost od 76% posto. Tuniranjem hiperparametara dobivamo minimalno poboljšanje od 2%. 

### Linear SVM

Koristimo SVM sa linearnom jezgrom (Yang, Liu : A re-examination of text categorization methods)

In [6]:
%%time

from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.001, 0.1, 1, 10, 50, 100, 500, 1000, 5000],  
              'penalty': ['l1','l2'],
             'loss': ['hinge','squared hinge'],
             'class_weight': [None, 'balanced'],
             'max_iter': [1000, 2000]} 

grid_linear_svm = GridSearchCV(LinearSVC(), param_grid, refit = True, verbose = 3, n_jobs=-2, scoring='accuracy',
                       return_train_score=True, cv = [(train_index, test_index)])

grid_linear_svm.fit(cnt_matrix, y)

pd.DataFrame(grid_linear_svm.cv_results_)[
    ['param_C', 'param_loss', 'param_penalty', 'param_max_iter', 'param_class_weight', 'split0_test_score', 
     'split0_train_score', 'rank_test_score']].sort_values(['rank_test_score'])

Fitting 1 folds for each of 144 candidates, totalling 144 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  18 tasks      | elapsed:    4.4s
[Parallel(n_jobs=-2)]: Done 114 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-2)]: Done 144 out of 144 | elapsed:  7.4min finished


CPU times: user 8.54 s, sys: 1.07 s, total: 9.61 s
Wall time: 7min 28s




Unnamed: 0,param_C,param_loss,param_penalty,param_max_iter,param_class_weight,split0_test_score,split0_train_score,rank_test_score
19,0.1,hinge,l2,2000,,0.771716,0.825356,1
17,0.1,hinge,l2,1000,,0.771716,0.825324,1
35,1,hinge,l2,2000,,0.770207,0.865269,3
33,1,hinge,l2,1000,,0.770082,0.865426,4
51,10,hinge,l2,2000,,0.759271,0.882209,5
...,...,...,...,...,...,...,...,...
46,1,squared hinge,l1,2000,balanced,,,140
45,1,squared hinge,l2,1000,balanced,,,141
44,1,squared hinge,l1,1000,balanced,,,142
40,1,hinge,l1,1000,balanced,,,143


In [8]:
pd.DataFrame(grid_linear_svm.cv_results_)[
    ['param_C', 'param_loss', 'param_penalty', 'param_max_iter', 'param_class_weight', 'split0_test_score', 
     'split0_train_score', 'rank_test_score']].sort_values(['rank_test_score']).head(30)

Unnamed: 0,param_C,param_loss,param_penalty,param_max_iter,param_class_weight,split0_test_score,split0_train_score,rank_test_score
19,0.1,hinge,l2,2000,,0.771716,0.825356,1
17,0.1,hinge,l2,1000,,0.771716,0.825324,1
35,1.0,hinge,l2,2000,,0.770207,0.865269,3
33,1.0,hinge,l2,1000,,0.770082,0.865426,4
51,10.0,hinge,l2,2000,,0.759271,0.882209,5
49,10.0,hinge,l2,1000,,0.759271,0.881957,5
27,0.1,hinge,l2,2000,balanced,0.754243,0.816808,7
25,0.1,hinge,l2,1000,balanced,0.754243,0.816808,7
41,1.0,hinge,l2,1000,balanced,0.75374,0.854961,9
43,1.0,hinge,l2,2000,balanced,0.753488,0.854929,10


Linearni SVM je mrvicu bolji od slučajnih šuma - 1%.

### Gaussian naive bayes

In [12]:
%%time

from sklearn.naive_bayes import GaussianNB

gaussian_nb = GaussianNB()

gaussian_nb.fit(cnt_matrix[train_index].toarray(), y[train_index])

%%time

print("train score: ", gaussian_nb.score(cnt_matrix[train_index].toarray(), y[train_index]))
print("test score: ", gaussian_nb.score(cnt_matrix[test_index].toarray(), y[test_index]))

UsageError: Line magic function `%%time` not found.


### multinomial naive bayes

https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes

In [13]:
%%time
from sklearn.naive_bayes import MultinomialNB

param_grid = {'alpha': [i * 0.01 for i in range(101)],
              'fit_prior': [False, True]
             } 

grid_multinomialNB = GridSearchCV(MultinomialNB(), param_grid, refit = True, verbose = 3, 
                                   n_jobs=-2, scoring='accuracy', cv = [(train_index, test_index)],
                                   return_train_score=True)

grid_multinomialNB.fit(cnt_matrix, y) 

pd.DataFrame(grid_multinomialNB.cv_results_)[
    ['param_alpha', 'param_fit_prior', 
     'split0_test_score', 'split0_train_score', 'rank_test_score']].sort_values(['rank_test_score'])

Fitting 1 folds for each of 202 candidates, totalling 202 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  18 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-2)]: Done 114 tasks      | elapsed:   14.6s
[Parallel(n_jobs=-2)]: Done 202 out of 202 | elapsed:   23.3s finished


CPU times: user 5.76 s, sys: 528 ms, total: 6.29 s
Wall time: 23.7 s


Unnamed: 0,param_alpha,param_fit_prior,split0_test_score,split0_train_score,rank_test_score
167,0.83,True,0.730987,0.752946,1
169,0.84,True,0.730861,0.752789,2
171,0.85,True,0.730861,0.752601,2
165,0.82,True,0.730735,0.753198,4
181,0.9,True,0.730735,0.752538,4
...,...,...,...,...,...
14,0.07,False,0.693652,0.734687,198
20,0.1,False,0.693652,0.733901,198
18,0.09,False,0.693400,0.734184,200
16,0.08,False,0.693023,0.734467,201


Naivni bayes je mrvicu lošiji od linearnog SVM, ali je puno brži algoritam.

### ComplementNB

https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes

In [14]:
%%time
from sklearn.naive_bayes import ComplementNB

param_grid = {'alpha': [i * 0.01 for i in range(101)],
              'norm': [False, True]
             } 

grid_complementNB = GridSearchCV(ComplementNB(), param_grid, refit = True, verbose = 3, 
                                   n_jobs=-2, scoring='accuracy', cv = [(train_index, test_index)],
                                   return_train_score=True)

grid_complementNB.fit(cnt_matrix, y) 

pd.DataFrame(grid_complementNB.cv_results_)[
    ['param_alpha', 'param_norm', 
     'split0_test_score', 'split0_train_score', 'rank_test_score']].sort_values(['rank_test_score'])

Fitting 1 folds for each of 202 candidates, totalling 202 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  18 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-2)]: Done 114 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-2)]: Done 202 out of 202 | elapsed:   22.0s finished


CPU times: user 5.82 s, sys: 284 ms, total: 6.11 s
Wall time: 22.6 s


Unnamed: 0,param_alpha,param_norm,split0_test_score,split0_train_score,rank_test_score
29,0.14,True,0.693526,0.716867,1
35,0.17,True,0.693526,0.715547,1
27,0.13,True,0.693400,0.716962,3
55,0.27,True,0.693275,0.713190,4
61,0.3,True,0.693275,0.711839,4
...,...,...,...,...,...
0,0.0,False,0.683218,0.709199,197
197,0.98,True,0.683218,0.698482,197
201,1.0,True,0.682841,0.698419,200
3,0.01,True,0.676430,0.698199,201


### BernoulliNB

In [15]:
from sklearn.naive_bayes import BernoulliNB

param_grid = {'alpha': [i * 0.01 for i in range(101)],
              'fit_prior': [False, True]
             } 

grid_bernoulliNB = GridSearchCV(BernoulliNB(), param_grid, refit = True, verbose = 3, 
                                   n_jobs=-2, scoring='accuracy', cv = [(train_index, test_index)],
                                   return_train_score=True)
grid_bernoulliNB.fit(cnt_matrix, y) 

pd.DataFrame(grid_bernoulliNB.cv_results_)[
    ['param_alpha', 'param_fit_prior', 
     'split0_test_score', 'split0_train_score', 'rank_test_score']].sort_values(['rank_test_score'])

Fitting 1 folds for each of 202 candidates, totalling 202 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  18 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-2)]: Done 114 tasks      | elapsed:   12.8s
[Parallel(n_jobs=-2)]: Done 202 out of 202 | elapsed:   23.8s finished


Unnamed: 0,param_alpha,param_fit_prior,split0_test_score,split0_train_score,rank_test_score
137,0.68,True,0.720302,0.739590,1
133,0.66,True,0.720302,0.739810,1
121,0.6,True,0.720176,0.741255,3
127,0.63,True,0.720050,0.740407,4
131,0.65,True,0.720050,0.739904,4
...,...,...,...,...,...
10,0.05,False,0.680704,0.717810,198
12,0.06,False,0.680327,0.716742,199
6,0.03,False,0.680201,0.719539,200
4,0.02,False,0.679195,0.721456,201


### GradientBoostingClassifier

In [104]:
%%time

from sklearn.ensemble import GradientBoostingClassifier

param_grid = {'n_estimators': [i*20 for i in range(1, 5)],
              'learning_rate': [0.1, 0.25, 0.5, 1],
              'max_depth': [1, 2, 5]
             } 

grid_grad_boost = GridSearchCV(GradientBoostingClassifier(), param_grid, refit = True, verbose = 3, 
                                   n_jobs=6, scoring='accuracy', cv = [(train_index, test_index)],
                                   return_train_score=True)
grid_grad_boost.fit(cnt_matrix, y) 

pd.DataFrame(grid_grad_boost.cv_results_)[
    ['param_n_estimators', 'param_learning_rate', 'param_max_depth',
     'split0_test_score', 'split0_train_score', 'rank_test_score']].sort_values(['rank_test_score'])

Fitting 1 folds for each of 48 candidates, totalling 48 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  20 tasks      | elapsed: 17.7min
[Parallel(n_jobs=6)]: Done  48 out of  48 | elapsed: 32.7min finished


CPU times: user 2min 40s, sys: 488 ms, total: 2min 40s
Wall time: 35min 21s


Unnamed: 0,param_n_estimators,param_learning_rate,param_max_depth,split0_test_score,split0_train_score,rank_test_score
11,80,0.1,5,0.757134,0.885069,1
23,80,0.25,5,0.748209,0.939124,2
10,60,0.1,5,0.745066,0.865866,3
22,60,0.25,5,0.742426,0.924039,4
19,80,0.25,2,0.739158,0.798139,5
9,40,0.1,5,0.738529,0.834627,6
21,40,0.25,5,0.738278,0.897985,7
18,60,0.25,2,0.732747,0.783368,8
33,40,0.5,5,0.727593,0.840661,9
20,20,0.25,5,0.722313,0.842893,10
