In [1]:
# Widen width of notebook
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))

## Agenda

In the notebook ML Practice II:

we will have a quick recap on the ML Practice I:
1. Reading in the Kaggle data and adding features
2. Using a **`Pipeline`** for proper cross-validation
3. Combining **`GridSearchCV`** with **`Pipeline`**

The following are new items: 

4. Efficiently searching for tuning parameters using **`RandomizedSearchCV`**
5. Advanced Hyperparameter tuning using **`Hyperopt`**
6. Adding features to a document-term matrix (using SciPy)
7. Adding features to a document-term matrix (using **`FeatureUnion`**)
8. Ensembling models

# ML Practice Part 2

In [2]:
# for Python 2: use print only as a function
from __future__ import print_function

## Part 1: Reading in the Kaggle data and adding features

- Our goal is to predict the **cuisine** of a recipe, given its **ingredients**.
- **Feature engineering** is the process through which you create features that don't natively exist in the dataset.

In [3]:
import pandas as pd
import numpy as np

In [4]:
# define a function that accepts a DataFrame and adds new features
def make_features(df):
    
    # number of ingredients
    df['num_ingredients'] = df.ingredients.apply(len)
    
    # mean length of ingredient names
    df['ingredient_length'] = df.ingredients.apply(lambda x: np.mean([len(item) for item in x]))
    
    # string representation of the ingredient list
    df['ingredients_str'] = df.ingredients.astype(str)
    
    return df

In [8]:
# create the same features in the training data and the new data
train = make_features(pd.read_json('../data/cuisine_data/train.json'))
new = make_features(pd.read_json('../data/cuisine_data/test.json'))

In [9]:
train.head()

Unnamed: 0,id,cuisine,ingredients,num_ingredients,ingredient_length,ingredients_str
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes...",9,12.0,"['romaine lettuce', 'black olives', 'grape tom..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g...",11,10.090909,"['plain flour', 'ground pepper', 'salt', 'toma..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g...",12,10.333333,"['eggs', 'pepper', 'salt', 'mayonaise', 'cooki..."
3,22213,indian,"[water, vegetable oil, wheat, salt]",4,6.75,"['water', 'vegetable oil', 'wheat', 'salt']"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe...",20,10.1,"['black pepper', 'shallots', 'cornflour', 'cay..."


In [10]:
train.shape

(39774, 6)

In [11]:
new.head()

Unnamed: 0,id,ingredients,num_ingredients,ingredient_length,ingredients_str
0,18009,"[baking powder, eggs, all-purpose flour, raisi...",6,9.333333,"['baking powder', 'eggs', 'all-purpose flour',..."
1,28583,"[sugar, egg yolks, corn starch, cream of tarta...",11,10.272727,"['sugar', 'egg yolks', 'corn starch', 'cream o..."
2,41580,"[sausage links, fennel bulb, fronds, olive oil...",6,9.666667,"['sausage links', 'fennel bulb', 'fronds', 'ol..."
3,29752,"[meat cuts, file powder, smoked sausage, okra,...",21,12.0,"['meat cuts', 'file powder', 'smoked sausage',..."
4,35687,"[ground black pepper, salt, sausage casings, l...",8,13.0,"['ground black pepper', 'salt', 'sausage casin..."


In [12]:
new.shape

(9944, 5)

## Part 2: Using a `Pipeline` for proper cross-validation

In [13]:
# define X and y
X = train.ingredients_str
y = train.cuisine

In [14]:
# X is just a Series of strings
X.head()

0    ['romaine lettuce', 'black olives', 'grape tom...
1    ['plain flour', 'ground pepper', 'salt', 'toma...
2    ['eggs', 'pepper', 'salt', 'mayonaise', 'cooki...
3          ['water', 'vegetable oil', 'wheat', 'salt']
4    ['black pepper', 'shallots', 'cornflour', 'cay...
Name: ingredients_str, dtype: object

In [15]:
# replace the regex pattern that is used for tokenization
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(token_pattern=r"'([a-z ]+)'")

In [16]:
# import and instantiate Multinomial Naive Bayes (with the default parameters)
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

[make_pipeline documentation](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)

In [17]:
# create a pipeline of vectorization and Naive Bayes
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(vect, nb)

In [18]:
# examine the pipeline steps
pipe.steps

[('countvectorizer', CountVectorizer(token_pattern="'([a-z ]+)'")),
 ('multinomialnb', MultinomialNB())]

**Proper cross-validation:**

- By passing our pipeline to **`cross_val_score`**, features will be created from **`X`** (via **`CountVectorizer`**) within each fold of cross-validation.
- This process simulates the real world, in which your out-of-sample data will contain **features that were not seen** during model training.

In [19]:
# cross-validate the entire pipeline
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.7323126392849393

## Part 3: Combining `GridSearchCV` with `Pipeline`

- We use **`GridSearchCV`** to locate optimal tuning parameters by performing an "exhaustive grid search" of different parameter combinations, searching for the combination that has the best cross-validated accuracy.
- By passing a **`Pipeline`** to **`GridSearchCV`** (instead of just a model), we can search tuning parameters for both the vectorizer and the model.

In [20]:
# pipeline steps are automatically assigned names by make_pipeline
pipe.named_steps.keys()

dict_keys(['countvectorizer', 'multinomialnb'])

In [21]:
# create a grid of parameters to search (and specify the pipeline step along with the parameter)
param_grid = {}
param_grid['countvectorizer__token_pattern'] = [r"\b\w\w+\b", r"'([a-z ]+)'"]
param_grid['multinomialnb__alpha'] = [0.5, 1]
param_grid

{'countvectorizer__token_pattern': ['\\b\\w\\w+\\b', "'([a-z ]+)'"],
 'multinomialnb__alpha': [0.5, 1]}

[GridSearchCV documentation](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html)

In [22]:
# pass the pipeline (instead of the model) to GridSearchCV
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')

In [23]:
# time the grid search
%time grid.fit(X, y)

CPU times: user 7.36 s, sys: 103 ms, total: 7.46 s
Wall time: 7.46 s


In [24]:
# print the single best score and parameters that produced that score
print(grid.best_score_)
print(grid.best_params_)

0.7476492724428822
{'countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 0.5}


## Part 4: Efficiently searching for tuning parameters using `RandomizedSearchCV`

- When there are many parameters to tune, searching all possible combinations of parameter values may be **computationally infeasible**.
- **`RandomizedSearchCV`** searches a sample of the parameter values, and you control the computational "budget".

[RandomizedSearchCV documentation](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV.html)

In [25]:
from sklearn.model_selection import RandomizedSearchCV

[scipy.stats documentation](http://docs.scipy.org/doc/scipy/reference/stats.html)

In [26]:
# for any continuous parameters, specify a distribution instead of a list of options
import scipy as sp
param_grid = {}
param_grid['countvectorizer__token_pattern'] = [r"\b\w\w+\b", r"'([a-z ]+)'"]
param_grid['countvectorizer__min_df'] = [1, 2, 3]
param_grid['multinomialnb__alpha'] = sp.stats.uniform(scale=1)
param_grid

{'countvectorizer__token_pattern': ['\\b\\w\\w+\\b', "'([a-z ]+)'"],
 'countvectorizer__min_df': [1, 2, 3],
 'multinomialnb__alpha': <scipy.stats._distn_infrastructure.rv_continuous_frozen at 0x28c981850>}

In [27]:
# set a random seed for sp.stats.uniform
np.random.seed(1)

In [28]:
# additional parameters are n_iter (number of searches) and random_state
rand = RandomizedSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_iter=5, random_state=1)

In [29]:
# time the randomized search
%time rand.fit(X, y)

CPU times: user 9.4 s, sys: 137 ms, total: 9.54 s
Wall time: 9.54 s


In [30]:
print(rand.best_score_)
print(rand.best_params_)

0.7452356138936534
{'countvectorizer__min_df': 2, 'countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 0.7203244934421581}


### Making predictions for new data

In [31]:
# define X_new as the ingredient text
X_new = new.ingredients_str

In [32]:
# print the best model found by RandomizedSearchCV
rand.best_estimator_

In [33]:
# RandomizedSearchCV/GridSearchCV automatically refit the best model with the entire dataset, and can be used to make predictions
new_pred_class_rand = rand.predict(X_new)
new_pred_class_rand

array(['southern_us', 'southern_us', 'italian', ..., 'italian',
       'southern_us', 'mexican'], dtype='<U12')

In [34]:
# create a submission file (score: 0.75342)
pd.DataFrame({'id':new.id, 'cuisine':new_pred_class_rand}).set_index('id').to_csv('sub3.csv')

## Part 5: Tuning Hyperparameters via Bayesian Optimization (using Hyperopt)

- Different from grid-search and random-search, bayesian optimization aims to limit evals of the objective function by spending more time chossing the next value to try. 
- Define a probability model of P(loss|input parameters), which can be a surrogate function. 
- Select the next parameters values by applying a criteria (Expected Improvement) to the surrogate function.
- Why we call it bayesian? Updating a model based on new evidence and the probability model is updated to incorporate the latest information.

In [35]:
#use pip from jupyter notebook to install package
#!pip install seaborn
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from time import time
"""
all different parameter spaces should be joined in one tuple or dictionary. hyperpot define several spaces as:

1. hp.choice(name_arguments, options):    label is a string input which refers to the hyperparameter
                                 options will contain a list, one element will be returned from the list for that particular label.
2. hp.uniform(label, low, high):  Again the label will contain the string referring to hyperparameter 
                                  and returns a value uniformly between low and high. 
                                  And when optimizing, this variable is constrained to a two-sided interval

"""
param_hyperopt= {
    'countvectorizer__token_pattern': hp.choice('countvectorizer__token_pattern', [r"\b\w\w+\b", r"'([a-z ]+)'"]),
    'countvectorizer__min_df':        hp.choice('countvectorizer__min_df', np.arange(1, 5,1, dtype=int)), 
    'multinomialnb__alpha':           hp.uniform('multinomialnb__alpha', 0.0, 1.0)                                   
}

In [39]:
def hyperopt(param_space, X_train, y_train, num_eval):
    
    start = time()
    
    # defin the object function
    def objective_function(params):
        """since pipe object is not callable, we can not use pipe(**params) """
        clf = pipe.set_params(**params) ### since pipelien object is not callable
        score = cross_val_score(clf, X_train, y_train, cv=5).mean()
        return {'loss': -score, 'status': STATUS_OK}

    trials = Trials()
    """fmin function will iterate on differnt sets of algoritms and their hyperparmeters
        and return the set on which loss is minimum.
    """
    best_param = fmin(objective_function, 
                      param_space, 
                      algo=tpe.suggest, # you can change other algorithms such as GP,
                      max_evals=num_eval, 
                      trials=trials,
                      rstate= np.random.default_rng(42))
    loss = [x['result']['loss'] for x in trials.trials]
    
    best_param_values = [x for x in best_param.values()]
    
    if best_param_values[0] == 0:
        token_type = r"\b\w\w+\b"
    else:
        token_type = r"'([a-z ]+)'"
    
    clf_best = pipe.set_params(countvectorizer__token_pattern=token_type,
                    countvectorizer__min_df=int(best_param_values[1]),
                    multinomialnb__alpha=float(best_param_values[2]))
                                  
    clf_best.fit(X_train, y_train)
    
    print("")
    print("##### Results")
    print("Score best parameters: ", min(loss)*-1)
    print("Best parameters: ", best_param)
    print("Time elapsed: ", time() - start)
    print("Parameter combinations evaluated: ", num_eval)
    
    return trials

In [40]:
num_eval = 5

In [41]:
results_hyperopt = hyperopt(param_hyperopt, X, y,num_eval)

100%|██████████| 5/5 [00:09<00:00,  1.90s/trial, best loss: -0.7476995552838627]

##### Results
Score best parameters:  0.7476995552838627
Best parameters:  {'countvectorizer__min_df': 0, 'countvectorizer__token_pattern': 1, 'multinomialnb__alpha': 0.5029037546614818}
Time elapsed:  9.967747926712036
Parameter combinations evaluated:  5


## Part 6: Adding features to a document-term matrix (using SciPy)

- We can call it data fusion or feature-level ensemble
- So far, we've trained models on either the **document-term matrix** or the **manually created features**, but not both.
- To train a model on both types of features, we need to **combine them into a single feature matrix**.
- Because one of the matrices is **sparse** and the other is **dense**, the easiest way to combine them is by using SciPy.

In [42]:
# create a document-term matrix from all of the training data
X_dtm = vect.fit_transform(X)
X_dtm.shape

(39774, 3010)

In [43]:
type(X_dtm)

scipy.sparse._csr.csr_matrix

[scipy.sparse documentation](http://docs.scipy.org/doc/scipy/reference/sparse.html)

In [44]:
# create a DataFrame of the manually created features
X_manual = train.loc[:, ['num_ingredients', 'ingredient_length']]
X_manual.shape

(39774, 2)

In [45]:
# create a sparse matrix from the DataFrame
X_manual_sparse = sp.sparse.csr_matrix(X_manual)
type(X_manual_sparse)

scipy.sparse._csr.csr_matrix

In [46]:
# combine the two sparse matrices
X_dtm_manual = sp.sparse.hstack([X_dtm, X_manual_sparse])
X_dtm_manual.shape

(39774, 3012)

- This was a relatively easy process.
- However, it does not allow us to do **proper cross-validation**, and it doesn't integrate well with the rest of the **scikit-learn workflow**.

## Part 7: Adding features to a document-term matrix (using `FeatureUnion`)

- Below is an alternative process that does allow for proper cross-validation, and does integrate well with the scikit-learn workflow.
- To use this process, we have to learn about transformers, **`FunctionTransformer`**, and **`FeatureUnion`**.

### What are "transformers"?

Transformer objects provide a `transform` method in order to perform **data transformations**. Here are a few examples:

- **`CountVectorizer`**
    - `fit` learns the vocabulary
    - `transform` creates a document-term matrix using the vocabulary
- **`Imputer`**
    - `fit` learns the value to impute
    - `transform` fills in missing entries using the imputation value
- **`StandardScaler`**
    - `fit` learns the mean and scale of each feature
    - `transform` standardizes the features using the mean and scale
- **`HashingVectorizer`**
    - `fit` is not used, and thus it is known as a "stateless" transformer
    - `transform` creates the document-term matrix using a hash of the token

### Converting a function into a transformer

In [47]:
# define a function that accepts a DataFrame returns the manually created features
def get_manual(df):
    return df.loc[:, ['num_ingredients', 'ingredient_length']]

In [48]:
get_manual(train).head()

Unnamed: 0,num_ingredients,ingredient_length
0,9,12.0
1,11,10.090909
2,12,10.333333
3,4,6.75
4,20,10.1


[FunctionTransformer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) (new in 0.17)

In [49]:
from sklearn.preprocessing import FunctionTransformer

In [50]:
# create a stateless transformer from the get_manual function
get_manual_ft = FunctionTransformer(get_manual, validate=False)
type(get_manual_ft)

sklearn.preprocessing._function_transformer.FunctionTransformer

In [51]:
# execute the function using the transform method
get_manual_ft.transform(train).head()

Unnamed: 0,num_ingredients,ingredient_length
0,9,12.0
1,11,10.090909
2,12,10.333333
3,4,6.75
4,20,10.1


In [52]:
# define a function that accepts a DataFrame returns the ingredients string
def get_text(df):
    return df.ingredients_str

In [53]:
# create and test another transformer
get_text_ft = FunctionTransformer(get_text, validate=False)
get_text_ft.transform(train).head()

0    ['romaine lettuce', 'black olives', 'grape tom...
1    ['plain flour', 'ground pepper', 'salt', 'toma...
2    ['eggs', 'pepper', 'salt', 'mayonaise', 'cooki...
3          ['water', 'vegetable oil', 'wheat', 'salt']
4    ['black pepper', 'shallots', 'cornflour', 'cay...
Name: ingredients_str, dtype: object

### Combining feature extraction steps

- **`FeatureUnion`** applies a list of transformers in parallel to the input data (not sequentially), then **concatenates the results**.
- This is useful for combining several feature extraction mechanisms into a single transformer.

![Pipeline versus FeatureUnion](../notebook_imgs/pipeline_versus_featureunion.jpg)

[make_union documentation](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_union.html)

In [54]:
from sklearn.pipeline import make_union

In [55]:
# create a document-term matrix from all of the training data
X_dtm = vect.fit_transform(X)
X_dtm.shape

(39774, 3010)

In [56]:
# this is identical to a FeatureUnion with just one transformer
union = make_union(vect)
X_dtm = union.fit_transform(X)
X_dtm.shape

(39774, 3010)

In [57]:
# try to add a second transformer to the Feature Union (what's wrong with this?)
# union = make_union(vect, get_manual_ft)
# X_dtm_manual = union.fit_transform(X)

In [58]:
# properly combine the transformers into a FeatureUnion
union = make_union(make_pipeline(get_text_ft, vect), get_manual_ft)
X_dtm_manual = union.fit_transform(train)
X_dtm_manual.shape

(39774, 3012)

![Pipeline in a FeatureUnion](../notebook_imgs/pipeline_in_a_featureunion.jpg)

### Cross-validation

In [59]:
# slightly improper cross-validation
cross_val_score(nb, X_dtm_manual, y, cv=5, scoring='accuracy').mean()

0.7257003002967882

In [60]:
# create a pipeline of the FeatureUnion and Naive Bayes
pipe = make_pipeline(union, nb)

In [61]:
# properly cross-validate the entire pipeline (and pass it the entire DataFrame)
cross_val_score(pipe, train, y, cv=5, scoring='accuracy').mean()

0.7261779872861032

### Alternative way to specify `Pipeline` and `FeatureUnion`

In [62]:
# reminder of how we created the pipeline
union = make_union(make_pipeline(get_text_ft, vect), get_manual_ft)
pipe = make_pipeline(union, nb)

[Pipeline documentation](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and [FeatureUnion documentation](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html)

In [63]:
# duplicate the pipeline structure without using make_pipeline or make_union
from sklearn.pipeline import Pipeline, FeatureUnion
pipe = Pipeline([
    ('featureunion', FeatureUnion([
            ('pipeline', Pipeline([
                    ('functiontransformer', get_text_ft),
                    ('countvectorizer', vect)
                    ])),
            ('functiontransformer', get_manual_ft)
        ])),
    ('multinomialnb', nb)
])

### Grid search of a nested `Pipeline`

In [64]:
# examine the pipeline steps
pipe.steps

[('featureunion',
  FeatureUnion(transformer_list=[('pipeline',
                                  Pipeline(steps=[('functiontransformer',
                                                   FunctionTransformer(func=<function get_text at 0x179c75550>)),
                                                  ('countvectorizer',
                                                   CountVectorizer(token_pattern='\\b\\w\\w+\\b'))])),
                                 ('functiontransformer',
                                  FunctionTransformer(func=<function get_manual at 0x179c753a0>))])),
 ('multinomialnb', MultinomialNB(alpha=0.5029037546614818))]

In [65]:
# create a grid of parameters to search (and specify the pipeline step along with the parameter)
param_grid = {}
param_grid['featureunion__pipeline__countvectorizer__token_pattern'] = [r"\b\w\w+\b", r"'([a-z ]+)'"]
param_grid['multinomialnb__alpha'] = [0.5, 1]
param_grid

{'featureunion__pipeline__countvectorizer__token_pattern': ['\\b\\w\\w+\\b',
  "'([a-z ]+)'"],
 'multinomialnb__alpha': [0.5, 1]}

In [66]:
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')

In [67]:
%time grid.fit(train, y)

CPU times: user 7.74 s, sys: 173 ms, total: 7.91 s
Wall time: 7.92 s


In [68]:
print(grid.best_score_)
print(grid.best_params_)

0.7426710530869912
{'featureunion__pipeline__countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 0.5}


## Part 8: Ensembling models

Rather than combining features into a single feature matrix and training a single model, we can instead create separate models and "ensemble" them.

### What is ensembling?

Ensemble learning (or "ensembling") is the process of combining several predictive models in order to produce a combined model that is **better than any individual model**.

- **Regression:** average the predictions made by the individual models
- **Classification:** let the models "vote" and use the most common prediction, or average the predicted probabilities

For ensembling to work well, the models must have the following characteristics:

- **Accurate:** they outperform the null model
- **Independent:** their predictions are generated using different "processes", such as:
    - different types of models
    - different features
    - different tuning parameters

**The big idea:** If you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when averaging the models.

**Note:** There are also models that have built-in ensembling, such as Random Forests.

### Model 1: KNN model using only manually created features

In [69]:
# define X and y
feature_cols = ['num_ingredients', 'ingredient_length']
X = train[feature_cols]
y = train.cuisine

In [70]:
# use KNN with K=800
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=800)

In [71]:
# train KNN on all of the training data
knn.fit(X, y)

In [72]:
# define X_new as the manually created features
X_new = new[feature_cols]

In [73]:
# calculate predicted probabilities of class membership for the new data
new_pred_prob_knn = knn.predict_proba(X_new)
new_pred_prob_knn.shape

(9944, 20)

In [74]:
# print predicted probabilities for the first row only
new_pred_prob_knn[0, :]

array([0.02625, 0.02625, 0.015  , 0.04375, 0.035  , 0.08   , 0.0175 ,
       0.075  , 0.0275 , 0.13125, 0.0125 , 0.0775 , 0.01875, 0.165  ,
       0.00875, 0.0125 , 0.14875, 0.025  , 0.03   , 0.02375])

In [75]:
# display classes with probabilities
zip(knn.classes_, new_pred_prob_knn[0, :])

<zip at 0x17a0b8480>

In [76]:
# predicted probabilities will sum to 1 for each row
new_pred_prob_knn[0, :].sum()

1.0

### Model 2: Naive Bayes model using only text features

In [77]:
# print the best model found by RandomizedSearchCV
rand.best_estimator_

In [78]:
# define X_new as the ingredient text
X_new = new.ingredients_str

In [79]:
# calculate predicted probabilities of class membership for the new data
new_pred_prob_rand = rand.predict_proba(X_new)
new_pred_prob_rand.shape

(9944, 20)

In [80]:
# print predicted probabilities for the first row only
new_pred_prob_rand[0, :]

array([3.59476986e-04, 4.04209227e-01, 7.38375500e-05, 1.29657196e-04,
       3.00331358e-03, 2.09215451e-03, 4.82924358e-04, 5.35343905e-04,
       1.22359513e-01, 7.08319855e-03, 2.06222706e-04, 7.18742744e-04,
       5.49904762e-06, 1.64352345e-03, 1.01157435e-05, 1.71202022e-02,
       4.39643190e-01, 3.21357119e-04, 1.84691815e-06, 6.53184216e-07])

### Ensembling models 1 and 2

In [81]:
# calculate the mean of the predicted probabilities for the first row
(new_pred_prob_knn[0, :] + new_pred_prob_rand[0, :]) / 2

array([0.01330474, 0.21522961, 0.00753692, 0.02193983, 0.01900166,
       0.04104608, 0.00899146, 0.03776767, 0.07492976, 0.0691666 ,
       0.00635311, 0.03910937, 0.00937775, 0.08332176, 0.00438006,
       0.0148101 , 0.2941966 , 0.01266068, 0.01500092, 0.01187533])

In [82]:
# calculate the mean of the predicted probabilities for all rows
new_pred_prob = pd.DataFrame((new_pred_prob_knn + new_pred_prob_rand) / 2, columns=knn.classes_)
new_pred_prob.head()

Unnamed: 0,brazilian,british,cajun_creole,chinese,filipino,french,greek,indian,irish,italian,jamaican,japanese,korean,mexican,moroccan,russian,southern_us,spanish,thai,vietnamese
0,0.013305,0.21523,0.007537,0.02194,0.019002,0.041046,0.008991,0.037768,0.07493,0.069167,0.006353,0.039109,0.009378,0.083322,0.00438,0.01481,0.294197,0.012661,0.015001,0.011875
1,0.008127,0.011794,0.0175,0.045625,0.018768,0.024328,0.0175,0.0475,0.010005,0.069375,0.005626,0.026253,0.020625,0.0675,0.0075,0.008751,0.546973,0.0075,0.025625,0.013125
2,0.013627,0.008822,0.007844,0.02,0.015059,0.04508,0.011783,0.030008,0.013628,0.449318,0.005651,0.03876,0.007505,0.080644,0.024697,0.008998,0.078138,0.112301,0.015626,0.01251
3,0.003125,0.004375,0.533125,0.039375,0.001875,0.0225,0.00625,0.075625,0.00125,0.051875,0.01125,0.008125,0.003125,0.10875,0.029375,0.001875,0.025,0.0075,0.0375,0.028125
4,0.001877,0.009598,0.020083,0.020625,0.003751,0.044735,0.017502,0.01375,0.012539,0.641456,0.003754,0.008125,0.003125,0.08438,0.004377,0.003133,0.071484,0.017581,0.015,0.003125


In [83]:
# for each row, find the column with the highest predicted probability
new_pred_class = new_pred_prob.apply(np.argmax, axis=1)
new_pred_class.head()

0    16
1    16
2     9
3     2
4     9
dtype: int64

In [84]:
# create a submission file (score: 0.75241)
pd.DataFrame({'id':new.id, 'cuisine':new_pred_class}).set_index('id').to_csv('sub4.csv')

In [85]:
hh = pd.DataFrame({'id':new.id, 'cuisine':new_pred_class}).set_index('id')

In [86]:
hh.head()

Unnamed: 0_level_0,cuisine
id,Unnamed: 1_level_1
18009,16
28583,16
41580,9
29752,2
35687,9


**Note:** [VotingClassifier](http://scikit-learn.org/stable/modules/ensemble.html#votingclassifier) (new in 0.17) makes it easier to ensemble classifiers, though it is limited to the case in which all of the classifiers are fit to the same data.
**Note:** A deatiled jupyter notebook about Ensemble Learning and Random Forests could be found [here](https://github.com/ageron/handson-ml2/blob/master/07_ensemble_learning_and_random_forests.ipynb).