In the following notebook, I will try to predict the **major type** (`Type 1` column in this dataset) 
of Pokemons given various features (more about this in what follows).  

Before you start reading this notebook, I highly recommend checking a previous [EDA notebook](https://www.kaggle.com/yassinealouini/pokemon-eda) where I explore more in details the dataset. 

Enjoy!

In [None]:
# There are a lot of warnings about CV not having enough data for each fold.
# TODO: Find a better way to deal with the warnings
import warnings
warnings.filterwarnings("ignore")

# Same old imports
import numpy as np
import pandas as pd
import os
import pandas_profiling as pdp
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import matplotlib.pylab as plt
from hyperopt import hp, tpe, Trials
from hyperopt.fmin import fmin
from tqdm import tqdm
import itertools

# Models
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from tpot import TPOTClassifier



# Load the data and quick exploration

In [None]:
# Some constants
DATA_PATH = "../input/Pokemon.csv"
TARGET_COL = "Type 1"
ENCODED_TARGET_COL = "encoded_type_1"
TO_DROP_COLS = ["#", "Name"]
# The dataset is small
TEST_RATIO = 0.1
# For reproducibility
SEED = 31415
RUN_HP_OPTIMIZATION = False
# Reduce this if needed! (resources are scarce here!)
MAX_EVALS = 200
HP_SPACE = {
    # Trying to reduce class imbalance
    'max_delta_step': 2, 
    # To avoid overfitting
    'reg_alpha': hp.loguniform('reg_alpha', np.log(0.01), np.log(1)), 
    'reg_lambda': hp.loguniform('reg_lambda', np.log(0.01), np.log(1)), 
    'n_estimators': hp.quniform('n_estimators', 100, 1000, 1),
    'max_depth': hp.quniform('max_depth', 2, 8, 1),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(1)),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.3, 1.0),
    'gamma': hp.loguniform('gamma', np.log(0.01), np.log(1)),
}
# Optimal hp from previous run
OPTIMAL_HP = {'colsample_bytree': 0.7316836664311229, 'gamma': 0.04744535212276833, 
              'learning_rate': 0.02478735341127185, 'max_depth': 5.0, 'n_estimators': 349.0, 
              'reg_alpha': 0.03216806358838591, 'reg_lambda': 0.019055394071559602}
# Tpot conf values: increase these for more runs (and hopefully better results)
TPOT_GENERATION = 20
TPOT_POPULATION_SIZE = 100

In [None]:
# Some useful functions 


# Inspired from here: http://scikit-learn.org/stable/auto_examples/model_selection/
# plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py


def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    
    fig, ax = plt.subplots(1, 1, figsize=(10, 10))
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.set_title(title)
    fig.colorbar(im)
    tick_marks = np.arange(len(classes))
    ax.set_xticks(tick_marks)
    ax.set_xticklabels(classes, rotation=45)
    ax.set_yticks(tick_marks)
    ax.set_yticklabels(classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        ax.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    fig.tight_layout()
    ax.set_ylabel('True Type 1')
    ax.set_xlabel('Predicted Type 1')
    ax.grid(False)

In [None]:
pokemon_df = pd.read_csv(DATA_PATH)
pokemon_df.sample(5)

Notice that the `#` and `Name` columns aren't useful for predicting the major type so will be dropped (these are the `TO_DROP_COLS`).  

In [None]:
pokemon_df.dtypes

In [None]:
pdp.ProfileReport(pokemon_df)

As mentionned in the beginning, I will predict the major type (this is the `Type 1` column). 
Let's explore the target to start. 

In [None]:
target_s = pokemon_df['Type 1']
"There are {} unique major types".format(target_s.nunique())
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
target_s.value_counts().plot(kind='bar', ax=ax)
ax.set_ylabel('Number')
ax.set_xlabel("Pokemons' Type 1")

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
target_s.value_counts(normalize=True).mul(100).plot(kind='bar', ax=ax)
ax.set_ylabel('%')
ax.set_xlabel("Pokemons' Type 1")

Based on the target's historgrams: 

1. This is a **multi-class** (**18** major types) **classification** (categorical target) problem
2. This is an **unblanaced** problem. Indeed, some types (fairy and flying) are much less common than the other ones.

Notice that some major types (check the EDA notebook) aren't present for all the generations: flying, dark, and steel types aren't available for the six generations. 

Thus some **features engineering** based on the `Generation` column might be useful. 

Let's **dummify** (i.e. transform categorical columns into boolean ones) the target, the `Type 2` and `Generation` columns. 

For that I use pandas [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) function. Also, since not every Pokemon has a `Type 2`, I have filled the missing values with the "missing" type before dummifying. Notice also that I have used a `LabelEncoder` for the target col (since the target contains strings). 

Finally, I drop the `TARGET_COL` and `TO_DROP_COLS` (i.e. `Name` and `#`) columns from the features. 

In [None]:
le = LabelEncoder()
encoded_target_s = pd.Series(le.fit_transform(target_s), name=ENCODED_TARGET_COL)
dummified_target_s = pd.get_dummies(target_s)
dummified_features_df = (pokemon_df.drop(TO_DROP_COLS + [TARGET_COL], axis=1)
                                   .assign(Generation=lambda df: df.Generation.astype(str))
                                   .assign(**{"Legendary": lambda df: df["Legendary"].astype(int), 
                                              "Type 2": lambda df: df["Type 2"].fillna("missing")})
                                   .pipe(pd.get_dummies))
features_and_targets_df = pd.concat([encoded_target_s, dummified_features_df], axis=1)

In [None]:
encoded_target_s.sample(5)

In [None]:
le.inverse_transform(encoded_target_s.sample(5))

In [None]:
dummified_target_s.sample(5)

In [None]:
dummified_features_df.sample(5)

To end this preparation phase, let's see if there are any **correlations**

In [None]:
# Inspired from this: https://seaborn.pydata.org/examples/many_pairwise_correlations.html

corr_df = pd.concat([dummified_features_df, dummified_target_s], axis=1).corr()


mask = np.zeros_like(corr_df, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 12))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr_df, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})



Some correlations: 

* **Steel** major types tend to be positivelt correlated with **Defense** and **Psychic** with ** Special Attack**. 
* **Ghost** major types tend to be positvely correlated with a Type 2 of **Grass** and **Grass** with **Poison**. 
* **Fairy** major types tend to be positvely correlated with the **Generation** 6. 
* **Dragon** major types tend to be positvely correlated with **Attack** and **Total**. 

These observations aren't surprising to any true Pokemon connoisseur but are, nonethless, reassuring to find using the data. 

## Train and test split

I will split the features and targets into train and test datasets. 

The test dataset will only be used at the end to evaluate the various trained models (you should do this as well whenever you train an ML model). Next, I will use cross validation to train and evaluate the model using the train dataset. 

In [None]:
train_df, test_df = train_test_split(features_and_targets_df, 
                                     stratify=encoded_target_s, 
                                     test_size=TEST_RATIO, random_state=SEED)

In [None]:
train_df.head(1).T

# Evaluation metric

Alright, now that the features have been prepared and split, it is time to pick an evaluation metric. 

Since this an **nbalanced multi-class classification** problem, I will be using the [**F1 score**](https://en.wikipedia.org/wiki/F1_score) with **weighted** average: the F1 score is computed for each class then we take the weighted average using the true classes count.

Check the sklearn documentation for more details [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). In what follows, I have provied two examples of usage of the `f1_score`(unblanaced and balanced classes). 

In [None]:
# The three variations of the F1 score for unbalanced classes are different for unblanaced classes

true_classes = ["a", "b", "c", "a", "c", "c"]
predicted_classes = ["a", "b", "c", "c", "c", "c"]

print("Unbalanced: ")
print("Weighted F1 score:", f1_score(true_classes, predicted_classes, average="weighted"))
print("Micro F1 score:", f1_score(true_classes, predicted_classes, average="micro"))
print("Macro F1 score:", f1_score(true_classes, predicted_classes, average="macro"))

# The three variations of the F1 score for balanced classes are the same

true_classes = ["a", "b", "c", "a", "b", "c"]
predicted_classes = ["a", "b", "c", "a", "b", "c"]
print(32 * "-")

print("Balanaced: ")
print("Weighted F1 score:", f1_score(true_classes, predicted_classes, average="weighted"))
print("Micro F1 score:", f1_score(true_classes, predicted_classes, average="micro"))
print("Macro F1 score:", f1_score(true_classes, predicted_classes, average="macro"))

# Baseline model

As with any ML problem, one usually starts by establishing a baseline, i.e. a score/error that one aims at improving. 
Why is that important? Well, without a baseline, it is hard to tell if one is making progress or not. Moreover, some problems are much easier than others: a very high accuracy might look impressive
but is less impressive one compared to a high accuracy obtained with a very simple model. 

As a baseline, let's use a linear regression model.


In [None]:
train_features_df = train_df.drop(ENCODED_TARGET_COL, axis=1)
train_target_s = train_df[ENCODED_TARGET_COL]

In [None]:
def improvement_in_percent(model_score, baseline_score):
    return (100 * (model_score - baseline_score)  / baseline_score).round(3)

In [None]:
lr = LogisticRegression(random_state=SEED)
lr_scores = cross_val_score(lr, X=train_features_df, y=train_target_s, cv=5, scoring="f1_weighted")
print("Logistic regression mean and std scores are: ({}, {})".format(lr_scores.mean(), lr_scores.std()))
lr.fit(train_features_df, train_target_s)

# Simple XGBoost

Now that a baseline score has been found, let's try to improve it. 

In [None]:
xgb_clf = XGBClassifier(random_state=SEED)
xgb_clf_scores = cross_val_score(xgb_clf, X=train_features_df, y=train_target_s, cv=5, scoring="f1_weighted")
"Simple XGBoost classification mean and std scores are: ({}, {})".format(xgb_clf_scores.mean(), xgb_clf_scores.std())

A "simple" (no hyperparameters tuning) XGBoost classifier does better than the baseline. 

In [None]:
"This is a {} % improvement".format(improvement_in_percent(xgb_clf_scores.mean(), lr_scores.mean()))

Could we do better?

## Tuning the XGBoost classifier

Let's try to vary the hyperparamters for the XGBoost classifier and see what we get.

In [None]:
# More trees
clf = XGBClassifier(random_state=SEED, n_estimators=1000)
clf_scores = cross_val_score(clf, X=train_features_df, y=train_target_s, cv=5, scoring="f1_weighted")
print("Alternative XGBoost classification mean and std scores are: ({}, {})".format(clf_scores.mean(), clf_scores.std()))
print("This is a {} % improvement".format(improvement_in_percent(clf_scores.mean(), lr_scores.mean())))

In [None]:
# Smaller learning rate
clf = XGBClassifier(random_state=SEED, learning_rate=0.01)
clf_scores = cross_val_score(clf, X=train_features_df, y=train_target_s, cv=5, scoring="f1_weighted")
print("Alternative XGBoost classification mean and std scores are: ({}, {})".format(clf_scores.mean(), clf_scores.std()))
print("This is a {} % improvement".format(improvement_in_percent(clf_scores.mean(), lr_scores.mean())))

As you can see, trying different hyperparamters values manually would be tedious. Is there a better way?
Fortunately, there is (at least) one method: using an automatic hyperparameter optimizaton tool. 
    
One of these is [**hyperopt**](http://https://github.com/hyperopt/hyperopt).

# Hyperopt + XGboost

In [None]:
class HPOptimizer(object):

    def __init__(self):
        # A progress bar to monitor the hyperopt optimization process
        self.pbar = tqdm(total=MAX_EVALS, desc="Hyperopt")
        self.trials = Trials()

    def objective(self, hyperparameters):
        hyperparameters = {
            "max_delta_step": hyperparameters["max_delta_step"],
            "reg_alpha": '{:.3f}'.format(hyperparameters["reg_alpha"]), 
            "reg_lambda": '{:.3f}'.format(hyperparameters["reg_lambda"]), 
            "n_estimators": int(hyperparameters["n_estimators"]), 
            "max_depth": int(hyperparameters["max_depth"]),
            "learning_rate": '{:.3f}'.format(hyperparameters["learning_rate"]), 
            "colsample_bytree": '{:.3f}'.format(hyperparameters['colsample_bytree']),
            "gamma": "{:.3f}".format(hyperparameters['gamma']),
        }
        print("The current hyperparamters are: {}".format(hyperparameters))

        clf = XGBClassifier(
            n_jobs=4,
            **hyperparameters
        )

        scores = cross_val_score(clf, X=train_features_df, y=train_target_s, cv=5, 
                                 scoring="f1_weighted")
        print("Mean and std CV scores are: ({}, {})".format(scores.mean(), scores.std()))
        # Update the progress bar after each iteration
        self.pbar.update()
        # Since we are minimizing the objective => return -1 * mean(scores) (this is a loss)
        return -scores.mean()

    def run(self):
        if RUN_HP_OPTIMIZATION:
            optimal_hp = fmin(fn=objective,
                              space=HP_SPACE,
                              algo=tpe.suggest,
                              trials= trials,
                              max_evals=MAX_EVALS)
        else:
            optimal_hp = OPTIMAL_HP
        self.optimal_hp = optimal_hp

In [None]:
hp_optimizer = HPOptimizer()
hp_optimizer.run()
optimal_hp = hp_optimizer.optimal_hp
print("The optimal hyperparamters are: {}".format(optimal_hp))

# Exploring the hyperopt trials

Let's explore the saved trials (these are handy to store hyperopt runs).

In [None]:
if RUN_HP_OPTIMIZATION:
    hyperaramters_df = pd.DataFrame(trials.idxs_vals[1])
    losses_df = pd.DataFrame(trials.results)
    hyperopt_trials_df = pd.concat([losses_df, hyperaramters_df], axis=1)

In [None]:
if RUN_HP_OPTIMIZATION:
    # Check that the argmin of the hyperopt_trials_df DataFrame is the same as the optimal_hp 
    min_loss_index = losses_df['loss'].argmin()
    assert (hyperaramters_df.loc[min_loss_index, :].to_dict() == optimal_hp)

In [None]:
def hp_vs_loss_scatterplot(hyperparameter):

    fig, ax = plt.subplots(1, 1, figsize=(12, 8))
    hyperopt_trials_df.plot(x=hyperparameter, y='loss', kind='scatter', ax=ax)
    best_coordinates = hyperopt_trials_df.loc[min_loss_index, [hyperparameter, "loss"]].values
    ax.annotate("Best {}: {}".format(hyperparameter, round(best_coordinates[0], 3)), 
                xy=best_coordinates, 
                color="red")

In [None]:
if RUN_HP_OPTIMIZATION:
    # Remove the "max_delta_step" since it is fixed for now
    HP_SPACE.pop("max_delta_step")
    for hyperparmeter in HP_SPACE.keys():
        hp_vs_loss_scatterplot(hyperparmeter)

# Train tuned XGBoost classifier model on train data and evaluate on test

Let's train our best XGBoost classifier (using the optimal hyperparamters) on the train dataet then evaluate it on the test dataset.

In [None]:
parsed_optimal_hp = {
    "n_estimators": int(optimal_hp["n_estimators"]), 
    "max_depth": int(optimal_hp["max_depth"]),
    "learning_rate": optimal_hp["learning_rate"], 
    "colsample_bytree": '{:.3f}'.format(optimal_hp['colsample_bytree']),
    "gamma": "{:.3f}".format(optimal_hp['gamma']),
}

best_xgb_clf =  XGBClassifier(random_state=SEED, **parsed_optimal_hp)
best_xgb_clf.fit(train_features_df, train_target_s)

# Random Forests

Let's try other models starting with a random forests classifier

In [None]:
rf_clf = RandomForestClassifier(random_state=SEED)
rf_clf_scores = cross_val_score(rf_clf, X=train_features_df, y=train_target_s, cv=5, scoring="f1_weighted")
print("Simple random forests classification mean and std scores are: ({}, {})".format(rf_clf_scores.mean(), rf_clf_scores.std()))
print("This is a {} % improvement".format(improvement_in_percent(rf_clf_scores.mean(), lr_scores.mean())))

This isn't very promising for a start. Probably will neeed some hp tuning...

# Neural network

In [None]:
nn_clf = MLPClassifier(random_state=SEED)
nn_clf_scores = cross_val_score(nn_clf, X=train_features_df, y=train_target_s, cv=5, scoring="f1_weighted")
print("Simple classification neural network mean and std scores are: ({}, {})".format(nn_clf_scores.mean(), nn_clf_scores.std()))
print("This is a {} % improvement".format(improvement_in_percent(nn_clf_scores.mean(), lr_scores.mean())))

In [None]:
That's a better start. Let's see if one can improve things. 

# TPOT

In [None]:
TPOTClassifier?

In [None]:
# Previous values: TPOT_GENERATION=15 and TPOT_POPULATION_SIZE=80
TPOT_GENERATION = 20
TPOT_POPULATION_SIZE = 100
# TPOT will have TPOT_POPULATION_SIZE + offspring_size * TPOT_GENERATION runs in total. 
# The offspring_size is set to 100 by default.


tpot_clf = TPOTClassifier(generations=TPOT_GENERATION, 
                          population_size=TPOT_POPULATION_SIZE,
                          random_state=SEED, cv=5, 
                          n_jobs=-1, memory='auto', 
                          early_stop = 10,
                          verbosity=2, scoring="f1_weighted")
tpot_clf.fit(train_features_df, train_target_s)

# Test evaluation

In [None]:
test_features_df = test_df.drop(ENCODED_TARGET_COL, axis=1)
encoded_test_targets_s = test_df[ENCODED_TARGET_COL]

In [None]:
def test_evaluation(clf):
    """
    Evaluate a classifier on the test dataset. Returns a confusion matrix and F1 score. 
    """
    encoded_test_predictions_s = clf.predict(test_features_df)
    test_predictions_s = pd.Series(le.inverse_transform(encoded_test_predictions_s), 
                                   name="predicted_type_1")
    test_targets_s = pd.Series(le.inverse_transform(encoded_test_targets_s), 
                               name="true_type_1")
    test_cm = confusion_matrix(test_targets_s, test_predictions_s)
    test_f1_score = f1_score(test_targets_s, test_predictions_s, average='weighted').round(3)
    return test_cm, test_f1_score

In [None]:
test_cm_tpot, test_f1_score_tpot = test_evaluation(tpot_clf)
test_cm_best_xgb, test_f1_score_best_xgb = test_evaluation(best_xgb_clf)
test_cm_lr, test_f1_score_lr = test_evaluation(lr)

In [None]:
print("Tpot test F1 weighted score is {}".format(test_f1_score_tpot))
print("Best XGBoost test F1 weighted score is {}".format(test_f1_score_best_xgb))
print("Logistic regression test F1 weighted score is {}".format(test_f1_score_lr))

In [None]:
# Confusion matrix for Tpot
plot_confusion_matrix(test_cm_tpot, classes=target_s.unique())
plot_confusion_matrix(test_cm_tpot, classes=target_s.unique(), normalize=True)

That's impressive. Tpot is by far the winner!

# Stacking 

Alright. Let's stack our best models and use an XGBoost as a second-level model. 
To be continued...

# To wrap up

Some ideas to test: 

* More hyperopt iterations and other hypreparamters to optimize. This had the effect of improving the test F1 weighted score. Add more regularization?
* Use 3 folds CV instead of 5 folds CV. 
* Change the objective to optimize (try something that accounts for the classes' imbalance).
* Try a neural network.
* Try random forests.
* Try stacking.
* Try TPOT => done
* Try TPOT with more generations and bigger poupulation size (for now: generations=10, population_size=40)

I hope you have enjoyed this notebook. Stay tuned for updates!