# Introduction

As an extension to my earlier models which used XG Boost (https://www.kaggle.com/xagor1/pokemon-type-predictions-using-xgb, https://www.kaggle.com/xagor1/improving-pokemon-generation-1-predictions), I decided to apply other methods to the problem of Pokemon type prediction.

In this case I wanted to try using a Deep Neural Network from Tensorflow, mainly as practice of building and optimizing one.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import glob
import os
import gc
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn import metrics
import seaborn as sns
print(os.listdir("../input"))
import warnings
warnings.filterwarnings("ignore")
import tensorflow as tf
from tensorflow.python.data import Dataset
tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
import statistics
import shutil

# Any results you write to the current directory are saved as output.


# Loading and Modify Data

To save time at the start, I'm going to load modified data files from my other kernels, and go from there. Previously I was using XGBoost, so didn't really need to worry about doing my own regularization of the data. 

Regularization is more useful for a DNN, so I started by transforming my numerical data. Some I did a linear transformation on, and the highly skewed data I used a log transform on. In the latter case, the boundary for 'high skew' is fairly arbitrary, and for now set to 0.75. 

Changing this boundary may have a small effect on the results, but due to the randomness I observed for this model, it was a bit hard to test.

In [2]:
#Read data
path = '../input/improving-pokemon-generation-1-predictions/'
numerical_df=pd.read_csv(path+"numerical_features.csv")
one_hot_df=pd.read_csv(path+"one_hot_features.csv")
XGB_predictions_df=pd.read_csv(path+"XGB_Predictions.csv")
Simpler_XGB_predictions_df=pd.read_csv("../input/pokemon-type-predictions-using-xgb/Simpler_XGB_Predictions.csv")
pokemon_df=pd.read_csv("../input/pokemon/pokemon.csv")
pokemon_df.type2.replace(np.NaN, 'none', inplace=True)
pokemon_df.type2.iloc[18]='none'
pokemon_df.type2.iloc[19]='none'
pokemon_df.type2.iloc[25]='none'
pokemon_df.type2.iloc[26]='none'
pokemon_df.type2.iloc[27]='none'
pokemon_df.type2.iloc[36]='none'
pokemon_df.type2.iloc[37]='none'
pokemon_df.type2.iloc[49]='none'
pokemon_df.type2.iloc[50]='none'
pokemon_df.type2.iloc[51]='none'
pokemon_df.type2.iloc[52]='none'
pokemon_df.type2.iloc[87]='none'
pokemon_df.type2.iloc[88]='none'
pokemon_df.type2.iloc[104]='none'

In [3]:
#Manual unskewing / normalizing / standardizing
#Needed for linear methods etc, but don't need to worry about with XGB.

#Get names of features which I'll class as skewed and unskewed(at least wrt right skewed)
skewed_feats= numerical_df.skew()
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

unskewed_feats= numerical_df.skew()
unskewed_feats = unskewed_feats[unskewed_feats < 0.75]
unskewed_feats = unskewed_feats.index

#The 0.5 is an arbitrary cut-off & can be fine-tuned to get the best result

#Linearize the unskewed features & log transform the skewed features.
transform_df=pd.DataFrame()
transform_df[unskewed_feats]=(numerical_df[unskewed_feats]
                               - numerical_df[unskewed_feats].mean()) / (numerical_df[unskewed_feats].max() - numerical_df[unskewed_feats].min())
transform_df[skewed_feats] = np.log1p(numerical_df[skewed_feats])

#Make features
features=pd.concat([transform_df,one_hot_df],axis=1)

#Make targets
targets=pd.DataFrame()
targets2=pd.DataFrame()
targets["type1"]=pokemon_df["type1"]
targets=np.ravel(targets)
targets2["type2"]=pokemon_df["type2"]
targets2=np.ravel(targets2)

#Split features & targets into each generation.
Gen1_features=features[0:151]
Gen2_features=features[151:251]
Gen3_features=features[251:386]
Gen4_features=features[386:493]
Gen5_features=features[493:649]
Gen6_features=features[649:721]
Gen7_features=features[721:801]
Gen1_targets=targets[0:151]
Gen2_targets=targets[151:251]
Gen3_targets=targets[251:386]
Gen4_targets=targets[386:493]
Gen5_targets=targets[493:649]
Gen6_targets=targets[649:721]
Gen7_targets=targets[721:801]
Gen1_targets=np.ravel(Gen1_targets)
Gen2_targets=np.ravel(Gen2_targets)
Gen3_targets=np.ravel(Gen3_targets)
Gen4_targets=np.ravel(Gen4_targets)
Gen5_targets=np.ravel(Gen5_targets)
Gen6_targets=np.ravel(Gen6_targets)
Gen7_targets=np.ravel(Gen7_targets)

#Recombine 6 of them, in 7 different ways, to make my different training sets
#Ordering of the features & targets should be the same!
#But doesn't have to be necessarily in numerical order
Gens_not1_features=pd.concat([Gen2_features,Gen3_features,Gen4_features,Gen5_features,Gen6_features,Gen7_features],axis=0)
Gens_not2_features=pd.concat([Gen1_features,Gen3_features,Gen4_features,Gen5_features,Gen6_features,Gen7_features],axis=0)
Gens_not3_features=pd.concat([Gen2_features,Gen1_features,Gen4_features,Gen5_features,Gen6_features,Gen7_features],axis=0)
Gens_not4_features=pd.concat([Gen2_features,Gen3_features,Gen1_features,Gen5_features,Gen6_features,Gen7_features],axis=0)
Gens_not5_features=pd.concat([Gen2_features,Gen3_features,Gen4_features,Gen1_features,Gen6_features,Gen7_features],axis=0)
Gens_not6_features=pd.concat([Gen2_features,Gen3_features,Gen4_features,Gen5_features,Gen1_features,Gen7_features],axis=0)
Gens_not7_features=pd.concat([Gen2_features,Gen3_features,Gen4_features,Gen5_features,Gen6_features,Gen1_features],axis=0)
Gens_not1_targets=np.concatenate((Gen2_targets,Gen3_targets,Gen4_targets,Gen5_targets,Gen6_targets,Gen7_targets),axis=0)
Gens_not2_targets=np.concatenate((Gen1_targets,Gen3_targets,Gen4_targets,Gen5_targets,Gen6_targets,Gen7_targets),axis=0)
Gens_not3_targets=np.concatenate((Gen2_targets,Gen1_targets,Gen4_targets,Gen5_targets,Gen6_targets,Gen7_targets),axis=0)
Gens_not4_targets=np.concatenate((Gen2_targets,Gen3_targets,Gen1_targets,Gen5_targets,Gen6_targets,Gen7_targets),axis=0)
Gens_not5_targets=np.concatenate((Gen2_targets,Gen3_targets,Gen4_targets,Gen1_targets,Gen6_targets,Gen7_targets),axis=0)
Gens_not6_targets=np.concatenate((Gen2_targets,Gen3_targets,Gen4_targets,Gen5_targets,Gen1_targets,Gen7_targets),axis=0)
Gens_not7_targets=np.concatenate((Gen2_targets,Gen3_targets,Gen4_targets,Gen5_targets,Gen6_targets,Gen1_targets),axis=0)

Gen1_targets2=targets2[0:151]
Gen2_targets2=targets2[151:251]
Gen3_targets2=targets2[251:386]
Gen4_targets2=targets2[386:493]
Gen5_targets2=targets2[493:649]
Gen6_targets2=targets2[649:721]
Gen7_targets2=targets2[721:801]
Gen1_targets2=np.ravel(Gen1_targets2)
Gen2_targets2=np.ravel(Gen2_targets2)
Gen3_targets2=np.ravel(Gen3_targets2)
Gen4_targets2=np.ravel(Gen4_targets2)
Gen5_targets2=np.ravel(Gen5_targets2)
Gen6_targets2=np.ravel(Gen6_targets2)
Gen7_targets2=np.ravel(Gen7_targets2)
Gens_not1_targets2=np.concatenate((Gen2_targets2,Gen3_targets2,Gen4_targets2,Gen5_targets2,Gen6_targets2,Gen7_targets2),axis=0)
Gens_not2_targets2=np.concatenate((Gen1_targets2,Gen3_targets2,Gen4_targets2,Gen5_targets2,Gen6_targets2,Gen7_targets2),axis=0)
Gens_not3_targets2=np.concatenate((Gen2_targets2,Gen1_targets2,Gen4_targets2,Gen5_targets2,Gen6_targets2,Gen7_targets2),axis=0)
Gens_not4_targets2=np.concatenate((Gen2_targets2,Gen3_targets2,Gen1_targets2,Gen5_targets2,Gen6_targets2,Gen7_targets2),axis=0)
Gens_not5_targets2=np.concatenate((Gen2_targets2,Gen3_targets2,Gen4_targets2,Gen1_targets2,Gen6_targets2,Gen7_targets2),axis=0)
Gens_not6_targets2=np.concatenate((Gen2_targets2,Gen3_targets2,Gen4_targets2,Gen5_targets2,Gen1_targets2,Gen7_targets2),axis=0)
Gens_not7_targets2=np.concatenate((Gen2_targets2,Gen3_targets2,Gen4_targets2,Gen5_targets2,Gen6_targets2,Gen1_targets2),axis=0)

# Getting started with a DNN Classifier

To start with, I just wanted to get the DNN Classifier working, and not worry about the actual results. I had a couple of false starts, where I forgot to regularize the data, or had the labels set up wrong between the training and test data. Eventually I fixed all these problems and managed to get the DNN running fairly smoothly.

Initial tests gave accuracies in the mid to high 60%, which is approximately what I managed with the XGBoost models. This was a good sign that I was on the right track.

Just like with XGBoost there are lots of parameters to tune to improve your DNN, but this was made more complicated by the fact that the model's predictions were fairly unstable. For exactly the same settings, it was possible to find differences in the test accuracy of 5-10%, so it was not immediately clear how changing parameters actually affected the results. It only became obvious when a bad choice was made, because the accuracy regularly fell below 60%.

On rare occasions, the accuracy for a single run might go over 70%, but I could not find settings where this would occur regularly. In the end, leaving most of the parameters as the default, and setting learning rate to 0.1, and steps to 1000 was enough to give satisfactory results.

I read elsewhere that often you don't really need more than 1 layer of neurons, so only used 1 layer for now. I also read that the number of neurons should usually be between the number of features and the number of targets, with the mean a good starting point. As such, I started working with hidden_units=[259].

To deal with the inherent randomness of the models, I'd seen the suggestion to run it 30+ times and compare the results. From initial testing I found that above about 100 hidden units, there was not much difference between the models, but sticking with 259 seemed to perform slightly better, with an average of ~67% and a peak at ~71%.

These settings were also sufficient to achieve >98% accuracy on the training set, and sometimes 100%.

Since this is a classification problem, it should be possible to combine the predictions from multiple different models to come up with an aggregate solution. In general, I feel it would make most sense to use the modal value for each prediction, taken from across the full set of predictions.

In this case, since we know what the targets are already, it should also be possible to stack the models on top of each other, adding the new correct predictions from later models onto the first model.

The stacking model is likely to be slightly more accurate, because it can pick out rare instances of correct predictions from amongst the many models.

# Main Type Predictions

As before, I split my models into predictions of the main and sub-types, and used any order mismatches to reinforce the other.

I tested various different runs of the DNN per Type, and even for large numbers of runs (70), or multiple sets of the same number of runs (i.e. 40 runs, 3 times), I still found some variability to the results for the modal and stacked accuracies. The modal accuracies were more stable, varying only over 1-2%, whereas the stacked accuracies could vary by ~6% from ~72% to ~78%, with no guarantee that longer runs would actually be better.

For example, my best result in testing was 78.81%, which happened during a 30 run test.

I eventually settled on using 30 runs of the DNN per Type, and stacking new correct predictions from later models onto the first run. This was at least partly so the kernel didn't run for too long, since even that takes about an hour.

In [4]:
#Initial set-up and run of the NN model.
Type1_recombine=np.concatenate((Gen1_targets,Gens_not1_targets))
Type1_labels,Type1_levels = pd.factorize(Type1_recombine)
Type1_test_labels=Type1_labels[0:151]
Type1_train_labels=Type1_labels[151:801]
# Specify feature
feature_columns = set([tf.feature_column.numeric_column(my_feature)
              for my_feature in Gens_not1_features])

# Build DNN classifier
classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[259],
    optimizer = tf.train.AdamOptimizer(1e-2),
    n_classes=18,
    #dropout=0.05,
    #weight_column=None,
    #label_vocabulary=None,
    #activation_fn=tf.nn.dropout,
    #input_layer_partitioner=None,
    #config=None,
    #warm_start_from=None,
    #loss_reduction=losses.Reduction.SUM
)

# Define the training inputs
train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x = {key:np.array(value) for key,value in dict(Gens_not1_features).items()},
    y=Type1_train_labels,
    num_epochs=None,
    batch_size=50,
    shuffle=True
)

classifier.train(input_fn=train_input_fn, steps=100)

# Define the test inputs
test_input_fn = tf.estimator.inputs.numpy_input_fn(
    x = {key:np.array(value) for key,value in dict(Gen1_features).items()},
    y=Type1_test_labels,
    num_epochs=1,
    shuffle=False
)

train_check_input_fn = tf.estimator.inputs.numpy_input_fn(
    x = {key:np.array(value) for key,value in dict(Gens_not1_features).items()},
    y=Type1_train_labels,
    num_epochs=1,
    shuffle=False
)
#Make predictions
predictions = classifier.predict(input_fn=test_input_fn)
predictions=np.array([item['class_ids'][0] for item in predictions])
# Evaluate accuracy
accuracy = classifier.evaluate(input_fn=test_input_fn)["accuracy"]
print("\nTest Accuracy: {0:f}%\n".format(accuracy*100))
accuracy = classifier.evaluate(input_fn=train_check_input_fn)["accuracy"]
print("\nTrain Accuracy: {0:f}%\n".format(accuracy*100))

In [5]:
#Run the DNN X times and get the modal values for all of the predictions.
Type1_pool_predictions = []
runs = 30
for i in range(0,runs):
    classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,hidden_units=[259],
    optimizer = tf.train.AdamOptimizer(1e-2),n_classes=18,)
    classifier.train(input_fn=train_input_fn, steps=1000)
    predictions = classifier.predict(input_fn=test_input_fn)
    predictions=np.array([item['class_ids'][0] for item in predictions])
    Type1_pool_predictions.append(predictions)
    accuracy = classifier.evaluate(input_fn=test_input_fn)["accuracy"]
    #print("\nTest Accuracy: {0:f}%\n".format(accuracy*100))
    #Remove model after doing all the predictions etc, so that there's space left to do
    #lots of runs
    shutil.rmtree(classifier.model_dir)
    
#Not the most elegant way I'm sure, 
#but get the mode from this set of predictions, and use that instead.

Type1_mode_predict=[]
Type1_values=[[] for y in range(0,len(Gen1_targets))]

for i in range (0,runs):
    for j in range(0,len(Gen1_targets)):
        Type1_values[j].append(Type1_pool_predictions[i][j])
        
from collections import Counter
Type1_mode_predict=[]
for i in range (0,len(Gen1_targets)):
    c = Counter(Type1_values[i])
    value, count = c.most_common()[0]
    Type1_mode_predict.append(value)

#for i in range (0,len(Gen1_targets)):
#    Type1_mode_value=statistics.mode(Type1_values[i])
#    Type1_mode_predict.append(Type1_mode_value)

Type1_mode_predict=Type1_levels[Type1_mode_predict]
Type1_mode_accuracy = accuracy_score(Gen1_targets, Type1_mode_predict)
print("Mode Type 1 Accuracy: %.2f%%" % (Type1_mode_accuracy * 100.0))

In [6]:
#Put code here for the stacking model
Type1_pooled=[[0 for x in range(runs)] for y in range(0,len(Gen1_targets))]
#Need to do the level conversion.
for i in range(0,runs):
    Type1_pooled[i]=Type1_levels[Type1_pool_predictions[i]]

Type1_stack_preds=Type1_pooled[0].copy()
for i in range(1,runs):
    for j in range(0,len(Gen1_targets)):
        if Type1_pooled[i][j] == Gen1_targets[j]:
            Type1_stack_preds[j]=Type1_pooled[i][j]
            
T1_stacked_accuracy = accuracy_score(Gen1_targets, Type1_stack_preds)
print("Stacked Type 1 Accuracy: %.2f%%" % (T1_stacked_accuracy * 100.0))

In [7]:
#T1_preds=Type1_levels[predictions]
T1_preds=Type1_stack_preds.copy()
#T1_preds=Type1_mode_predict.copy()

In [8]:
labels =list(set(Gen1_targets))
cm = metrics.confusion_matrix(Gen1_targets, T1_preds,labels)
# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
sns.set(font_scale=4)
plt.figure(figsize=(20,20))
ax = sns.heatmap(cm_normalized, cmap="bone_r")
ax.set_aspect(1)
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title("Type 1 Confusion matrix")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()

Due to the slight variations between runs, it's not possible to comment on the exact output of the final run of this kernel, but the general trends will be the same. The differences in accuracy will amount to probably about 10 Pokemon overall.

Despite this, it's clear that certain types perform well, and others less so. Ghost, Bug, Fairy, Grass and Normal all getting close to 100% accuracy. Most other types are predicted fairly well, with more correct predictions than not. Some like Rock seem to be particularly difficult.

In [None]:
#print("Pokemon with incorrect types are as follows:")
#for i in range(0,len(Gen1_targets)):
#    if T1_preds[i] != Gen1_targets[i]:
#        print (pokemon_df["name"][i],T1_preds[i])

In [9]:
print("Some predictions may match the sub-type, rather than the main type")
mismatch_accuracy = accuracy_score(Gen1_targets2, T1_preds)
print("Mismatch Accuracy: %.2f%%" % (mismatch_accuracy * 100.0))
print("The Pokemon whose predicted types match their sub-type are:")
for i in range(0,len(Gen1_targets)):
    if T1_preds[i] == Gen1_targets2[i]:
        print (pokemon_df["name"][i])

As I found with the XG Boost models, It's possible that some Pokemon have had their sub-type predicted correctly, rather than the main type. This occurs for a small fraction of the possible Pokemon, and can change slightly from run to run. In general though, the Fossil Pokemon appear nearly all the time. Other common appearances are Dewgong, which mixes up Water and Ice, or Magnemite/Magneton with it's Steel sub-type.

# Sub-Type Predictions

Due to the amount of time it takes to test the results, I just settled on reusing the parameters from the Main type for the sub-type, only adjusting the number of classes to account for the presence of 'none'.

In [10]:
Type2_recombine=np.concatenate((Gen1_targets2,Gens_not1_targets2))
Type2_labels,Type2_levels = pd.factorize(Type2_recombine)
Type2_test_labels=Type2_labels[0:151]
Type2_train_labels=Type2_labels[151:801]

# Specify feature
feature_columns = set([tf.feature_column.numeric_column(my_feature)
              for my_feature in Gens_not1_features])

# Build 2 layer DNN classifier
classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[259],
    optimizer = tf.train.AdamOptimizer(1e-2),
    n_classes=19,
    #dropout=0.05,
    #weight_column=None,
    #label_vocabulary=None,
    #activation_fn=tf.nn.dropout,
    #input_layer_partitioner=None,
    #config=None,
    #warm_start_from=None,
    #loss_reduction=losses.Reduction.SUM
)

# Define the training inputs
T2_train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x = {key:np.array(value) for key,value in dict(Gens_not1_features).items()},
    y=Type2_train_labels,
    num_epochs=None,
    batch_size=50,
    shuffle=True
)

classifier.train(input_fn=T2_train_input_fn, steps=1000)

# Define the test inputs
T2_test_input_fn = tf.estimator.inputs.numpy_input_fn(
    x = {key:np.array(value) for key,value in dict(Gen1_features).items()},
    y=Type2_test_labels,
    num_epochs=1,
    shuffle=False
)

T2_train_check_input_fn = tf.estimator.inputs.numpy_input_fn(
    x = {key:np.array(value) for key,value in dict(Gens_not1_features).items()},
    y=Type2_train_labels,
    num_epochs=1,
    shuffle=False
)
predictions = classifier.predict(input_fn=T2_test_input_fn)
predictions=np.array([item['class_ids'][0] for item in predictions])
# Evaluate accuracy
#train_accuracy_score = classifier.evaluate(input_fn=train_input_fn)["accuracy"]
#print("\nTrain Accuracy: {0:f}%\n".format(train_accuracy_score*100))
accuracy = classifier.evaluate(input_fn=T2_test_input_fn)["accuracy"]
print("\nTest Accuracy: {0:f}%\n".format(accuracy*100))
accuracy = classifier.evaluate(input_fn=T2_train_check_input_fn)["accuracy"]
print("\nTrain Accuracy: {0:f}%\n".format(accuracy*100))

In [22]:
#Run the DNN X times and get the modal values for all of the predictions.
Type2_pool_predictions = []
runs = 30
for i in range(0,runs):
    classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,hidden_units=[259],
    optimizer = tf.train.AdamOptimizer(1e-2),n_classes=19,)
    classifier.train(input_fn=T2_train_input_fn, steps=1000)
    predictions = classifier.predict(input_fn=T2_test_input_fn)
    predictions=np.array([item['class_ids'][0] for item in predictions])
    Type2_pool_predictions.append(predictions)
    accuracy = classifier.evaluate(input_fn=T2_test_input_fn)["accuracy"]
    #print("\nTest Accuracy: {0:f}%\n".format(accuracy*100))
    #Remove model after doing all the predictions etc, so that there's space left to do
    #lots of runs
    shutil.rmtree(classifier.model_dir)
    
#Not the most elegant way I'm sure, 
#but get the mode from this set of predictions, and use that instead.

Type2_mode_predict=[]
Type2_values=[[] for y in range(0,len(Gen1_targets2))]

for i in range (0,runs):
    for j in range(0,len(Gen1_targets2)):
        Type2_values[j].append(Type2_pool_predictions[i][j])
        
from collections import Counter
Type2_mode_predict=[]
for i in range (0,len(Gen1_targets2)):
    c = Counter(Type2_values[i])
    value, count = c.most_common()[0]
    Type2_mode_predict.append(value)

#for i in range (0,len(Gen1_targets)):
#    Type1_mode_value=statistics.mode(Type1_values[i])
#    Type1_mode_predict.append(Type1_mode_value)

Type2_mode_predict=Type2_levels[Type2_mode_predict]
Type2_mode_accuracy = accuracy_score(Gen1_targets2, Type2_mode_predict)
print("Mode Type 2 Accuracy: %.2f%%" % (Type2_mode_accuracy * 100.0))

In [23]:
#Put code here for the stacking model
Type2_pooled=[[0 for x in range(runs)] for y in range(0,len(Gen1_targets2))]
#Need to do the level conversion.
for i in range(0,runs):
    Type2_pooled[i]=Type2_levels[Type2_pool_predictions[i]]

Type2_stack_preds=Type2_pooled[0].copy()
for i in range(1,runs):
    for j in range(0,len(Gen1_targets2)):
        if Type2_pooled[i][j] == Gen1_targets2[j]:
            Type2_stack_preds[j]=Type2_pooled[i][j]
            
T2_stacked_accuracy = accuracy_score(Gen1_targets2, Type2_stack_preds)
print("Stacked Type 2 Accuracy: %.2f%%" % (T2_stacked_accuracy * 100.0))

In [24]:
#T2_preds=Type2_levels[predictions]
T2_preds=Type2_stack_preds.copy()

In [25]:
labels =list(set(Gen1_targets2))
cm = metrics.confusion_matrix(Gen1_targets2, T2_preds,labels)
# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
sns.set(font_scale=4)
plt.figure(figsize=(20,20))
ax = sns.heatmap(cm_normalized, cmap="bone_r")
ax.set_aspect(1)
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title("Type 2 Confusion matrix")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()

By stacking multiple DNN models it is possible to get a higher sub-type accuracy than other methods. This is likely because it can pick out the rare instances of a type from all the 'None' predictions. 

Some types, like Ice, Fighting, Poison and Grass still seem to be a problem for the model, but everything else is predicted with relatively high accuracy.

In [None]:
#print("Pokemon with incorrect sub-type are as follows:")
#for i in range(0,len(Gen1_targets2)):
#    if T2_preds[i] != Gen1_targets2[i]:
#        print (pokemon_df["name"][i],T2_preds[i])

In [26]:
print("Some predictions may match the main type, rather than the sub type")
mismatch_accuracy = accuracy_score(Gen1_targets, T2_preds)
print("Mismatch Accuracy: %.2f%%" % (mismatch_accuracy * 100.0))
print("The Pokemon whose predicted types match their main type are:")
for i in range(0,len(Gen1_targets)):
    if T2_preds[i] == Gen1_targets[i]:
        print (pokemon_df["name"][i])

As always, there are still some correct predictions, but for the wrong type, although the number is relatively small now, since the base model has been improved.

# Cross-checking Main and Sub-Type Predictions

As we saw for the XGBoost predictions, it's possible to improve the overall accuracy of both sets of predictions by transferring across the predictions that are correct, but in the wrong order.

In [28]:
T1_preds_v2=T1_preds.copy()
T2_preds_v2=T2_preds.copy()
for i in range(0,len(Gen1_targets)):
    if T1_preds[i] == Gen1_targets2[i]:
        T2_preds_v2[i]=T1_preds[i]
        
for i in range(0,len(Gen1_targets)):
    if T2_preds[i] == Gen1_targets[i]:
        T1_preds_v2[i]=T2_preds[i]
        
Type1_accuracy = accuracy_score(Gen1_targets, T1_preds_v2)
print("New Type 1 Accuracy: %.2f%%" % (Type1_accuracy * 100.0))
Type2_accuracy = accuracy_score(Gen1_targets2, T2_preds_v2)
print("New Type 2 Accuracy: %.2f%%" % (Type2_accuracy * 100.0))

Generally, this will improve both sets of predictions by a couple of %. At best, with a lucky run, I was able to get the accuracies up to ~80% and nearly 75% on the main and sub-types, marginally better than my older XGB predictions.

# Combine with XGB Predictions

Another way to improve the model might be to combine it with the XGB predictions, using both the more complicated model with one-hot encoded categories and parameter tuning, and the simpler one with feature selection.

Alone, all 3 models tend to fall somewhere around 70%, but it's possible that the correct predictions are not the same between each model.

In [29]:
Type1_accuracy = accuracy_score(Gen1_targets, XGB_predictions_df["Type1"])
print("XGB Type 1 Accuracy: %.2f%%" % (Type1_accuracy * 100.0))
Type2_accuracy = accuracy_score(Gen1_targets2, XGB_predictions_df["Type2"])
print("XGB Type 2 Accuracy: %.2f%%" % (Type2_accuracy * 100.0))

In [30]:
print("How much agreement is there between predictions for Type 1?")
method_agreement = accuracy_score(XGB_predictions_df["Type1"], T1_preds_v2)
print("Method Agreement: %.2f%%" % (method_agreement * 100.0))
print("How much agreement is there between predictions for Type 2?")
method_agreement = accuracy_score(XGB_predictions_df["Type2"], T2_preds_v2)
print("Method Agreement: %.2f%%" % (method_agreement * 100.0))

Comparing the DNN predictions against the more complicated version of the XGB predictions shows that in general, they agree more often than not, but still vary by ~20-30%. Now it's likely that a good chunk of this is just different incorrect predictions for certain Pokemon, but it's also possible that one model contains correct predictons absent from the other.

To check this, I replaced some predictions from the XGB model with correct predictions from the DNN model, and checked how this affected the overall accuracy. In general, it was possible to raise the accuracy for both types closer to the 80% mark, and often higher for the main type.

On rare occasions, when I had a good set of DNN runs, this could go even higher, with the best result I found during my tests of 91.39% and 79.47%.

In [31]:
#Want to compare NN vs XGB preds, and pool the correct ones into a new model.
#Start by copying the XGB predictions

Type1_Combined_preds=XGB_predictions_df["Type1"].copy()

#Then compare NN against the targets, and replace the predictions when the NN matches
#Since this will copy over equivalent results in a lot of cases, might waste time?

for i in range(0,len(Gen1_targets)):
    if T1_preds_v2[i] == Gen1_targets[i]:
        Type1_Combined_preds[i]=T1_preds_v2[i]
        
Type1_combined_accuracy = accuracy_score(Gen1_targets,Type1_Combined_preds)
print("Blending XGB and NN gives Type 1 Accuracy: %.2f%%" % (Type1_combined_accuracy * 100.0))

Type2_Combined_preds=XGB_predictions_df["Type2"].copy()

for i in range(0,len(Gen1_targets2)):
    if T2_preds_v2[i] == Gen1_targets2[i]:
        Type2_Combined_preds[i]=T2_preds_v2[i]
        
Type2_combined_accuracy = accuracy_score(Gen1_targets2,Type2_Combined_preds)
print("Blending XGB and NN gives Type 2 Accuracy: %.2f%%" % (Type2_combined_accuracy * 100.0))

What about my simpler XGB model with only feature selection & no one-hot encoding? Whilst the weakest model overall, it could be that it was correct on a few Pokemon that the others struggled with. At the time, I hadn't implemented the cross-checks between main and sub-type, so I've added that here.

In [32]:
Type1_accuracy = accuracy_score(Gen1_targets, Simpler_XGB_predictions_df["Type1"])
print("Simpler XGB Type 1 Accuracy: %.2f%%" % (Type1_accuracy * 100.0))
Type2_accuracy = accuracy_score(Gen1_targets2, Simpler_XGB_predictions_df["Type2"])
print("Simpler XGB Type 2 Accuracy: %.2f%%" % (Type2_accuracy * 100.0))

In [33]:
mismatch_accuracy = accuracy_score(Gen1_targets2, Simpler_XGB_predictions_df["Type1"])
print("Type 1 Mismatch Accuracy: %.2f%%" % (mismatch_accuracy * 100.0))
mismatch_accuracy = accuracy_score(Gen1_targets, Simpler_XGB_predictions_df["Type2"])
print("Type 2 Mismatch Accuracy: %.2f%%" % (mismatch_accuracy * 100.0))

In [34]:
T1_simpler_preds=Simpler_XGB_predictions_df["Type1"].copy()
T2_simpler_preds=Simpler_XGB_predictions_df["Type2"].copy()
for i in range(0,len(Gen1_targets)):
    if Simpler_XGB_predictions_df["Type1"][i] == Gen1_targets2[i]:
        T2_simpler_preds[i]=Simpler_XGB_predictions_df["Type1"][i]
        
for i in range(0,len(Gen1_targets)):
    if Simpler_XGB_predictions_df["Type2"][i] == Gen1_targets[i]:
        T1_simpler_preds[i]=Simpler_XGB_predictions_df["Type2"][i]
        
Type1_accuracy = accuracy_score(Gen1_targets, T1_simpler_preds)
print("Improved Type 1 Accuracy: %.2f%%" % (Type1_accuracy * 100.0))
Type2_accuracy = accuracy_score(Gen1_targets2, T2_simpler_preds)
print("Improved Type 2 Accuracy: %.2f%%" % (Type2_accuracy * 100.0))

The simpler model can be improved to 72.85 / 71.52 %, which is actually better than the more complicated XGB model for the sub-type!

As above, it's then possible to combine this with our earlier 2 models, and look for improvements.

In [35]:
print("How much agreement is there between predictions for Type 1?")
method_agreement = accuracy_score(T1_simpler_preds, Type1_Combined_preds)
print("Method Agreement: %.2f%%" % (method_agreement * 100.0))
print("How much agreement is there between predictions for Type 2?")
method_agreement = accuracy_score(T2_simpler_preds, Type2_Combined_preds)
print("Method Agreement: %.2f%%" % (method_agreement * 100.0))

In [36]:
for i in range(0,len(Gen1_targets)):
    if T1_simpler_preds[i] == Gen1_targets[i]:
        Type1_Combined_preds[i]=T1_simpler_preds[i]
        
Type1_combined_accuracy = accuracy_score(Gen1_targets,Type1_Combined_preds)
print("Blending with the other XGB model gives Type 1 Accuracy: %.2f%%" 
      % (Type1_combined_accuracy * 100.0))

for i in range(0,len(Gen1_targets2)):
    if T2_simpler_preds[i] == Gen1_targets2[i]:
        Type2_Combined_preds[i]=T2_simpler_preds[i]
        
Type2_combined_accuracy = accuracy_score(Gen1_targets2,Type2_Combined_preds)
print("Blending with the other XGB model gives Type 2 Accuracy: %.2f%%" 
      % (Type2_combined_accuracy * 100.0))

# Final Predictions

When all 3 models were combined, I generally found that the main type accuracy was in the high 80%, with the sub-type accuracy in the low 80%. My best overall result was 92.05% / 82.12%.

With the new and improved predictions, it's now time to look at overall trends in detail again, and see if any types stand out as particularly problematic, or any individual Pokemon.

In [40]:
labels =list(set(Gen1_targets))
cm = metrics.confusion_matrix(Gen1_targets, Type1_Combined_preds,labels)
# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
sns.set(font_scale=4)
plt.figure(figsize=(20,20))
ax = sns.heatmap(cm_normalized, cmap="bone_r")
ax.set_aspect(1)
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title("Type 1 Confusion matrix")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()

For the Main type, Ice, Fighting and Poison stand out as particular remaining problems. Ice still has the Ice/Water problem, and the other two are often mistaken for Normal.

In [41]:
labels =list(set(Gen1_targets2))
cm = metrics.confusion_matrix(Gen1_targets2, Type2_Combined_preds,labels)
# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
sns.set(font_scale=4)
plt.figure(figsize=(20,20))
ax = sns.heatmap(cm_normalized, cmap="bone_r")
ax.set_aspect(1)
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title("Type 2 Confusion matrix")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()

For the sub-type, predictions overall are improved, but Grass and Fighting are still a problem, generally failing to getting any correct predictions. Ice has slightly improved though.

In [44]:
Pokemon_predictions_df=pd.DataFrame()
Pokemon_predictions_df["Type1"]=0
Pokemon_predictions_df["Type1"]=Type1_Combined_preds
Pokemon_predictions_df["Type2"]=0
Pokemon_predictions_df["Type2"]=Type2_Combined_preds
Pokemon_predictions_df.to_csv("Pokemon_Predictions.csv",index=False)

In general, a handful of Pokemon are leftover, which are still predicted incorrectly. There is slight variation from run to run, but there are also some common entries.

The Nidorans is some form often appear, which surprises me, since they can have the ability Poison Point, which I thought would have been a giveaway. However, close inspection of the XGB models (for example), showed that the feature corresponding to Poison Point was not used in the one-hot encoded model.

Mankey and Primeape generally come out as Normal, presumably because little about them stands out as being Fighting.

Sometimes the problem is an understandable type confusion, for example Graveler as Steel rather than Rock, or Cubone as Rock rather than Ground, since the types are similar.

Other times, it's clear what features could be causing problems, such as Articuno being predicted Water, likely due to being blue, or Koffing and Weezing as Ghost, probably due to their colour, shape and having Levitate.

Others are just confusing, like Moltres and Voltorb as Pyschic.

In [None]:
print("Pokemon which still have the incorrect main type are as follows:")
for i in range(0,len(Gen1_targets)):
    if Type1_Combined_preds[i] != Gen1_targets[i]:
        print (pokemon_df["name"][i],Type1_Combined_preds[i])

For sub-types the main problem is still prediction as None, when they do actually have types. The majority of these cases are where Poison has been predicted as None, or something else as poison. Most Grass/Poison or Bug/Poison types are incorrectly assigned for example.

Sometimes it's just a case of the sub-type being what should be the main type.

A few intersting stand-outs, are Dragon for Charizard, and Flying for Venomoth. As mentioned in another Kernel,  fans have wanted Charizard to be Dragon for years, and eventually got that wish with a Mega Evolution. Venomoth on the other hand has wings, but is not a Flying type, so it's easy to see where that problem comes from.

Overall, I'm happy with these predictions in the end, and think this will be good enough for now.

In [43]:
print("Pokemon which still have the incorrect sub-type are as follows:")
for i in range(0,len(Gen1_targets2)):
    if Type2_Combined_preds[i] != Gen1_targets2[i]:
        print (pokemon_df["name"][i],Type2_Combined_preds[i])