My goal in this notebook is to find a classifier that can guess a Pokemon's type given its stats. This is my first time using Python for Data Science, and my first time using sklearn!

Start off with imports and loading the data

In [None]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from time import time

# Read pokemon table (don't use pokemon # as index, it is non-unique
pokemon = pd.read_csv("../input/Pokemon.csv", na_filter=False, index_col=False)


print(pokemon.loc[0:10])

Now I'm going to want to figure out types. Some pokemon have one type, some have two. I think the way to handle this is to reduce it to one type. For pokemon with one type already, nothing will change; for pokemon with two types, there will be three entries in the new pokemon table. One that has the first type, one that has the second type, and one that has the two types appended. For example, Bulbasaur (grass, poison), would become [Bulbasaur (grass), Bulbasaur(poison), Bulbasaur (grasspoison). I will assume that type order (i.e. Type 1 before Type 2) DOES matter, so for example grasspoison (from Type 1 grass, Type 2 poison) will be treated as a different type than poisongrass (from Type 1 poison, type 2 grass). I also want to assign a number to each type (including new merged types), to make it easier to treat as a response variable.

In [None]:
# Don't show warnings tripped by adding columns to a dataframe
pd.options.mode.chained_assignment = None


# save type strings for convenience
t1 = 'Type 1'
t2 = 'Type 2'
t = 'Type'


def set_type(typefcn, df):
    """Sets pokemon type using typefcn and clears the Type 1 and Type 2 fields
    typefcn should take a dataframe as an arugument and return a series that will fit into that dataframe"""
    newdf = df
    newdf[t] = typefcn(df)
    return newdf.filter(items=list(set(list(df)).difference([t1, t2])))


# get pokemon with only one type
singletypepokemon = set_type(lambda df: df[t1], pokemon.loc[lambda p: p[t2] == '', :])


# get pokemon with 2 types
dualtypepokemon = pokemon.loc[lambda p: p[t2] != '', :]


# copy dual typed pokemon and set type to first type
firsttypepokemon = set_type(lambda df: df[t1], dualtypepokemon)
# copy dual typed pokemon and set type to second type
secondtypepokemon = set_type(lambda df: df[t2], dualtypepokemon)
# copy dual typed pokemon and set type to first+second type
mergedtypepokemon = set_type(lambda df: df[t1] + df[t2], dualtypepokemon)


# combine pokemon into single table and create a type number
newpokemon = singletypepokemon.append(firsttypepokemon).append(secondtypepokemon).append(mergedtypepokemon)
typelookup = list(set(list(newpokemon[t])))
typenumlookup = dict([(poketype, idx) for idx, poketype in enumerate(typelookup)])
newpokemon['typenum'] = pd.Series([typenumlookup[poketype] for poketype in list(newpokemon[t])])


# Print bulbasaur and charmander examples
print(newpokemon.loc[0].append(newpokemon.loc[4]))

Now the data needs to be separated into test and training sets. 

In [None]:
predictorcolumns = ['Speed', 'Attack', 'Sp. Atk', 'Defense', 'Sp. Def', 'HP']
responsecolumn = 'typenum'

testsize = .2

trainingset, testset = train_test_split(newpokemon, test_size=testsize)

xtrain = trainingset.loc[:, predictorcolumns]
ytrain = trainingset.typenum

xtest = testset.loc[:, predictorcolumns]
ytest = testset.typenum

Now the data is finally in a format that should map well to sklearn. My response variable is typenum, and my predictors are Speed, Attack, Special Attack, Defense, Special Defense, and HP. I will try a number of different classifiers, so I will want a function to train and score any given model. The classifiers I selected are the ones I am most familiar with, or that I could at least explain and discuss at a reasonably technical level.

In [None]:
def try_classifier(modelname, model, scorefcn):
    t0 = time()
    model.fit(xtrain, ytrain)
    t1 = time()
    print(modelname)
    print("Accuracy on training set: {:.3f}".format(scorefcn(model,xtrain, ytrain)))
    print("Accuracy on test set: {:.3f}".format(scorefcn(model, xtest, ytest)))
    print("Run time: {:.3f}".format(t1-t0))
    print("------------------------------------\n")
    
models = {"Decision Tree Classifier": DecisionTreeClassifier(),
          "Random Forest Classifier": RandomForestClassifier(),
          "Gradient Boosting Classifier": GradientBoostingClassifier(),
          "K Nearest Neighbors Classifier": KNeighborsClassifier(),
          "K Means Cluster Classifier": KMeans(),
          "Naive Bayes Classifier": GaussianNB(),
          "Scaled Vector Classifier": SVC()}

for modelname in models:
    try_classifier(modelname, models[modelname], lambda model, x, y: model.score(x,y))

First of all, K Means Cluster makes it clear that the scoring is done differently from the others... I need to confirm that each of the methods use the same scoring function, or at least have comparable results.

Overall, the accuracy for all the classifiers on the test set is in around 70 (with the exception of the abysmal score of the Naive Bayes Classifier). This isn't great, but my training/test procedure is flawed and my scoring function isn't completely fair. When I split up the Pokemons' types, I made it so that the training set could have Bulbasaur as a poison pokemon and the test set could have Bulbasaur as a grass pokemon. When I score the classifier, it will "correctly" classify as poison, and be scored incorrect because it was not grass. 

Really, the test/training sets should be separated on actual pokemon, but that presents a problem for training. I need to make sure all types are represented in my training set and in my test set. Ideally, my scoring function should also call the classifier correct whether it classifies Bulbasaur as grass, poison, or both. 

Finally, these classifiers can be improved by tweaking their input parameters, so there is still plenty of work to do improving my results