In [None]:
# Creating a classifier for legendary & mythic pokemon

The objective of this notebook is to create a classifier that can adequately identify legendary and mythic pokemon from their statistics.

The key problem we'll have to deal with is that the classes are very unbalanced - as would be reasonable, legendary and mythic pokemon represent only a tiny fraction of the total amount of pokemon created.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
data = pd.read_csv('/kaggle/input/complete-pokemon-data-set/pokemon.csv', sep=',')

We'll create an objective class that maps to 1 for either legendary or mythic pokemon and -1 for everyone else.

In [None]:
data['obj_class'] = [1 if i else -1 for i in data['legendary']]
data['obj_class'] = data['obj_class'] + [1 if i else -1 for i in data['mythical']]
data['obj_class'] = [1 if i==0 else -1 for i in data['obj_class']]
data['obj_class'].value_counts()

We can see the objective class states that the classes are imbalanced about 10:1.

In [None]:
data[['legendary', 'mythical', 'can_evolve','baby_pokemon']].value_counts()

In an effort to balance the classes, we can use the fact that legendary and mythic pokemon don't count as baby pokemon and only a handful of legendary poemon can evolve. We can use these binary flags to reduce the analysis space.

In [None]:
filtered_data = data.loc[(data['can_evolve'] == False) & (data['baby_pokemon']==False)].copy()
filtered_data['obj_class'].value_counts()

Doing this has reduced the imbalance in the classes from 10:1 to about 6:1. Not great, but certainly better.

## Understanding pokemon stats

Now we explore the data to see if any pairs of battle stats can provide us with information to separate non-legendaries from their legendary counterparts.

In [None]:
colors = ['red' if i==1 else 'blue' for i in filtered_data['obj_class']]
filtered_data.plot.scatter('hp', 'attack', c=colors)

In [None]:
filtered_data.plot.scatter('attack', 'defense', c=colors)

In [None]:
filtered_data.plot.scatter('defense', 'speed', c=colors)

A quick overview shows that there is little hope of using individual stats to separate the legendaries, which makes sense from a game design standpoint. Having a spread of stats in those pokemon means that they can fill many niches in gameplay.

However, the aggregates of their stats might be better separated. Legendaries should - overall - have greater stats than normal pokemon to reward players that go through the effort of catching them.

In [None]:
filtered_data['total_attack_stats'] = filtered_data['attack'] + filtered_data['special_attack'] + filtered_data['speed']
filtered_data['total_defense_stats'] = filtered_data['hp'] + filtered_data['defense'] + filtered_data['special_defense']
filtered_data.plot.scatter('total_attack_stats', 'total_defense_stats', c=colors)

Indeed, this approach has separated the legendary pokemon better from the rest of the domain. 

However, the classes are not linearly separable. For classification, we'll use an Support Vector Classifier using an RBF kernel.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

In [None]:
x = filtered_data[['total_attack_stats', 'total_defense_stats']]
y = filtered_data['obj_class']
X_train, X_test, Y_train, Y_test = train_test_split(x,y, test_size=0.25)

In [None]:
clf = SVC(gamma='auto')
clf.fit(X_train, Y_train)
pred = clf.predict(X_test)
confusion_matrix(Y_test, pred)

In [None]:
filtered_data['y_pred'] = clf.predict(filtered_data[['total_attack_stats', 'total_defense_stats']])
filtered_data['quality'] = filtered_data['y_pred']+filtered_data['obj_class']

In [None]:
colors = ['red' if i==2 else 'green' if i== 0 else 'blue' for i in filtered_data['quality']]
filtered_data.plot.scatter('total_attack_stats', 'total_defense_stats', c=colors, title='Legendary classification results')

We color true negatives as blues, true positives as red and failed predictions (be they false positives or false negatives) as green.

As we can see, the class imbalance has produced results that are in line with what's expected with a few mistakes.