Pokemon battles revolve around individual Pokemons' stats, which control every aspect of their performance in battle. No two Pokemon types are alike, and some Pokemon naturally have higher or "better" stats than others.

This is particularly true of Legendary Pokemon, which, befitting their name, have the highest stats amongst all Pokemon.

What if we try to classify Pokemon as legendary or not legendary based on those stats?

In [None]:
import numpy as np
import pandas as pd
from sklearn import linear_model

pokemon = pd.read_csv('../input/Pokemon.csv')

Befitting our hypothesis, legendary Pokemon have significantly higher stats on average.

In [None]:
pokemon.groupby('Legendary').apply(np.mean)

We can see this effect's distribution more clearly using a box plot.

In [None]:
import seaborn as sns
%matplotlib inline

sns.boxplot(x='Legendary', y='Total', data=pokemon)

This plot shows us that the legendary/not legendary boundary is actually rather fuzzy, with a tail of very good "normal" Pokemon whose stats equal or even exceed legendary ones.

Let's take a closer look at the boundary between the two.

In [None]:
sns.swarmplot(x='Legendary', y='Total', data=pokemon.query('550 < Total < 800'))

Here we focus on the area of overlap. Looking at this closely, can you see a "natural" boundary point between the two classes?

Actually yes. 650 seems like a natural break point; and this is, indeed the number that our model will pick, too.

The trouble is, it leaves a lot of legendary Pokemon on the wrong side of the fence!

Let's fit a logistic regressor to our data and see what happens.

In [None]:
pokemon_sorted = pokemon.sort_values(by='Total')

X = pokemon_sorted['Total'].values.reshape((800, 1))
y = pokemon_sorted['Legendary'].values.reshape((800, 1))
clf = linear_model.LogisticRegression(C=1e5)
clf.fit(X, y)

In [None]:
totals = X.flatten()
y_predicted = clf.predict(X).flatten()
y_actual = y.flatten()

prediction_correct = (y_actual == y_predicted)

In the plot below, blue points are incorrect classifications and red ones are correct ones (note however that the points overlap, so this isn't an accurate count by any means).

In [None]:
import matplotlib.pyplot as plt

plt.figure(1, figsize=(8, 4))
plt.scatter(X, y_predicted, color=['steelblue' if p else 'darkred' for p in prediction_correct])

X_test = np.linspace(150, 900, 300)
decision_boundary = (1/(1 + np.exp(-(clf.intercept_ + X_test*clf.coef_)))).ravel()
plt.plot(X_test, decision_boundary, color='black')

We can look at the confusion matrix for our classification to see what we get wrong.

In [None]:
from sklearn import metrics

metrics.confusion_matrix(y_actual, y_predicted)

This tells us that we classified 720 non-legendary Pokemon correctly and 15 incorrectly (we got just 2% of non-legendary Pokemon wrong, not too shabby). On the other hand, we only classified 28 legendary Pokemon correctly; we got 37 of them wrong (we're wrong more than half the time!).

Perhaps it would be more helpful to think instead of our two classes not as "Legendary" and "Not Legendary", but as  "Legendary or On Par with Legendary" (e.g. Pseudo-Legendary) and "Not Legendary and Not On Par with Legendary". After all it's pretty clear that some of our ordinary Pokemon have extraordinary stats; if we're interesting in finding "good" Pokemon that's clearly more meaningful than the in-game label.

In such a classification scheme we're interested in "rare case" high-stat Pokemon, with our boundary being the legendary Pokemon boundary. We can achieve this by decreasing our emphasis on the outcomes of non-legendary Pokemon, using the class_weight argument for our classifier.

In [None]:
clf = linear_model.LogisticRegression(C=1e5, class_weight={0: 0.2, 1: 1})
clf.fit(X, y)

In [None]:
totals = X.flatten()
y_predicted = clf.predict(X).flatten()
y_actual = y.flatten()

prediction_correct = (y_actual == y_predicted)

plt.figure(1, figsize=(8, 4))
plt.scatter(X, y_predicted, color=['steelblue' if p else 'darkred' for p in prediction_correct])

X_test = np.linspace(150, 900, 300)
decision_boundary = (1/(1 + np.exp(-(clf.intercept_ + X_test*clf.coef_)))).ravel()
plt.plot(X_test, decision_boundary, color='black')

In [None]:
metrics.confusion_matrix(y_actual, y_predicted)

This model is much more satisfying. We classify all legendary Pokemon correctly, and classify an additional 50 "normal" Pokemon as pseudo-legendary. The remaining 685 Pokemon are classified as nothing special.

In other words, 8.125% of Pokemon are legendary, 6.25% of them are pseudo-legendary, and the remaining 85.625% are neither.

And the boundary value is a stat total of 575!