## Investigating Types effects on victories

I remember that in my pokemon days on Game Boy, it was such a rush when you had your Charizard out and here comes a Grass pokemon and you just blast him away. 
That was the feature that stuck with me most, and I wanted to do some visualization to see how big the effect really is, and then try to predict winners with this feature
if there deems to be siginicant effects.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [2]:
pokemon_df = pd.read_csv('../input/pokemon.csv')
combats_df = pd.read_csv('../input/combats.csv')

In [3]:
pokemon_df.info()

In [4]:
pokemon_df[['Type 1', 'Type 2']].isnull().sum()

It seems that Type 1 has no nulls, but Type 2 has plenty so I guess not all pokemon are multi-typed. Lets start with just looking at Type 1.

In [5]:
possible_types = set(pokemon_df['Type 1'])
matches_by_type = pd.DataFrame(0, index=possible_types, columns=possible_types, dtype=int)
wins_by_type = pd.DataFrame(0, index=possible_types, columns=possible_types, dtype=int)

# join the types to the combats table to make it a little easier. 
combats_df = pd.merge(combats_df, pokemon_df, left_on='First_pokemon', right_on='#', how='left')
combats_df = combats_df.rename(mapper={"#":"First_pokemon #", "Name":"First_pokemon Name", "Type 2": "First_pokemon Type 2", "Type 1":"First_pokemon Type 1","HP": "First_pokemon HP", "Attack": "First_pokemon Attack", 'Defense': "First_pokemon Defense", 'Sp. Atk': "First_pokemon Sp.Atk", 'Sp. Def': "First_pokemon Sp.Def", 'Speed': "First_pokemon Speed", 'Generation': "First_pokemon Generation", 'Legendary': "First_pokemon Legendary"}, axis='columns')

combats_df = pd.merge(combats_df, pokemon_df, left_on='Second_pokemon', right_on='#', how='left')
combats_df = combats_df.rename(mapper={"#":"Second_pokemon #", "Name":"Second_pokemon Name", "Type 2": "Second_pokemon Type 2", "Type 1":"Second_pokemon Type 1","HP": "Second_pokemon HP", "Attack": "Second_pokemon Attack", 'Defense': "Second_pokemon Defense", 'Sp. Atk': "Second_pokemon Sp.Atk", 'Sp. Def': "Second_pokemon Sp.Def", 'Speed': "Second_pokemon Speed", 'Generation': "Second_pokemon Generation", 'Legendary': "Second_pokemon Legendary"}, axis='columns')

combats_df

Lets count up the matches between types, and the wins. 
We will make this analysis from the perspective of **First_pokemon** so read the values for wins_by_type['Grass']['Fire'] as:

### The number of times First_pokemon has beaten Second_pokemon when First_pokemon was Grass, and Second_pokemon was Fire.


In [6]:
for index, row in combats_df.iterrows():
    p1_type = row['First_pokemon Type 1']
    p2_type = row['Second_pokemon Type 1']
    matches_by_type[p1_type][p2_type] = matches_by_type[p1_type][p2_type] + 1
    if (row['First_pokemon'] == row['Winner']):
        wins_by_type[p1_type][p2_type] = wins_by_type[p1_type][p2_type] + 1

In [7]:
wins_by_type_probabilities = wins_by_type / matches_by_type 

Lets take a gander at the win ratios. 

In [8]:
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(wins_by_type_probabilities, vmax=1.0,square=True);

Just verify that I did the math and counting correctly...

In [9]:
d = combats_df.loc[(combats_df['First_pokemon Type 1'] == "Electric") & (combats_df['Second_pokemon Type 1'] == "Fairy")]
d['first_won'] = d['First_pokemon'] == d['Winner']
print( d['first_won'].sum() / d['first_won'].count())
print (wins_by_type_probabilities['Electric']['Fairy'])

Based on wins percentages, it looks like

**FLYING** is the most likely to win, in all scenarios, by quite a margin.

In [10]:
wins_by_type_probabilities.mean().sort_values(ascending=False)

To make the modelling easier, I'm going to reframe the prediction against a new column: **First_pokemon won**

That way we can use a basic Logistic Regressor in a binary classification scenario. 

In [11]:
combats_df['First_pokemon won'] = (combats_df['First_pokemon'] == combats_df['Winner']).map({True: 1, False: 0})
combats_df

Now we can see Win percentage based on the differences in pokemon stats.

Lets start with ***HP**.

In [57]:
def win_percentage_by_stat(first_stat, second_stat, stat_name):
    combats_df['diff'] = combats_df[first_stat] - combats_df[second_stat]
    stat_bins = pd.qcut(combats_df['diff'], 10)
    bin_col = stat_name + " Bins"
    d = pd.DataFrame({bin_col: stat_bins, "First Won": combats_df['First_pokemon won']})
    bins = sorted(set(d[bin_col]))

    percentages = []
    for b in bins:
        bin_rows = d.loc[d[bin_col] == b]
        win_percentage = (bin_rows['First Won'] == 1).sum() / bin_rows['First Won'].count()
        percentages.append( win_percentage )

    results = pd.DataFrame({bin_col: bins, "Win Percentage": percentages})
    return results

In [69]:
results = win_percentage_by_stat("First_pokemon HP", "Second_pokemon HP", "HP")
print("Spread:" + str(max(results['Win Percentage']) - min(results['Win Percentage'])))
results

In [70]:
results = win_percentage_by_stat("First_pokemon Attack", "Second_pokemon Attack", "Attack")
print("Spread:" + str(max(results['Win Percentage']) - min(results['Win Percentage'])))
results

In [71]:
results = win_percentage_by_stat("First_pokemon Defense", "Second_pokemon Defense", "Defense")
print("Spread:" + str(max(results['Win Percentage']) - min(results['Win Percentage'])))
results

In [72]:
results = win_percentage_by_stat("First_pokemon Sp.Atk", "Second_pokemon Sp.Atk", "Sp. Atk")
print("Spread:" + str(max(results['Win Percentage']) - min(results['Win Percentage'])))
results

It seems that effect wise, **Attack** is the best stat to have an advantage over your opponent in, where as **Defense** doesn't seem to matter as much.

We could also do cross-stat comparisons like Attack - Defense

In [75]:
results = win_percentage_by_stat("First_pokemon Attack", "Second_pokemon Defense", "Atk - Def")
print("Spread:" + str(max(results['Win Percentage']) - min(results['Win Percentage'])))
results

I'll just stick with the basic stats for my first model.