# 1. Overview
In this mini project, I aim to use the Pokemon Index and combats datasets to explore data cleaning, merging, and the following data analysis questions:
    1. What is the winning percentage of each pokemon?
    2. What are the top 10 pokemons with the highest win percentage?
    3. What are the top 10 pokemons with the lowest win percentage?
    4. Which pokemon stat has the strongest correlation to win percentage?
    5. Are Type 1 = Rock pokemons more likely to win a combat? 

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 2. Data Profile

What entities/terms/features need to be extracted?

*    Pokemon dataset: Name, Type, HP, Attack, Defense,Sp. Atk, Sp. Def, Speed
*    Combats dataset: win percentage 

Are there restrictions or limitations to using it?
* The null entries in the data set can be an issue. For example, there may not be combat data for all pokemons. 
    

If it is a dataset, what would you need or not need from it to explore your question?
* I need all parts of the combat dataset
* While the pokemon dataset provides detailed stats on each pokemon, to explore my questions i do not need the data regarding Type 2 (because not all pokemon has a type 2), Legendary status, and Development Stage. 
* For the pokemon dataset I will be focusing on the data regarding pokemon Name, Type, HP, Attack, Defense,Sp. Atk, Sp. Def, and Speed.
* The combat data does not have a win percentage column. This means that in order to tie the pokemon data meaningfully with the combat data, I would have to engineer each pokemon's win percentage from the combat data.


# 3. Data Analysis: Initial Understanding
Are there missing values? What are the dimensions of the datasets?

In [None]:
#import libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

#import files
#pokemon2 = pd.read_csv("../input/pokemon-index-edited/pokemon2.csv")
pokemon = pd.read_csv("../input/pokemon-challenge/pokemon.csv")
combat = pd.read_csv("../input/pokemon-challenge/combats.csv")
#tests = pd.read_csv("../input/pokemon-challenge/tests.csv")


# rename the column in pokemon data with "#" as "number" as its name
pokemon = pokemon.rename(index=str, columns={"#": "Number"})
pokemon.head()

In [None]:
combat.head()

In [None]:
print("Dimenstions of Pokemon: " + str(pokemon.shape))
print("Dimenstions of Combat: " + str(combat.shape))

To better understand the dataset, let's check for missing values.

In [None]:
pokemon.isnull().sum()


In [None]:
combat.isnull().sum()

Initial review thoughts:
* There is 1 missing name 
* 386 missing values in Type 2 column, this could mean that 386 pokemons do not have a second type 
* There are 800 different pokemon and 50k battles in the datasets

# 3. Data Analysis: What is the winning percentage of each pokemon?
The combat data set provides win information between two pokemons that battled each other. This is not very useful in the current form. It is more useful for me to turn this information into winning percentage for each pokemon.

In [None]:
# Find total win number
total_Wins = len(combat.Winner.value_counts())
print(total_Wins)

In [None]:
# get the number of wins for each pokemon
numberOfWins = combat.groupby('Winner').count()
countByFirst = combat.groupby('Second_pokemon').count()
countBySecond = combat.groupby('First_pokemon').count()
# Finding the total fights of each pokemon
numberOfWins['Total Fights'] = countByFirst.Winner + countBySecond.Winner
# Finding the win percentage of each pokemon
numberOfWins['Win Percentage']= numberOfWins.First_pokemon/numberOfWins['Total Fights']
print(numberOfWins)

In [None]:
# Merge the the original pokemon dataset with the winning dataset
results2 = pd.merge(pokemon, numberOfWins, right_index = True, left_on='Number')
results3 = pd.merge(pokemon, numberOfWins, left_on='Number', right_index = True, how='left')
results3


# 3. Data Analysis: What are the top 10 pokemons with the highest win percentage?

In [None]:
results3[np.isfinite(results3['Win Percentage'])].sort_values(by = ['Win Percentage'], ascending = False ).head(10)

Observations: Mega Aerodactyl has the highest winning percentage. Within the top 10 winning pokemons, there are many with 'Mega' in their name. This could mean that Mega versions of pokemons are much more powerful and therefore more likely to win a combat. 

# 3. Data Analysis: Find the top 10 pokemons with the lowest win percentage

In [None]:
results3[np.isfinite(results3['Win Percentage'])].sort_values(by = ['Win Percentage'], ascending = True ).head(10)

Observations: Silcoon has the lowest win percentage. Pokemons with low win percentage has generally low stats across the board. 

# 3. Data Analysis: Which pokemon stat has the strongest correlation to Win Percentage?
A scan on the stats of the pokemons with the highest and the lowest win percentages, it looks like Attack and Speed likely have the strongest correlation with win percentage. 

In [None]:
#plot graph of Speed vs Win Percentage
import matplotlib.pyplot as plt
sns.regplot(x="Speed", y="Win Percentage", data=results3, logistic=True).set_title("Speed vs Win Percentage")


In [None]:
#plot graph of Attack vs Win Percentage
sns.regplot(x="Attack", y="Win Percentage", data=results3, logistic=True).set_title("Attack vs Win Percentage")

Observation: Speed has a stronger correlation with Win Percentage than Attack. Speed correlates so strongly with Win Percentages, it could be used as a way to predict Win Percentage.

Let's use a parallel coordinates plot to visualize an overview of the pokemon status and their relationship with win percentages (color coded)

In [None]:
#get the basic statistics of the data to help find ranges in Win Percentage to divide the continuous variable to categorical variable.
results3.describe()

0,25th,75th,100th percentiles seem to be good ranges to break Win Percentages data to. 

In [None]:
#import libraries to facilitate code testing
import matplotlib.pyplot as plt
from matplotlib import ticker
%matplotlib inline
import pandas as pd
import numpy as np


In [None]:

# 'Win Percentage' is a continuous variable and we are going to use it as categorical variable to colour the parallel coordinates so we need to divide it into range groups
results3['Win Percentage'] = pd.cut(results3['Win Percentage'], [0, 0.25, 0.50, 0.75, 1.0])

#defining variables
cols = ['HP', 'Attack', 'Defense','Sp. Atk', 'Sp. Def','Speed' ]
x = [i for i, _ in enumerate(cols)]
colours = ['DodgerBlue', 'MediumAquamarine', 'Gold', 'OrangeRed']

# create dict of categories: colours
colours = {results3['Win Percentage'].cat.categories[i]: colours[i] for i, _ in enumerate(results3['Win Percentage'].cat.categories)}

# Create (X-1) sublots along x axis
fig, axes = plt.subplots(1, len(x)-1, sharey=False, figsize=(15,5))

# Get min, max and range for each column
# Normalize the data for each column
min_max_range = {}
for col in cols:
    min_max_range[col] = [results3[col].min(), results3[col].max(), np.ptp(results3[col])]
    results3[col] = np.true_divide(results3[col]- results3[col].min(), np.ptp(results3[col]))
    
results3 = results3.dropna()
# Plot each row
for i, ax in enumerate(axes):
    for idx in results3.index:
        Win_category = results3.loc[idx,'Win Percentage']
        ax.plot(x, results3.loc[idx, cols], colours[Win_category])
    ax.set_xlim([x[i], x[i+1]])


# Set the tick positions and labels on y axis for each plot
# Tick positions based on normalised data
# Tick labels are based on original data
def set_ticks_for_axis(dim, ax, ticks):
    min_val, max_val, val_range = min_max_range[cols[dim]]
    step = val_range / float(ticks-1)
    tick_labels = [round(min_val + step * i, 2) for i in range(ticks)]
    norm_min = results3[cols[dim]].min()
    norm_range = np.ptp(results3[cols[dim]])
    norm_step = norm_range / float(ticks-1)
    ticks = [round(norm_min + norm_step * i, 2) for i in range(ticks)]
    ax.yaxis.set_ticks(ticks)
    ax.set_yticklabels(tick_labels)
    
for dim, ax in enumerate(axes):
    ax.xaxis.set_major_locator(ticker.FixedLocator([dim]))
    set_ticks_for_axis(dim, ax, ticks=6)
    ax.set_xticklabels([cols[dim]])
    
# Move the final axis' ticks to the right-hand side
ax = plt.twinx(axes[-1])
dim = len(axes)
ax.xaxis.set_major_locator(ticker.FixedLocator([x[-2], x[-1]]))
set_ticks_for_axis(dim, ax, ticks=6)
ax.set_xticklabels([cols[-2], cols[-1]])


# Remove space between subplots
plt.subplots_adjust(wspace=0)

# Add legend to plot
plt.legend(
    [plt.Line2D((0,1),(0,0), color=colours[cat]) for cat in results3['Win Percentage'].cat.categories],
    results3['Win Percentage'].cat.categories,
    bbox_to_anchor=(1.2, 1), loc=2, borderaxespad=0.)

plt.title("Pokemon Index")

plt.show()


Observation: The parallel coordinates plot indicates that Defense and Sp.Def do not correlate strongly with Win Percentages as the colors are more mixed along the axes. Speed's strong correlation with Win Percentages is reflected again by the clear color stacks and little mixing along the axis. Attack and Sp. Atk moderately correlates to Win Percentages. 

# 3. Data Analysis: Do Type 1= Rock pokemons more likely to win a combat? 

Let's filter the parallel coordinates graph by pokemon Type. The pokemon with the highest Win Percentage has Rock as its Type 1, let's find out the general win performances of Type1=Rock pokemons

In [None]:
#filter data by "Type 1"
results3=results3[results3['Type 1']=='Rock'] 



In [None]:
#replot parallel coordinates graph with filtered data

# Create (X-1) sublots along x axis
fig, axes = plt.subplots(1, len(x)-1, sharey=False, figsize=(15,5))

# Get min, max and range for each column
# Normalize the data for each column
min_max_range = {}
for col in cols:
    min_max_range[col] = [results3[col].min(), results3[col].max(), np.ptp(results3[col])]
    results3[col] = np.true_divide(results3[col]- results3[col].min(), np.ptp(results3[col]))
    
results3 = results3.dropna()
# Plot each row
for i, ax in enumerate(axes):
    for idx in results3.index:
        Win_category = results3.loc[idx,'Win Percentage']
        ax.plot(x, results3.loc[idx, cols], colours[Win_category])
    ax.set_xlim([x[i], x[i+1]])


# Set the tick positions and labels on y axis for each plot
# Tick positions based on normalised data
# Tick labels are based on original data
def set_ticks_for_axis(dim, ax, ticks):
    min_val, max_val, val_range = min_max_range[cols[dim]]
    step = val_range / float(ticks-1)
    tick_labels = [round(min_val + step * i, 2) for i in range(ticks)]
    norm_min = results3[cols[dim]].min()
    norm_range = np.ptp(results3[cols[dim]])
    norm_step = norm_range / float(ticks-1)
    ticks = [round(norm_min + norm_step * i, 2) for i in range(ticks)]
    ax.yaxis.set_ticks(ticks)
    ax.set_yticklabels(tick_labels)
    
for dim, ax in enumerate(axes):
    ax.xaxis.set_major_locator(ticker.FixedLocator([dim]))
    set_ticks_for_axis(dim, ax, ticks=6)
    ax.set_xticklabels([cols[dim]])
    
# Move the final axis' ticks to the right-hand side
ax = plt.twinx(axes[-1])
dim = len(axes)
ax.xaxis.set_major_locator(ticker.FixedLocator([x[-2], x[-1]]))
set_ticks_for_axis(dim, ax, ticks=6)
ax.set_xticklabels([cols[-2], cols[-1]])


# Remove space between subplots
plt.subplots_adjust(wspace=0)

# Add legend to plot
plt.legend(
    [plt.Line2D((0,1),(0,0), color=colours[cat]) for cat in results3['Win Percentage'].cat.categories],
    results3['Win Percentage'].cat.categories,
    bbox_to_anchor=(1.2, 1), loc=2, borderaxespad=0.)

plt.title("Type1=Rock")

plt.show()

Observation: There are only 5 Type1=Rock pokemon in 75th-100th percentile range, while majority of the Rock type pokemons are in the lower ranges of win percentage. So the general win percentages of Rock type pokemons are not high.

# Conclusions/Directions for future work 

From my explorations, it is clear there are many more patterns and trends that can be explored with more time. The direction I would take to further this work is to continue with exploratory analysis with the goal of coming up with a win percentage prediction model where I can predict a pokemon's win percentage by finding its stats doppelganger(s). A stretch goal will be to leverage Machine Learning to predict the win percentage. 

Overall this project has been fun, though frustrating at times. As a beginner, I wasted alot of time figuring out errors that i realized was caused by not running all code after code changes. I have learned that the nice clean UI of Kaggle that allows breaking code into blocks makes it easy to forget that Python is still being ran in a sequential manner in the backend. Another big challenge is cleaning and setting the dataset up for Kaggle. It took me a significant amount of time to realize that some of the errors i was getting has to do with the structure of the dataset. I had to make sure that even though the dataset looks right in excel, it is properly labelled and structured in Kaggle as well. I eventually found a different dataset to workaround some issues with my previous pokemon dataset. 


References:
http://benalexkeen.com/parallel-coordinates-in-matplotlib/