In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from pylab import scatter

In [2]:
# Colour palette for plotting
clr_pal = ['#000000','#e6194b','#3cb44b','#ffe119','#0082c8','#f58231','#911eb4','#46f0f0','#f032e6','#d2f53c','#fabebe','#008080','#e6beff','#aa6e28','#fffac8','#800000','#808000','#ffd8b1','#000080','#808080','#aaffc3']

Let's read the contents..

In [3]:
data = pd.read_csv('../input/Pokemon.csv')
features = data.columns
data.head(10)

There are some NaN values in the column `Type 2`, maybe there are some NaN values in other colums too just later. Let's check if other columns too have it and fix those undefined values

In [4]:
data.isnull().any(axis=0)

Cool. Its just `Type 2` then. Since the feature specifies the further classification of a pokemon, let's tag them as `Basic` type

In [5]:
data['Type 2'] = data['Type 2'].fillna("Basic")
data.head(10)

Perfect. Now, we are free to play with the data. How about a graph of the distribution of values

Since, the `total` and `id` of pokemon doesn't make sense to be added, lets plot without those features

In [6]:
df1 = data.drop(['#','Total'], axis=1)
plt.subplots(figsize = (15,10))
sns.boxplot(data=df1)
sns.plt.show()

Okay. That's a good overall picture about the distribution. The features `Generation` and `Legendary` doesn't really go along proportionally with other values, lets have some closer look at them

In [7]:
plt.subplots(1,2,figsize=(15,10))
plt.subplot(2,2,1)
sns.distplot(data['Generation'],bins=100)
plt.subplot(2,2,2)
sns.distplot(data['Legendary'],bins=100)
sns.plt.show()

So, Both Generation and Legendary are discrete values not continuous ones. As we can see from the graph. Let's see how each type of pokemons vary among those given features

In [8]:
df2 = data.groupby(['Type 1'], as_index=True).mean()
df2 = df2.drop(['#','Total','Legendary','Generation'],axis=1).transpose()
df2

`Total` values will be too high and redundant with other features. `Generation` and `Legendary` values will be too low, those values are dropped for better comparison in graph

In [9]:
df2[df2.columns].plot(color=clr_pal[:len(df2.columns)],marker='D')
plt.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)
plt.gcf().set_size_inches(15,10)
plt.show()

This is a pretty useful one as it evidently gives some information about some common properties of each features. From the mean values the following data is clear:
* Highest Attack - `Dragon`
* Highest Defense - `Steel`
* Highest Speed - `Flying`
* Lowest Attack - `Fairy`
* Lowest HP - `Bug`

Some types of pokemons have a similar type of graph, and it is better understood with correlation among all these types

In [10]:
plt.subplots(figsize=(15,10))
sns.heatmap(df2.corr(), annot=True)
sns.plt.show()

Some useful inferences from the correlation among pokemon types:
* Rock and Steel pokemons are negatively related to flying ones
* Water and Grass types have a positive relation
* Ground pokemons are greatly correlated with Poison types
* Electric and Ice types having zero correlation indicates a nonlinear relationship

It is hard to plot types of pokemons among all the available features, but among some interesting features is possible. So, it is better to know how closely they are correlated

In [11]:
df3 = data.groupby(['Type 1'], as_index=True).mean()
df3 = df3.drop('#',axis=1)
plt.subplots(figsize=(15,10))
sns.heatmap(df3.corr(), annot=True)
sns.plt.show()

There is an inverse correlation between `Defense` and `Speed` and thats because `Steel` type records the highest Defense and `Flying` type records the highest speed. And from the last correlation we saw that `Steel` and `Flying` ones are negatively correlated. Thus `Defense` and `Speed` has a negative correlation. Well, that makes sense.

Let's plot the types of pokemons against some features to have a better understanding.

In [12]:
def feature_relation(feature1, feature2):
    plot_features = ['Type 1','population',feature1, feature2]
    df4 = pd.DataFrame(columns=plot_features)
    df4[plot_features[0]] = data.groupby(['Type 1']).mean().index.tolist()
    df4[plot_features[1]] = data.groupby(['Type 1'])['#'].count().tolist()
    df4[plot_features[2]] = data.groupby(['Type 1'])[plot_features[2]].mean().tolist()
    df4[plot_features[3]] = data.groupby(['Type 1'])[plot_features[3]].mean().tolist()
    
    fig, ax = plt.subplots()
    fig.set_size_inches(15,10)
    for idx,val in df4.iterrows():
        ax.scatter(x= val[feature1], y=val[feature2], c = clr_pal[idx], label=val['Type 1'], s=val['population']*20)
    lgnd = plt.legend(df4['Type 1'],  bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)
    ax.set_xlabel(feature1)
    ax.set_ylabel(feature2)
    for x in range(len(df4.index)):
        lgnd.legendHandles[x]._sizes = [100]
    plt.show()

`Sp. Attack` and `Sp. Defense` have a great correlation. Let's try those features first.

In [13]:
feature_relation('Sp. Atk','Sp. Def')

The bubbles represent each pokemon types and the size of them represents the count of pokemons in the dataset. Furthermore, the graph is plotted between Attack and Defense and it has a positive relation. Let's plot against some other features to know their relation too

In [14]:
feature_relation('Speed','Defense')

Doesn't really seem like a negative correlation, but it does have a lot of low values

Thats it for now. I'll try some more plots in the future.

 ###### Gotta Explore 'Em all right!