![](https://cdn.pastemagazine.com/www/articles/Pokemon%20Header%20Best%20Of.jpg)

Hello Kaggle! Due the fact that I discovered Fire Red version some years ago and I spent a lot of hours enjoying it, I thought that analysing a dataset about 800 pokemons will be interesting and fun (the cartoons are also awesome). Here we go!

# 1. Importing the libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# 2. Importing the data

In [None]:
pokemons=pd.read_csv('../input/pokemon/Pokemon.csv')

# 3. EDA on pokemons

Let's see some random observations from our dataset.

In [None]:
pokemons.sample(7)

In [None]:
pokemons.columns

The columns are:
1. #= The id of the pokemon.
2. Name
3. Type 1 = Types refer to different elemental properties associated with both Pokémon and their moves.
4. Type 2 = Pokémon themselves can have up to two types, making them Dual-Type Pokémon; the moves can be only one type.
5. Total = Represents the sum of all stats of the pokemon, giving a sense of how strong the pokemon is.
6. HP = (hit points) is related to how much damage a Pokemon can sustain before fainting.
7. Attack = The strength of a Pokémon's physical attacks.
8. Defense = The Pokémon's resistance against physical attacks.
9. Sp. Atk = Special Attack, the power of a Pokémon's special attacks.
10. Sp. Def = Special Defense, the Pokémon's resilience to special attacks.
11. Speed = The pokemon with a higher speed attacks first.
12. Generation = Refers to the Pokémon game series.
13. Legendary = Legendary pokemons are extremely rare and powerful. In our dataset Legendary is boolean => says if the pokemon is legendary or not. 

In [None]:
pokemons.info()

* We got 800 observations 
* Column 'Type 2' is almost half filled with NaN values so I will consider Type 1 principal Type of the pokemons.

In [None]:
del pokemons['Type 2']
pokemons.rename(columns={'Type 1':'Type'},inplace=True)

In [None]:
pokemons.head()

Let's see some statistics.

In [None]:
pokemons.describe()

The statistics for stats are similar, so they have similar distributions.


In [None]:
len(pokemons.Name.unique())
# We got 800 DIFFERENT pokemons.

In [None]:
pokemons[pokemons.duplicated()]
#We don't have duplicated values in our data frame.

In [None]:
len(pokemons['Type'].unique())
#There are 18 different types  of pokemons

How many pokemons are in each generation?

In [None]:
pokemons['Generation'].value_counts()

In [None]:
sns.countplot(x='Generation',data=pokemons,palette='nipy_spectral')
plt.title('Number of pokemons grouped by generation')

How many pokemons are for each type?

In [None]:
pokemons.Type.value_counts()

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(pokemons.Type,palette='twilight')

*NOTE: Because the number of pokemons of types Water and Normal is high compared to the rest, the respective pokemons will always have higher total stats ! *

Checking the distribution of stats with boxplot & violinplot:

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(10,6))
sns.boxplot(data=pokemons.drop(['#','Total','Generation','Legendary'],axis=1),fliersize=3,palette='seismic')
plt.title('Boxplots for stats')

We have some outliers for each stat.

In [None]:
plt.figure(figsize=(10,6))
sns.violinplot(data=pokemons.drop(['#','Total','Generation','Legendary'],axis=1),palette='rocket')
plt.title('Violinplots for stats')

The stats have a similar distribution.

Let's visualize the pokemons grouped by type.

In [None]:
pokemons.groupby('Type').sum()

In [None]:
pokemons.groupby('Type').sum().HP

# In this series the types are alphabetically ordered.

In [None]:
pokemons['Type'].unique()

I need to extract the types of pokemons and order them alphabetically.

In [None]:
list_types=pokemons['Type'].unique().tolist() # Convert the array of types into a list
list_types.sort() # Sorting the list of strings alphabetically
list_types

Plotting the total of stats for each type of pokemon:

In [None]:
plt.style.use('ggplot')
plt.style.use('seaborn-darkgrid')

stats=pokemons[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']]
k=1
m=0
palette=['magma','ocean','vlag','copper','mako','winter']
plt.figure(figsize=(17,17))
for i in stats:
    plt.subplot(3,2,k)
    k=k+1
    sns.barplot(x=pokemons.groupby('Type').sum()[i],y=list_types,palette=palette[m])
    m=m+1
    plt.title(str('Total of '+i))

There's no surprise that the barplots look similar since the stats have similar distributions.


Time for some swarmplots!

In [None]:
k=1;
m=0;
plt.figure(figsize=(15,30))
for i in stats:
    plt.subplot(6,1,k);
    k=k+1;
    sns.stripplot(x=pokemons.Type,y=pokemons[i],palette='Dark2');
    plt.title(str('Total of ')+i + str(' for each type'))

Now let's draw plots for each type separately.

In [None]:
k=1;
plt.figure(figsize=(17,22))
for i in list_types:
    plt.subplot(6,3,k);
    k=k+1;
    sns.barplot(x=pokemons[pokemons.Type==i].sum().drop(['#','Name','Type','Generation','Legendary','Total']).values,
                y=pokemons[pokemons.Type==i].sum().drop(['#','Name','Type','Generation','Legendary','Total']).index,
                palette='inferno');
    plt.title(i)
    plt.xlim(0,8500)
        

Again the swarmplots!

In [None]:
pok_melt=pd.melt(pokemons,id_vars=['Name','Type','Legendary'],value_vars=['HP','Defense','Attack','Sp. Atk','Sp. Def','Speed'])
pok_melt.head()

In [None]:
plt.figure(figsize=(17,22))
k=1
for i in list_types:
    plt.subplot(6,3,k)
    k=k+1
    sns.swarmplot(x=pok_melt.variable,y=pok_melt[pok_melt.Type==i].value,palette='gist_stern')
    plt.title(i)
    plt.xlabel('')

We can observe that Normal, Bug and Water pokemons have more observations than the rest. 

**BUT** what if we calculate the mean for each stat and we plot it? In this case the small number of some pokemons of different types will not affect the analyse.

Firstly, let's see the mean of stats grouped by popkemon's type.

In [None]:
df=pd.DataFrame()
for i in stats:
    df[i]=pokemons.groupby('Type').describe()[i]['mean']

I made a dataframe with all the means grouped by pokemon's type.

In [None]:
df

In [None]:
plt.figure(figsize=(16,20))
k=1
m=0
for i in stats:
    plt.subplot(3,2,k)
    k=k+1
    sns.barplot(x=df[i],y=df.index,palette=palette[m])
    m=m+1
    plt.title(str('Mean of total ')+ i +str(' for each type'))
    plt.xlabel(i)

In [None]:
k=1;
plt.figure(figsize=(16,25))
for i in list_types:
    plt.subplot(6,3,k);
    k=k+1;
    sns.barplot(x=df.loc[i,:].values,y=df.loc[i,:].index, palette='Paired');
    plt.title(i)
    plt.xlim(0,130)
    plt.ylabel('Mean')

Let's compare the initial total stats and then the mean of total stats.

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(x=pokemons.groupby('Type').sum().Total.sort_values(ascending=False).index
            ,y=pokemons.groupby('Type').sum().Total.sort_values(ascending=False),palette='cool')
plt.title('Total of all stats for each type of pokemon')

3 top types pokemons based on the intital Total :
1. Water
2. Normal
3. Grass

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(x=pokemons.groupby('Type').mean().Total.sort_values(ascending=False).index,
            y=pokemons.groupby('Type').mean().Total.sort_values(ascending=False).values,palette='twilight_shifted')
plt.title('Mean of the total of all stats for each type of pokemon')

3 top types pokemons based on the mean of the Total:
1. Dragon
2. Steel
3. Flying

*Conclusion*: plotting the mean of the values instead of the actual values makes a big difference.

Let's answer to some questions:

1.What's the best stat for each type? (What is the advantage of each type from the 18 types? ; *based on the dataframe with means of the sum of stats*)

In [None]:
best_stats=[]
for i in list_types:
    best_stats.append(df.loc[i,:].sort_values(ascending=False).index[0])

In [None]:
m=0
for k in best_stats:
    print('Best stat of type ',list_types[m],' is ',k)
    m=m+1

At one point I observed that pokemon's names which contains 'Mega' have the next word doubled. 

In [None]:
pokemons[pokemons.Name.str.contains('Mega')]

Let's fix this:

In [None]:
mega_pokemons = ['Mega'+poke.split('Mega')[1] for poke in pokemons[pokemons.Name.str.contains('Mega')].Name]
mega_pokemons

In [None]:
pokemons=pokemons.replace(to_replace=pokemons[pokemons.Name.str.contains('Mega')].Name.values,value=mega_pokemons)

Which is the best pokemon for each type? (according to each stat)

In [None]:
for n in list_types:
    print(str('TYPE ')+n.upper())
    for i in stats:
        name=pokemons[(pokemons.Type==n)].sort_values(by=i,ascending=False).Name.values[0]
        print(str('Best ')+i+(' pokemon is ')+name)
    print('*****************************************')


What about the features legendary and generation? I didn't pay attention to these until now.

I am going to plot again the numbers of pokemons for each generation.

In [None]:
sns.countplot(x='Generation',data=pokemons,palette='seismic')
plt.title('Number of pokemons grouped by generation')
plt.ylabel('Number of pokemons')

In [None]:
pokemons.groupby('Generation').sum()

Plotting the sum of stats for each generation:

In [None]:
plt.figure(figsize=(15,15))
k=1
for i in stats:
    plt.subplot(3,2,k)
    x=sns.swarmplot(x='Generation',y=i,data=pokemons,palette='plasma')
    k=k+1
    plt.title(i+str(' for each generation'))
    

Boxplots & Bokeh plotting:

In [None]:
k=1
plt.figure(figsize=(17,15))
for i in stats:
    plt.subplot(3,2,k)
    sns.boxplot(y=pokemons[i],x=pokemons.Generation)
    k=k+1
    plt.title(i+str(' for each generation'))

In [None]:
from bokeh.io import output_notebook, show, output_file
from bokeh.plotting import figure
from bokeh.layouts import row, gridplot
output_notebook()

p1=figure(plot_width=400,plot_height=200,title='HP for each generation')
p1.circle(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum().HP,size=3,color='red')
p1.line(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum().HP,line_width=1,color='red')

p2=figure(plot_width=400,plot_height=200,title='Attack for each generation')
p2.circle(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum().Attack,size=3,color='red')
p2.line(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum().Attack,line_width=1,color='red')
    
p3=figure(plot_width=400,plot_height=200,title='Defense for each generation')
p3.circle(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum().Defense,size=3,color='red')
p3.line(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum().Defense,line_width=1,color='red')

p4=figure(plot_width=400,plot_height=200,title='Sp. Atk for each generation')
p4.circle(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum()['Sp. Atk'],size=3,color='red')
p4.line(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum()['Sp. Atk'],line_width=1,color='red')

p5=figure(plot_width=400,plot_height=200,title='Sp. Def for each generation')
p5.circle(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum()['Sp. Def'],size=3,color='red')
p5.line(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum()['Sp. Def'],line_width=1,color='red')

p6=figure(plot_width=400,plot_height=200,title='Speed for each generation')
p6.circle(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum().Speed,size=3,color='red')
p6.line(x=[1,2,3,4,5,6],y=pokemons.groupby('Generation').sum().Speed,line_width=1,color='red')

grid=gridplot([p1,p2,p3,p4,p5,p6],ncols=2)
show(grid)

Now the legendary pokemons:

How many legendary pokemons are in total?

In [None]:
len(pokemons[pokemons.Legendary==True])
# There are 65 Legendary pokemons
# 8.125% pokemons are Legendary

In [None]:
pokemons.groupby('Generation').sum().Legendary
# Generations 3 ,5 & 4 have the most legendary pokemons

In [None]:
sns.barplot(x=pokemons.groupby('Generation').sum().Legendary.index,
            y=pokemons.groupby('Generation').sum().Legendary.values,palette='CMRmap')

How many legendary pokemons are for each type?

In [None]:
pokemons.groupby('Type').sum().Legendary.sort_values(ascending=False)

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x=pokemons.groupby('Type').sum().Legendary.sort_values(ascending=False).index,
              y=pokemons.groupby('Type').sum().Legendary.sort_values(ascending=False).values,palette='Paired')

So there are no legendary Fighting, Poison or Bug pokemons. The most legendary pokemons are Pshychic & Dragon

How are the stats of legendary pokemons compared to the others?

In [None]:
k=1;
m=0;
plt.figure(figsize=(15,30))
for i in stats:
    plt.subplot(6,1,k);
    k=k+1;
    sns.swarmplot(x='Type',y=i,palette='Dark2',hue='Legendary',data=pokemons);
    plt.title(str('Total of ')+i + str(' for each type'))

In [None]:
plt.figure(figsize=(17,22))
k=1
for i in list_types:
    plt.subplot(6,3,k)
    k=k+1
    sns.swarmplot(x=pok_melt.variable,y=pok_melt[pok_melt.Type==i].value,palette='Dark2',
                  hue=pok_melt[pok_melt.Type==i].Legendary)
    plt.title(i)
    plt.xlabel('')
    plt.legend(bbox_to_anchor=(0, 1.02, 1, 0.102), loc='lower left',ncol=2, mode="expand", borderaxespad=0.)
   

Most of the legendary pokemons are very powerful compared to the other non-legendary pokemons.

Let's see how many legendary pokemon have higher stats than the average stats of other non-legendary pokemon. (*and percentage *)

In [None]:
legend=pokemons[pokemons.Legendary==True]
for i in stats:
    print('Number of legendary pokemons with ',i, ' higher than the average:',
          len(legend[legend[i]>pokemons[i].mean()]),'\nPercentage:', round(len(legend[legend[i]>pokemons[i].mean()])/65*100,2),
           '\n**************')

These results strengthen the swarmplots. Indeed, most of legendary pokemons are stronger.

Let's see the corelations between values.

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(pokemons.drop(['#'],axis=1).corr(),annot=True,cmap="YlGnBu")

* The majority of big or medium correlations are between Total and the other values of stats, which is completely logical. 
* There is no semnificant correlation between values.

So there are only meidum correlation, which is not very helpfull.

In [None]:
sns.pairplot(pokemons.drop(['#','Legendary','Generation'],axis=1))

**So that's it. Hope you enjoyed it and if you have any question/ improvements don't be shy!**