# Exploratory Data Analysis on Pokemon Dataset

### Pokemon
Everyone is aware of this amazing anime franchise. It was the staple show of every late 90's or early 2000's kid.
1996 was its year of inception. It has been loved by millions since. The franchise is currently the **highest grossing franchise** in the world with a revenue of 100 billion dollars. It has surpassed the likes of Hello Kitty, Harry Potter and Marvel.

<img src="https://wallpapers.com/images/high/pokemon-go-title-logo-zd9p69e069waqssp.jpg">

### Objective
The objective of this notebook is to perform **exploratory data analysis on a Pokemon dataset** and to derive some insights using it. The most important of it all is to have fun while doing it!

Note: This notebook is extremely beginner friendly and a great way to start with EDA.

#### Data set used: [Pokemon Dataset](https://www.kaggle.com/abcsds/pokemon)

### Exploratory  Data Analysis (EDA)

#### Loading required libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Reading the dataset

In [None]:
pokemon_df = pd.read_csv("../input/pokemon/Pokemon.csv")

In [None]:
print(pokemon_df.head())

Now that we have taken a glimpse at the dataset, let's explore it a little more by taking a look at it in a little more meaningful way.

#### Exploring the dataset

Let's take a look at the number of unique values for each field in the dataset.

In [None]:
print(pokemon_df.nunique())

Taking a look at the number of unique values, column names and my prior knowledge about Pokemon, the columns - 'Type 1', 'Type 2', 'Generation' and 'Legendary' may be considered as **categorical variables.** However, we need to look a little bit closer to confirm it. 

Let's have a look at the unique values of the above mentioned columns.

In [None]:
print(pokemon_df['Type 1'].unique())

In [None]:
print(pokemon_df['Type 2'].unique())

In [None]:
print(pokemon_df['Generation'].unique())

In [None]:
print(pokemon_df['Legendary'].unique())

##### Categorical Variables

Based on our initial exploration, the following variables qualify as categorical:

1. **Type 1**: There are 18 types in Pokemon, where each type has its own unique set of characteristics. This field tells us about the type to which a Pokemon belongs to.

2. **Type 2**: Some Pokemons can be of dual types, that is, they can have more than one type. This field tells us about the second type of the Pokemon in question.

3. **Generation**: Pokemon is a long running series, that is why it was divided across multiple generations. Our dataset has data about 6 of them. This field tells us what generation a Pokemon belongs to.

4. **Legendary**: There are a few legendary Pokemons in the franchise which have special powers as compared to others. This field tells whether a Pokemon is legendary or not.

Other variables define the quantitative attributes of a Pokemon. They are the **numerical variables**  of our dataset.

To get an overview about the values of our numerical variables, we can look at some useful aggregration stats.

In [None]:
df_numerical = pokemon_df.iloc[:, 5:11]
df_num_desc = df_numerical.describe()
print(df_num_desc)

Doing so, we have got an idea about the **maximum, minimum and mean values** of all these numerical variables in the dataset. We also have the **count and 25th, 50th and 75th percentile values** of these fields. In addition to this, **standard deviation** of all the field is also known to us now.

**Note**: We have dropped 'Total' from numerical variables as it does not provide much information about the pokemon. Summing different abilities is not a useful way to gain insights. Moreover, its distribution deviates a lot from the Normal Distribution curve.

###### Let us now see a few meaningful representations for the numerical variables.

In [None]:
df_numerical.plot.kde(figsize=(14,10))

Plotting a **Kernel Density Plot** for our numerical variables, we can see that all these fields approximately resemble a **bell curve** (normal distribution). They are centered around their mean. This is an extremely common occurrence in datasets. The curve tapers off well on both sides, which implies that our data is well distributed on both sides of the mean. Moreover, majority of data is close to the mean values, which implies that **the standard deviation is small** (as curve is steep).

We can also infer that that the mean values for all these numerical variables is pretty close to one another and the curves (distribution) is also very similar, implying that these values do not require scaling to affect accuracy as much (if not neccessary for the model used) as they are lying in the same neighbourhood. However, scaling them to a lower value would probably speed up the model.

###### Let us plot some meaning visualisations for aggregate stats.

Comparing the aggregate stats for all fields

In [None]:
labels = df_num_desc.columns
titles = df_num_desc.index
plt.style.use('ggplot')
fig, a = plt.subplots(4,2, figsize=(12,18))
c=0
for i in range(len(a)):
    for j in range(len(a[0])):
        y = df_num_desc.iloc[c]
        a[i, j].set_title(titles[c])
        a[i, j].bar(labels, y)
        c += 1

fig.suptitle("Aggregate stats vs Numerical Variables", fontdict={'weight':'bold'}, fontsize=20)
fig.tight_layout(pad=3)
plt.show()

From the above plot we can infer that all aggregate stats are similar for different fields **except the minimum value** which is varying a lot. This suggests that overall distribution of values is similar, however

#### Cleaning the data

In order to build robust and accurate machine learning models, cleaning the data is an essential step. It helps in **making the data more consistent** and hence the model more productive.

Let us check for null values.

In [None]:
pokemon_df.isnull().sum()

As seen from the output, the only column that contains null values (386), is **Type 2**. This makes sense,because all Pokemons in the franchise are not dual type, only those who are dual type will have a Type 2 value.

Instead of keeping it null, let's put 'NA' in 'Type 2' for Pokemons that are not dual type.

In [None]:
pokemon_df['Type 2'] = pokemon_df['Type 2'].fillna('NA')

In [None]:
pokemon_df.head(10)

In [None]:
print(pokemon_df.info())

#### Analysing the Relationship between features

###### Now that we have somewhat cleaned our data let's analyse the relationship between different variables.

In order to find how the different numerical variables are related to each other, we are going to use the correlation matrix and plot a heatmap for it.

In [None]:
cor_numerical = df_numerical.corr()
print(cor_numerical)

Importing seaborn and using it to plot the heatmap for the above correlation matrix found using **Pearson Correlation Coefficient** which is default for the corr method.

In [None]:
import seaborn as sns

In [None]:
sns.heatmap(cor_numerical, annot=True)
plt.show()

Taking a glance at the above heatmap, we can infer that the numerical variables in our dataset are not very highly correlated. This makes sense, as each of these attributes are scores of different properties of a Pokemon which can be considered highly unrelated to each other.

We do see more than 50% correlation between Special Defence and Defense variables as well as Special Defense and Special Attack variables. However, this maybe too low to derive any conclusions.

#### Visualising the data

Now, that we have a fair idea about the dataset, we can create some meaningful visualisation to get some insights about it.

Let's start by plotting the counts associated with different categorical variables.

*Plotting the count of legendary and non-legendary pokemons*

In [None]:
legendary_count_dist = pokemon_df['Legendary'].value_counts()

In [None]:
labels = ['Non-Legendary', 'Legendary']
fig, a = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle('Legendary Distribution', fontdict={'fontweight':'bold'}, fontsize=20)

a[0].bar(labels, legendary_count_dist)
a[0].set_ylabel('No. of Pokemons')

a[1].pie(legendary_count_dist, labels=labels, explode=[0.05]*2, autopct='%.2f')

plt.show()

*Plotting Generation wise distribution of pokemons*

In [None]:
generation_count_dist = pokemon_df['Generation'].value_counts()

In [None]:
labels = [i for i in range(1,7)]
fig, a = plt.subplots(1, 2, figsize=(15, 5))

a[0].bar(labels, generation_count_dist)
a[0].set_xlabel('Generation')
a[0].set_ylabel('No. of Pokemons')

a[1].pie(generation_count_dist, explode=[0.05]*len(generation_count_dist), autopct='%.2f', labels=labels)

fig.suptitle('Generation-wise Distribution', fontdict={'fontweight':'bold'}, fontsize=20)
plt.show()

*Plotting count of dual type and non-dual type pokemons*

Note: You might recall that we previously changed all na/null values in the 'Type 2' column of the dataset. This implies all rows having 'NA' in the 'Type 2' field refer to non-dual type pokemons.

In [None]:
dualpokemon_count_dist = [
    pokemon_df[pokemon_df['Type 2'] == 'NA']['#'].count(),
    pokemon_df[pokemon_df['Type 2'] != 'NA']['#'].count()
]

In [None]:
labels = ['Non-Dual', 'Dual']

fig, a = plt.subplots(1, 2, figsize=(15, 5))

a[0].bar(labels, dualpokemon_count_dist)
a[0].set_ylabel('No. of Pokemons')

a[1].pie(dualpokemon_count_dist, explode=[0.02]*2, autopct='%.2f', labels=labels)

fig.suptitle('Dual Type Distribution', fontdict={'fontweight':'bold'}, fontsize=20)

plt.show()

*Plotting Type-wise(Type 1) distribution of pokemons* 

In [None]:
type_count_dist = pokemon_df['Type 1'].value_counts()

In [None]:
labels = list(type_count_dist.index)

fig, a = plt.subplots(2, 1, figsize=(15, 12))

a[0].bar(labels, type_count_dist)
a[0].set_xlabel('Type')
a[0].set_ylabel('No. of Pokemons')

a[1].pie(type_count_dist, labels=labels, explode=[0.03]*len(type_count_dist))

fig.suptitle('Type-wise Distribution', fontdict={'fontweight':'bold'}, fontsize=20)
plt.show()

Now that we have some idea about the distribution of Pokemons based on various categorical variables, let us now see how do the Pokemons compare with each other based on different categories.

Before doing so, let us add a column to the existing dataset which can give us an idea about the overall strength of a Pokemon. Let us name it **Overall** and let it be the mean of our numerical variables, i.e., mean of HP, Attack, Defense, Sp. Atk, Sp. Def and Speed.

In [None]:
pokemon_df['Overall'] = pokemon_df[['HP', 'Attack', 'Defense', 'Speed', 'Sp. Atk', 'Sp. Def']].mean(axis=1)

In [None]:
pokemon_df.head(10)

Using **Overall** as a measure of strength of a Pokemon, let's compare them based on various categories.

*We will start by comparing legendary Pokemons with non-legendary Pokemons.*

In [None]:
overalls_legendary = list(pokemon_df.groupby('Legendary')['Overall'])

In [None]:
labels = ['Non-Legendary', 'Legendary']
overalls = [list(o[1]) for o in overalls_legendary]
colors = ['red', 'blue']

plt.figure(figsize = (10, 7))

plt.boxplot(overalls, labels = labels)

for i in range(len(labels)):
    x = []
    for j in overalls[i]:
        x.append(i+1)
    plt.scatter(x, overalls[i], c = colors[i])
    
plt.title('Legendary v/s Non-Legendary Pokemons', fontdict={'weight':'bold', 'fontsize':16})
plt.ylabel('Overall')
plt.show()

As seen from the [box plot](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51), the median and all quartile values of legendary pokemons are higher than that of non-lengendary ones. This implies that **legendary Pokemons can be considered stronger than non-legendary ones** which is actually true.
Also condiser that there are more data points for non-legendary Pokemons which is also verifiable from the count plots above, which may make it more reliable to study the general characteristics of Pokemons.

*Let us compare the Pokemons generation wise*

In [None]:
overalls_generation = list(pokemon_df.groupby('Generation')['Overall'])

In [None]:
labels = [o[0] for o in overalls_generation]
overalls = [list(o[1]) for o in overalls_generation]
colors = ['red', 'blue', 'green', 'orange', 'purple', 'pink']

plt.figure(figsize = (10, 7))

plt.boxplot(overalls, labels = labels)

for i in range(len(labels)):
    x = []
    for j in overalls[i]:
        x.append(i+1)
    plt.scatter(x, overalls[i], c = colors[i])
    
plt.title('Generation-wise Pokemon Strength', fontdict={'weight':'bold', 'fontsize':16})
plt.ylabel('Overall')
plt.xlabel('Generation')
plt.show()

From the above boxplot, we can infer that the pokemons of different generations are much more comparable to each other as compared to how it was on the basis of legendary status.
Some other inferences:
- Generation 3 has the highest value of maximum overall (Discounting the outlier in generation 1)
- Generation 4 has the highest median value for overall
- Generation 2 has the lowest median value for overall
etc.

Similarly, inferences regarding different [quartiles](https://en.wikipedia.org/wiki/Quartile#:~:text=In%20statistics%2C%20a%20quartile%20is,%2Dor%2Dless%20equal%20size.&text=The%20third%20quartile%20(Q3,maximum)%20of%20the%20data%20set) can be concluded from a box plot like this

*For our final visualization, let us compare the strengths of different types(Type 1) of Pokemons*

In [None]:
overalls_type = list(pokemon_df.groupby('Type 1')['Overall'])

In [None]:
labels = [o[0] for o in overalls_type]
overalls = [list(o[1]) for o in overalls_type]
colors = ['red', 'blue', 'green', 'orange', 'purple', 'pink', 'brown', 'cyan', 'olive']

plt.figure(figsize = (15, 7))

plt.boxplot(overalls, labels = labels)

for i in range(len(labels)):
    x = []
    for j in overalls[i]:
        x.append(i+1)
        
    if i < 9:
        plt.scatter(x, overalls[i], c = colors[i])
    else:
        plt.scatter(x, overalls[i], c = colors[i-9])
    
plt.title('Type-wise Pokemon Strength', fontdict={'weight':'bold', 'fontsize':16})
plt.ylabel('Overall')
plt.show()

Inferences can be drawn from the given boxplot as explained earlier.
From an overview, one can tell that Dragon Pokemons seem to be stronger as they have a high median and Inter-Quartile Range.
This does make sense as a lot of legendary Pokemons in the series are of Dragon type.

A lot of other inferences can also be drawn out.
And it should be a nice experiment to build a better understanding about it by exploring it on your own.