# Comprehensive Exploratory Data Analysis (EDA) Tutorial with pandas and Pokémon Database (UPDATE 30/11)

## Welcome to this Kernel and thanks for visiting it!

Hey Kagglers, this kernel is intended to show steps of EDAs using `pandas` module for `python`. The dataset selected is the _Pokémon_ available [here](https://www.kaggle.com/abcsds/pokemon)! It is an easy and fun database to start using our most basic tools and play around without feeling pressure from competitions.

The structure of this notebook is the following:
1. Data exploration: first peak at the data.
2. More data reviewing!
3. Data cleaning.
4. Classification models: predicting character's legendary status and main type.
5. Regression models: predicting character's HP.

The idea is that just after performing some action, it is justified so everyone can learn the reasond behind that decision. In data science, is easy to find someone that can simply type code mindlessly. The real challenge is to make all decisions and processes based on a strong foundation. Therefore, if something is not clear enough, I encourage you to ask it in the comments section so we can discuss about it! :)

---
The parts already covered are:
1. `[x] Data exploration: first peak at the data.`
2. `[ ] More data reviewing!` <- coming soon
3. `[ ] Data cleaning.` <- coming soon
4. `[ ] Classification models: predicting character's legendary status and main type.` <- coming soon
5. `[ ] Regression models: predicting character's HP.` <- coming soon
---


In [None]:
# Import relevant packages
import matplotlib.pyplot as plt # For plot configuration
import numpy as np              # For numerical operations
import pandas as pd             # For database management
import seaborn as sns           # For plotting data easily

import warnings
warnings.filterwarnings("ignore")

sns.set()                       # Set seaborn style so plots are nice!

In [None]:
# Read the file with the `.read_csv()` method
df = pd.read_csv('../input/Pokemon.csv')

## 1. Data exploration: first peak at the data

Reminder: the variable types are the following.
* __Nominal__: names, gender... Variables that have not really an intrinsic numerical value. Categories are nominal variables.
* __Ordinal__: scales, education degrees... Numerical variables that have order but they do not display proportions (e.g. _very good_ does not necessarily mean that it is the double of _good_ but simply that it is in a higher position).
* __Discrete__: number of chairs in a room, days in the calendar... Natural numbers (i.e. with no decimals, e.g. 0, 4, 129, -12).
* __Real__: height, weight... Numbers from the Real numbers set (i.e. any non-complex number, e.g. 0.9182, $\sqrt{34}$).

### 1.1 Simple facts about the dataset

Check the variables of the dataset and obtain a few descriptive statistics.

In [None]:
n_rows, n_cols = df.shape
print('The dataset has {} rows and {} columns.'.format(n_rows, n_cols))

In [None]:
df.head()

In [None]:
columns = df.columns
print('Columns names: {}.'.format(columns.tolist()))

### 1.2 Insights from `.columns`
* We have columns with spaces in their names and punctuation symbols (e.g. ".").
* The number of the pokemon is called with a __reserved__ `python` symbol.

After the insights we have gained it is obvious we have to solve two problems:
*  In general, naming things with spaces is not recommended. One of the reasons is that if we want to quickly access the feature (e.g. pressing `Tab`), we won't find it and we will have to be typed completely. It is better to replace all spaces with underscores.
* We need to fix the column with the reserved symbol as well.

Let's fix the columns renaming them :)!

In [None]:
# Method 1: step by step
#    1. Create a copy of the df
#    2. Assign the new column name to the old one.
#    3. Delete the old column.
# Notice jhow this method reorders the columns!
df_slow = df.copy()
df_slow['Num'] = df['#']
to_rename = columns[[2,3,8,9]]
for col in to_rename:
    if '.' in col:
        cola = col.replace('. ', '_')
    else:
        cola = col.replace(' ', '_')
    df_slow[cola] = df[col]
df_slow = df_slow.drop(columns=columns[[0,2,3,8,9]])
print(df_slow.columns)


# Method 2: built-in function
#    1. Create a copy of the df
#    2. Create the mapping
#    3. Apply the function
df_good = df.copy()
mapper = {'#': 'Num'}
mapper.update({col: col.replace(' ','_') if '.' not in col else col.replace('. ', '_') for col in to_rename.tolist()})
df_good.rename(columns=mapper, inplace=True)
print(df_good.columns)

With that we can see that we have 13 columns the following data:
1. Number of Pokémon (discrete variable).
2. Name of the Pokémon (nominal variable).
3. Its main type (nominal variable).
4. Secondary type (nominal variable).
5. Total: sum of Attack, Defense, Special Attak and Special Defense points (discrete variable).
6. Stats for fighting that add up to the total (discrete variable).
7. Generation that it was first seen (nominal variable).
8. Whether or not the Pokémon is legendary (nominal variable).

In reality, the stats could be real variables (e.g. Pikachu could have a HP of 34.8 instead of 35) that have been discretized in this dataset.

### 1.3 `unique()` and `describe()`

Let's say that we want to know what are the unique Pokémons that we have:

In [None]:
# 1. We can use the built-in method of pandas "unique()"
names = df.Name.unique()

# 2. Print a few names:
print('Some Pokémon names are: '+('"{}", '*3).format(*np.random.choice(names, 3))+
      'and "{}".'.format(*np.random.choice(names,1)))
print('The amount of unique Pokémons is {}.'.format(len(names)))

In [None]:
df_good.describe()

### 1.4 Insights from `.describe()`
* It is clear that the stats for the number (i.e. column '#') does not provide any information.
* 'Total' allows us to see what is the distribution of available points for any character.
* The following stats show the distribution of each stat. Seems that, given that the mean and the median (i.e. 50%) are close to each other, they seem to follow a normal distribution.
* The generation is a categorical variable. Therefore, the description doesn't provide anything useful.
* Since all variables have the same count as the dataset size, we can conclude that these variables have no missing values.

In [None]:
# Let's store these variables since we will use it later on
stats = df_good.columns[4:-3]

### 1.5 Distribution plots


Normally, instead of reading the numbers, it is better to see the distribution of the variables so let's do it! :)

In [None]:
sns.pairplot(data=df_good.iloc[:,5:-2], 
             kind='reg')

### 1.6 Insights
* We can see that most of the data is right skewed since all histograms have most of the values in the left.
* Also we can see that some variables seem to be correlated: for instance the HP and the Attack.
* It can be seen also that there are outliers in the data. For instance, in the HP vs. Attack it looks that they are highly correlated but it doesn't if we plot Attack vs. HP. When the data has not many outliers, changing the order of the variables should not alter the correlation.

Let's vamp it up by plotting it depending on the generation they come from so it looks cooler!

In [None]:
sns.pairplot(data=df_good.iloc[:,5:-1],       # Take all stats and the Generation column
             diag_kind='kde',                 # Instead of a histogram, plot a KDE
             kind='reg',                      # Plot regression lines as well
             plot_kws=dict(truncate=True),    # Autoadjust the axis to the data
             vars=df_good.iloc[:,5:-2],       # Use only the stats columns for the plots...
             hue='Generation')                # ...and Generation as the separation variable.

### 1.6 Insights (II)
* Interessantly enough, it can be seen that eventhough most of the variables seem to follow the same KDE, `Speed` is different for generation 4 with respect of the rest. Likewise with the `Defense`. One could think that the designers of the Pokémon did a bit more innovative characters in that generation.
* The outliers are from different generations. It would also be interesting to see if they are of the same `Type_1`, `Type_2`, `Legendary` or not. It will be done further down in the more advanced analysis.

### 1.7 Categorical variables overview

Let's continue with the analysis for the non-numerical variables. Usually, we would like to know how much of each category we have. For that we have two approaches defined below.

In [None]:
for category in ['Type_1', 'Type_2', 'Generation', 'Legendary']:
    print('"{}" has {} missing values. The rest are:'.format(category, df_good[category].isnull().sum()))
    
    # 1.a Simple, built-in method "value_counts()"
    types_simple = df_good[category].value_counts()
    # 1.b MapReduce strategy: group by type and count the instances, keeping only the number (column 'Num')
    types_group = df_good.groupby(category).count()['Num']
    
    # Both yield the same counting *BUT* the ordering is different:
    #    - groupby: alphabetical order.
    #    - value_counts: frequency order.
    # Either way, we can print any of the results (personally I like the frequency order):
    print(types_simple, end='\n\n')

### 1.8 Insights from categorical variables:
* Both `Type_` variables have a big amount of categories. This will can be counter productive when trying to learn a model.
* `Type_2` has a big amount of missing values that represent 386/800=0.48 of the data.

## Summary part 1
* Data is pretty clean and straightforward.
* We had to rename some columns to avoid further problems.
* Seems like there are some outliers seen in the numerical variables that we will have to take into account in next steps.
* Similarly, the categorical variables have probably too many categories and some have a high amount of missing values.

---

## 2. More on the data

### 2.1 Cleaning outliers

First, let's define a function that it going to help us doing the analysis process simpler.

In [None]:
# CLEANING OUTLIERS ---- ongoing
# CATEGORY FIXING ------ todo
# ONE-HOT LABELING ----- todo
# PLOTS! --------------- put them everywhere

In [None]:
def outlier_check(data):
    """
    This function obtains a pandas Series and plots the distribution
    of the variable, along with bars that indicate the upper (or 
    lower) 5% of the data.
    """
    # 1. First computations of maximum, mean and standard deviation
    M = max(data)
    m, s = np.mean(data), np.std(data)
    
    # 2. L(ow) and H(igh) filters.
    L, H = m-2*s, m+2*s
    
    # 3. Plot congiguration
    f, ax = plt.subplots()
    f.set_figheight(5)
    f.set_figwidth(5)
    ax.set_ylim([0,0.025])
    ax.set_xlim([0,M])
    ax.set_title('"{}" outlier detection'.format(data.name))
    
    # 3.1 Draw a vertical line in the upper limit and shade it
    ax.vlines(H, 0, 0.025, color='red', linestyle='dashed')
    ax.fill_between(x=[H,M], y1=0.025, color='red', alpha=.05)
    
    # 3.2 Similarly for the lower limit
    ax.vlines(L, 0, 0.025, color='red', linestyle='dashed')
    ax.fill_between(x=[0,L], y1=0.025, color='red', alpha=.05)
    
    # 4. Plot the distribution
    sns.distplot(data, ax=ax)
    
    # 5. Return the indices of the outliers
    return data[(data<L) | (data>H)].index

With the helper function `oulier_check()` now we can see fancy plots of our real-valued features.

In [None]:
df_good['Outlier'] = np.zeros((len(df_good),1))
for var in stats:
    df_good.loc[outlier_check(df_good[var]),'Outlier'] = 1

In [None]:
len(df_good.loc[df_good.Outlier==1, :])

As one can see, 125 out of 800 Pokémon (16%) are outliers. It is a lot. We need to further investigate what is going on!

#### Research 1:
We might suspect that most of them should be Legendary, given that they have special abilities. Let's check it out.

In [None]:
legendary = len(df_good[df_good.Legendary==True])
legendary_outliers = len(df_good.loc[(df_good.Outlier==1) & (df_good.Legendary==True), :])
print('There are {} legendary Pokémon, out of those {} have outlier stats.'.format(legendary, legendary_outliers))

Eventhough it's a lot of them, not all Legendary characters are actually special!

#### Research 2:
Why are there so many non legendary outliers? Maybe they are the _Mega_ versions of some characters, let's check it out.

In [None]:
normal_outliers = len(df_good.loc[(df_good.Outlier==1) & (df_good.Legendary==False), :])
mega_outliers = len([pokemon for pokemon in df_good.loc[(df_good.Outlier==1) & (df_good.Legendary==False), 'Name'] 
                 if 'Mega' in pokemon])

print('There are {} normal outlier Pokémon, out of those {} are "Mega" versions.'.format(normal_outliers, mega_outliers))

### 2.2 Insigths from outlier analysis
1. We can breakdown the 125 outliers into: 41 Legendary, 26 Mega and 84 Normal outliers.
2. As we can see, there are 65 Legendary Pokémons, of which only 41 have stats that are not represented in the 95% of the rest of the Pokémon.
3. There are, therefore, 24 Legendary Pokémons with not so special stats.
4. We need to treat the Legendary characters different.
5. From the rest, the Mega versions are really different from the normal versions.
6. We need to treat Mega characters different as well!
7. We can drop the normal Pokémons.

----

One option to work with the outliers is simply to drop them, which is what I'm going to be doing here. However, other methods of outlier treatment could be easily implemented. They include methods like:
* Treating the feature in which they are outliers as a random variable coming form a certain probability distribution and assign a random number from that distribution.
* Same as before but assigning them an extreme value in the distribution.
* Change the value by the mean, median or mode.

In [None]:
sns.jointplot(df_good.iloc[:,0], df_good.iloc[:,5], kind='reg')