# Data Mining & Machine Learning - Classification Part 1

### Case 1: Classification of Legendary Pokémon with Supervised Learning

83109 Samuel Didovic<br>
86368 Isabel Lober<br>
85915 Pascal Seitz<br>

Lecturer: Prof. Dr. Adrian Moriariu

## Table of Contents
1. [Step 1: Investigation of the dataset's basics](#intro)
2. [Step 2: Investigation of missing values](#second)
    1. [2.1 `type2`](#sub21)
    2. [2.2`percentage_male`](#sub22)
    3. [2.3`height_m` and `weight_kg`](#sub23)
    4. [2.4 First conclusions](#sub24)
3. [Step 3: Feature Engineering](#third)
    1. [3.1 NaN replacement](#sub31)
    2. [3.2 Imputation](#sub32)
    3. [3.3 Introducing a new feature](#sub33)
    4. [3.4 Check on changes](#sub34)
    5. [3.5 `capture_rate`](#sub35)
    6. [3.6 Final steps](#sub36)
4. [Step 4: Save the cleaned dataset](#fourth)

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np

<br>

### Step 1: Investigation of the dataset's basics <a name = "intro"></a>

In [3]:
df = pd.read_csv("pokemon.csv")

In [4]:
df.shape

(801, 41)

In [5]:
df.head()

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
0,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,1,65,65,45,grass,poison,6.9,1,0
1,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,2,80,80,60,grass,poison,13.0,1,0
2,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,3,122,120,80,grass,poison,100.0,1,0
3,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,4,60,50,65,fire,,8.5,1,0
4,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,5,80,65,80,fire,,19.0,1,0


<br>

Transpose the dataset to provide an appropriate overview, since not every feature is displayed.<br>
Additionally, display some random rows to get a proper understanding of the data.

In [6]:
df.sample(5).T

Unnamed: 0,717,55,701,751,275
abilities,"['Aura Break', 'Power Construct']","['Vital Spirit', 'Anger Point', 'Defiant']","['Cheek Pouch', 'Pickup', 'Plus']","['Water Bubble', 'Water Absorb']","['Guts', 'Scrappy']"
against_bug,1.0,0.5,0.5,1.0,0.5
against_dark,1.0,0.5,0.5,1.0,1.0
against_dragon,2.0,1.0,0.0,1.0,1.0
against_electric,0.0,1.0,0.5,2.0,2.0
against_fairy,2.0,2.0,1.0,1.0,1.0
against_fight,1.0,1.0,0.5,0.5,1.0
against_fire,0.5,1.0,1.0,1.0,1.0
against_flying,1.0,2.0,0.5,2.0,1.0
against_ghost,1.0,1.0,1.0,1.0,0.0


From this randomly chosen Pokémon, we now have gained some insights.<br>
- We might see values of type String, Integer and Float.<br>
- Further, there are some NaN values in `percentage_male` as well as in `type2`.<br>

Therefore, it is worth looking over these features in an in-depth view.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          801 non-null    object 
 1   against_bug        801 non-null    float64
 2   against_dark       801 non-null    float64
 3   against_dragon     801 non-null    float64
 4   against_electric   801 non-null    float64
 5   against_fairy      801 non-null    float64
 6   against_fight      801 non-null    float64
 7   against_fire       801 non-null    float64
 8   against_flying     801 non-null    float64
 9   against_ghost      801 non-null    float64
 10  against_grass      801 non-null    float64
 11  against_ground     801 non-null    float64
 12  against_ice        801 non-null    float64
 13  against_normal     801 non-null    float64
 14  against_poison     801 non-null    float64
 15  against_psychic    801 non-null    float64
 16  against_rock       801 non

This gives insights of the data type for each column.<br>
In sum, there's a total number of 
- **21** columns containing **float types** 
- **13** columns containing **integer types**
- **7** columns containing **object types**.

Examine some basic statistics.

In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
against_bug,801.0,0.9962547,0.597248,0.25,0.5,1.0,1.0,4.0
against_dark,801.0,1.057116,0.438142,0.25,1.0,1.0,1.0,4.0
against_dragon,801.0,0.968789,0.353058,0.0,1.0,1.0,1.0,2.0
against_electric,801.0,1.07397,0.654962,0.0,0.5,1.0,1.0,4.0
against_fairy,801.0,1.068976,0.522167,0.25,1.0,1.0,1.0,4.0
against_fight,801.0,1.065543,0.717251,0.0,0.5,1.0,1.0,4.0
against_fire,801.0,1.135456,0.691853,0.25,0.5,1.0,2.0,4.0
against_flying,801.0,1.192884,0.604488,0.25,1.0,1.0,1.0,4.0
against_ghost,801.0,0.9850187,0.558256,0.0,1.0,1.0,1.0,4.0
against_grass,801.0,1.03402,0.788896,0.25,0.5,1.0,1.0,4.0


<br>

### Step 2: Investigation of missing values <a name = "second"></a>

Examine whether the dataset contains null values.<br>
Based on the initial insights there must at least a few.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          801 non-null    object 
 1   against_bug        801 non-null    float64
 2   against_dark       801 non-null    float64
 3   against_dragon     801 non-null    float64
 4   against_electric   801 non-null    float64
 5   against_fairy      801 non-null    float64
 6   against_fight      801 non-null    float64
 7   against_fire       801 non-null    float64
 8   against_flying     801 non-null    float64
 9   against_ghost      801 non-null    float64
 10  against_grass      801 non-null    float64
 11  against_ground     801 non-null    float64
 12  against_ice        801 non-null    float64
 13  against_normal     801 non-null    float64
 14  against_poison     801 non-null    float64
 15  against_psychic    801 non-null    float64
 16  against_rock       801 non

It seems that some columns containing null values, just as assumed.<br>
Now, use an alternative function to get a more readable overview.

In [10]:
df.isnull().sum()

abilities              0
against_bug            0
against_dark           0
against_dragon         0
against_electric       0
against_fairy          0
against_fight          0
against_fire           0
against_flying         0
against_ghost          0
against_grass          0
against_ground         0
against_ice            0
against_normal         0
against_poison         0
against_psychic        0
against_rock           0
against_steel          0
against_water          0
attack                 0
base_egg_steps         0
base_happiness         0
base_total             0
capture_rate           0
classfication          0
defense                0
experience_growth      0
height_m              20
hp                     0
japanese_name          0
name                   0
percentage_male       98
pokedex_number         0
sp_attack              0
sp_defense             0
speed                  0
type1                  0
type2                384
weight_kg             20
generation             0


This is actually very insightful.<br>
- `type2` has with **386** the most missing values.
- `percentage_male` has the second highest count on missing values with a total number of **98**.
- `height_m` and `weight_kg` containing **20** missing values each.

Now, try to point out why there is a total number of 384 missing values in `type2`.

#### 2.1 `type2` <a name = "sub21"></a>

In [11]:
df[df["type2"].isnull()].head().T

Unnamed: 0,3,4,6,7,8
abilities,"['Blaze', 'Solar Power']","['Blaze', 'Solar Power']","['Torrent', 'Rain Dish']","['Torrent', 'Rain Dish']","['Torrent', 'Rain Dish']"
against_bug,0.5,0.5,1.0,1.0,1.0
against_dark,1.0,1.0,1.0,1.0,1.0
against_dragon,1.0,1.0,1.0,1.0,1.0
against_electric,1.0,1.0,2.0,2.0,2.0
against_fairy,0.5,0.5,1.0,1.0,1.0
against_fight,1.0,1.0,1.0,1.0,1.0
against_fire,0.5,0.5,0.5,0.5,0.5
against_flying,1.0,1.0,1.0,1.0,1.0
against_ghost,1.0,1.0,1.0,1.0,1.0


In [12]:
df["type2"].isnull().head().T

0    False
1    False
2    False
3     True
4     True
Name: type2, dtype: bool

The difference between these two is, that the latter returns a series of boolean values whereas `type2` is true and false if not.<br>
The previous one returns the df itself, which is useful to see the direct cause. Further, it only returns the slice of
the df, where `type2` is null!<br>
<br>
This examination basically shows, that some Pokémom doesn't have a second type. This should kept in mind for later.

#### 2.2 `percentage_male`<a name = "sub22"></a>

In [13]:
df[df["percentage_male"].isnull()].head(12).T

Unnamed: 0,80,81,99,100,119,120,131,136,143,144,145,149
abilities,"['Magnet Pull', 'Sturdy', 'Analytic']","['Magnet Pull', 'Sturdy', 'Analytic']","['Soundproof', 'Static', 'Aftermath']","['Soundproof', 'Static', 'Aftermath']","['Illuminate', 'Natural Cure', 'Analytic']","['Illuminate', 'Natural Cure', 'Analytic']","['Limber', 'Imposter']","['Trace', 'Download', 'Analytic']","['Pressure', 'Snow Cloak']","['Pressure', 'Static']","['Pressure', 'Flame Body']","['Pressure', 'Unnerve']"
against_bug,0.5,0.5,1.0,1.0,1.0,2.0,1.0,1.0,0.5,0.5,0.25,2.0
against_dark,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0
against_dragon,0.5,0.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
against_electric,0.5,0.5,0.5,0.5,2.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0
against_fairy,0.5,0.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5,1.0
against_fight,2.0,2.0,1.0,1.0,1.0,0.5,2.0,2.0,1.0,0.5,0.5,0.5
against_fire,2.0,2.0,1.0,1.0,0.5,0.5,1.0,1.0,2.0,1.0,0.5,1.0
against_flying,0.25,0.25,0.5,0.5,1.0,1.0,1.0,1.0,1.0,0.5,1.0,1.0
against_ghost,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0,2.0


Another insightful detail.<br>
Some Pokémon doesn't seem to have a designated gender.<br>
Further, legendary Pokémon doesnt't seem to have a gender either on first glance.

#### 2.3 `height_m` and `weight_kg`<a name = "sub23"></a>

In [14]:
df[df["height_m"].isnull()]["name"]

18       Rattata
19      Raticate
25        Raichu
26     Sandshrew
27     Sandslash
36        Vulpix
37     Ninetales
49       Diglett
50       Dugtrio
51        Meowth
52       Persian
73       Geodude
74      Graveler
75         Golem
87        Grimer
88           Muk
102    Exeggutor
104      Marowak
719        Hoopa
744     Lycanroc
Name: name, dtype: object

In [15]:
df[df["weight_kg"].isnull()]["name"]

18       Rattata
19      Raticate
25        Raichu
26     Sandshrew
27     Sandslash
36        Vulpix
37     Ninetales
49       Diglett
50       Dugtrio
51        Meowth
52       Persian
73       Geodude
74      Graveler
75         Golem
87        Grimer
88           Muk
102    Exeggutor
104      Marowak
719        Hoopa
744     Lycanroc
Name: name, dtype: object

Based on the index one could draw that these Pokémon represents Generation 1 Pokémon, except for Hoopa and Lycanroc as these were introduced in Generation 6 and 7.<br>
With some background knowledge one might say that these are Generation 1 Pokémon which got alternative forms in later generations.

#### 2.4 First conclusions<a name = "sub24"></a>

Since this dataset contains 800 rows it is recommended not to drop any Pokémon. This could make it more difficult to draw conclusions from the dataset.<br>
<br>
Possible solutions:
- `type2`: replace NaN with None
- `percentage_male`: replace NaN with None
- `height_m` and `weight_kg`: these ones could be imputed by calculate the median replace NaN with median
- introduce a new feature `genderless`

<br>

### Step 3: Feature Engineering <a name = "third"></a>

#### 3.1 NaN replacement<a name = "sub31"></a>

In [16]:
df["type2"].fillna("None", inplace = True)
df["percentage_male"].fillna("None", inplace = True)

#### 3.2 Imputation<a name = "sub32"></a>

In [17]:
df["height_m"].fillna(df["height_m"].mean(), inplace = True)
df["weight_kg"].fillna(df["weight_kg"].mean(), inplace = True)

#### 3.3 Introducing a new feature<a name = "sub33"></a>

Introduce a new feature genderless, which identifies whether a Pokémon has a gender (1) or not (0).

In [18]:
df["genderless"] = np.where(df["percentage_male"] == "None", 1, 0)

#### 3.4 Check on changes<a name = "sub34"></a>

In [19]:
df.head().T

Unnamed: 0,0,1,2,3,4
abilities,"['Overgrow', 'Chlorophyll']","['Overgrow', 'Chlorophyll']","['Overgrow', 'Chlorophyll']","['Blaze', 'Solar Power']","['Blaze', 'Solar Power']"
against_bug,1.0,1.0,1.0,0.5,0.5
against_dark,1.0,1.0,1.0,1.0,1.0
against_dragon,1.0,1.0,1.0,1.0,1.0
against_electric,0.5,0.5,0.5,1.0,1.0
against_fairy,0.5,0.5,0.5,0.5,0.5
against_fight,0.5,0.5,0.5,1.0,1.0
against_fire,2.0,2.0,2.0,0.5,0.5
against_flying,2.0,2.0,2.0,1.0,1.0
against_ghost,1.0,1.0,1.0,1.0,1.0


In [20]:
df.tail().T

Unnamed: 0,796,797,798,799,800
abilities,['Beast Boost'],['Beast Boost'],['Beast Boost'],['Prism Armor'],['Soul-Heart']
against_bug,0.25,1.0,2.0,2.0,0.25
against_dark,1.0,1.0,0.5,2.0,0.5
against_dragon,0.5,0.5,2.0,1.0,0.0
against_electric,2.0,0.5,0.5,1.0,1.0
against_fairy,0.5,0.5,4.0,1.0,0.5
against_fight,1.0,2.0,2.0,0.5,1.0
against_fire,2.0,4.0,0.5,1.0,2.0
against_flying,0.5,1.0,1.0,1.0,0.5
against_ghost,1.0,1.0,0.5,2.0,1.0


Overall, it looks good.<br>
Confirm this by looking at some Pokémon in detail (especially a legendary and one with an alternative form).<br>

In [21]:
confirm_list = ["Mew", "Golem"]
df[df["name"].isin(confirm_list)].T

Unnamed: 0,75,150
abilities,"['Rock Head', 'Sturdy', 'Sand Veil', 'Magnet P...",['Synchronize']
against_bug,1.0,2.0
against_dark,1.0,2.0
against_dragon,1.0,1.0
against_electric,0.0,1.0
against_fairy,1.0,1.0
against_fight,2.0,0.5
against_fire,0.5,1.0
against_flying,0.5,1.0
against_ghost,1.0,2.0


It worked out as already figured.<br>
Display the overall information once again.

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 42 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          801 non-null    object 
 1   against_bug        801 non-null    float64
 2   against_dark       801 non-null    float64
 3   against_dragon     801 non-null    float64
 4   against_electric   801 non-null    float64
 5   against_fairy      801 non-null    float64
 6   against_fight      801 non-null    float64
 7   against_fire       801 non-null    float64
 8   against_flying     801 non-null    float64
 9   against_ghost      801 non-null    float64
 10  against_grass      801 non-null    float64
 11  against_ground     801 non-null    float64
 12  against_ice        801 non-null    float64
 13  against_normal     801 non-null    float64
 14  against_poison     801 non-null    float64
 15  against_psychic    801 non-null    float64
 16  against_rock       801 non

In [23]:
df.isnull().sum()

abilities            0
against_bug          0
against_dark         0
against_dragon       0
against_electric     0
against_fairy        0
against_fight        0
against_fire         0
against_flying       0
against_ghost        0
against_grass        0
against_ground       0
against_ice          0
against_normal       0
against_poison       0
against_psychic      0
against_rock         0
against_steel        0
against_water        0
attack               0
base_egg_steps       0
base_happiness       0
base_total           0
capture_rate         0
classfication        0
defense              0
experience_growth    0
height_m             0
hp                   0
japanese_name        0
name                 0
percentage_male      0
pokedex_number       0
sp_attack            0
sp_defense           0
speed                0
type1                0
type2                0
weight_kg            0
generation           0
is_legendary         0
genderless           0
dtype: int64

Now, columns are cleaned from missing values.<br>
However, there's something about `capture_rate`. df.info() states that the values are of type object.<br>
But, looking in detail only integer values are to be identified.

#### 3.5 `capture_rate`<a name = "sub35"></a>

In [24]:
for i in df.capture_rate:
    print(i, end = ", ")

45, 45, 45, 45, 45, 45, 45, 45, 45, 255, 120, 45, 255, 120, 45, 255, 120, 45, 255, 127, 255, 90, 255, 90, 190, 75, 255, 90, 235, 120, 45, 235, 120, 45, 150, 25, 190, 75, 170, 50, 255, 90, 255, 120, 45, 190, 75, 190, 75, 255, 50, 255, 90, 190, 75, 190, 75, 190, 75, 255, 120, 45, 200, 100, 50, 180, 90, 45, 255, 120, 45, 190, 60, 255, 120, 45, 190, 60, 190, 75, 190, 60, 45, 190, 45, 190, 75, 190, 75, 190, 60, 190, 90, 45, 45, 190, 75, 225, 60, 190, 60, 90, 45, 190, 75, 45, 45, 45, 190, 60, 120, 60, 30, 45, 45, 225, 75, 225, 60, 225, 60, 45, 45, 45, 45, 45, 45, 45, 255, 45, 45, 35, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 25, 3, 3, 3, 45, 45, 45, 3, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 255, 90, 255, 90, 255, 90, 255, 90, 90, 190, 75, 190, 150, 170, 190, 75, 190, 75, 235, 120, 45, 45, 190, 75, 65, 45, 255, 120, 45, 45, 235, 120, 75, 255, 90, 45, 45, 30, 70, 45, 225, 45, 60, 190, 75, 190, 60, 25, 190, 75, 45, 25, 190, 45, 60, 120, 60, 190, 75, 225, 75, 60, 190, 75, 45, 25, 25, 120, 45, 45,

It seems that one Pokémon in particular has two capture rates. This is an important finding.

In [25]:
df["capture_rate"].replace({"30 (Meteorite)255 (Core)" : "30"}, inplace = True)

Finally, convert `capture_rate`.

In [26]:
df["capture_rate"] = df["capture_rate"].astype("int")
df["capture_rate"].dtype

dtype('int32')

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 42 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          801 non-null    object 
 1   against_bug        801 non-null    float64
 2   against_dark       801 non-null    float64
 3   against_dragon     801 non-null    float64
 4   against_electric   801 non-null    float64
 5   against_fairy      801 non-null    float64
 6   against_fight      801 non-null    float64
 7   against_fire       801 non-null    float64
 8   against_flying     801 non-null    float64
 9   against_ghost      801 non-null    float64
 10  against_grass      801 non-null    float64
 11  against_ground     801 non-null    float64
 12  against_ice        801 non-null    float64
 13  against_normal     801 non-null    float64
 14  against_poison     801 non-null    float64
 15  against_psychic    801 non-null    float64
 16  against_rock       801 non

#### 3.6 Final steps<a name = "sub36"></a>

Remove `japanese_name` as it provides no added value for our model and `pokedex_number` as this could negatively influence
our model which is about to developed. The same applies for `against types`.

In [28]:
# Define a list of featues, that are about to be removed from the dataset.
against_types = ["against_bug", "against_dark", "against_dragon", "against_electric", "against_fairy", "against_fight", 
                 "against_fire", "against_flying", "against_ghost", "against_grass", "against_ground", "against_ice", 
                 "against_normal", "against_poison", "against_psychic", "against_rock", "against_steel", "against_water"]

# Add two more features, which are about to be removed from the dataset.
against_types.extend(["japanese_name", "pokedex_number"])

# Remove the features.
df.drop(columns = against_types, inplace = True)

<br>

Fix a typo.

In [29]:
df.rename(columns = {"classfication" : "classification"}, inplace = True)

<br>

Set a Pokémon's name right at the beginning of the dataset

In [30]:
df.insert(0, "name", df.pop("name"))

In [31]:
df.head().T

Unnamed: 0,0,1,2,3,4
name,Bulbasaur,Ivysaur,Venusaur,Charmander,Charmeleon
abilities,"['Overgrow', 'Chlorophyll']","['Overgrow', 'Chlorophyll']","['Overgrow', 'Chlorophyll']","['Blaze', 'Solar Power']","['Blaze', 'Solar Power']"
attack,49,62,100,52,64
base_egg_steps,5120,5120,5120,5120,5120
base_happiness,70,70,70,70,70
base_total,318,405,625,309,405
capture_rate,45,45,45,45,45
classification,Seed Pokémon,Seed Pokémon,Seed Pokémon,Lizard Pokémon,Flame Pokémon
defense,49,63,123,43,58
experience_growth,1059860,1059860,1059860,1059860,1059860


<br>

### Step 4: Save the cleaned dataset <a name = "fourth"></a>

In [32]:
df.shape

(801, 22)

In [33]:
df.to_csv("pokemon_cleaned.csv", index = False)