# Data Mining & Machine Learning - Regression Part 2

### Case 2: Prediction of Footballer Values with Supervised Learning

83109 Samuel Didovic<br>
86368 Isabel Lober<br>
85915 Pascal Seitz<br>

Lecturer: Prof. Dr. Adrian Moriariu

## Table of Contents
1. [Step 1: Investigation of the dataset's basics](#intro)
2. [Step 2: Investigation of missing values](#second)
    1. [2.1 `type2`](#sub21)
    2. [2.2`percentage_male`](#sub22)
    3. [2.3`height_m` and `weight_kg`](#sub23)
    4. [2.4 First conclusions](#sub24)
3. [Step 3: Feature Engineering](#third)
    1. [3.1 NaN replacement](#sub31)
    2. [3.2 Imputation](#sub32)
    3. [3.3 Introducing a new feature](#sub33)
    4. [3.4 Check on changes](#sub34)
    5. [3.5 `capture_rate`](#sub35)
    6. [3.6 Final steps](#sub36)
4. [Step 4: Save the cleaned dataset](#fourth)

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statistics
import os
import seaborn as sns
import matplotlib as mpl

<br>

### Step 1: Investigation of the dataset's basics <a name = "intro"></a>

In [5]:
# Loading the data
filenames = ["transfermarkt_fbref_201718", "transfermarkt_fbref_201819", "transfermarkt_fbref_201920"]

# Data folder
data_dir = "football-data"

dfs = [] # List to store the dataframes

# Load the data from each CSV file
for i, file in enumerate(filenames):
    # Construct the file path
    file_path = f"{data_dir}/{file}.csv"
    
    # Read the CSV file directly, specifying the delimiter and thousands separator
    data = pd.read_csv(file_path, delimiter=';', thousands=',')
    
    # Add a column for the year of the data
    data['year'] = 2017 + i
    
    # Append the dataframe to the list
    dfs.append(data)
    
df = pd.concat(dfs) # Turning the list of dataframes into one dataframe
data = df.copy() # Creating a copy of the dataframe for later use

print("--------------------------------------")
print(f"Amount of samples: {len(df)}. Amount of features: {len(df.columns)}")
print("--------------------------------------")


--------------------------------------
Amount of samples: 7108. Amount of features: 402
--------------------------------------


In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,player,nationality,position,squad,age,birth_year,value,height,position2,...,xGA,xGDiff,xGDiff/90,Attendance,CL,WinCL,CLBestScorer,Season,year,Column1
0,379.0,Burgui,es ESP,"FW,MF",Alavés,23.0,1993.0,1800000.0,186.0,attack - Left Winger,...,53.2,-14.2,-0.37,16819.0,0.0,0.0,0.0,201718#,2017,
1,2530.0,Raphaël Varane,fr FRA,DF,Real Madrid,24.0,1993.0,70000000.0,191.0,Defender - Centre-Back,...,45.4,37.9,1.0,66161.0,1.0,1.0,0.0,201718#,2017,
2,721.0,Rubén Duarte,es ESP,DF,Alavés,21.0,1995.0,2000000.0,179.0,Defender - Left-Back,...,53.2,-14.2,-0.37,16819.0,0.0,0.0,0.0,201718#,2017,
3,2512.0,Samuel Umtiti,fr FRA,DF,Barcelona,23.0,1993.0,60000000.0,182.0,Defender - Centre-Back,...,41.1,37.2,0.98,66603.0,1.0,0.0,0.0,201718#,2017,
4,882.0,Manu García,es ESP,MF,Alavés,31.0,1986.0,1800000.0,183.0,midfield - Defensive Midfield,...,53.2,-14.2,-0.37,16819.0,0.0,0.0,0.0,201718#,2017,


<br>

Transpose the dataset to provide an appropriate overview, since not every feature is displayed.<br>
Additionally, display some random rows to get a proper understanding of the data.

In [13]:
from tabulate import tabulate

# Set display options to show all rows
pd.set_option('display.max_rows', None)

columns_overview = df.sample(1).T # Transpose the dataset to get a better overview. Also, display some random rows to get a proper understanding of the data.
# Convert DataFrame to a pretty-printed table string
table_string = tabulate(columns_overview, headers='keys', tablefmt='psql')

print(table_string) # Print the table string

# Write the string to a text file
with open('columns_overview.txt', 'w') as file:
    file.write(table_string)

print("File has been written and saved as 'columns_overview.txt'.")
print("--------------------------------------")

# Reset the display options after printing
pd.reset_option('display.max_rows')

+----------------------------------------+--------------------------+
|                                        | 908                      |
|----------------------------------------+--------------------------|
| Unnamed: 0                             | nan                      |
| player                                 | Collin Quaner            |
| nationality                            | de GER                   |
| position                               | FW,MF                    |
| squad                                  | Huddersfield             |
| age                                    | 27.0                     |
| birth_year                             | 1991.0                   |
| value                                  | 1500000.0                |
| height                                 | 191.0                    |
| position2                              | Forward - Centre-Forward |
| foot                                   | right                    |
| league            

From our initial exploratory data analysis, we have gained several insights into the dataset's structure and content:

- **Data Types Observed**:
  - **String Data**: Attributes like `player`, `nationality`, `position`, `squad`, `league`, and `foot` are categorical data stored as strings. These features describe qualitative aspects of the players and their playing environment.
  - **Numerical Data**: Attributes such as `age`, `birth_year`, `value`, `height`, `games`, `minutes`, `goals`, and `assists`, along with various performance metrics like `xg` (expected goals), `xa` (expected assists), and `passes` are stored as integers or floats. These provide quantitative measurements of player performances and physical characteristics.
  
- **Handling Missing Values**:
  - The dataset may contain missing values (`NaN` values), which are common in comprehensive sports data. This can occur in several scenarios, such as incomplete data capture, players not participating in certain games, or unrecorded metrics.
  - Specific attributes such as performance metrics (`goals_per90`, `assists_per90`, etc.) often have zero values, which could indicate either no activity in these areas or insufficient playing time to gather data. It's crucial to distinguish between genuine zeros (no activity) and cases where data may not be recorded due to players not playing in certain matches.

- **Features Warranting Further In-depth Review**:
  - **Performance Metrics**: Exploring how different metrics like `xg`, `xa`, `passes`, and `shots` are distributed among players can reveal insights into player efficiency and team strategies.
  - **Physical Attributes**: Analyzing features like `height` and `age` can help in understanding physical diversity in the dataset and potential correlations with player roles and performance.
  - **Participation Metrics**: Metrics like `games`, `minutes`, and `starts` are critical for understanding player utilization and impact. Analyzing the distribution of these features can help identify key players, squad rotation, and injury impacts.
  - **Missing Data Patterns**: Identifying patterns in missing data can help in understanding biases in the data collection process or in the dataset itself. For instance, backup players might have more missing data points in performance metrics.

Therefore, it is beneficial to conduct a detailed review of these aspects to ensure a comprehensive understanding of the dataset before proceeding with further analysis or model building.


In [14]:
print("Data Information")
print(df.info())
print("--------------------------------------")

Data Information
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7108 entries, 0 to 2643
Columns: 402 entries, Unnamed: 0 to Column1
dtypes: float64(393), int64(1), object(8)
memory usage: 21.9+ MB
None
--------------------------------------


This gives insights of the data type for each column.<br>
In sum, there's a total number of 
- **393** columns containing **float types** 
- **1** columns containing **integer types**
- **8** columns containing **object types**.

Examine some basic statistics.

In [15]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,2232.0,1.339719e+03,7.770203e+02,0.0,656.75,1347.5,2013.25,2686.0
age,6976.0,2.542317e+01,4.444526e+00,0.0,22.00,25.0,28.00,41.0
birth_year,6976.0,1.991136e+03,4.790624e+01,0.0,1989.00,1992.0,1996.00,2004.0
value,6976.0,1.023844e+07,1.654409e+07,50.0,1500000.00,4000000.0,12000000.00,200000000.0
height,6975.0,1.819604e+02,9.470125e+00,0.0,178.00,183.0,187.00,203.0
...,...,...,...,...,...,...,...,...
CL,6976.0,1.836296e-01,3.872096e-01,0.0,0.00,0.0,0.00,1.0
WinCL,6976.0,9.317661e-03,9.608416e-02,0.0,0.00,0.0,0.00,1.0
CLBestScorer,6696.0,4.480287e-04,2.116353e-02,0.0,0.00,0.0,0.00,1.0
year,7108.0,2.018058e+03,8.262711e-01,2017.0,2017.00,2018.0,2019.00,2019.0


<br>

### Step 2: Investigation of missing values <a name = "second"></a>

Examine whether the dataset contains null values.<br>
Based on the initial insights there must at least a few.

In [29]:
missing_values_summary = df.isnull().sum()
filtered_missing_values = missing_values_summary[missing_values_summary > 0]  # Filter columns with missing values

# Convert the Series to a string
missing_values_str = filtered_missing_values.to_string()

# Print the content
print(missing_values_str)

# Write the string to a text file
with open('missing_values_summary.txt', 'w') as file:
    file.write(missing_values_str)

print("File has been written and saved as 'missing_values_summary.txt'.")
print("--------------------------------------")

Unnamed: 0                                4876
player                                     132
nationality                                132
position                                   132
squad                                      132
age                                        132
birth_year                                 132
value                                      132
height                                     133
position2                                  132
foot                                       155
league                                     132
games                                      132
games_starts                               132
minutes                                    132
goals                                      132
assists                                    132
pens_made                                  132
pens_att                                   132
cards_yellow                               132
cards_red                                  132
goals_per90  

## Analysis of Missing Data

The exploratory data analysis has revealed a significant pattern of missing values across various features in our dataset. Here's a structured approach to further handle and investigate these missing values:

### Understanding Patterns of Missingness

- **Consistency in Missing Data**: Several features exhibit the same number of missing values(132, 264), suggesting a possible systematic issue or non-applicability in certain cases.
- **Features with Extensive Missing Data**: Particularly high levels of missing data in specific columns (`Unnamed: 0`, `Column1`, and `CLBestScorer`) necessitate a deeper investigation to determine their relevance and potential impact on the analysis.




In [35]:
# Calculate the percentage of missing values in each row
missing_percentage_samples = df.isnull().sum(axis=1) / len(df.columns) * 100

# Add this as a new column to the DataFrame for easy inspection
df['missing_percentage'] = missing_percentage_samples

# Select a few key features to display along with the missing percentage
# Adjust the column names as per your DataFrame
selected_features = ['player', 'nationality', 'squad', 'missing_percentage']

# Filter to show only rows with more than 50% missing values and display selected features
filtered_df = df[df['missing_percentage'] > 50][selected_features]
print(filtered_df.sort_values(by='missing_percentage', ascending=False))


     player nationality squad  missing_percentage
2100    NaN         NaN   NaN           99.255583
2183    NaN         NaN   NaN           99.255583
2197    NaN         NaN   NaN           99.255583
2196    NaN         NaN   NaN           99.255583
2195    NaN         NaN   NaN           99.255583
...     ...         ...   ...                 ...
2138    NaN         NaN   NaN           99.255583
2137    NaN         NaN   NaN           99.255583
2136    NaN         NaN   NaN           99.255583
2135    NaN         NaN   NaN           99.255583
2231    NaN         NaN   NaN           99.255583

[132 rows x 4 columns]


### Python Code to Remove Highly Incomplete Rows
This approach ensures that our dataset is now free of rows where a majority of the data is missing, thus improving the quality of our dataset for subsequent analyses.

In [40]:
# Filtering out rows where missing_percentage is greater than 50%
df = df[df['missing_percentage'] <= 50]

# Confirm the removal by printing the shape of the new DataFrame and the original DataFrame
print(f"Original DataFrame shape: {data.shape}")
print(f"New DataFrame shape after removing high missing percentage rows: {df_clean.shape}")
print(f"Removed {data.shape[0] - df.shape[0]} rows with more than 50% missing values.")


Original DataFrame shape: (7108, 402)
New DataFrame shape after removing high missing percentage rows: (6976, 403)
Removed 132 rows with more than 50% missing values.


In [41]:
# Calculate the percentage of missing values for each feature
missing_percentage = df.isnull().sum() / len(df) * 100

# Filter and print features with more than 0% missing values
print(missing_percentage[missing_percentage > 0].sort_values(ascending=False))


Unnamed: 0            68.004587
Column1               31.995413
CLBestScorer           5.905963
tackles_mid_3rdm       1.892202
pass_targetsm          1.892202
                        ...    
passes_left_footm      1.892202
passes_right_footm     1.892202
pens_madem             1.892202
foot                   0.329702
height                 0.014335
Length: 188, dtype: float64


In [42]:
# Drop the 'missing_percentage' column as it is no longer needed
df = df.drop(columns='missing_percentage')

0        379.0
1       2530.0
2        721.0
3       2512.0
4        882.0
         ...  
2639       NaN
2640       NaN
2641       NaN
2642       NaN
2643       NaN
Name: Unnamed: 0, Length: 6976, dtype: float64

In [11]:
df[df["type2"].isnull()].head().T

Unnamed: 0,3,4,6,7,8
abilities,"['Blaze', 'Solar Power']","['Blaze', 'Solar Power']","['Torrent', 'Rain Dish']","['Torrent', 'Rain Dish']","['Torrent', 'Rain Dish']"
against_bug,0.5,0.5,1.0,1.0,1.0
against_dark,1.0,1.0,1.0,1.0,1.0
against_dragon,1.0,1.0,1.0,1.0,1.0
against_electric,1.0,1.0,2.0,2.0,2.0
against_fairy,0.5,0.5,1.0,1.0,1.0
against_fight,1.0,1.0,1.0,1.0,1.0
against_fire,0.5,0.5,0.5,0.5,0.5
against_flying,1.0,1.0,1.0,1.0,1.0
against_ghost,1.0,1.0,1.0,1.0,1.0


In [12]:
df["type2"].isnull().head().T

0    False
1    False
2    False
3     True
4     True
Name: type2, dtype: bool

The difference between these two is, that the latter returns a series of boolean values whereas `type2` is true and false if not.<br>
The previous one returns the df itself, which is useful to see the direct cause. Further, it only returns the slice of
the df, where `type2` is null!<br>
<br>
This examination basically shows, that some Pokémom doesn't have a second type. This should kept in mind for later.

#### 2.2 `percentage_male`<a name = "sub22"></a>

In [13]:
df[df["percentage_male"].isnull()].head(12).T

Unnamed: 0,80,81,99,100,119,120,131,136,143,144,145,149
abilities,"['Magnet Pull', 'Sturdy', 'Analytic']","['Magnet Pull', 'Sturdy', 'Analytic']","['Soundproof', 'Static', 'Aftermath']","['Soundproof', 'Static', 'Aftermath']","['Illuminate', 'Natural Cure', 'Analytic']","['Illuminate', 'Natural Cure', 'Analytic']","['Limber', 'Imposter']","['Trace', 'Download', 'Analytic']","['Pressure', 'Snow Cloak']","['Pressure', 'Static']","['Pressure', 'Flame Body']","['Pressure', 'Unnerve']"
against_bug,0.5,0.5,1.0,1.0,1.0,2.0,1.0,1.0,0.5,0.5,0.25,2.0
against_dark,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0
against_dragon,0.5,0.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
against_electric,0.5,0.5,0.5,0.5,2.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0
against_fairy,0.5,0.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5,1.0
against_fight,2.0,2.0,1.0,1.0,1.0,0.5,2.0,2.0,1.0,0.5,0.5,0.5
against_fire,2.0,2.0,1.0,1.0,0.5,0.5,1.0,1.0,2.0,1.0,0.5,1.0
against_flying,0.25,0.25,0.5,0.5,1.0,1.0,1.0,1.0,1.0,0.5,1.0,1.0
against_ghost,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0,2.0


Another insightful detail.<br>
Some Pokémon doesn't seem to have a designated gender.<br>
Further, legendary Pokémon doesnt't seem to have a gender either on first glance.

#### 2.3 `height_m` and `weight_kg`<a name = "sub23"></a>

In [14]:
df[df["height_m"].isnull()]["name"]

18       Rattata
19      Raticate
25        Raichu
26     Sandshrew
27     Sandslash
36        Vulpix
37     Ninetales
49       Diglett
50       Dugtrio
51        Meowth
52       Persian
73       Geodude
74      Graveler
75         Golem
87        Grimer
88           Muk
102    Exeggutor
104      Marowak
719        Hoopa
744     Lycanroc
Name: name, dtype: object

In [15]:
df[df["weight_kg"].isnull()]["name"]

18       Rattata
19      Raticate
25        Raichu
26     Sandshrew
27     Sandslash
36        Vulpix
37     Ninetales
49       Diglett
50       Dugtrio
51        Meowth
52       Persian
73       Geodude
74      Graveler
75         Golem
87        Grimer
88           Muk
102    Exeggutor
104      Marowak
719        Hoopa
744     Lycanroc
Name: name, dtype: object

Based on the index one could draw that these Pokémon represents Generation 1 Pokémon, except for Hoopa and Lycanroc as these were introduced in Generation 6 and 7.<br>
With some background knowledge one might say that these are Generation 1 Pokémon which got alternative forms in later generations.

#### 2.4 First conclusions<a name = "sub24"></a>

Since this dataset contains 800 rows it is recommended not to drop any Pokémon. This could make it more difficult to draw conclusions from the dataset.<br>
<br>
Possible solutions:
- `type2`: replace NaN with None
- `percentage_male`: replace NaN with None
- `height_m` and `weight_kg`: these ones could be imputed by calculate the median replace NaN with median
- introduce a new feature `genderless`

<br>

### Step 3: Feature Engineering <a name = "third"></a>

#### 3.1 NaN replacement<a name = "sub31"></a>

In [16]:
df["type2"].fillna("None", inplace = True)
df["percentage_male"].fillna("None", inplace = True)

#### 3.2 Imputation<a name = "sub32"></a>

In [17]:
df["height_m"].fillna(df["height_m"].mean(), inplace = True)
df["weight_kg"].fillna(df["weight_kg"].mean(), inplace = True)

#### 3.3 Introducing a new feature<a name = "sub33"></a>

Introduce a new feature genderless, which identifies whether a Pokémon has a gender (1) or not (0).

In [18]:
df["genderless"] = np.where(df["percentage_male"] == "None", 1, 0)

#### 3.4 Check on changes<a name = "sub34"></a>

In [19]:
df.head().T

Unnamed: 0,0,1,2,3,4
abilities,"['Overgrow', 'Chlorophyll']","['Overgrow', 'Chlorophyll']","['Overgrow', 'Chlorophyll']","['Blaze', 'Solar Power']","['Blaze', 'Solar Power']"
against_bug,1.0,1.0,1.0,0.5,0.5
against_dark,1.0,1.0,1.0,1.0,1.0
against_dragon,1.0,1.0,1.0,1.0,1.0
against_electric,0.5,0.5,0.5,1.0,1.0
against_fairy,0.5,0.5,0.5,0.5,0.5
against_fight,0.5,0.5,0.5,1.0,1.0
against_fire,2.0,2.0,2.0,0.5,0.5
against_flying,2.0,2.0,2.0,1.0,1.0
against_ghost,1.0,1.0,1.0,1.0,1.0


In [20]:
df.tail().T

Unnamed: 0,796,797,798,799,800
abilities,['Beast Boost'],['Beast Boost'],['Beast Boost'],['Prism Armor'],['Soul-Heart']
against_bug,0.25,1.0,2.0,2.0,0.25
against_dark,1.0,1.0,0.5,2.0,0.5
against_dragon,0.5,0.5,2.0,1.0,0.0
against_electric,2.0,0.5,0.5,1.0,1.0
against_fairy,0.5,0.5,4.0,1.0,0.5
against_fight,1.0,2.0,2.0,0.5,1.0
against_fire,2.0,4.0,0.5,1.0,2.0
against_flying,0.5,1.0,1.0,1.0,0.5
against_ghost,1.0,1.0,0.5,2.0,1.0


Overall, it looks good.<br>
Confirm this by looking at some Pokémon in detail (especially a legendary and one with an alternative form).<br>

In [21]:
confirm_list = ["Mew", "Golem"]
df[df["name"].isin(confirm_list)].T

Unnamed: 0,75,150
abilities,"['Rock Head', 'Sturdy', 'Sand Veil', 'Magnet P...",['Synchronize']
against_bug,1.0,2.0
against_dark,1.0,2.0
against_dragon,1.0,1.0
against_electric,0.0,1.0
against_fairy,1.0,1.0
against_fight,2.0,0.5
against_fire,0.5,1.0
against_flying,0.5,1.0
against_ghost,1.0,2.0


It worked out as already figured.<br>
Display the overall information once again.

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 42 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          801 non-null    object 
 1   against_bug        801 non-null    float64
 2   against_dark       801 non-null    float64
 3   against_dragon     801 non-null    float64
 4   against_electric   801 non-null    float64
 5   against_fairy      801 non-null    float64
 6   against_fight      801 non-null    float64
 7   against_fire       801 non-null    float64
 8   against_flying     801 non-null    float64
 9   against_ghost      801 non-null    float64
 10  against_grass      801 non-null    float64
 11  against_ground     801 non-null    float64
 12  against_ice        801 non-null    float64
 13  against_normal     801 non-null    float64
 14  against_poison     801 non-null    float64
 15  against_psychic    801 non-null    float64
 16  against_rock       801 non

In [23]:
df.isnull().sum()

abilities            0
against_bug          0
against_dark         0
against_dragon       0
against_electric     0
against_fairy        0
against_fight        0
against_fire         0
against_flying       0
against_ghost        0
against_grass        0
against_ground       0
against_ice          0
against_normal       0
against_poison       0
against_psychic      0
against_rock         0
against_steel        0
against_water        0
attack               0
base_egg_steps       0
base_happiness       0
base_total           0
capture_rate         0
classfication        0
defense              0
experience_growth    0
height_m             0
hp                   0
japanese_name        0
name                 0
percentage_male      0
pokedex_number       0
sp_attack            0
sp_defense           0
speed                0
type1                0
type2                0
weight_kg            0
generation           0
is_legendary         0
genderless           0
dtype: int64

Now, columns are cleaned from missing values.<br>
However, there's something about `capture_rate`. df.info() states that the values are of type object.<br>
But, looking in detail only integer values are to be identified.

#### 3.5 `capture_rate`<a name = "sub35"></a>

In [24]:
for i in df.capture_rate:
    print(i, end = ", ")

45, 45, 45, 45, 45, 45, 45, 45, 45, 255, 120, 45, 255, 120, 45, 255, 120, 45, 255, 127, 255, 90, 255, 90, 190, 75, 255, 90, 235, 120, 45, 235, 120, 45, 150, 25, 190, 75, 170, 50, 255, 90, 255, 120, 45, 190, 75, 190, 75, 255, 50, 255, 90, 190, 75, 190, 75, 190, 75, 255, 120, 45, 200, 100, 50, 180, 90, 45, 255, 120, 45, 190, 60, 255, 120, 45, 190, 60, 190, 75, 190, 60, 45, 190, 45, 190, 75, 190, 75, 190, 60, 190, 90, 45, 45, 190, 75, 225, 60, 190, 60, 90, 45, 190, 75, 45, 45, 45, 190, 60, 120, 60, 30, 45, 45, 225, 75, 225, 60, 225, 60, 45, 45, 45, 45, 45, 45, 45, 255, 45, 45, 35, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 25, 3, 3, 3, 45, 45, 45, 3, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 255, 90, 255, 90, 255, 90, 255, 90, 90, 190, 75, 190, 150, 170, 190, 75, 190, 75, 235, 120, 45, 45, 190, 75, 65, 45, 255, 120, 45, 45, 235, 120, 75, 255, 90, 45, 45, 30, 70, 45, 225, 45, 60, 190, 75, 190, 60, 25, 190, 75, 45, 25, 190, 45, 60, 120, 60, 190, 75, 225, 75, 60, 190, 75, 45, 25, 25, 120, 45, 45,

It seems that one Pokémon in particular has two capture rates. This is an important finding.

In [25]:
df["capture_rate"].replace({"30 (Meteorite)255 (Core)" : "30"}, inplace = True)

Finally, convert `capture_rate`.

In [26]:
df["capture_rate"] = df["capture_rate"].astype("int")
df["capture_rate"].dtype

dtype('int32')

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 42 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          801 non-null    object 
 1   against_bug        801 non-null    float64
 2   against_dark       801 non-null    float64
 3   against_dragon     801 non-null    float64
 4   against_electric   801 non-null    float64
 5   against_fairy      801 non-null    float64
 6   against_fight      801 non-null    float64
 7   against_fire       801 non-null    float64
 8   against_flying     801 non-null    float64
 9   against_ghost      801 non-null    float64
 10  against_grass      801 non-null    float64
 11  against_ground     801 non-null    float64
 12  against_ice        801 non-null    float64
 13  against_normal     801 non-null    float64
 14  against_poison     801 non-null    float64
 15  against_psychic    801 non-null    float64
 16  against_rock       801 non

#### 3.6 Final steps<a name = "sub36"></a>

Remove `japanese_name` as it provides no added value for our model and `pokedex_number` as this could negatively influence
our model which is about to developed. The same applies for `against types`.

In [28]:
# Define a list of featues, that are about to be removed from the dataset.
against_types = ["against_bug", "against_dark", "against_dragon", "against_electric", "against_fairy", "against_fight", 
                 "against_fire", "against_flying", "against_ghost", "against_grass", "against_ground", "against_ice", 
                 "against_normal", "against_poison", "against_psychic", "against_rock", "against_steel", "against_water"]

# Add two more features, which are about to be removed from the dataset.
against_types.extend(["japanese_name", "pokedex_number"])

# Remove the features.
df.drop(columns = against_types, inplace = True)

<br>

Fix a typo.

In [29]:
df.rename(columns = {"classfication" : "classification"}, inplace = True)

<br>

Set a Pokémon's name right at the beginning of the dataset

In [30]:
df.insert(0, "name", df.pop("name"))

In [31]:
df.head().T

Unnamed: 0,0,1,2,3,4
name,Bulbasaur,Ivysaur,Venusaur,Charmander,Charmeleon
abilities,"['Overgrow', 'Chlorophyll']","['Overgrow', 'Chlorophyll']","['Overgrow', 'Chlorophyll']","['Blaze', 'Solar Power']","['Blaze', 'Solar Power']"
attack,49,62,100,52,64
base_egg_steps,5120,5120,5120,5120,5120
base_happiness,70,70,70,70,70
base_total,318,405,625,309,405
capture_rate,45,45,45,45,45
classification,Seed Pokémon,Seed Pokémon,Seed Pokémon,Lizard Pokémon,Flame Pokémon
defense,49,63,123,43,58
experience_growth,1059860,1059860,1059860,1059860,1059860


<br>

### Step 4: Save the cleaned dataset <a name = "fourth"></a>

In [32]:
df.shape

(801, 22)

In [33]:
df.to_csv("pokemon_cleaned.csv", index = False)