# Pokemon Classificator
Workflow d'apprentissage supervisé avec scikit-learn

In [None]:
import pandas as pd
import numpy as np
import sklearn as sk

## Project Statement

### Contexte :

Les Pokémon sont des créatures aux caractéristiques variées, certaines étant
classées comme "légendaires" en raison de leur rareté et de leur puissance. L’objectif est
d’entraîner un modèle permettant de prédire si un Pokémon est légendaire ou non à
partir de ses statistiques.


### Description des données :

Le jeu de données comprend des informations sur 800
Pokémon, incluant des caractéristiques comme les points de vie (HP), l’attaque, la
défense, la vitesse, ainsi que des attributs catégoriels (type, génération, etc.).
Pistes à explorer :
* Sélection des meilleures caractéristiques pour la classification.
* Comparaison des performances des modèles (arbres de décision, kNN, réseaux de
neurones).
* Impact de la normalisation des données sur les résultats.

Lien du jeu de données : https://www.kaggle.com/abcsds/pokemon


## Preprocessing

### Data importation

In [None]:
pokemon_df = pd.read_csv("Pokemon.csv")
print("Pokemon data set size:", pokemon_df.size)
pokemon_df.head()

Pokemon data set size: 10400


Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [None]:
pokemon_df["Type 1"].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [None]:
pokemon_df["Type 2"].unique()

array(['Poison', nan, 'Flying', 'Dragon', 'Ground', 'Fairy', 'Grass',
       'Fighting', 'Psychic', 'Steel', 'Ice', 'Rock', 'Dark', 'Water',
       'Electric', 'Fire', 'Ghost', 'Bug', 'Normal'], dtype=object)

### Drop or Clean Irrelevant Columns

Drop unnecesary **#** and **Legendary** columns from the explicit variable, as well as the "Name" since there are too many different ones

\# (index or ID)
Name (string, probably not useful unless you use NLP)

In [None]:
X = pokemon_df.drop(["#", "Name", "Legendary"], axis=1) # Axis=1 allows to drop columns

### Encode Categorical Features

Use One-Hot Encoding for Type 1 and Type 2.
We could use LabelEncoder, but one-hot is safer for tree-based or NN models.

In [None]:
X = pd.get_dummies(X)

These are the resulting columns:

In [None]:
X.columns

Index(['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed',
       'Generation', 'Type 1_Bug', 'Type 1_Dark', 'Type 1_Dragon',
       'Type 1_Electric', 'Type 1_Fairy', 'Type 1_Fighting', 'Type 1_Fire',
       'Type 1_Flying', 'Type 1_Ghost', 'Type 1_Grass', 'Type 1_Ground',
       'Type 1_Ice', 'Type 1_Normal', 'Type 1_Poison', 'Type 1_Psychic',
       'Type 1_Rock', 'Type 1_Steel', 'Type 1_Water', 'Type 2_Bug',
       'Type 2_Dark', 'Type 2_Dragon', 'Type 2_Electric', 'Type 2_Fairy',
       'Type 2_Fighting', 'Type 2_Fire', 'Type 2_Flying', 'Type 2_Ghost',
       'Type 2_Grass', 'Type 2_Ground', 'Type 2_Ice', 'Type 2_Normal',
       'Type 2_Poison', 'Type 2_Psychic', 'Type 2_Rock', 'Type 2_Steel',
       'Type 2_Water'],
      dtype='object')

In [None]:
y = pokemon_df["Legendary"].astype(int)  # convert True/False to 1/0

### Normalization

In [None]:
scaler = sklearn.preprocessing.StandardScaler()
X_scaled = scaler.fit_transform(X)

Some algorithms (like kNN or neural networks) are sensitive to scale.

### Train/Test

We do the training/testing dataset division before the feature selection, to avoid data leakage.

In [None]:
X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(X_scaled, y, test_size=0.2, random_state=42)

## Feature selection

* Basic: Sélection basique, peu d'arg, pas d'innovation
* Inter: Choix pertinent et adaptée, avec optimisation et argumentations
* Eleve: Original et optimale des features. Bien argumentée. Innovant et parfaitement adaptées au probleme.


#### Main categories of feature selection:


> **Filter Methods**

Evaluate each feature independently with target variable. Used in preprocessing phase. Fast and limited.

> **Wrapper Methods**

They use different combination of features and compute relation between these subset features and target variable and based on conclusion addition and removal of features are done.
Slow but better.

> **Embedded Methods**

Combination. Good for only certain cases.

### Feature Selection Method



Given that the dataset is not too big and the **wrapper** methods are more reliable, lets use those.

Also, since we have the answer data, we can use **supervised learning**

We have the following options:
* Fordward selection
* Backward selection
* Recursive elimination

*should I translate the categorial data into numbers? Taking into account that I should use the names as is, since they are no categories*

Even if we could output numbers, we are still gonna use the results to classify between discrete values.

The features are mixed: some have categorical data and others have numerical data.

The output variable is also categorical, since we wish to classify into 2 options.

https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/ recomends using the following methods


*   Chi-Squared test
*   Mutual Information



Are they right? Why? How could I know which is the ideal one?

For now lets just use any of them, let's begin with Chi-Squared Test

### Chi-Squared Test

In [None]:
chi2 = sk.feature_selection.chi2

chi2_stats, p_values = chi2(X, y)

In [None]:
print(np.matrix([p_values, X.columns]).transpose())
print()
print(np.matrix([chi2_stats, X.columns]).transpose())

[[0.0 'Total']
 [1.7322356821731213e-124 'HP']
 [2.080439825756548e-278 'Attack']
 [6.441190768273053e-141 'Defense']
 [0.0 'Sp. Atk']
 [7.205246491320064e-250 'Sp. Def']
 [2.0899917647147156e-231 'Speed']
 [0.03985114934635663 'Generation']
 [0.013502585724311459 'Type 1_Bug']
 [0.7330961793501931 'Type 1_Dark']
 [1.1873314411477657e-09 'Type 1_Dragon']
 [0.814592900000255 'Type 1_Electric']
 [0.7350355429506508 'Type 1_Fairy']
 [0.12228912645290166 'Type 1_Fighting']
 [0.6940539602117852 'Type 1_Fire']
 [0.0021744421624509806 'Type 1_Flying']
 [0.6978611744786973 'Type 1_Ghost']
 [0.2397228150260804 'Type 1_Grass']
 [0.36503027183830394 'Type 1_Ground']
 [0.9702015901694437 'Type 1_Ice']
 [0.0274910221085786 'Type 1_Normal']
 [0.1155818999513828 'Type 1_Poison']
 [5.575974025954529e-06 'Type 1_Psychic']
 [0.814592900000255 'Type 1_Rock']
 [0.20327061696778514 'Type 1_Steel']
 [0.07776465501983229 'Type 1_Water']
 [0.6064979888020764 'Type 2_Bug']
 [0.6089935509092999 'Type 2_Dark']
 

So, the problem here is that the importance of the types is being evaluated for each value of the type, so it is not reliable.

How to fix this?
Cant we use all the features??

In [None]:
SelectKBest = sk.feature_selection.SelectKBest
chi2_selector = SelectKBest(chi2, k=10)
X_kbest = chi2_selector.fit_transform(X, y)
print(X_kbest)

print('Original number of features:', X.shape)
print('Reduced number of features:', X_kbest.shape)

[[318  45  49 ...   0   0   0]
 [405  60  62 ...   0   0   0]
 [525  80  82 ...   0   0   0]
 ...
 [600  80 110 ...   0   0   1]
 [680  80 160 ...   0   0   1]
 [600  80 110 ...   0   0   0]]
Original number of features: (800, 44)
Reduced number of features: (800, 10)


How do we know which ones were chosen?
Can we call that function on the data with the column names?

### Recursive Feature Elimination (RFE)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
rfe = RFE(model, n_features_to_select=10)
rfe.fit(X, y)

selected_features = X.columns[rfe.support_]
print(selected_features)


Index(['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed',
       'Generation', 'Type 1_Psychic', 'Type 1_Water'],
      dtype='object')
