# Pokemon Classificator
Workflow d'apprentissage supervisé avec scikit-learn

In [22]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt

## Project Statement

### Contexte

Les Pokémon sont des créatures aux caractéristiques variées, certaines étant
classées comme "légendaires" en raison de leur rareté et de leur puissance. L’objectif est
d’entraîner un modèle permettant de prédire si un Pokémon est légendaire ou non à
partir de ses statistiques.


### Description des données

Le jeu de données comprend des informations sur 800
Pokémon, incluant des caractéristiques comme les points de vie (HP), l’attaque, la
défense, la vitesse, ainsi que des attributs catégoriels (type, génération, etc.).
Pistes à explorer :
* Sélection des meilleures caractéristiques pour la classification.
* Comparaison des performances des modèles (arbres de décision, kNN, réseaux de
neurones).
* Impact de la normalisation des données sur les résultats.

Lien du jeu de données : https://www.kaggle.com/abcsds/pokemon


## Preprocessing

### Data importation

In [23]:
pokemon_df = pd.read_csv("Pokemon.csv")
print("Pokemon data set size:", pokemon_df.size)
pokemon_df.head()

Pokemon data set size: 10400


Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [24]:
pokemon_df["Type 1"].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [25]:
pokemon_df["Type 2"].unique()

array(['Poison', nan, 'Flying', 'Dragon', 'Ground', 'Fairy', 'Grass',
       'Fighting', 'Psychic', 'Steel', 'Ice', 'Rock', 'Dark', 'Water',
       'Electric', 'Fire', 'Ghost', 'Bug', 'Normal'], dtype=object)

### Drop or Clean Irrelevant Columns

Drop unnecesary **#** and **Legendary** columns from the explicit variable, as well as the "Name" since there are too many different ones

\# (index or ID)
Name (string, probably not useful unless you use NLP)

In [26]:
X = pokemon_df.drop(["#", "Name", "Legendary"], axis=1) # Axis=1 allows to drop columns

### Encode Categorical Features

Use One-Hot Encoding for Type 1 and Type 2.
We could use LabelEncoder, but one-hot is safer for tree-based or NN models.

In [27]:
X = pd.get_dummies(X)

These are the resulting columns:

In [28]:
X.columns

Index(['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed',
       'Generation', 'Type 1_Bug', 'Type 1_Dark', 'Type 1_Dragon',
       'Type 1_Electric', 'Type 1_Fairy', 'Type 1_Fighting', 'Type 1_Fire',
       'Type 1_Flying', 'Type 1_Ghost', 'Type 1_Grass', 'Type 1_Ground',
       'Type 1_Ice', 'Type 1_Normal', 'Type 1_Poison', 'Type 1_Psychic',
       'Type 1_Rock', 'Type 1_Steel', 'Type 1_Water', 'Type 2_Bug',
       'Type 2_Dark', 'Type 2_Dragon', 'Type 2_Electric', 'Type 2_Fairy',
       'Type 2_Fighting', 'Type 2_Fire', 'Type 2_Flying', 'Type 2_Ghost',
       'Type 2_Grass', 'Type 2_Ground', 'Type 2_Ice', 'Type 2_Normal',
       'Type 2_Poison', 'Type 2_Psychic', 'Type 2_Rock', 'Type 2_Steel',
       'Type 2_Water'],
      dtype='object')

In [29]:
y = pokemon_df["Legendary"].astype(int)  # convert True/False to 1/0

### Train/Test

We do the training/testing dataset division before the feature selection, to avoid data leakage.

In [30]:
X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(X, y, test_size=0.2, random_state=42)

### Normalization

In [31]:
scaler = sk.preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)


We scale since some algorithms (like kNN or neural networks) are sensitive to scale.
Note that we scale AFTER dividing, so that there is no info leakage.

## Feature selection method

### Main categories of feature selection


> **Filter Methods**

Evaluate each feature independently with target variable. Used in preprocessing phase. Fast and limited.

> **Wrapper Methods**

They use different combination of features and compute relation between these subset features and target variable and based on conclusion addition and removal of features are done.
Slow but better.

> **Embedded Methods**

Combination. Good for only certain cases.

### Feature Selection Method to use: Forward selection


Given that the dataset is not too big , we can have the luxury of using the more reliable **wrapper** methods.

Also, since we have the answer data, we can use **supervised learning**

We have the following options:
* Fordward selection
* Backward selection
* Recursive elimination


Even though Recursive Elimination is normally better than its greedy counterparts, since it only works well if the model can rank feature importance, it could work with decision trees but not with kNN or with neural networks.

So we can work with backward elimination or fordward selection. Since we have a lot of categorical features that are not very useful by themselves, let's use Fordward Selection.



## Feature selection and model training

### k-Nearest Neighbors

We make sure to use the scaled data

In [42]:
knn = sk.neighbors.KNeighborsClassifier(n_neighbors=5)

selector = sk.feature_selection.SequentialFeatureSelector(knn, n_features_to_select='auto', direction='forward', cv=5)
selector.fit(X_train_scaled, y_train)

selected_feature_names = X.columns[selector.get_support()]
print("Selected features:\n", selected_feature_names)

Selected features:
 Index(['Total', 'Attack', 'Type 1_Bug', 'Type 1_Fairy', 'Type 1_Fighting',
       'Type 1_Flying', 'Type 1_Ghost', 'Type 1_Ice', 'Type 1_Normal',
       'Type 1_Poison', 'Type 1_Psychic', 'Type 1_Rock', 'Type 1_Water',
       'Type 2_Bug', 'Type 2_Dark', 'Type 2_Dragon', 'Type 2_Fairy',
       'Type 2_Grass', 'Type 2_Ground', 'Type 2_Normal', 'Type 2_Poison',
       'Type 2_Rock'],
      dtype='object')


In [43]:
# Evaluate model on selected features
X_train_sel = selector.transform(X_train_scaled)
X_test_sel = selector.transform(X_test_scaled)

knn.fit(X_train_sel, y_train)
y_pred_knn = knn.predict(X_test_sel)

### k-Nearest Neighbors without scaling

In [44]:
knn = sk.neighbors.KNeighborsClassifier(n_neighbors=5)

selector = sk.feature_selection.SequentialFeatureSelector(knn, n_features_to_select='auto', direction='forward', cv=5)
selector.fit(X_train, y_train)

selected_feature_names = X.columns[selector.get_support()]
print("Selected features:\n", selected_feature_names)

Selected features:
 Index(['Total', 'Generation', 'Type 1_Bug', 'Type 1_Dark', 'Type 1_Dragon',
       'Type 1_Electric', 'Type 1_Fairy', 'Type 1_Fighting', 'Type 1_Flying',
       'Type 1_Ghost', 'Type 1_Grass', 'Type 1_Ground', 'Type 1_Normal',
       'Type 1_Poison', 'Type 1_Steel', 'Type 2_Bug', 'Type 2_Dark',
       'Type 2_Dragon', 'Type 2_Electric', 'Type 2_Fairy', 'Type 2_Fighting',
       'Type 2_Fire'],
      dtype='object')


In [45]:
# Evaluate model on selected features
X_train_sel = selector.transform(X_train)
X_test_sel = selector.transform(X_test)

knn.fit(X_train_sel, y_train)
y_pred_knn_not_scaled = knn.predict(X_test_sel)

### Arbres de decision

Here we can use either scaled or unscaled data

In [35]:
tree = sk.tree.DecisionTreeClassifier(random_state=42, max_depth=10)

selector = sk.feature_selection.SequentialFeatureSelector(tree, n_features_to_select='auto', direction='forward', cv=5)
selector.fit(X_train, y_train)

selected_features = selector.get_support()
selected_feature_names = X.columns[selected_features]

print("Selected features:\n", selected_feature_names)

Selected features:
 Index(['Total', 'Sp. Atk', 'Type 1_Bug', 'Type 1_Dragon', 'Type 1_Electric',
       'Type 1_Fairy', 'Type 1_Fighting', 'Type 1_Flying', 'Type 1_Ghost',
       'Type 1_Grass', 'Type 1_Ground', 'Type 1_Normal', 'Type 1_Poison',
       'Type 1_Psychic', 'Type 1_Rock', 'Type 2_Bug', 'Type 2_Dragon',
       'Type 2_Electric', 'Type 2_Fairy', 'Type 2_Fire', 'Type 2_Ground',
       'Type 2_Normal'],
      dtype='object')


In [36]:
# Train and evaluate model on selected features
X_train_sel = selector.transform(X_train)
X_test_sel = selector.transform(X_test)

tree.fit(X_train_sel, y_train)
y_pred_tree = tree.predict(X_test_sel)

### Neural networks

In [37]:
mlp = sk.neural_network.MLPClassifier(hidden_layer_sizes=(50,), max_iter=100, random_state=42)

selector = sk.feature_selection.SequentialFeatureSelector(mlp, n_features_to_select='auto', direction='forward', cv=5)
selector.fit(X_train_scaled, y_train)

selected_features = selector.get_support()
selected_feature_names = X.columns[selected_features]

print("Selected features (Neural Network):")
print(selected_feature_names)

Selected features (Neural Network):
Index(['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed',
       'Generation', 'Type 1_Bug', 'Type 1_Dark', 'Type 1_Dragon',
       'Type 1_Electric', 'Type 1_Fairy', 'Type 1_Fighting', 'Type 1_Fire',
       'Type 1_Flying', 'Type 1_Ghost', 'Type 1_Grass', 'Type 1_Ice',
       'Type 1_Normal', 'Type 2_Grass', 'Type 2_Poison'],
      dtype='object')


In [38]:
# Train and evaluate model on selected features
X_train_sel = selector.transform(X_train_scaled)
X_test_sel = selector.transform(X_test)

mlp.fit(X_train_sel, y_train)
y_pred_neural = mlp.predict(X_test_sel)

**Notes**

It is ok to take into account some of the columns result of the one-hot encoding, and not take into account others.

Also, we do feature selection over the train dataset, as to prevent data leakage.

## Comparaison des performances des modèles

In [39]:
print("\n Classification Report - kNN:")
print(sk.metrics.classification_report(y_test, y_pred_knn))


 Classification Report - kNN:
              precision    recall  f1-score   support

           0       0.97      0.98      0.97       150
           1       0.62      0.50      0.56        10

    accuracy                           0.95       160
   macro avg       0.80      0.74      0.76       160
weighted avg       0.95      0.95      0.95       160



In [46]:
print("\n Classification Report - kNN not scaled:")
print(sk.metrics.classification_report(y_test, y_pred_knn_not_scaled))


 Classification Report - kNN not scaled:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       150
           1       0.83      1.00      0.91        10

    accuracy                           0.99       160
   macro avg       0.92      0.99      0.95       160
weighted avg       0.99      0.99      0.99       160



In [40]:
print("\n Classification Report - Trees:")
print(sk.metrics.classification_report(y_test, y_pred_tree))


 Classification Report - Trees:
              precision    recall  f1-score   support

           0       0.98      0.97      0.98       150
           1       0.64      0.70      0.67        10

    accuracy                           0.96       160
   macro avg       0.81      0.84      0.82       160
weighted avg       0.96      0.96      0.96       160



In [41]:
print("\n Classification Report - Neural networks:")
print(sk.metrics.classification_report(y_test, y_pred_neural))


 Classification Report - Neural networks:
              precision    recall  f1-score   support

           0       0.94      1.00      0.97       150
           1       0.00      0.00      0.00        10

    accuracy                           0.94       160
   macro avg       0.47      0.50      0.48       160
weighted avg       0.88      0.94      0.91       160



We can see that the unscaled kNN had the most accuracy of them all, which was not to be expected.

Also the report of the neural network suggests that it learnt to always say the Pokemon is not Legendary.

## To Do:

* Impact de la normalisation des données sur les résultats.
* Comparaison des performances des modèles: HERE
* Utilisation IA
* Report