# Pokemon

## Data content

This database includes 21 variables per each of the 721 Pokémon of the first six generations, plus the Pokémon ID and its name. These variables are briefly described next:

    Number. Pokémon ID in the Pokédex.
    Name. Name of the Pokémon.
    Type_1. Primary type.
    Type_2. Second type, in case the Pokémon has it.
    Total. Sum of all the base stats (Health Points, Attack, Defense, Special Attack, Special Defense, and Speed).
    HP. Base Health Points.
    Attack. Base Attack.
    Defense. Base Defense.
    Sp_Atk. Base Special Attack.
    Sp_Def. Base Special Defense.
    Speed. Base Speed.
    Generation. Number of the generation when the Pokémon was introduced.
    isLegendary. Boolean that indicates whether the Pokémon is Legendary or not.
    Color. Colour of the Pokémon according to the Pokédex.
    hasGender. Boolean that indicates if the Pokémon can be classified as female or male.
    Pr_male. In case the Pokémon has Gender, the probability of its being male. The probability of being female is, of course, 1 minus this value.
    EggGroup1. Egg Group of the Pokémon.
    EggGroup2. Second Egg Group of the Pokémon, in case it has two.
    hasMegaEvolution. Boolean that indicates whether the Pokémon is able to Mega-evolve or not.
    Height_m. Height of the Pokémon, in meters.
    Weight_kg. Weight of the Pokémon, in kilograms.
    Catch_Rate. Catch Rate.
    Body_Style. Body Style of the Pokémon according to the Pokédex.

## Prepare data

In [None]:
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
train = pd.read_csv('../input/pokemon-datasets-for-ml/train_pokemon.csv')
test = pd.read_csv('../input/pokemon-datasets-for-ml/test_pokemon.csv')

In [None]:
train.head(3)

In [None]:
train.shape

In [None]:
train.describe()

In [None]:
# describe(include = ['O']) will show the descriptive statistics of object data types.
train.describe(include=['O'])

In [None]:
# check for missing values
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Let's fill the rows in `Type_2` column that are currently null with `None`. 

In [None]:
def fill_type_2(cols):
    type_2 = cols[0]
    if pd.isnull(type_2):
        return "None"
    else:
        return type_2

In [None]:
train['Type_2'] = train[['Type_2']].apply(fill_type_2,axis=1)

Because the majority of the `Egg_Group_2` is `NaN`, we will drop from the dataset as it will not be of any help. Then, we will be in a position to start investigating our data.

In [None]:
train.drop(columns=['Egg_Group_2'], inplace=True)

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

## Relationship between Features and Legendary

In this section, we will analyse the relationship between different features with respect to `isLegendary`.

In [None]:
legendary = train[train['isLegendary'] == 1]
not_legendary = train[train['isLegendary'] == 0]

print("Legendary: %i (%.1f%%)"%(len(legendary), float(len(legendary))/len(train)*100.0))
print("Not Legendary: %i (%.1f%%)"%(len(not_legendary), float(len(not_legendary))/len(train)*100.0))
print("Total: %i"%len(train))

In [None]:
train.columns

## Correlating Features

Heatmap of Correlation between different features:
   - Positive numbers = Positive correlation, i.e. increase in one feature will increase the other feature & vice-versa.
   - Negative numbers = Negative correlation, i.e. increase in one feature will decrease the other feature & vice-versa.

In our case, we focus on which features have strong positive or negative correlation with the Survived feature.

In [None]:
plt.figure(figsize=(25,10))
train2 = train.drop(['Number','Name','hasGender','shuffle'], axis=1)
sns.heatmap(train2.corr(), vmin= -1, vmax=1, square=True, annot=True)

Apparently, some feature have no correlation with legendary. These features are `hasMegaEvolution`,`Generation`.

In [None]:
#boxplot of Attack vs. Legendary
plt.figure(figsize=(8, 4))
sns.boxplot(x='isLegendary',y='Attack',data=train, palette='rainbow')

#stripplot of Attack vs. Legendary
plt.figure(figsize=(15, 4))
sns.stripplot(x='Type_1',y='Total',data=train, jitter=True,hue='isLegendary',palette=['r','b'],dodge=False).set_title('Type_1 Distribution on Legendary')

#stripplot of Attack vs. Legendary
plt.figure(figsize=(15, 4))
sns.stripplot(x='Type_2',y='Total',data=train, jitter=True,hue='isLegendary',palette=['r','b'],dodge=False).set_title('Type_2 Distribution on Legendary')

### Type_1 vs Lengendary

In [None]:
type_1 = train[['Type_1','isLegendary']].groupby(['Type_1'], as_index=False).mean().set_index('Type_1')
type_1.sort_values(by='isLegendary',ascending=False).plot(kind='bar')

It seems that most legendary Pokemons are also a Flying type, followed by the Dragon type. There are no legendary Poison, Fighting or Bug types. Still, `Type_1` feature can be useful to predict legendary Pokemons.

In [None]:
type_2 = train[['Type_2','isLegendary']].groupby(['Type_2'], as_index=False).mean().set_index('Type_2')
type_2.sort_values(by='isLegendary',ascending=False).plot(kind='bar')

Like Type_1, `Type_2` can be useful to predict legendary Pokemons.

## Feature Extraction

In this section, we select the appropriate features to train our classifier. Here, we create new features based on existing features. We also convert categorical features into numeric form.

In [None]:
train_test_data = [train, test]

In [None]:
for dataset in train_test_data:
    dataset['isLegendary'] = dataset['isLegendary'].map({True: 1, False: 0}).astype(int)

In [None]:
type_1.sort_values(by='isLegendary',ascending=False)

After that, we convert the categorical Title values into numeric form.

In [None]:
type_1_mapping = {"Fire": 1, "Dragon": 2, "Electric": 3, "Fighting": 4, "Ice": 5, "Flying": 6, "Water": 7, "Ghost": 8, "Steel": 9, "None": 10, "Fairy": 11, "Psychic": 12, "Ground": 13, "Rock": 14, "Bug": 15, "Poison": 16, "Normal": 17, "Dark": 18, "Grass": 19}
for dataset in train_test_data:
    dataset['Type_1'] = dataset['Type_1'].map(type_1_mapping)
    dataset['Type_1'] = dataset['Type_1'].fillna(0)

Now, let's do the same thing for `Type_2`. Luckily, both `Type_1` and `Type_2` have the same Pokemon types, so we can just copy & paste and replace `Type_1` for `Type_2`. 

In [None]:
type_2_mapping = {"Fire": 1, "Dragon": 2, "Electric": 3, "Fighting": 4, "Ice": 5, "Flying": 6, "Water": 7, "Ghost": 8, "Steel": 9, "None": 10, "Fairy": 11, "Psychic": 12, "Ground": 13, "Rock": 14, "Bug": 15, "Poison": 16, "Normal": 17, "Dark": 18, "Grass": 19}
for dataset in train_test_data:
    dataset['Type_2'] = dataset['Type_2'].map(type_2_mapping)
    dataset['Type_2'] = dataset['Type_2'].fillna(0)

### Pr_Male

We first fill the NULL values of `Pr_Male` with a random number between (mean_Pr_Male - std_Pr_Male) and (mean_Pr_Male + std_Pr_Male). Then, we create a new column named Pr_Male_Band. This categorises Pr_Male into different ranges.

In [None]:
for dataset in train_test_data:
    pr_male_avg = dataset['Pr_Male'].mean()
    pr_male_std = dataset['Pr_Male'].std()
    pr_male_null_count = dataset['Pr_Male'].isnull().sum()
    
    pr_male_null_random_list = np.random.uniform(pr_male_avg - pr_male_std, pr_male_avg + pr_male_std, pr_male_null_count)
    dataset['Pr_Male'][np.isnan(dataset['Pr_Male'])] = pr_male_null_random_list
    dataset['Pr_Male'] = dataset['Pr_Male'].astype(int)
    
train['Pr_Male_Band'] = pd.cut(train['Pr_Male'], 5)

print(train[['Pr_Male_Band', 'isLegendary']].groupby(['Pr_Male_Band'], as_index=False).mean())

Now, we map `Pr_Male` according to `Pr_Male_Band`.

In [None]:
for dataset in train_test_data:
    dataset.loc[ dataset['Pr_Male'] <= 0.2, 'Pr_Male'] = 0
    dataset.loc[(dataset['Pr_Male'] > 0.2) & (dataset['Pr_Male'] <= 0.4), 'Pr_Male'] = 1
    dataset.loc[(dataset['Pr_Male'] > 0.4) & (dataset['Pr_Male'] <= 0.6), 'Pr_Male'] = 2
    dataset.loc[(dataset['Pr_Male'] > 0.6) & (dataset['Pr_Male'] <= 0.8), 'Pr_Male'] = 3
    dataset.loc[ dataset['Pr_Male'] >= 1, 'Pr_Male'] = 4

### Attack & Defense

In [None]:
for dataset in train_test_data:
    attack_avg = dataset['Attack'].mean()
    attack_std = dataset['Attack'].std()
    attack_null_count = dataset['Attack'].isnull().sum()
    
    attack_null_random_list = np.random.randint(attack_avg - attack_std, attack_avg + attack_std, attack_null_count)
    dataset['Attack'][np.isnan(dataset['Attack'])] = attack_null_random_list
    dataset['Attack'] = dataset['Attack'].astype(int)
    
train['Attack_Band'] = pd.cut(train['Attack'], 5)

print(train[['Attack_Band', 'isLegendary']].groupby(['Attack_Band'], as_index=False).mean())

In [None]:
for dataset in train_test_data:
    dataset.loc[ dataset['Attack'] <= 36, 'Attack'] = 0
    dataset.loc[(dataset['Attack'] > 36) & (dataset['Attack'] <= 67), 'Attack'] = 1
    dataset.loc[(dataset['Attack'] > 67) & (dataset['Attack'] <= 98), 'Attack'] = 2
    dataset.loc[(dataset['Attack'] > 98) & (dataset['Attack'] <= 129), 'Attack'] = 3
    dataset.loc[ dataset['Attack'] >= 129, 'Attack'] = 4

In [None]:
for dataset in train_test_data:
    defense_avg = dataset['Defense'].mean()
    defense_std = dataset['Defense'].std()
    defense_null_count = dataset['Defense'].isnull().sum()
    
    defense_null_random_list = np.random.randint(defense_avg - defense_std, defense_avg + defense_std, defense_null_count)
    dataset['Defense'][np.isnan(dataset['Defense'])] = defense_null_random_list
    dataset['Defense'] = dataset['Defense'].astype(int)
    
train['Defense_Band'] = pd.cut(train['Defense'], 5)

print(train[['Defense_Band', 'isLegendary']].groupby(['Defense_Band'], as_index=False).mean())

In [None]:
for dataset in train_test_data:
    dataset.loc[ dataset['Defense'] <= 50, 'Defense'] = 0
    dataset.loc[(dataset['Defense'] > 50) & (dataset['Defense'] <= 95), 'Defense'] = 1
    dataset.loc[(dataset['Defense'] > 95) & (dataset['Defense'] <= 140), 'Defense'] = 2
    dataset.loc[(dataset['Defense'] > 140) & (dataset['Defense'] <= 230), 'Defense'] = 3
    dataset.loc[ dataset['Defense'] >= 230, 'Defense'] = 4

### Catch Rate %

In [None]:
for dataset in train_test_data:
    cr_avg = dataset['Catch_Rate'].mean()
    cr_std = dataset['Catch_Rate'].std()
    cr_null_count = dataset['Catch_Rate'].isnull().sum()
    
    cr_null_random_list = np.random.randint(cr_avg - cr_std, cr_avg + cr_std, cr_null_count)
    dataset['Catch_Rate'][np.isnan(dataset['Catch_Rate'])] = cr_null_random_list
    dataset['Catch_Rate'] = dataset['Catch_Rate'].astype(int)
    
train['Catch_Rate_Band'] = pd.cut(train['Catch_Rate'], 5)

print(train[['Catch_Rate_Band', 'isLegendary']].groupby(['Catch_Rate_Band'], as_index=False).mean())

In [None]:
for dataset in train_test_data:
    dataset.loc[ dataset['Catch_Rate'] <= 53, 'Catch_Rate'] = 0
    dataset.loc[(dataset['Catch_Rate'] > 53) & (dataset['Catch_Rate'] <= 104), 'Catch_Rate'] = 1
    dataset.loc[(dataset['Catch_Rate'] > 104) & (dataset['Catch_Rate'] <= 154), 'Catch_Rate'] = 2
    dataset.loc[(dataset['Catch_Rate'] > 154) & (dataset['Catch_Rate'] <= 204), 'Catch_Rate'] = 3
    dataset.loc[ dataset['Catch_Rate'] >= 255, 'Catch_Rate'] = 4

### Feature Selection

We drop unnecessary columns/features and keep only the useful ones for our experiment.

In [None]:
train.columns

In [None]:
train_drop = ['Number', 'Name', 'Total', 'HP', 'Sp_Atk', 'Sp_Def', 'Speed', 'Generation','Color', 'hasGender', 'Egg_Group_1', 'hasMegaEvolution','Height_m', 'Weight_kg', 'Body_Style', 'shuffle','Pr_Male_Band', 'Attack_Band', 'Defense_Band', 'Catch_Rate_Band']
train = train.drop(train_drop, axis=1)

In [None]:
train.head()

In [None]:
test.head()

In [None]:
test_drop = ['Number', 'Name', 'Total', 'HP', 'Sp_Atk', 'Sp_Def', 'Speed', 'Generation',
       'Color', 'hasGender', 'Egg_Group_1', 'Egg_Group_2', 'isLegendary',
       'hasMegaEvolution', 'Height_m', 'Weight_kg', 'Body_Style',
       'shuffle']
test = test.drop(test_drop, axis=1)

In [None]:
test.head()

We are done with Feature Selection/Engineering. Now, we are ready to train a classifier with our feature set.

## Classification & Accuracy
Define training and testing set.

In [None]:
X_train = train.drop('isLegendary', axis=1)
y_train = train['isLegendary']
X_test = test.copy()

X_train.shape, y_train.shape, X_test.shape

There are many classifying algorithms. Among them, we will apply the following Classification algorithms to predict a legendary Pokémon:

   - Logistic Regression
   - Support Vector Machines (SVC)
   - *k*-Nearest Neighbor (KNN)
   - Decision Tree
   - Random Forest
   - Naive Bayes (GaussianNB)
   - Perceptron
   - Stochastic Gradient Descent (SGD)

Here is the training and testing procedure:

   - First, we train these classifiers with our training data.
   - After that, using the trained classifier, we predict the Survival outcome of test data.
   - Finally, we calculate the accuracy score (in percentange) of the trained classifier.

**Please note**: that the accuracy score is generated based on our training dataset.

In [None]:
# Importing Classifier Modules
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier

### Logistic Regression

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred_log_reg = clf.predict(X_test)
acc_log_reg = round( clf.score(X_train, y_train) * 100, 2)
print(str(acc_log_reg) + ' percent')

### Support Vector Machine (SVM)

In [None]:
clf = SVC()
clf.fit(X_train, y_train)
y_pred_svc = clf.predict(X_test)
acc_svc = round(clf.score(X_train, y_train) * 100, 2)
print (acc_svc)

### *k*-Nearest Neighbors

In [None]:
clf = KNeighborsClassifier(n_neighbors = 3)
clf.fit(X_train, y_train)
y_pred_knn = clf.predict(X_test)
acc_knn = round(clf.score(X_train, y_train) * 100, 2)
print (acc_knn)

### Decision Tree

In [None]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred_decision_tree = clf.predict(X_test)
acc_decision_tree = round(clf.score(X_train, y_train) * 100, 2)
print (acc_decision_tree)

### Random Forest

In [None]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred_random_forest = clf.predict(X_test)
acc_random_forest = round(clf.score(X_train, y_train) * 100, 2)
print (acc_random_forest)

### Gaussian Naive Bayes

In [None]:
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred_gnb = clf.predict(X_test)
acc_gnb = round(clf.score(X_train, y_train) * 100, 2)
print (acc_gnb)

### Perceptron

In [None]:
clf = Perceptron(max_iter=5, tol=None)
clf.fit(X_train, y_train)
y_pred_perceptron = clf.predict(X_test)
acc_perceptron = round(clf.score(X_train, y_train) * 100, 2)
print (acc_perceptron)

### Stochastic Gradient Descent (SGD)

In [None]:
clf = SGDClassifier(max_iter=5, tol=None)
clf.fit(X_train, y_train)
y_pred_sgd = clf.predict(X_test)
acc_sgd = round(clf.score(X_train, y_train) * 100, 2)
print (acc_sgd)

## Confusion Matrix

A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class and *vice versa*. The name stems from the fact that it makes it easy to see if the system is confusing two classes (*i.e.* commonly mislabelling one as another).

In predictive analytics, a table of confusion (sometimes also called a confusion matrix), is a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives. This allows more detailed analysis than mere proportion of correct classifications (accuracy). Accuracy is not a reliable metric for the real performance of a classifier, because it will yield misleading results if the data set is unbalanced (that is, when the numbers of observations in different classes vary greatly). For example, if there were 95 cats and only 5 dogs in the data set, a particular classifier might classify all the observations as cats. The overall accuracy would be 95%, but in more detail the classifier would have a 100% recognition rate for the cat class but a 0% recognition rate for the dog class.

Here's another guide explaining Confusion Matrix with example.

In our Pokémon case:

   - **True Positive**: The classifier predicted legendary and the Pokémon was actually a legendary.

   - **True Negative**: The classifier predicted not legendary and the Pokémon was not a legendary.

   - **False Positive**: The classifier predicted legendary but the Pokémon was not a legendary.

   - **False Negative**: The classifier predicted not legendary the Pokémon was actually a legendary.
    
In the example code below, we plot a confusion matrix for the prediction of Random Forest Classifier on our training dataset. This shows how many entries are correctly and incorrectly predicted by our classifer.

In [None]:
from sklearn.metrics import confusion_matrix
import itertools

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred_random_forest_training_set = clf.predict(X_train)
acc_random_forest = round(clf.score(X_train, y_train) * 100, 2)
print ("Accuracy: %i %% \n"%acc_random_forest)

class_names = ['Legendary', 'Not Legendary']

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_train, y_pred_random_forest_training_set)
np.set_printoptions(precision=2)

print ('Confusion Matrix in Numbers')
print (cnf_matrix)
print ('')

cnf_matrix_percent = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]

print ('Confusion Matrix in Percentage')
print (cnf_matrix_percent)
print ('')

true_class_names = ['True Legendary', 'True Not Legendary']
predicted_class_names = ['Predicted Legendary', 'Predicted Not Legendary']

df_cnf_matrix = pd.DataFrame(cnf_matrix, 
                             index = true_class_names,
                             columns = predicted_class_names)

df_cnf_matrix_percent = pd.DataFrame(cnf_matrix_percent, 
                                     index = true_class_names,
                                     columns = predicted_class_names)

plt.figure(figsize = (15,5))

plt.subplot(121)
sns.heatmap(df_cnf_matrix, annot=True, fmt='d')

plt.subplot(122)
sns.heatmap(df_cnf_matrix_percent, annot=True)

## Comparing Models

Let's compare the accuracy score of all the classifier models used above.

In [None]:
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Support Vector Machines', 
              'KNN', 'Decision Tree', 'Random Forest', 'Naive Bayes', 
              'Perceptron', 'Stochastic Gradient Decent'],
    
    'Score': [acc_log_reg, acc_svc, 
              acc_knn,  acc_decision_tree, acc_random_forest, acc_gnb, 
              acc_perceptron, acc_sgd]
    })

models.sort_values(by='Score', ascending=False)

From the above table, we can see that Decision Tree and Random Forest classfiers have the highest accuracy score.

Between the two, we choose **Random Forest Classifier** as it has the ability to limit overfitting as compared to Decision Tree classifier.