<h2 style="font-family: 'Constantia'; font-size: 26px; color:#006400;"> 1 🔎 Setting The Scene 🔎
<p style="color:  #93E9BE; font-family: 'Constantia', cursive;"> 🌳Preliminary exploration using Isolation Forest and Hypothesis Testing🌳</p></h2>

In [1]:
# importing dependencies 

# !pip install ydata-profiling 
#installing the library

from ydata_profiling import ProfileReport
!pip install missingno 



In [2]:
# Basic libraries
import pymysql
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno # great library for visualisating the distribution of nulls!
import statsmodels.api as sm

# Hypothesis Testing
import math
from scipy import stats
from scipy.stats import ttest_ind

# Machine Learning
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge, BayesianRidge, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import GridSearchCV
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import classification_report
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV



# Other
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings("ignore")
import plotly.express as px
%matplotlib inline
import re
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)


In [3]:
## 0 Goal

# <span style="color:blue">Why should we be interested in</span> <span style="color:red; font-family:'Segoe Script', cursive;">legendary</span> <span style="color:green; font-family:'Comic Sans MS', cursive;">pokemon</span>? 🌟🔥🌈

<img src="https://media.tenor.com/u-qWcV0GwbkAAAAC/mew-pokemon.gif" alt="Legendary Pokemon Mew">

As most Pokemon enthusiasts would know, legendary Pokemon are rare creatures that have exceptional power, abilties and base stats. They often play a key role in the Pokemon storyline and mythology. Acquiring one has long been an attractive challenge to players as they can have a major impact on battle strategies and outcomes. Legendary Pokemon also look pretty cool :P 

In [None]:
#Getting the Data

In [None]:
data = pd.read_csv('pokemon.csv')
data.head()

In [None]:
data.info() #fortunately, the names of the data have been standardized

In [None]:
data.shape

In [None]:
data.describe().round(1)

In [None]:
print(data.loc[1]) #sample row

For the sake of logical flow and structure, I have decided to put this part first, preceding all other parts of the project, in order to provide a persuasive reason as to why this project is being undertaken in the first place. As such, I will be using a version of the same dataset that has been partially cleaned. 

**Are legendary pokemon truly rarer and more anomalous than non-legendary pokemon?** To find out, we can use a combination of **Anomaly Testing(Isolation Forest)** and using the results thereafter to conduct a **two sample hypothesis test**.  To determine the true **direction** of the anomaly (do legendary pokemon have exceptionally **better** or exceptionally **crappier** stats than non-legendary pokemon?), we would have to look at our **Tableau Visaulisations!*

In [None]:
rarity_data = pd.read_csv('pokemon_partially_cleaned.csv')
rarity_data.head()


<h2 style="color:green; font-family:Comic Sans MS;">🌳 Isolation Forest Model 🌳</h2>


### The isolation forest model is an unsupervised learning model that isolates anomalies or rare occurences within the data by creating a set of random decision trees which split into two. Anomalies can be identified with fewer splits, which means the average path length to its branch is shorter than average (which is why anomaly scores are negative in value). More information [here](https://www.analyticsvidhya.com/blog/2021/07/anomaly-detection-using-isolation-forest-a-complete-guide/) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html).

In [None]:
#Scaling the variables before training the IsolationForest model

predictor_variables = rarity_data.drop(columns=['name','japanese_name','abilities', 'base_egg_steps'])  # 'is_legendary' is the target variable

# Step 2: Drop the 'type1_' and 'type2_' variables
type1_columns = [col for col in predictor_variables.columns if col.startswith('type1_')]
type2_columns = [col for col in predictor_variables.columns if col.startswith('type2_')]
predictor_variables = predictor_variables.drop(columns=type1_columns + type2_columns)
selected_rarity_features = ['hp', 'attack', 'defense', 'sp_attack', 'sp_defense', 'speed']

# Create a new DataFrame containing only the selected features
rarity_features_df = predictor_variables[selected_rarity_features].copy()

# Normalize the features (optional but can improve performance)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(rarity_features_df)
features_df = pd.DataFrame(scaled_features, columns=selected_rarity_features)

In [None]:
#Baseline model

In [None]:
# Create an instance of the IsolationForest class
isolation_forest = IsolationForest(contamination=0.001)  # Adjust contamination based on expected rarity proportion

# Fit the model on the features
isolation_forest.fit(features_df)

# Predict the anomaly scores for each Pokémon (negative values indicate anomalies)
anomaly_scores = isolation_forest.score_samples(features_df)

# Add the anomaly scores to the original DataFrame
rarity_data['AnomalyScore'] = anomaly_scores

In [None]:
# Visualisation of anomaly scores - the lower the anomaly score, the rarer the pokemon

In [None]:
sns.histplot(rarity_data['AnomalyScore'], bins=20, kde=True)
plt.title('Distribution of Anomaly Scores')
plt.xlabel('Anomaly Score')
plt.ylabel('Frequency')
plt.show()

rare_pokemon = rarity_data.nsmallest(10, 'AnomalyScore') #lists the top ten most anomalous/rarest/most unique pokemon in the dataset
print(rare_pokemon[['name', 'AnomalyScore']])


In [None]:
# 4 of the most anomalous pokemon are legendary pokemon! But we'll have to tune the model to really know for sure

In [None]:

legendary_pokemon = rare_pokemon[rare_pokemon['is_legendary'] == 1]

# Print the names and AnomalyScores of legendary Pokémon
print(legendary_pokemon[['name', 'AnomalyScore']])

In [None]:
#Hyperparameter tuning via GridCv

In [None]:
isolation_forest = IsolationForest()

# Defining the model's hyperparameters and their values so that they can be tuned
param_grid = {
    'n_estimators': [50, 100, 150],    # Number of trees in the forest
    'max_samples': [0.1, 0.2, 0.3],    # Proportion of samples to draw for each tree
    'contamination': [0.01, 0.05, 0.1] # Expected proportion of anomalies (rare Pokémon)
}

# Create a GridSearchCV object with the Isolation Forest model and parameter grid
grid_search = GridSearchCV(isolation_forest, param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the GridSearchCV object on the features
grid_search.fit(features_df)

# Get the best hyperparameters from the search
best_hyperparameters = grid_search.best_params_

# Create a new Isolation Forest model with the best hyperparameters
best_isolation_forest = IsolationForest(**best_hyperparameters)

# Fit the best model on the features
best_isolation_forest.fit(features_df)

# Predict the anomaly scores for each Pokémon using the best model
anomaly_scores = best_isolation_forest.score_samples(features_df)

# Add the anomaly scores to the original DataFrame
rarity_data['AnomalyScore'] = anomaly_scores

# Sort the DataFrame by 'AnomalyScore' in ascending order
rarity_data_sorted = rarity_data.sort_values(by='AnomalyScore', ascending=True)

# Get the top ten rarest Pokémon
top_rarest_pokemon = rarity_data_sorted.head(10)

# Print the best hyperparameters
print("Best Hyperparameters:")
print(best_hyperparameters)

# Print the top ten rarest Pokémon
print("\nTop Ten Rarest Pokémon:")
print(top_rarest_pokemon[['name', 'AnomalyScore']])


In [None]:
# Plot the distribution of anomaly scores
sns.histplot(rarity_data['AnomalyScore'], bins=20, kde=True)
plt.title('Distribution of Anomaly Scores')
plt.xlabel('Anomaly Score')
plt.ylabel('Frequency')
plt.show()


In [None]:
#After tuning the hyperparameters of the Isolation Forest Model, only two legendary pokemon remain in the top ten. How about the rest of the 70 pokemon in general?

In [None]:
legendary_pokemon1 = top_rarest_pokemon[top_rarest_pokemon['is_legendary'] == 1]
print(legendary_pokemon1[['name', 'AnomalyScore']])

In [None]:
# From the filtered data below, we can see that all 70 pokemon are in the top 500 in terms of anomaly scores. But we'll need more concrete proof to irrevocably conlcude that legendary pokemon are indeed rarer and more anomalous than non-legendary pokemon.

In [None]:
legendary_data = rarity_data[rarity_data['is_legendary'] == 1]

# Applying a filter 
non_legendary_data = rarity_data[rarity_data['is_legendary'] == 0]

# Concatenating both legendary and non-legendary Pokémon data
all_data = pd.concat([legendary_data, non_legendary_data])

# Adding a new 'rank' column
all_data = all_data.sort_values(by='AnomalyScore', ascending=True)
all_data['Rank'] = range(1, len(all_data) + 1)

# Let's print it!
legendary_data_filtered = all_data[all_data['is_legendary'] == 1]
print(legendary_data_filtered[['name', 'AnomalyScore', 'Rank']])


**Now, we can proceed with hypothesis testing using the anomaly scores that have been just generated.** 

<h3 style="color: blue; font-family: 'Courier New';">❤️ Hypothesis testing for two-tailed t-test ❤️</h3>


#### **(H<sub>0</sub>): Legendary pokemon are not rarer and more anomalous then non-legendary pokemon.**

#### **(H<sub>1</sub>): Legendary pokemon are rarer and more anomalous than non-legendary pokemon.**
        



In [None]:
# Extract the AnomalyScores for legendary and non-legendary Pokémon
legendary_scores = legendary_data['AnomalyScore']
non_legendary_scores = non_legendary_data['AnomalyScore']

# Perform a two-sample t-test
t_stat, p_value = ttest_ind(legendary_scores, non_legendary_scores, equal_var=False)

alpha = 0.01  # Choosing a conservative alpha
if p_value < alpha:
    print("\033[1;31mThe difference in AnomalyScores between legendary and non-legendary Pokémon is statistically significant.\033[0m Thus, Legendary Pokémon are rarer and more anomalous than non-legendary Pokémon.")
else:
    print("\033[1;32mThere is no significant difference in AnomalyScores between legendary and non-legendary Pokémon.\033[0m Thus, we cannot \033[1;32mreject the null hypothesis\033[0m and conclude that \033[1;32mLegendary Pokémon are not rarer and more anomalous than non-legendary Pokémon.\033[0m")


<span style="font-family: Arial, sans-serif; font-size: 24px;">To confirm the direction of anomalousness</span> <span style="font-family: 'Chewy', cursive; color: red; font-size: 24px;">(are legendary Pokémon anomalously better or crappier)</span><span style="font-family: 'Chewy', cursive; color: linear-gradient(to right, violet, indigo, blue, green, yellow, orange, red); font-size: 24px;">, please proceed to my Tableau visualizations for the rest of the EDA! 🌈</span>
