# Generate Features

This notebook reduces the data to only the features which we want to use, and eliminates all NA values with various techniques.

## Setup

In [1]:
import pandas as pd

In [2]:
IN_PATH = "data/1-scrape_missing_values.csv"
pokedex = pd.read_csv(IN_PATH)
pokedex.head()

Unnamed: 0.1,Unnamed: 0,pokedex_number,name,german_name,japanese_name,generation,status,species,type_number,type_1,...,sprite_red_mean,sprite_green_mean,sprite_blue_mean,sprite_brightness_mean,sprite_red_sd,sprite_green_sd,sprite_blue_sd,sprite_brightness_sd,sprite_overflow_vertical,sprite_overflow_horizontal
0,0,1,Bulbasaur,Bisasam,フシギダネ (Fushigidane),1,Normal,Seed Pokémon,2,Grass,...,0.301195,0.461089,0.281119,0.347801,0.203707,0.296569,0.196318,0.227527,0.0,0.0
1,1,2,Ivysaur,Bisaknosp,フシギソウ (Fushigisou),1,Normal,Seed Pokémon,2,Grass,...,0.315388,0.399557,0.341093,0.352013,0.25271,0.266331,0.230437,0.224049,0.0,0.0
2,2,3,Venusaur,Bisaflor,フシギバナ (Fushigibana),1,Normal,Seed Pokémon,2,Grass,...,0.460529,0.504329,0.458905,0.474588,0.311674,0.27918,0.251883,0.248132,0.0,0.0
3,3,3,Mega Venusaur,Bisaflor,フシギバナ (Fushigibana),1,Normal,Seed Pokémon,2,Grass,...,0.438993,0.478647,0.459373,0.459004,0.258814,0.216374,0.20465,0.190111,0.0,0.0
4,4,4,Charmander,Glumanda,ヒトカゲ (Hitokage),1,Normal,Lizard Pokémon,1,Fire,...,0.547675,0.330607,0.185229,0.354504,0.422389,0.278011,0.178439,0.276396,0.0,0.0


## Features

We will use all features except those that have too many different values (e.g. names) or features that give away the type too easily (e.g. egg type).

In [3]:
features = pokedex.copy().drop(columns=[
    "Unnamed: 0",
    "pokedex_number",
    "name", "german_name", "japanese_name",
    "species",
    "ability_1", "ability_2", "ability_hidden",
    "egg_type_1", "egg_type_2",
])

In [4]:
def fill_explicit_none_type(df: pd.DataFrame) -> pd.DataFrame:
    df["type_2"] = df["type_2"].fillna("None")
    return df

features = fill_explicit_none_type(features)

[Eternatus Eternamax](https://bulbapedia.bulbagarden.net/wiki/Eternatus_(Pok%C3%A9mon)) is the only Pokémon with unknown weight. It is 5 times the height of its normal form, so if we assume that Eternatus Eternamax is about 5 times as large in every dimension, its volume (thus its mass, assuming the same density) is 125 times as much (placing it at 118,750 kg).

In [5]:
def impute_eternamax_weight(df, weight_kg=(5**3)*950):
    df.loc[
        pokedex["name"] == "Eternatus Eternamax",
        "weight_kg"
    ] = weight_kg
    return df

features = impute_eternamax_weight(features)

Based on values from [Bulbapedia](https://bulbapedia.bulbagarden.net/wiki/Experience#Relation_to_level), we can associate each growth rate with the experience it takes the Pokémon to reach maximum level (100).

In [6]:
MAX_EXP = {
    "Erratic": 600_000,
    "Fast": 800_000,
    "Medium Fast": 1_000_000,
    "Medium Slow": 1_059_860,
    "Slow": 1_250_000,
    "Fluctuating": 1_640_000,
}

def quantify_growth_rate(df: pd.DataFrame) -> pd.DataFrame:
    columns = df.columns.to_list()
    if "growth_rate" not in columns:
        return df
    i = columns.index("growth_rate")
    df["maximum_experience"] = df["growth_rate"].apply(lambda r: MAX_EXP[r])
    return df[columns[:i] + ["maximum_experience"] + columns[i+1:]]

features = quantify_growth_rate(features)

In [7]:
def transform_gender(df: pd.DataFrame) -> pd.DataFrame:
    columns = df.columns.to_list()
    if "percentage_male" not in columns:
        return df
    i = columns.index("percentage_male")
    df["has_gender"] = ~pd.isna(df["percentage_male"])
    df["proportion_male"] = (df["percentage_male"] / 100.).fillna(1./2.)
    return df[columns[:i] + ["has_gender", "proportion_male"] + columns[i+1:]]

features = transform_gender(features)

The `against_*` columns are a bit poorly named. We replace the names with `damage_from_*`, which is clearer.

In [8]:
def clarify_against_naming(df: pd.DataFrame) -> pd.DataFrame:
    types: set[str] = set(df["type_1"])
    return df.rename(columns={
        f"against_{'fight' if t == 'Fighting' else t.lower()}":
        f"damage_from_{t.lower()}"
        for t in types
    })

features = clarify_against_naming(features)

In the original data, [Dewgong](https://bulbapedia.bulbagarden.net/wiki/Dewgong_(Pok%C3%A9mon)) has 125 as its `damage_from_ice`, which is incorrect (should be 0.125).

In [9]:
def fix_dugong_damage_from_ice(df: pd.DataFrame) -> pd.DataFrame:
    df.loc[df["damage_from_ice"] == 125, "damage_from_ice"] = 0.125
    return df

features = fix_dugong_damage_from_ice(features)

Convert all columns to either `float` (continuous), `bool` (binary), or `str` (categorical). In practice, this just means converting all `int` columns to `float`.

In [10]:
def cleanup_types(df: pd.DataFrame) -> pd.DataFrame:
    int_columns = df.columns[df.dtypes == "int64"]
    df[int_columns] = df[int_columns].astype(float)
    df = df.convert_dtypes(
        convert_string=True,
        convert_integer=False,
        convert_boolean=True,
        convert_floating=True,
    )
    return df

features = cleanup_types(features)

Sanity check:

In [11]:
assert not features.isna().any().any()
print(features.dtypes)
features.head()

generation                        Float64
status                             string
type_number                       Float64
type_1                             string
type_2                             string
height_m                          Float64
weight_kg                         Float64
abilities_number                  Float64
total_points                      Float64
hp                                Float64
attack                            Float64
defense                           Float64
sp_attack                         Float64
sp_defense                        Float64
speed                             Float64
catch_rate                        Float64
base_friendship                   Float64
base_experience                   Float64
maximum_experience                Float64
egg_type_number                   Float64
has_gender                        boolean
proportion_male                   Float64
egg_cycles                        Float64
damage_from_normal                

Unnamed: 0,generation,status,type_number,type_1,type_2,height_m,weight_kg,abilities_number,total_points,hp,...,sprite_red_mean,sprite_green_mean,sprite_blue_mean,sprite_brightness_mean,sprite_red_sd,sprite_green_sd,sprite_blue_sd,sprite_brightness_sd,sprite_overflow_vertical,sprite_overflow_horizontal
0,1.0,Normal,2.0,Grass,Poison,0.7,6.9,2.0,318.0,45.0,...,0.301195,0.461089,0.281119,0.347801,0.203707,0.296569,0.196318,0.227527,0.0,0.0
1,1.0,Normal,2.0,Grass,Poison,1.0,13.0,2.0,405.0,60.0,...,0.315388,0.399557,0.341093,0.352013,0.25271,0.266331,0.230437,0.224049,0.0,0.0
2,1.0,Normal,2.0,Grass,Poison,2.0,100.0,2.0,525.0,80.0,...,0.460529,0.504329,0.458905,0.474588,0.311674,0.27918,0.251883,0.248132,0.0,0.0
3,1.0,Normal,2.0,Grass,Poison,2.4,155.5,1.0,625.0,80.0,...,0.438993,0.478647,0.459373,0.459004,0.258814,0.216374,0.20465,0.190111,0.0,0.0
4,1.0,Normal,1.0,Fire,,0.6,8.5,2.0,309.0,39.0,...,0.547675,0.330607,0.185229,0.354504,0.422389,0.278011,0.178439,0.276396,0.0,0.0


## Save Results

In [12]:
OUT_PATH = "data/2-generate_features.csv"
features.to_csv(OUT_PATH, index=False)