## Scrape Data

It turns out that most columns in the original dataset are inconsistent with [pokemondb.net](https://pokemondb.net/pokedex). We will rescrape the data with our own more accurate scraper.

We will also scrape all Pokémon's sprites and compute 13 numerical features.

This notebook uses our custom scraper to retrieve missing values from the website.

### Imports

In [1]:
from collections import defaultdict

import pandas as pd

from scrape import Variant, all_variants
from scrape.util.dict import FuzzyDict

### Original Data

Our scraper handles everything but generation and status (which are correct). We will use the original dataset to provide those values.

In [2]:
IN_PATH = "../data/raw.csv"
raw = pd.read_csv(IN_PATH)

info_by_pokedex = defaultdict(dict)

raw = raw[["pokedex_number", "name", "generation", "status"]]
for i, name, generation, status in raw.itertuples(index=False):
    info_by_pokedex[i][name] = (generation, status)

info_by_pokedex = { i: FuzzyDict(d) for i, d in info_by_pokedex.items() }

### Scraping

In [3]:
variant_dicts = []
for variant in all_variants():
    # skip new Hisuian variants because they're Generation 9
    if variant.variant_name is not None and "Hisuian" in variant.variant_name:
        continue
    if variant.pokedex_number in info_by_pokedex:
        (generation, status) = info_by_pokedex[variant.pokedex_number].get(variant.full_name)
        variant_dict = variant.as_dict()
        variant_dict["generation"] = generation
        variant_dict["status"] = status
        variant_dicts.append(variant_dict)

0001	Bulbasaur
0002	Ivysaur
0003	Venusaur
0004	Venusaur: Mega Venusaur
0005	Charmander
0006	Charmeleon
0007	Charizard
0008	Charizard: Mega Charizard X
0009	Charizard: Mega Charizard Y
0010	Squirtle
0011	Wartortle
0012	Blastoise
0013	Blastoise: Mega Blastoise
0014	Caterpie
0015	Metapod
0016	Butterfree
0017	Weedle
0018	Kakuna
0019	Beedrill
0020	Beedrill: Mega Beedrill
0021	Pidgey
0022	Pidgeotto
0023	Pidgeot
0024	Pidgeot: Mega Pidgeot
0025	Rattata
0026	Rattata: Alolan Rattata
0027	Raticate
0028	Raticate: Alolan Raticate
0029	Spearow
0030	Fearow
0031	Ekans
0032	Arbok
0033	Pikachu
0034	Pikachu: Partner Pikachu
0035	Raichu
0036	Raichu: Alolan Raichu
0037	Sandshrew
0038	Sandshrew: Alolan Sandshrew
0039	Sandslash
0040	Sandslash: Alolan Sandslash
0041	Nidoran♀
0042	Nidorina
0043	Nidoqueen
0044	Nidoran♂
0045	Nidorino
0046	Nidoking
0047	Clefairy
0048	Clefable
0049	Vulpix
0050	Vulpix: Alolan Vulpix
0051	Ninetales
0052	Ninetales: Alolan Ninetales
0053	Jigglypuff
0054	Wigglytuff
0055	Zubat
0056	Golb

In [4]:
data = pd.DataFrame.from_dict(variant_dicts)
data = data[["generation", "status", *Variant.PROPERTIES]]
data.head()

Unnamed: 0,generation,status,type_number,type_1,type_2,height_m,weight_kg,abilities_number,total_points,hp,...,sprite_red_mean,sprite_green_mean,sprite_blue_mean,sprite_brightness_mean,sprite_red_sd,sprite_green_sd,sprite_blue_sd,sprite_brightness_sd,sprite_overflow_vertical,sprite_overflow_horizontal
0,1,Normal,2,Grass,Poison,0.7,6.9,2,318,45,...,0.301195,0.461089,0.281119,0.347801,0.203707,0.296569,0.196318,0.227527,0.0,0.0
1,1,Normal,2,Grass,Poison,1.0,13.0,2,405,60,...,0.315388,0.399557,0.341093,0.352013,0.25271,0.266331,0.230437,0.224049,0.0,0.0
2,1,Normal,2,Grass,Poison,2.0,100.0,2,525,80,...,0.460529,0.504329,0.458905,0.474588,0.311674,0.27918,0.251883,0.248132,0.0,0.0
3,1,Normal,2,Grass,Poison,2.4,155.5,1,625,80,...,0.438993,0.478647,0.459373,0.459004,0.258814,0.216374,0.20465,0.190111,0.0,0.0
4,1,Normal,1,Fire,,0.6,8.5,2,309,39,...,0.547675,0.330607,0.185229,0.354504,0.422389,0.278011,0.178439,0.276396,0.0,0.0


### Missing Values

There are only two `None` values in the entire data set, and they both belong to [Eternatus Eternamax](https://bulbapedia.bulbagarden.net/wiki/Eternatus_(Pok%C3%A9mon)).

Eternatus Eternamax is 5 times the height of its normal form, so if we assume that Eternatus Eternamax is about 5 times as large in every dimension, its volume (thus its mass, assuming the same density) is 125 times as much (placing it at 118,750 kg).

Moreover, Eternatus Eternamax cannot be caught, so we can assume its catch rate is zero.

In [5]:
data["weight_kg"] = data["weight_kg"].fillna(118750.0)
data["catch_rate"] = data["catch_rate"].fillna(0.)

### Data Types

Convert all columns to either `float` (continuous), `bool` (binary), or `str` (categorical). In practice, this just means converting all `int` columns to `float`.

In [6]:
int_columns = data.columns[data.dtypes == "int64"]
data[int_columns] = data[int_columns].astype(float)
data = data.convert_dtypes(
    convert_string=True,
    convert_integer=False,
    convert_boolean=True,
    convert_floating=True,
)

### Save Results

In [7]:
OUT_PATH = "../data/scraped.csv"
data.to_csv(OUT_PATH, index=False)
data.head()

Unnamed: 0,generation,status,type_number,type_1,type_2,height_m,weight_kg,abilities_number,total_points,hp,...,sprite_red_mean,sprite_green_mean,sprite_blue_mean,sprite_brightness_mean,sprite_red_sd,sprite_green_sd,sprite_blue_sd,sprite_brightness_sd,sprite_overflow_vertical,sprite_overflow_horizontal
0,1.0,Normal,2.0,Grass,Poison,0.7,6.9,2.0,318.0,45.0,...,0.301195,0.461089,0.281119,0.347801,0.203707,0.296569,0.196318,0.227527,0.0,0.0
1,1.0,Normal,2.0,Grass,Poison,1.0,13.0,2.0,405.0,60.0,...,0.315388,0.399557,0.341093,0.352013,0.25271,0.266331,0.230437,0.224049,0.0,0.0
2,1.0,Normal,2.0,Grass,Poison,2.0,100.0,2.0,525.0,80.0,...,0.460529,0.504329,0.458905,0.474588,0.311674,0.27918,0.251883,0.248132,0.0,0.0
3,1.0,Normal,2.0,Grass,Poison,2.4,155.5,1.0,625.0,80.0,...,0.438993,0.478647,0.459373,0.459004,0.258814,0.216374,0.20465,0.190111,0.0,0.0
4,1.0,Normal,1.0,Fire,,0.6,8.5,2.0,309.0,39.0,...,0.547675,0.330607,0.185229,0.354504,0.422389,0.278011,0.178439,0.276396,0.0,0.0
