# Preprocessing Pokemon Data

#### Jonna Chen, Amy Huang, Kate Liang, Stephanie Zhang

In [1]:
import pandas as pd

In [2]:
df2 = pd.read_csv("All_Pokemon.csv")
df2.head()
df2.columns

Index(['Number', 'Name', 'Type 1', 'Type 2', 'Abilities', 'HP', 'Att', 'Def',
       'Spa', 'Spd', 'Spe', 'BST', 'Mean', 'Standard Deviation', 'Generation',
       'Experience type', 'Experience to level 100', 'Final Evolution',
       'Catch Rate', 'Legendary', 'Mega Evolution', 'Alolan Form',
       'Galarian Form', 'Against Normal', 'Against Fire', 'Against Water',
       'Against Electric', 'Against Grass', 'Against Ice', 'Against Fighting',
       'Against Poison', 'Against Ground', 'Against Flying', 'Against Psychic',
       'Against Bug', 'Against Rock', 'Against Ghost', 'Against Dragon',
       'Against Dark', 'Against Steel', 'Against Fairy', 'Height', 'Weight',
       'BMI'],
      dtype='object')

In [3]:
df2.shape

(1032, 44)

It seems like df1 has the least amount of information while df2 and df3 have around the same information. Let's just stick to df2.

In [4]:
df2.head()

Unnamed: 0,Number,Name,Type 1,Type 2,Abilities,HP,Att,Def,Spa,Spd,...,Against Bug,Against Rock,Against Ghost,Against Dragon,Against Dark,Against Steel,Against Fairy,Height,Weight,BMI
0,1,Bulbasaur,Grass,Poison,"['Chlorophyll', 'Overgrow']",45,49,49,65,65,...,1.0,1.0,1.0,1.0,1.0,1.0,0.5,0.7,6.9,14.1
1,2,Ivysaur,Grass,Poison,"['Chlorophyll', 'Overgrow']",60,62,63,80,80,...,1.0,1.0,1.0,1.0,1.0,1.0,0.5,1.0,13.0,13.0
2,3,Venusaur,Grass,Poison,"['Chlorophyll', 'Overgrow']",80,82,83,100,100,...,1.0,1.0,1.0,1.0,1.0,1.0,0.5,2.0,100.0,25.0
3,3,Mega Venusaur,Grass,Poison,['Thick Fat'],80,100,123,122,120,...,1.0,1.0,1.0,1.0,1.0,1.0,0.5,2.4,155.5,27.0
4,4,Charmander,Fire,,"['Blaze', 'Solar Power']",39,52,43,60,50,...,0.5,2.0,1.0,1.0,1.0,0.5,0.5,0.6,8.5,23.6


#### Content Description
- Number: Number of the Pokemon in the National Pokedex
- Name: Name of the Pokemon
- Type 1: Primary Type of the Pokemon
- Type 2: Secondary Type of the Pokemon
- Abilities: A list that contains the abilities of the Pokemon
- HP: Base Hit Points stat of the Pokemon
- Att: Base Attack stat of the Pokemon
- Def: Base Defense stat of the Pokemon
- Spa: Base Special Attack stat of the Pokemon
- Spd: Base Special Defense stat of the Pokemon
- Spe: Base Speed stat of the Pokemon
- BST: Sum of all the base stats
- Mean: Mean of the base stats
- Standard Deviation: Standard deviation of the base stats
- Generation: The Generation in which the Pokemon was introduced
- Experience Type: The Experience Group to which the Pokemon belongs
- Experience to level 100: Amount of experience the Pokemon needs to level up to 100
- Final Evolution: Denotes if the Pokemon is a Final Evolution
- Catch Rate: Catch Rate of the Pokemon
- Legendary: Denotes if the Pokemon is Legendary
- Mega Evolution: Denotes if the Pokemon is a Mega Evolution
- Alolan Form: Denotes if the Pokemon is an Alolan Form
- Galarian Form: Denotes if the Pokemon is a Galarian Form
- Against { }: Effectiveness of certain type against the Pokemon
- Height: The height of the Pokemon in metres
- Weight: The weight of the Pokemon in kilograms
- BMI: The Body mass index of the Pokemon (Weight / Height^2)

In [5]:
#what casting has to be done?
df2.dtypes

Number                       int64
Name                        object
Type 1                      object
Type 2                      object
Abilities                   object
HP                           int64
Att                          int64
Def                          int64
Spa                          int64
Spd                          int64
Spe                          int64
BST                          int64
Mean                       float64
Standard Deviation         float64
Generation                 float64
Experience type             object
Experience to level 100      int64
Final Evolution            float64
Catch Rate                   int64
Legendary                  float64
Mega Evolution             float64
Alolan Form                float64
Galarian Form              float64
Against Normal             float64
Against Fire               float64
Against Water              float64
Against Electric           float64
Against Grass              float64
Against Ice         

In [6]:
#checking for nulls - not all pokemon have a second type
nan_count = df2.isna().sum()
print(nan_count)

Number                       0
Name                         0
Type 1                       0
Type 2                     484
Abilities                    0
HP                           0
Att                          0
Def                          0
Spa                          0
Spd                          0
Spe                          0
BST                          0
Mean                         0
Standard Deviation           0
Generation                   0
Experience type              0
Experience to level 100      0
Final Evolution              0
Catch Rate                   0
Legendary                    0
Mega Evolution               0
Alolan Form                  0
Galarian Form                0
Against Normal               0
Against Fire                 0
Against Water                0
Against Electric             0
Against Grass                0
Against Ice                  0
Against Fighting             0
Against Poison               0
Against Ground               0
Against 

#### Converting Abilities to a Dummy Variable

In [7]:
type(df2.Abilities.iloc[0]) #right now it's a string

str

In [8]:
df2['Abilities'] = df2['Abilities'].str.strip('[]').str.split('\s*,\s*')

In [9]:
df2['Abilities']

0       ['Chlorophyll', 'Overgrow']
1       ['Chlorophyll', 'Overgrow']
2       ['Chlorophyll', 'Overgrow']
3                     ['Thick Fat']
4          ['Blaze', 'Solar Power']
                   ...             
1027             ['Chilling Neigh']
1028                 ['Grim Neigh']
1029                    ['Unnerve']
1030                     ['As One']
1031                     ['As One']
Name: Abilities, Length: 1032, dtype: object

In [10]:
df = (
    df2['Abilities'].explode().str.get_dummies().sum(level=0).add_prefix('Ability_')
)

df = df2.drop('Abilities', 1).join(df)

In [11]:
df.shape

(1032, 307)

In [12]:
df

Unnamed: 0,Number,Name,Type 1,Type 2,HP,Att,Def,Spa,Spd,Spe,...,Ability_'Water Absorb',Ability_'Water Bubble',Ability_'Water Compaction',Ability_'Water Veil',Ability_'Weak Armor',Ability_'White Smoke',Ability_'Wimp Out',Ability_'Wonder Guard',Ability_'Wonder Skin',Ability_'Zen Mode'
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,...,0,0,0,0,0,0,0,0,0,0
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,...,0,0,0,0,0,0,0,0,0,0
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,...,0,0,0,0,0,0,0,0,0,0
3,3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,...,0,0,0,0,0,0,0,0,0,0
4,4,Charmander,Fire,,39,52,43,60,50,65,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1027,896,Glastrier,Ice,,100,145,130,65,110,30,...,0,0,0,0,0,0,0,0,0,0
1028,897,Spectrier,Ghost,,100,65,60,145,80,130,...,0,0,0,0,0,0,0,0,0,0
1029,898,Calyrex,Psychic,Grass,100,80,80,80,80,80,...,0,0,0,0,0,0,0,0,0,0
1030,898,Calyrex Ice Rider,Psychic,Ice,100,165,150,85,130,50,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# experience type is an object type
df['Experience type'].value_counts()

Medium Fast    426
Slow           254
Medium Slow    245
Fast            67
Erratic         26
Fluctuating     14
Name: Experience type, dtype: int64

#### Look at Generation, we might take away the older generations

In [14]:
df['Generation'].value_counts()

5.0    163
1.0    151
3.0    138
6.0    133
7.0    116
4.0    116
8.0    115
2.0    100
Name: Generation, dtype: int64

In [17]:
#export processed dataframe

df.to_json('data.json', orient = 'records')

In [21]:
df.columns

Index(['Number', 'Name', 'Type 1', 'Type 2', 'HP', 'Att', 'Def', 'Spa', 'Spd',
       'Spe',
       ...
       'Ability_'Water Absorb'', 'Ability_'Water Bubble'',
       'Ability_'Water Compaction'', 'Ability_'Water Veil'',
       'Ability_'Weak Armor'', 'Ability_'White Smoke'', 'Ability_'Wimp Out'',
       'Ability_'Wonder Guard'', 'Ability_'Wonder Skin'',
       'Ability_'Zen Mode''],
      dtype='object', length=307)

In [25]:
df["Type 1"].value_counts().to_frame().index

Index(['Water', 'Normal', 'Grass', 'Bug', 'Psychic', 'Fire', 'Rock',
       'Electric', 'Dark', 'Dragon', 'Fighting', 'Ghost', 'Ground', 'Poison',
       'Ice', 'Steel', 'Fairy', 'Flying'],
      dtype='object')