# Pokemon Classification

In this project, we will use requests and BeautifulSoup to pull information off of the Pokemon Database.
Once we have the information, the next step will be to make a classification model to see how acccurately 
we can classify different pokemon based on their types.

## Webscrapping

The site that will be used to pull information is [pokemondb.net](https://pokemondb.net/).

<img src='https://img.pokemondb.net/news/2018/design-v4.jpg'>


We will be pulling information from two parts of the website and wil categorize the information based on two categories.
<br>The Pokemon observed will be from Generations 1 - 8.</br>
1. _General_
    * This infromation is the most basic on the specific pokemon:
        - Name
        - Types
        - Stats
2. _Specific_
    * The information in this section will extend on the general by including the following:
        * Base stats
        * Min stats
        * Max stats
        * Type Defense
        * Species
        * Height
        * Weight
        * Abilities
        * Training data
        * Breeding Data

**Import files**

In [1]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from bs4 import BeautifulSoup

In [2]:
page = requests.get("https://pokemondb.net/pokedex/all")

In [3]:
page.status_code

200

Now that the page was pulled and stored as a string of the HTML tags, we will use beautiful soup to pull the table information.

In [4]:
soup = BeautifulSoup(page.content, 'html.parser')
# print(soup.prettify())

In [5]:
children = list(soup.children)

The children of the soup item are printed out to get an idea of what some of the major classes are.

In [6]:
print([type(item) for item in children])

[<class 'bs4.element.Doctype'>, <class 'bs4.element.NavigableString'>, <class 'bs4.element.Tag'>, <class 'bs4.element.NavigableString'>]


In this block, we are pulling all of the information from the webpage and formatting so that we can save the information in a dictionary, or in a data frame

Now that we are able to pull out specific information from the webpage, we can then store the info into a dataframe, or another data sctructrue. In this example, we will be using a dataframe to store the information

In [7]:
def pull_info(soup=None, tag='', class_str='', **kwargs):
    """
        parameters:
            soup:       beautiful soup object
            tag:        html-tag that we want to pull information from 
            class_str:  string name of the class tied to the html tag being observed
            **kwargs:   any supplemental argumentation that goes into beautifulsoup.find_all() function            
    """
    
    if not soup:
        return "You did not enter a beautiful soup object"
    if class_str:
        keys = [item.text for item in soup.find_all(name=tag, class_=class_str, **kwargs)]
    else:
        keys = [item.text for item in soup.find_all(name=tag, **kwargs)]
    return keys

With the introduction of generation 6 and forward, mega evolutions and alternate forms were made avaialbe for certain pokemon. These forms, though they can differ greatly in stats and appearances, they share the same national pokedex entry number. This can be seen in the print out below.

In [8]:
headers = pull_info(soup=soup, tag='th')  # pull the header information from the webpage

table = pull_info(soup=soup, tag='td') # pull the entries from each respective table
rows = [table[i:i+10]for i in range(0,len(table),10)]

print(headers)
for row in rows[:10]: print(row) # print the 1st 10 rows

['#', 'Name', 'Type', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
['001', 'Bulbasaur', 'Grass Poison', '318', '45', '49', '49', '65', '65', '45']
['002', 'Ivysaur', 'Grass Poison', '405', '60', '62', '63', '80', '80', '60']
['003', 'Venusaur', 'Grass Poison', '525', '80', '82', '83', '100', '100', '80']
['003', 'Venusaur Mega Venusaur', 'Grass Poison', '625', '80', '100', '123', '122', '120', '80']
['004', 'Charmander', 'Fire ', '309', '39', '52', '43', '60', '50', '65']
['005', 'Charmeleon', 'Fire ', '405', '58', '64', '58', '80', '65', '80']
['006', 'Charizard', 'Fire Flying', '534', '78', '84', '78', '109', '85', '100']
['006', 'Charizard Mega Charizard X', 'Fire Dragon', '634', '78', '130', '111', '130', '85', '100']
['006', 'Charizard Mega Charizard Y', 'Fire Flying', '634', '78', '104', '78', '159', '115', '100']
['007', 'Squirtle', 'Water ', '314', '44', '48', '65', '50', '64', '43']


### Data Cleanning & Feature Engineering

In [9]:
pokedex = pd.DataFrame(data=rows, columns=list(map(lambda x:x.replace(' ','_').lower(),headers)))

In [10]:
pokedex.rename(columns={'#': 'nat_idx'}, inplace=True)

In [11]:
pokedex.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1034 entries, 0 to 1033
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   nat_idx  1034 non-null   object
 1   name     1034 non-null   object
 2   type     1034 non-null   object
 3   total    1034 non-null   object
 4   hp       1034 non-null   object
 5   attack   1034 non-null   object
 6   defense  1034 non-null   object
 7   sp._atk  1034 non-null   object
 8   sp._def  1034 non-null   object
 9   speed    1034 non-null   object
dtypes: object(10)
memory usage: 80.9+ KB


At the moment, eveything is stored as string. We will need to convert numerical columns to integer types

In [12]:
pokedex.head()

Unnamed: 0,nat_idx,name,type,total,hp,attack,defense,sp._atk,sp._def,speed
0,1,Bulbasaur,Grass Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass Poison,525,80,82,83,100,100,80
3,3,Venusaur Mega Venusaur,Grass Poison,625,80,100,123,122,120,80
4,4,Charmander,Fire,309,39,52,43,60,50,65


Here, we are going to separate the type column into two separate columns: primary and secondary. This will make it easier to identify pokemon via primary and secondary typing later on.

**_Note:_** Not every pokemon has a secondary type. To represent thsi lack of a type, the symbol `---` is used to represent this.

In [13]:
pokedex['primary'] = pokedex.type.apply(lambda x: x.split()[0])
pokedex['secondary'] = pokedex.type.apply(lambda x: x.split()[-1] 
                                          if len(x.split())==2 else '---')

In [14]:
pokedex[(pokedex['primary']=='Dragon')|(pokedex['secondary']=='Dragon')]

Unnamed: 0,nat_idx,name,type,total,hp,attack,defense,sp._atk,sp._def,speed,primary,secondary
7,006,Charizard Mega Charizard X,Fire Dragon,634,78,130,111,130,85,100,Fire,Dragon
135,103,Exeggutor Alolan Exeggutor,Grass Dragon,530,95,105,85,125,75,45,Grass,Dragon
187,147,Dratini,Dragon,300,41,64,45,50,50,50,Dragon,---
188,148,Dragonair,Dragon,420,61,84,65,70,70,70,Dragon,---
189,149,Dragonite,Dragon Flying,600,91,134,95,100,100,80,Dragon,Flying
...,...,...,...,...,...,...,...,...,...,...,...,...
1021,885,Dreepy,Dragon Ghost,270,28,60,30,40,30,82,Dragon,Ghost
1022,886,Drakloak,Dragon Ghost,410,68,80,50,60,50,102,Dragon,Ghost
1023,887,Dragapult,Dragon Ghost,600,88,120,75,100,75,142,Dragon,Ghost
1028,890,Eternatus,Poison Dragon,690,140,85,95,145,95,130,Poison,Dragon


Now we will convert numeric columns to integer types.  This will allow us to utilize the fields in the classification models later on.

In [15]:
for col in pokedex.columns:
    try:
        pokedex[col] = pokedex[col].astype('int')
    except:
        print(f'Unable to convert{col} to type int')

pokedex.info()

Unable to convertname to type int
Unable to converttype to type int
Unable to convertprimary to type int
Unable to convertsecondary to type int
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1034 entries, 0 to 1033
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   nat_idx    1034 non-null   int64 
 1   name       1034 non-null   object
 2   type       1034 non-null   object
 3   total      1034 non-null   int64 
 4   hp         1034 non-null   int64 
 5   attack     1034 non-null   int64 
 6   defense    1034 non-null   int64 
 7   sp._atk    1034 non-null   int64 
 8   sp._def    1034 non-null   int64 
 9   speed      1034 non-null   int64 
 10  primary    1034 non-null   object
 11  secondary  1034 non-null   object
dtypes: int64(8), object(4)
memory usage: 97.1+ KB


In [16]:
pokedex.head(10)

Unnamed: 0,nat_idx,name,type,total,hp,attack,defense,sp._atk,sp._def,speed,primary,secondary
0,1,Bulbasaur,Grass Poison,318,45,49,49,65,65,45,Grass,Poison
1,2,Ivysaur,Grass Poison,405,60,62,63,80,80,60,Grass,Poison
2,3,Venusaur,Grass Poison,525,80,82,83,100,100,80,Grass,Poison
3,3,Venusaur Mega Venusaur,Grass Poison,625,80,100,123,122,120,80,Grass,Poison
4,4,Charmander,Fire,309,39,52,43,60,50,65,Fire,---
5,5,Charmeleon,Fire,405,58,64,58,80,65,80,Fire,---
6,6,Charizard,Fire Flying,534,78,84,78,109,85,100,Fire,Flying
7,6,Charizard Mega Charizard X,Fire Dragon,634,78,130,111,130,85,100,Fire,Dragon
8,6,Charizard Mega Charizard Y,Fire Flying,634,78,104,78,159,115,100,Fire,Flying
9,7,Squirtle,Water,314,44,48,65,50,64,43,Water,---


# Modeling

### __Label Encoding__

In order to prep the data for classification modeling, we first have eto 

In [17]:
from sklearn import preprocessing

In [18]:
le = preprocessing.LabelEncoder()

In [19]:
# contacenate the different types to get all of the available types
unique_types = np.concatenate((pokedex.primary.unique(), pokedex.secondary.unique()))

# transform the types into categorical
le.fit(unique_types)

type_labels = le.transform(unique_types)
types = dict(zip(le.classes_, le.transform(le.classes_)))

Mapping of the different types and their associated labels with _sklearn.preprocessing.LabelEncoder_

In [20]:
types

{'---': 0,
 'Bug': 1,
 'Dark': 2,
 'Dragon': 3,
 'Electric': 4,
 'Fairy': 5,
 'Fighting': 6,
 'Fire': 7,
 'Flying': 8,
 'Ghost': 9,
 'Grass': 10,
 'Ground': 11,
 'Ice': 12,
 'Normal': 13,
 'Poison': 14,
 'Psychic': 15,
 'Rock': 16,
 'Steel': 17,
 'Water': 18}

In [21]:
df = pokedex.drop(columns='type').copy()

df.primary = le.transform(df.primary)
df.secondary = le.transform(df.secondary)

In [22]:
df.head(10)

Unnamed: 0,nat_idx,name,total,hp,attack,defense,sp._atk,sp._def,speed,primary,secondary
0,1,Bulbasaur,318,45,49,49,65,65,45,10,14
1,2,Ivysaur,405,60,62,63,80,80,60,10,14
2,3,Venusaur,525,80,82,83,100,100,80,10,14
3,3,Venusaur Mega Venusaur,625,80,100,123,122,120,80,10,14
4,4,Charmander,309,39,52,43,60,50,65,7,0
5,5,Charmeleon,405,58,64,58,80,65,80,7,0
6,6,Charizard,534,78,84,78,109,85,100,7,8
7,6,Charizard Mega Charizard X,634,78,130,111,130,85,100,7,3
8,6,Charizard Mega Charizard Y,634,78,104,78,159,115,100,7,8
9,7,Squirtle,314,44,48,65,50,64,43,18,0


# Visualizations

### Types


The chart below show what percentage of all pokemon in the national pokedex are of the different associated types. Each pokemon will have at least one type.  Secondary types are optional.    

Click on the legend below to add/remove certain types from the Plotly generated pie chart.

#### Primary & Secondary typing next to each other

In [62]:
## Plotly bar & pie subplots ##

prim_labels = pokedex.primary.value_counts().keys()
prim_values = pokedex.primary.value_counts().values

sec_labels = pokedex.secondary.value_counts().keys()
sec_values = pokedex.secondary.value_counts().values

fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])

fig.add_trace(go.Pie(labels=prim_labels, values=prim_values, name="Primary Types"), 1, 1)
fig.add_trace(go.Pie(labels=sec_labels, values=sec_values, name="Secondary Types"), 1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
    title_text="Pokemon Primary and Secondary Typing",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Primary', x=0.17, y=0.5, font_size=16, showarrow=False),
                 dict(text='Secondary', x=0.85, y=0.5, font_size=16, showarrow=False)])

# fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.show()

In [81]:
## Plotly grouped bar plots ##

prim_labels = pokedex.primary.value_counts().keys()
prim_values = pokedex.primary.value_counts().values

sec_labels = pokedex.secondary.value_counts().keys()
sec_values = pokedex.secondary.value_counts().values

fig = go.Figure(data=[
    go.Bar(name='Primary', x=prim_labels, y=prim_values),
    go.Bar(name='Secondary', x=sec_labels, y=sec_values)
])
# Change the bar mode
fig.update_layout(title_text='Primary vs Secondary Typing', 
                  barmode='group')
fig.show()

#### _My Tree Map_

In [293]:
## My attempt at Tree Maps with the pokedex ##
df = pokedex
df["all"] = "all"
df["count"] = 1

fig = px.treemap(df, path=['all', 'primary', 'secondary', 'name'],
                values='count',
                color_continuous_scale='RdBu',)
fig.show()

#### _Spider Chart_

I decided to add a Spider chart to the list, to get a better idea of the overall balance of respective pokemons in respective types.  I also saw that plotly had a library for this and wanted to try it out.

In [158]:
idx = 930
min_idx, max_idx = 4,10
categories = pokedex.columns[min_idx:max_idx]

pokemon = pokedex.loc[pokedex.index==idx]
pokemon_stats = pokemon[pokemon.columns[min_idx:max_idx]]

# pokemon_stats, 
pokemon.name.to_string()

'930    Necrozma Dusk Mane Necrozma'

In [161]:
fig = go.Figure(data=go.Scatterpolar(
  r=pokemon_stats.to_numpy()[0],
  theta=categories,
  fill='toself',
  name=pokemon.name.to_string()
))

fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True,
      range=[0,200]
    ),
  ),
  showlegend=False
)

fig.show()

### Train-test split

In [23]:
from sklearn.model_selection import train_test_split

In [29]:
## Set the target from our data frame for modeling
target = df.primary.values
model_df = df.drop(labels=['name','primary'])


## Train-test split
x_train, x_test,y_train, y_test = train_test_split(model_df, target, test_size=0.40, random_state = 139)

## Machine Learning Models