<a href="https://colab.research.google.com/github/yuki-shi/pokedex-flask/blob/main/serebii_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports !!! 🐊

In [74]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import json
from collections import OrderedDict

## Webscraping !!! 🐈

In [58]:
url = 'https://serebii.net/pokemon/gen1pokemon.shtml'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

Para a extração da tabela, começamos selecionando apenas os headers.

A estrutura HTML da página é um pouco bagunçada, mas, por sorte, todos os elementos do header estão com a classe *fooevo*.

Após isso, vamos criar uma lista com o innerText das tags scrapeadas, além de limpar cada elemento de seus caractéres de separação.

In [59]:
header = []
h = soup.find_all('td', class_='fooevo')

for i in h:
  header.append(i.text)

header = [x.strip('\r\n\t') for x in header] # visto que alguns possuíam caractéres de escape
header.remove('Base Stats') # a tabela é pivotada, vamos retirar o header sobrejascente
print(header)

['No.', 'Pic', 'Name', 'Type', 'Abilities', 'HP', 'Att', 'Def', 'S.Att', 'S.Def', 'Spd']


A partir do header que conseguimos, criamos um *dicionário ordenado* usando-os como keys.

Aqui é importante usarmos tal estrutura imutável para tornar os próximos passos mais simples, mantendo-se o layout da tabela original.


Para cada key, criamos uma lista como valor.

In [60]:
poke_dict = OrderedDict.fromkeys(header)

for i in poke_dict.keys(): # talvez seja inútil, visto que usaremos .setdefault em seguida
  poke_dict[i] = []

print(poke_dict)

OrderedDict([('No.', []), ('Pic', []), ('Name', []), ('Type', []), ('Abilities', []), ('HP', []), ('Att', []), ('Def', []), ('S.Att', []), ('S.Def', []), ('Spd', [])])


Selecionamos todos os tags *tr* e alocamo-os na variável *tbody*, que sofrerá um slice a fim de retirar o header que já extraímos.

Em seguida, procuramos os *td* de cada linha iterativamente e os anexamos ao nosso dicionário de listas.

In [61]:
tbody = soup.find_all('tr')
tbody = tbody[2:]

In [62]:
for index, tr in enumerate(tbody):
  if index % 2 == 0:  # pela extração clunky, pegamos apenas os indexes pares >_>
    for i, key in enumerate(poke_dict.keys()):
      poke_dict.setdefault(key, []).append(tr.find_all('td', class_='fooinfo')[i].text.strip('\r\n\t'))

  Eis o famigerado datarame inicial:

In [63]:
df = pd.DataFrame(poke_dict)
df.head()

Unnamed: 0,No.,Pic,Name,Type,Abilities,HP,Att,Def,S.Att,S.Def,Spd
0,#001,,Bulbasaur,,Overgrow Chlorophyll,45,49,49,65,65,45
1,#002,,Ivysaur,,Overgrow Chlorophyll,60,62,63,80,80,60
2,#003,,Venusaur,,Overgrow Chlorophyll,80,82,83,100,100,80
3,#004,,Charmander,,Blaze Solar Power,39,52,43,60,50,65
4,#005,,Charmeleon,,Blaze Solar Power,58,64,58,80,65,80


## API !!! 🐿

A fim de complementar a tabela screapeada, podemos usar o uma API chamada PokéAPI para, além de ter dados mais específicos de cada Pokémon, preencher a coluna "*Types*".

### Limpeza inicial

Para a chamada da API, usaremos os nomes dos Pokémons nos endpoints.

Logo, alguns terão de ter seus nomes tratados para remover espaços e caractéres especiais.

In [64]:
df[df['Name'].str.contains('Nidoran')]

Unnamed: 0,No.,Pic,Name,Type,Abilities,HP,Att,Def,S.Att,S.Def,Spd
28,#029,,Nidoran♀,,Poison Point Rivalry Hustle,55,47,52,40,40,41
31,#032,,Nidoran♂,,Poison Point Rivalry Hustle,46,57,40,40,40,50


In [65]:
df.loc[df['Name'].str.contains('Nidoran♂'), 'Name'] = 'Nidoran-m'
df.loc[df['Name'].str.contains('Nidoran♀'), 'Name'] = 'Nidoran-f'
df[df['Name'].str.contains('Nidoran')]

Unnamed: 0,No.,Pic,Name,Type,Abilities,HP,Att,Def,S.Att,S.Def,Spd
28,#029,,Nidoran-f,,Poison Point Rivalry Hustle,55,47,52,40,40,41
31,#032,,Nidoran-m,,Poison Point Rivalry Hustle,46,57,40,40,40,50


In [66]:
df['Name'] = df['Name'].str.replace("'", '', regex=False) # farfetch'd
df['Name'] = df['Name'].str.replace('.', '-', regex=False) # mr.mime
df['Name'] = df['Name'].str.replace(' ', '', regex=False)

In [67]:
df[df['Name'].str.contains('Mime')]

Unnamed: 0,No.,Pic,Name,Type,Abilities,HP,Att,Def,S.Att,S.Def,Spd
121,#122,,Mr-Mime,,Soundproof Filter Technician,40,45,65,100,120,90


### Call

Colocando todos os nomes em uma lista e convertendo-os para letras minúsculas, chamamos a API iterativamente.

Existe uma biblioteca *wrapper* para Python chamada [PokéBase](https://github.com/PokeAPI/pokebase) que realizaria tal tarefa de forma mais fácil, mas menos diertida.

In [68]:
nomes = df.loc[:, 'Name'].str.lower()
len(nomes)

151

In [77]:
def get_data(nomes, df):

  for index, nome in enumerate(nomes):

    r = requests.get(f'https://pokeapi.co/api/v2/pokemon/{nome}')
    json = r.json() if r and r.status_code == 200 else None

    try:
      # Tipos
      for i, j in enumerate(json['types']):

        if len(json['types']) == 1:
          df.loc[index, 'Type 1'] = j['type']['name']

        else:
          df.loc[index, f'Type {i+1}'] = j['type']['name']

      # Height
      df.loc[index, 'Height'] = json['height']
      
      # Weight
      df.loc[index, 'Weight'] = json['weight']

      # Abilities
      for i, j in enumerate(json['abilities']):
        df.loc[index, f'Ability {i+1}'] = j['ability']['name']
        df.loc[index, 'Hidden Ability'] = j['ability']['name'] if j['is_hidden'] else np.nan

    except:
      raise Exception(f'O pokemon {nome} deu ruim!')

  df.drop('Ability 3', axis=1, inplace=True) # temos uma coluna a mais de ability por conta do enumerate também contar a hidden   

  return df

In [78]:
df_pkm = get_data(nomes, df)
df_pkm.head(10)

Unnamed: 0,No.,Pic,Name,Type,Abilities,HP,Att,Def,S.Att,S.Def,Spd,Type 1,Type 2,Height,Weight,Ability 1,Hidden Ability,Ability 2
0,#001,,Bulbasaur,,Overgrow Chlorophyll,45,49,49,65,65,45,grass,poison,7.0,69.0,overgrow,chlorophyll,chlorophyll
1,#002,,Ivysaur,,Overgrow Chlorophyll,60,62,63,80,80,60,grass,poison,10.0,130.0,overgrow,chlorophyll,chlorophyll
2,#003,,Venusaur,,Overgrow Chlorophyll,80,82,83,100,100,80,grass,poison,20.0,1000.0,overgrow,chlorophyll,chlorophyll
3,#004,,Charmander,,Blaze Solar Power,39,52,43,60,50,65,fire,,6.0,85.0,blaze,solar-power,solar-power
4,#005,,Charmeleon,,Blaze Solar Power,58,64,58,80,65,80,fire,,11.0,190.0,blaze,solar-power,solar-power
5,#006,,Charizard,,Blaze Solar Power,78,84,78,109,85,100,fire,flying,17.0,905.0,blaze,solar-power,solar-power
6,#007,,Squirtle,,Torrent Rain Dish,44,48,65,50,64,43,water,,5.0,90.0,torrent,rain-dish,rain-dish
7,#008,,Wartortle,,Torrent Rain Dish,59,63,80,65,80,58,water,,10.0,225.0,torrent,rain-dish,rain-dish
8,#009,,Blastoise,,Torrent Rain Dish,79,83,100,85,105,78,water,,16.0,855.0,torrent,rain-dish,rain-dish
9,#010,,Caterpie,,Shield Dust Run Away,45,30,35,20,20,45,bug,,3.0,29.0,shield-dust,run-away,run-away


Agora removemos colunas irrelevantes...

In [79]:
df_pkm.drop(['Type', 'Pic', 'Abilities'], axis=1, inplace=True)
df_pkm.head()

Unnamed: 0,No.,Name,HP,Att,Def,S.Att,S.Def,Spd,Type 1,Type 2,Height,Weight,Ability 1,Hidden Ability,Ability 2
0,#001,Bulbasaur,45,49,49,65,65,45,grass,poison,7.0,69.0,overgrow,chlorophyll,chlorophyll
1,#002,Ivysaur,60,62,63,80,80,60,grass,poison,10.0,130.0,overgrow,chlorophyll,chlorophyll
2,#003,Venusaur,80,82,83,100,100,80,grass,poison,20.0,1000.0,overgrow,chlorophyll,chlorophyll
3,#004,Charmander,39,52,43,60,50,65,fire,,6.0,85.0,blaze,solar-power,solar-power
4,#005,Charmeleon,58,64,58,80,65,80,fire,,11.0,190.0,blaze,solar-power,solar-power


In [82]:
df_pkm = df_pkm[['No.', 'Name', 'Type 1', 'Type 2', 'HP', 'Def', 'S.Att', 'S.Def', 'Spd', 'Ability 1', 'Ability 2', 'Hidden Ability', 'Height', 'Weight']]

In [83]:
df_pkm

Unnamed: 0,No.,Name,Type 1,Type 2,HP,Def,S.Att,S.Def,Spd,Ability 1,Ability 2,Hidden Ability,Height,Weight
0,#001,Bulbasaur,grass,poison,45,49,65,65,45,overgrow,chlorophyll,chlorophyll,7.0,69.0
1,#002,Ivysaur,grass,poison,60,63,80,80,60,overgrow,chlorophyll,chlorophyll,10.0,130.0
2,#003,Venusaur,grass,poison,80,83,100,100,80,overgrow,chlorophyll,chlorophyll,20.0,1000.0
3,#004,Charmander,fire,,39,43,60,50,65,blaze,solar-power,solar-power,6.0,85.0
4,#005,Charmeleon,fire,,58,58,80,65,80,blaze,solar-power,solar-power,11.0,190.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146,#147,Dratini,dragon,,41,45,50,50,50,shed-skin,marvel-scale,marvel-scale,18.0,33.0
147,#148,Dragonair,dragon,,61,65,70,70,70,shed-skin,marvel-scale,marvel-scale,40.0,165.0
148,#149,Dragonite,dragon,flying,91,95,100,100,80,inner-focus,multiscale,multiscale,22.0,2100.0
149,#150,Mewtwo,psychic,,106,90,154,90,130,pressure,unnerve,unnerve,20.0,1220.0
