<a href="https://colab.research.google.com/github/yuki-shi/pokedex-flask/blob/main/serebii_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Imports !!!

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
from collections import OrderedDict

### Webscraping !!!

In [24]:
url = 'https://serebii.net/pokemon/gen1pokemon.shtml'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

Para a extração da tabela, começamos selecionando apenas os headers.

A estrutura HTML da página é um pouco bagunçada, mas, por sorte, todos os elementos do header estão com a classe *fooevo*.

Após isso, vamos criar uma lista com o innerText das tags scrapeadas, além de limpar cada elemento de seus caractéres de escape.

In [25]:
header = []
h = soup.find_all('td', class_='fooevo')

for i in h:
  header.append(i.text)

header = [x.strip('\r\n\t') for x in header] # visto que alguns possuíam caractéres de escape
header.remove('Base Stats') # a tabela é pivotada, vamos retirar o header sobrejascente
print(header)

['No.', 'Pic', 'Name', 'Type', 'Abilities', 'HP', 'Att', 'Def', 'S.Att', 'S.Def', 'Spd']


A partir do header que conseguimos, criamos um *dicionário ordenado* usando-os como keys.

Aqui é importante usarmos tal estrutura para tornar os próximos passos mais simples, mantendo-se a estrutura da tabela original.


Para cada key, criamos uma lista como valor.

In [26]:
poke_dict = OrderedDict.fromkeys(header)

for i in poke_dict.keys(): # talvez seja inútil, visto que usaremos .setdefault em seguida
  poke_dict[i] = []

print(poke_dict)

OrderedDict([('No.', []), ('Pic', []), ('Name', []), ('Type', []), ('Abilities', []), ('HP', []), ('Att', []), ('Def', []), ('S.Att', []), ('S.Def', []), ('Spd', [])])


Selecionamos todos os tags *tr* e alocamo-os na variável *tbody*, que sofrerá um slice a fim de retirar o header que já extraímos.

In [27]:
tbody = soup.find_all('tr')
tbody = tbody[2:]

In [28]:
for index, tr in enumerate(tbody):
  if index % 2 == 0:  # pela extração clunky, pegamos apenas os indexes pares >_>
    for i, key in enumerate(poke_dict.keys()):
      poke_dict.setdefault(key, []).append(tr.find_all('td', class_='fooinfo')[i].text.strip('\r\n\t'))

  Eis o famigerado datarame inicial:

In [29]:
df = pd.DataFrame(poke_dict)
df.head()

Unnamed: 0,No.,Pic,Name,Type,Abilities,HP,Att,Def,S.Att,S.Def,Spd
0,#001,,Bulbasaur,,Overgrow Chlorophyll,45,49,49,65,65,45
1,#002,,Ivysaur,,Overgrow Chlorophyll,60,62,63,80,80,60
2,#003,,Venusaur,,Overgrow Chlorophyll,80,82,83,100,100,80
3,#004,,Charmander,,Blaze Solar Power,39,52,43,60,50,65
4,#005,,Charmeleon,,Blaze Solar Power,58,64,58,80,65,80


### API !!!

In [30]:
df.head()

Unnamed: 0,No.,Pic,Name,Type,Abilities,HP,Att,Def,S.Att,S.Def,Spd
0,#001,,Bulbasaur,,Overgrow Chlorophyll,45,49,49,65,65,45
1,#002,,Ivysaur,,Overgrow Chlorophyll,60,62,63,80,80,60
2,#003,,Venusaur,,Overgrow Chlorophyll,80,82,83,100,100,80
3,#004,,Charmander,,Blaze Solar Power,39,52,43,60,50,65
4,#005,,Charmeleon,,Blaze Solar Power,58,64,58,80,65,80


In [31]:
df[df['Name'].str.contains('Nidoran')]

Unnamed: 0,No.,Pic,Name,Type,Abilities,HP,Att,Def,S.Att,S.Def,Spd
28,#029,,Nidoran♀,,Poison Point Rivalry Hustle,55,47,52,40,40,41
31,#032,,Nidoran♂,,Poison Point Rivalry Hustle,46,57,40,40,40,50


In [32]:
df.loc[df['Name'].str.contains('Nidoran♂'), 'Name'] = 'Nidoran-m'
df.loc[df['Name'].str.contains('Nidoran♀'), 'Name'] = 'Nidoran-f'
df[df['Name'].str.contains('Nidoran')]

Unnamed: 0,No.,Pic,Name,Type,Abilities,HP,Att,Def,S.Att,S.Def,Spd
28,#029,,Nidoran-f,,Poison Point Rivalry Hustle,55,47,52,40,40,41
31,#032,,Nidoran-m,,Poison Point Rivalry Hustle,46,57,40,40,40,50


In [49]:
df['Name'] = df['Name'].str.replace("'", '', regex=False) # farfetch'd
df['Name'] = df['Name'].str.replace('.', '-', regex=False) # mr.mime

In [50]:
nomes = df.loc[:, 'Name'].str.lower()
len(nomes)

151

In [None]:
for index, nome in enumerate(nomes):

  r = requests.get(f'https://pokeapi.co/api/v2/pokemon/{nome}')
  json = r.json() if r and r.status_code == 200 else None

  try:
    for i, j in enumerate(json['types']):

      if len(json['types']) == 1:
        df.loc[index, 'tipo 1'] = j['type']['name']

      else:
        df.loc[index, f'tipo {i+1}'] = j['type']['name']

  except:
    raise Exception(f'O pokemon {nome} deu ruim!')

In [56]:
df.drop('Type', axis=1, inplace=True)
df

Unnamed: 0,No.,Pic,Name,Abilities,HP,Att,Def,S.Att,S.Def,Spd,tipo 1,tipo 2
0,#001,,Bulbasaur,Overgrow Chlorophyll,45,49,49,65,65,45,grass,poison
1,#002,,Ivysaur,Overgrow Chlorophyll,60,62,63,80,80,60,grass,poison
2,#003,,Venusaur,Overgrow Chlorophyll,80,82,83,100,100,80,grass,poison
3,#004,,Charmander,Blaze Solar Power,39,52,43,60,50,65,fire,
4,#005,,Charmeleon,Blaze Solar Power,58,64,58,80,65,80,fire,
...,...,...,...,...,...,...,...,...,...,...,...,...
146,#147,,Dratini,Shed Skin Marvel Scale,41,64,45,50,50,50,dragon,
147,#148,,Dragonair,Shed Skin Marvel Scale,61,84,65,70,70,70,dragon,
148,#149,,Dragonite,Inner Focus Multiscale,91,134,95,100,100,80,dragon,flying
149,#150,,Mewtwo,Pressure Unnerve,106,110,90,154,90,130,psychic,
