# Getting data directly from a website
This notebook walks you through some steps in collecting data from [Bulbapedia's National Pokedex](https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number) using `requests` and `BeautifulSoup`

### Import `requests` library
This package allows you to get any website's HTML code so that you can extract from it. Let's save the website's URL in the `URL` variable.

In [1]:
import requests

URL="https://newsinfo.inquirer.net/source/inquirer-net/page/2"

### Load the page

In [2]:
page=requests.get(URL)

In [3]:
print(page.content)

b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n<script defer src="https://static.cloudflareinsights.com/beacon.min.js" data-cf-beacon=\'{"si":10,"rayId":"630d89b40dc31a42","version":"2021.2.0"}\'></script>\n</body>\r\n</html>\r\n'


### Parse HTML data

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

### Find all tables that contain Pokemon details

In [5]:
# Get main content <div>
poke_content=soup.find(id='mw-content-text')

# Get all <table> elements
poke_tables=poke_content.find_all('table')

### Get list of First Generation Pokemons

In [6]:
gen1_list=poke_tables[1]

In [7]:
# Check its contents and find where the first Pokemon entry is
gen1_list.contents

['\n',
 <tr>
 <th style="border-top-left-radius: 5px; -moz-border-radius-topleft: 5px; -webkit-border-top-left-radius: 5px; -khtml-border-top-left-radius: 5px; -icab-border-top-left-radius: 5px; -o-border-top-left-radius: 5px; background: #64D364"> <a href="/wiki/List_of_Pok%C3%A9mon_by_Kanto_Pok%C3%A9dex_number" title="List of Pokémon by Kanto Pokédex number"><span style="color:#000;">Kdex</span></a>
 </th>
 <th style="background: #64D364"> Ndex
 </th>
 <th style="background: #64D364"> MS
 </th>
 <th style="background: #64D364"> Pokémon
 </th>
 <th colspan="2" style="border-top-right-radius: 5px; -moz-border-radius-topright: 5px; -webkit-border-top-right-radius: 5px; -khtml-border-top-right-radius: 5px; -icab-border-top-right-radius: 5px; -o-border-top-right-radius: 5px; background: #64D364"> Type
 </th></tr>,
 '\n',
 <tr style="background:#FFF">
 <td style="font-family:monospace"> #001
 </td>
 <td style="font-family:monospace"> #001
 </td>
 <th> <a href="/wiki/Bulbasaur_(Pok%C3%A9mon

In [10]:
# The first Pokemon entry
gen1_list.contents[3]

<tr style="background:#FFF">
<td style="font-family:monospace"> #001
</td>
<td style="font-family:monospace"> #001
</td>
<th> <a href="/wiki/Bulbasaur_(Pok%C3%A9mon)" title="Bulbasaur"><img alt="Bulbasaur" height="68" src="//cdn.bulbagarden.net/upload/2/21/001MS8.png" width="68"/></a>
</th>
<td> <a href="/wiki/Bulbasaur_(Pok%C3%A9mon)" title="Bulbasaur (Pokémon)">Bulbasaur</a>
</td>
<td colspan="1" style="text-align:center; background:#78C850"> <a href="/wiki/Grass_(type)" title="Grass (type)"><span style="color:#FFF">Grass</span></a> </td>
<td align="center" colspan="1" rowspan="1" style="background:#A040A0"> <a href="/wiki/Poison_(type)" title="Poison (type)"><span style="color:#FFFFFF">Poison</span></a>
</td></tr>

In [13]:
info_start=3

# Let's figure out how to get each item for Bulbasaur
info_row=gen1_list.contents[info_start]

for i in range(len(info_row.contents)):
# for i in range(info_start, len(gen1_list.contents), 2):
#     poke_info=gen1_list.contents[i]
#     kdex=poke_info.contents[1].text.strip()
#     ndex=poke_info.contents[3].text.strip()
#     name=poke_info.contents[7].text.strip()
#     type1=poke_info.contents[9].text.strip()
#     if len(poke_info.contents) > 10:
#         type2=poke_info.contents[11].text.strip()
#         print(f'Pokemon {ndex} {name} is a {type1} & {type2} Pokemon')
#     else:
#         print(f'Pokemon {ndex} {name} is a {type1} Pokemon')ents))
    print(f'Index {i} - {info_row.contents[i]}')

Index 0 - 

Index 1 - <td style="font-family:monospace"> #001
</td>
Index 2 - 

Index 3 - <td style="font-family:monospace"> #001
</td>
Index 4 - 

Index 5 - <th> <a href="/wiki/Bulbasaur_(Pok%C3%A9mon)" title="Bulbasaur"><img alt="Bulbasaur" height="68" src="//cdn.bulbagarden.net/upload/2/21/001MS8.png" width="68"/></a>
</th>
Index 6 - 

Index 7 - <td> <a href="/wiki/Bulbasaur_(Pok%C3%A9mon)" title="Bulbasaur (Pokémon)">Bulbasaur</a>
</td>
Index 8 - 

Index 9 - <td colspan="1" style="text-align:center; background:#78C850"> <a href="/wiki/Grass_(type)" title="Grass (type)"><span style="color:#FFF">Grass</span></a> </td>
Index 10 - 

Index 11 - <td align="center" colspan="1" rowspan="1" style="background:#A040A0"> <a href="/wiki/Poison_(type)" title="Poison (type)"><span style="color:#FFFFFF">Poison</span></a>
</td>


In [14]:
# Extract items of interest
kdex=info_row.contents[1].text.strip()
ndex=info_row.contents[3].text.strip()
name=info_row.contents[7].text.strip()
type1=info_row.contents[9].text.strip()

print(f'Pokemon {ndex} {name} is a {type1} Pokemon')

Pokemon #001 Bulbasaur is a Grass Pokemon


### Get all Gen 1 Pokemons

In [15]:
for i in range(info_start, len(gen1_list.contents), 2):
    poke_info=gen1_list.contents[i]
    kdex=poke_info.contents[1].text.strip()
    ndex=poke_info.contents[3].text.strip()
    name=poke_info.contents[7].text.strip()
    type1=poke_info.contents[9].text.strip()
    if len(poke_info.contents) > 10:
        type2=poke_info.contents[11].text.strip()
        print(f'Pokemon {ndex} {name} is a {type1} & {type2} Pokemon')
    else:
        print(f'Pokemon {ndex} {name} is a {type1} Pokemon')

Pokemon #001 Bulbasaur is a Grass & Poison Pokemon
Pokemon #002 Ivysaur is a Grass & Poison Pokemon
Pokemon #003 Venusaur is a Grass & Poison Pokemon
Pokemon #004 Charmander is a Fire Pokemon
Pokemon #005 Charmeleon is a Fire Pokemon
Pokemon #006 Charizard is a Fire & Flying Pokemon
Pokemon #007 Squirtle is a Water Pokemon
Pokemon #008 Wartortle is a Water Pokemon
Pokemon #009 Blastoise is a Water Pokemon
Pokemon #010 Caterpie is a Bug Pokemon
Pokemon #011 Metapod is a Bug Pokemon
Pokemon #012 Butterfree is a Bug & Flying Pokemon
Pokemon #013 Weedle is a Bug & Poison Pokemon
Pokemon #014 Kakuna is a Bug & Poison Pokemon
Pokemon #015 Beedrill is a Bug & Poison Pokemon
Pokemon #016 Pidgey is a Normal & Flying Pokemon
Pokemon #017 Pidgeotto is a Normal & Flying Pokemon
Pokemon #018 Pidgeot is a Normal & Flying Pokemon
Pokemon #019 Rattata is a Normal Pokemon
Pokemon #019 Rattata is a Dark & Normal Pokemon
Pokemon #020 Raticate is a Normal Pokemon
Pokemon #020 Raticate is a Dark & Normal P

### Save them in a JSON

In [92]:
gen1_json = []

for i in range(info_start, len(gen1_list.contents), 2):
    poke_info=gen1_list.contents[i]
    kdex=poke_info.contents[1].text.strip()
    ndex=poke_info.contents[3].text.strip()
    name=poke_info.contents[7].text.strip()
    type1=poke_info.contents[9].text.strip()
    if len(poke_info.contents) > 10:
        type2=poke_info.contents[11].text.strip()
        gen1_json.append({
            "kdex": kdex,
            "ndex": ndex,
            "name": name,
            "type1": type1,
            "type2": type2
        })
    else:
        gen1_json.append({
            "kdex": kdex,
            "ndex": ndex,
            "name": name,
            "type1": type1
        })
        
gen1_json

[{'kdex': '#001', 'ndex': '#001', 'name': 'Bulbasaur', 'type1': 'Grass', 'type2': 'Poison'}, {'kdex': '#002', 'ndex': '#002', 'name': 'Ivysaur', 'type1': 'Grass', 'type2': 'Poison'}, {'kdex': '#003', 'ndex': '#003', 'name': 'Venusaur', 'type1': 'Grass', 'type2': 'Poison'}, {'kdex': '#004', 'ndex': '#004', 'name': 'Charmander', 'type1': 'Fire'}, {'kdex': '#005', 'ndex': '#005', 'name': 'Charmeleon', 'type1': 'Fire'}, {'kdex': '#006', 'ndex': '#006', 'name': 'Charizard', 'type1': 'Fire', 'type2': 'Flying'}, {'kdex': '#007', 'ndex': '#007', 'name': 'Squirtle', 'type1': 'Water'}, {'kdex': '#008', 'ndex': '#008', 'name': 'Wartortle', 'type1': 'Water'}, {'kdex': '#009', 'ndex': '#009', 'name': 'Blastoise', 'type1': 'Water'}, {'kdex': '#010', 'ndex': '#010', 'name': 'Caterpie', 'type1': 'Bug'}, {'kdex': '#011', 'ndex': '#011', 'name': 'Metapod', 'type1': 'Bug'}, {'kdex': '#012', 'ndex': '#012', 'name': 'Butterfree', 'type1': 'Bug', 'type2': 'Flying'}, {'kdex': '#013', 'ndex': '#013', 'name': 