## LAB | Web scraping multiple pages (pt 2)

**Practice web scraping. This is not involved with the GNOD project of the week.** <br>
Choose at least 3 sites to scrape:
- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: `url ='https://en.wikipedia.org/wiki/Python'`
- Find the number of titles that have changed in the United States Code since its last release point: `url = 'http://uscode.house.gov/download/download.shtml'`
- Create a Python list with the top ten FBI's Most Wanted names: `url = 'https://www.fbi.gov/wanted/topten'`
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: `url = 'https://www.emsc-csem.org/Earthquake/'`
- List all language names and number of related articles in the order they appear in [wikipedia.org](wikipedia.org): `url = 'https://www.wikipedia.org/'`
- A list with the different kind of datasets available in [data.gov.uk](data.gov.uk): `url = 'https://data.gov.uk/'`
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: `url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'`

##### 1. TOP10 FBI's monst wanted 

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
# save url in var 
url = "https://www.fbi.gov/wanted/topten"

# download html with get req
response = requests.get(url)
response.status_code

200

In [3]:
# make good soup
soup = BeautifulSoup(response.content, "html.parser")

In [4]:
# extract list from copied path
most_wanted = []

for person in soup.select("#query-results-0f737222c5054a81a120bce207b0446a > ul > li> h3 > a"):
    most_wanted.append(person.get_text())
    
most_wanted

['ALEJANDRO ROSALES CASTILLO',
 'RUJA IGNATOVA',
 'DONALD EUGENE FIELDS II',
 'ALEXIS FLORES',
 'ARNOLDO JIMENEZ',
 'OMAR ALEXANDER CARDENAS',
 'YULAN ADONAY ARCHAGA CARIAS',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'WILVER VILLEGAS-PALOMINO',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ']

##### 2. All datasets available in `data.gov.uk`

In [5]:
# save url in var 
url = "https://data.gov.uk/"

# download html with get req
response=requests.get(url)
response.status_code

200

In [6]:
# make good soup
soup = BeautifulSoup(response.content, "html.parser")

In [7]:
# extract list from copied path
dataset_list = []

for dset in soup.select("#main-content > div:nth-child(3) > div > ul > li > h3 > a"):
    dataset_list.append(dset.get_text())

dataset_list

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

##### 3. Arbitrary Wiki page of "Python"

In [8]:
# save url in var 
url = "https://en.wikipedia.org/wiki/Python"

# download html with get req
response = requests.get(url)
response.status_code

200

In [9]:
# make good soup
soup = BeautifulSoup(response.content, "html.parser")

In [10]:
# extract list from copied path
links = []

for link in soup.select("#mw-content-text > div.mw-parser-output > ul > li > a"):
    url = "https://en.wikipedia.org" + link.get("href")
    links.append(url)

In [11]:
# get the missing links
soup.select("#mw-content-text > div.mw-parser-output > ul > li > ul > li > a")

[<a href="/wiki/Python_(genus)" title="Python (genus)"><i>Python</i> (genus)</a>,
 <a href="/wiki/Python_(Monty)_Pictures" title="Python (Monty) Pictures">Python (Monty) Pictures</a>]

In [12]:
for link in soup.select('#mw-content-text > div.mw-parser-output > ul > li > ul > li > a'):
    url = "https://en.wikipedia.org" + link.get("href")
    links.append(url)

In [13]:
a = soup.select("#mw-content-text > div.mw-parser-output > ul > li > i > a")
a[0].get("href")

'/wiki/Pyton'

In [14]:
links.append("https://en.wikipedia.org" + a[0].get("href"))
links

['https://en.wikipedia.org/wiki/Pythonidae',
 'https://en.wikipedia.org/wiki/Python_(mythology)',
 'https://en.wikipedia.org/wiki/Python_(programming_language)',
 'https://en.wikipedia.org/wiki/CMU_Common_Lisp',
 'https://en.wikipedia.org/wiki/PERQ#PERQ_3',
 'https://en.wikipedia.org/wiki/Python_of_Aenus',
 'https://en.wikipedia.org/wiki/Python_(painter)',
 'https://en.wikipedia.org/wiki/Python_of_Byzantium',
 'https://en.wikipedia.org/wiki/Python_of_Catana',
 'https://en.wikipedia.org/wiki/Python_Anghelo',
 'https://en.wikipedia.org/wiki/Python_(Efteling)',
 'https://en.wikipedia.org/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 'https://en.wikipedia.org/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 'https://en.wikipedia.org/wiki/Python_(automobile_maker)',
 'https://en.wikipedia.org/wiki/Python_(Ford_prototype)',
 'https://en.wikipedia.org/wiki/Python_(missile)',
 'https://en.wikipedia.org/wiki/Python_(nuclear_primary)',
 'https://en.wikipedia.org/wiki/Colt_Python',
 'https://en.wikiped