# Lab | Web Scraping Multiple Pages

<br>

<details><summary>▶ Taking into account following:</summary>
<p>


#### Business goal:

Done in previous lab

#### Instructions 

#### Prioritize the MVP

MVP finished in previous lab

#### Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

#### Practice web scraping

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: `url ='https://en.wikipedia.org/wiki/Python'`
- Find the number of titles that have changed in the United States Code since its last release point: `url = 'http://uscode.house.gov/download/download.shtml'`
- Create a Python list with the top ten FBI's Most Wanted names: `url = 'https://www.fbi.gov/wanted/topten'`
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: `url = 'https://www.emsc-csem.org/Earthquake/'`
- List all language names and number of related articles in the order they appear in [wikipedia.org](wikipedia.org): `url = 'https://www.wikipedia.org/'`
- A list with the different kind of datasets available in [data.gov.uk](data.gov.uk): `url = 'https://data.gov.uk/'`
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: `url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'`

</p>
</details>


In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from random import randint  # for random wait time respectful scraping
from time import sleep      # sleep function

In [2]:
# using header to avoind 403 error, server thinks I am a just a user browsing
header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

artist = []
song = []
years = range(1969, 2017)

for y in years:
    start_at= str(y)
    wait_time = randint(1,4)
    sleep(wait_time)
    url = "https://playback.fm/charts/top-100-songs/" + start_at 
    response = requests.get(url, headers=header)
    soup = BeautifulSoup(response.content, "html.parser")
    for i in soup.select("td:nth-child(2) > a"):
        i = i.get_text()
        i = i.replace('\n','')
        artist.append(i)    
    for i in soup.select("td.mobile-hide > a > span.song"):
        i = i.get_text()
        song.append(i) 
    print("processed:", url, "with wait time:", wait_time, "- songs:", len(song), "artists:", len(artist))
top_100 = pd.DataFrame({"song":song, "artist":artist})  
top_100.shape

processed: https://playback.fm/charts/top-100-songs/1969 with wait time: 1 - songs: 100 artists: 100
processed: https://playback.fm/charts/top-100-songs/1970 with wait time: 3 - songs: 200 artists: 200
processed: https://playback.fm/charts/top-100-songs/1971 with wait time: 1 - songs: 300 artists: 300
processed: https://playback.fm/charts/top-100-songs/1972 with wait time: 1 - songs: 400 artists: 400
processed: https://playback.fm/charts/top-100-songs/1973 with wait time: 1 - songs: 500 artists: 500
processed: https://playback.fm/charts/top-100-songs/1974 with wait time: 4 - songs: 600 artists: 600
processed: https://playback.fm/charts/top-100-songs/1975 with wait time: 3 - songs: 700 artists: 700
processed: https://playback.fm/charts/top-100-songs/1976 with wait time: 4 - songs: 800 artists: 800
processed: https://playback.fm/charts/top-100-songs/1977 with wait time: 4 - songs: 900 artists: 900
processed: https://playback.fm/charts/top-100-songs/1978 with wait time: 1 - songs: 1000 ar

(4797, 2)

In [5]:
top_100.head(60)

Unnamed: 0,song,artist
0,Get Back,The Beatles
1,Sugar Sugar,Archies
2,In the Year 2525 (Exordium & Terminus),Zager & Evans
3,Honky Tonk Woman,The Rolling Stones
4,Suspicious Minds,Elvis Presley
5,In the Ghetto,Elvis Presley
6,Come Together,The Beatles
7,The Ballad of John & Yoko,The Beatles
8,Oh Happy Day,Edwin Hawkins Singers
9,Crimson & Clover,Tommy James & the Shondells


In [4]:
top_100[top_100.isna().any(axis=1)]   

Unnamed: 0,song,artist
