# Lab | Web Scraping Single Page

## Instructions - Scraping popular songs

Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

You have find data on the internet about currently popular songs. Billboard maintains a weekly Top 100 of "hot" songs here: https://www.billboard.com/charts/hot-100.

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [310]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

In [311]:
link = "https://www.billboard.com/charts/hot-100/"

In [312]:
# code adapted from Cormac and Isi
def request_link(link):
    try:
        request = requests.get(link)
        request.raise_for_status()  # returns an HTTPError if the response is not OK
        print("Success! Response code", request.status_code) 
    except requests.exceptions.HTTPError as err:
        if request.status_code == 404:
            print("404: Oops, sorry we can't find that page!")
        else:
            print("The error code is", err.args[0]) # look up the 1st argument from HTTPError 
    return request

In [313]:
billboard100 = request_link(link)

Success! Response code 200


In [314]:
content = billboard100.content

In [315]:
# parse html for readability
soup = BeautifulSoup(content, "html.parser")

In [316]:
# creating empty lists where info will be stored
artist_name = []
song_title = []
current_rank = []
last_week_rank = []

In [317]:
# getting song titles and current ranking
for a in soup.find_all('div', attrs={"class" : "o-chart-results-list-row-container"}):
    song_title.append(a.h3.get_text(strip=True))
    current_rank.append(a.span.get_text(strip=True))


In [318]:
# checking results
print(song_title[:10], len(song_title))
print(current_rank[:10], len(current_rank))

['Last Night', 'Fast Car', 'Calm Down', 'Flowers', 'All My Life', 'Favorite Song', 'Karma', 'Kill Bill', "Creepin'", 'Ella Baila Sola'] 100
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'] 100


In [319]:
# another empty list to store the artist names and last week's ranking together. 
# I tried numerous ways to get only the artist name but encountered a lot of difficulties and this, out of everything i tried yielded the complete list
artists = []

In [320]:
# code to scrap artist names and last week's ranking
for a in soup.find_all('li', attrs={'class' : "lrv-u-width-100p"}):
    artists.append(a.select('span')[0].get_text(strip=True))

In [321]:
len(artists)

200

In [322]:
artists[:20]

['Morgan Wallen',
 '1',
 'Luke Combs',
 '3',
 'Rema & Selena Gomez',
 '4',
 'Miley Cyrus',
 '2',
 'Lil Durk Featuring J. Cole',
 '5',
 'Toosii',
 '6',
 'Taylor Swift Featuring Ice Spice',
 '9',
 'SZA',
 '7',
 'Metro Boomin, The Weeknd & 21 Savage',
 '8',
 'Eslabon Armado X Peso Pluma',
 '10']

In [323]:
# function to separate artist name and last week ranking
def extract_artist_name(list):
    i= 0
    artist_list = []
    last_week_rank = []
    
    for i in range(len(list)):
        if i % 2 == 0:
            artist_list.append(list[i])
        else :
            last_week_rank.append(list[i])
            i += 1
    
    return artist_list, last_week_rank

In [324]:
artist_name, last_week_rank = extract_artist_name(artists)

In [325]:
len(artist_name), len(last_week_rank)

(100, 100)

In [326]:
df = pd.DataFrame([current_rank, song_title, artist_name, last_week_rank], index=["current_rank", "song_title", "artist_name", "last_week_rank"]).T
df

Unnamed: 0,current_rank,song_title,artist_name,last_week_rank
0,1,Last Night,Morgan Wallen,1
1,2,Fast Car,Luke Combs,3
2,3,Calm Down,Rema & Selena Gomez,4
3,4,Flowers,Miley Cyrus,2
4,5,All My Life,Lil Durk Featuring J. Cole,5
...,...,...,...,...
95,96,"Angel, Pt. 1","Kodak Black, NLE Choppa, Jimin, JVKE & Muni Long",-
96,97,Girl In Mine,Parmalee,-
97,98,Moonlight,Kali Uchis,90
98,99,Classy 101,Feid x Young Miko,-


In [327]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   current_rank    100 non-null    object
 1   song_title      100 non-null    object
 2   artist_name     100 non-null    object
 3   last_week_rank  100 non-null    object
dtypes: object(4)
memory usage: 3.3+ KB


> We can see that both rank columns are considered objects so they will be transformed to numeric.

In [328]:
df.current_rank = df.current_rank.astype(int)

In [329]:
# df.last_week_rank.astype(int) -- received an error so there must be a value that is not a number

In [330]:
df.last_week_rank.unique() # as expected, we have "-" for songs that were not in the top 100 the previous week

array(['1', '3', '4', '2', '5', '6', '9', '7', '8', '10', '11', '12',
       '15', '14', '13', '-', '19', '39', '16', '21', '18', '22', '25',
       '24', '20', '23', '26', '28', '17', '31', '30', '32', '34', '42',
       '27', '36', '29', '33', '41', '38', '37', '49', '35', '40', '45',
       '44', '50', '81', '52', '53', '46', '56', '60', '47', '64', '54',
       '55', '43', '57', '58', '59', '51', '65', '61', '62', '67', '74',
       '63', '66', '69', '71', '72', '76', '73', '68', '96', '83', '85',
       '70', '89', '77', '78', '82', '80', '79', '90'], dtype=object)

In [331]:
# replace "-" to 0
df.last_week_rank = df.last_week_rank.replace('-', "0", regex=True)

In [332]:
# transform column to int
df.last_week_rank = df.last_week_rank.astype(int)

In [333]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   current_rank    100 non-null    int64 
 1   song_title      100 non-null    object
 2   artist_name     100 non-null    object
 3   last_week_rank  100 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 3.3+ KB


In [334]:
df

Unnamed: 0,current_rank,song_title,artist_name,last_week_rank
0,1,Last Night,Morgan Wallen,1
1,2,Fast Car,Luke Combs,3
2,3,Calm Down,Rema & Selena Gomez,4
3,4,Flowers,Miley Cyrus,2
4,5,All My Life,Lil Durk Featuring J. Cole,5
...,...,...,...,...
95,96,"Angel, Pt. 1","Kodak Black, NLE Choppa, Jimin, JVKE & Muni Long",0
96,97,Girl In Mine,Parmalee,0
97,98,Moonlight,Kali Uchis,90
98,99,Classy 101,Feid x Young Miko,0


# Lab | Web Scraping Multiple Pages
## Expand the project

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs
- Practice web scraping
- As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

In [335]:
spotify_link = "https://kworb.net/spotify/country/global_daily_totals.html"

In [336]:
spotify = request_link(spotify_link)

Success! Response code 200


In [337]:
spotify_content = spotify.content

In [338]:
# parse html for readability
spotify = BeautifulSoup(spotify_content, "html.parser")

In [339]:
def spotify_list(content):    
    song_title = []
    artist_name = []
    
    for i in content.find_all('td', attrs = {"class" : "text mp"}): 
        song_title.append(i.select("a")[1].get_text(strip=True))
        artist_name.append(i.select("a")[0].get_text(strip=True))
    
    dct = {'song_title': song_title, 'artist_name': artist_name}
    
    return dct

In [340]:
dict = spotify_list(spotify)
dict

{'song_title': ['Blinding Lights',
  'Shape of You',
  'Someone You Loved',
  'Sunflower - Spider-Man: Into the Spider-Verse',
  'Stay',
  'Believer',
  'Dance Monkey',
  'Perfect',
  'Heat Waves',
  'As It Was',
  'Watermelon Sugar',
  'lovely',
  "Don't Start Now",
  'rockstar',
  'Señorita',
  "Say You Won't Let Go",
  'bad guy',
  'One Dance',
  'Circles',
  'Lucid Dreams',
  'Closer',
  'Starboy',
  'Shallow',
  'INDUSTRY BABY',
  'Sweater Weather',
  'good 4 u',
  'drivers license',
  'Levitating',
  'Cold Heart - PNAU Remix',
  '7 rings',
  'SAD!',
  'Bohemian Rhapsody - Remastered 2011',
  'Jocelyn Flores',
  'MONTERO (Call Me By Your Name)',
  'Save Your Tears',
  'goosebumps',
  'SICKO MODE',
  "God's Plan",
  'Something Just Like This',
  'Me Porto Bonito',
  'Dynamite',
  'Roses - Imanbek Remix',
  'Mood',
  'Memories',
  'Bad Habits',
  'Easy On Me',
  'Another Love',
  'DÁKITI',
  'Kiss Me More',
  'Happier',
  'Quevedo: Bzrp Music Sessions, Vol. 52',
  'Havana',
  'Think

In [341]:
df_spotify = pd.DataFrame(dict)
df_spotify

Unnamed: 0,song_title,artist_name
0,Blinding Lights,The Weeknd
1,Shape of You,Ed Sheeran
2,Someone You Loved,Lewis Capaldi
3,Sunflower - Spider-Man: Into the Spider-Verse,Post Malone
4,Stay,The Kid LAROI
...,...,...
9319,Adrenalina,Wisin
9320,Work,Iggy Azalea
9321,Främling,Orup
9322,Nina,Ed Sheeran


In [342]:
rank = range(1, len(df_spotify)+1)

In [343]:
df_spotify.insert(0, "rank", rank)

In [344]:
df_spotify

Unnamed: 0,rank,song_title,artist_name
0,1,Blinding Lights,The Weeknd
1,2,Shape of You,Ed Sheeran
2,3,Someone You Loved,Lewis Capaldi
3,4,Sunflower - Spider-Man: Into the Spider-Verse,Post Malone
4,5,Stay,The Kid LAROI
...,...,...,...
9319,9320,Adrenalina,Wisin
9320,9321,Work,Iggy Azalea
9321,9322,Främling,Orup
9322,9323,Nina,Ed Sheeran


### Multiple Pages

In [345]:
#number of pages to scrape
pages = range(1,6)

In [346]:
# function to separate artist name and song title and create a dictionary
def extract(list):
    i= 0
    artist_list = []
    song_title = []
    
    for i in range(len(list)):
        if i % 2 == 0:
            artist_list.append(list[i])
        else :
            song_title.append(list[i])
            i += 1
    
    dct = {'artist': artist_list, 'song_name': song_title}
    
    return dct

In [347]:
# dataframe where all results will be stored
df_npr = pd.DataFrame()

# iteration for webscrapping multiple pages
for page in pages: #change pages on the link
    r = requests.get("https://www.npr.org/2022/12/15/1135802083/100-best-songs-2022-page-{page}")
    soup = BeautifulSoup(r.content, 'html.parser')
    list = [] # empty list to story query below
    
    # query from the link to get artist name and song title
    for i in npr_soup.find_all("h3")[0:-1]:
        list.append(i.get_text())
        
    # new dictionary using the previous function to separate artist names and song titles
    dict_from_list = extract(list)
    # transform dictionary to dataframe
    new_df = pd.DataFrame.from_dict(dict_from_list)
    # store everything in the previously prepared df and concatenate new results from new webpages
    df_npr = pd.concat([df_npr, new_df])

In [348]:
df_npr

Unnamed: 0,artist,song_name
0,Little Simz,"""Gorilla"""
1,Ian William Craig,"""Attention For It Radiates"""
2,Viking Ding Dong x Ravi B,"""Leave It Alone (Remix)"""
3,Adeem the Artist,"""Middle of a Heart"""
4,"Zahsosaa, D STURDY & DJ Crazy","""Shakedhat"""
...,...,...
15,NewJeans,"""Hype Boy"""
16,Joyce,"""Feminina"""
17,Ayra Starr,"""Rush"""
18,Disclosure feat. RAYE,"""Waterfall"""
