## Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

In [1]:
# https://bittermelodies.com/2021/02/14/1000-greatest-songs-of-all-time-part-4-250-1/

In [2]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re

In [3]:
bittermelodies = requests.get('https://bittermelodies.com/2021/02/14/1000-greatest-songs-of-all-time-part-4-250-1/')
print('Bitter Melodies: ', bittermelodies.status_code)

Bitter Melodies:  200


In [4]:
bittermelodies.headers

{'Server': 'nginx', 'Date': 'Sun, 05 Mar 2023 15:12:42 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000', 'Vary': 'Accept-Encoding, Cookie', 'X-hacker': "If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.", 'Host-Header': 'WordPress.com', 'Link': '<https://wp.me/p89gm7-3zC>; rel=shortlink', 'Content-Encoding': 'br', 'X-ac': '3.cdg _dca MISS'}

In [5]:
# Find number of pages
soup = BeautifulSoup(bittermelodies.content, 'html.parser')
soup.find_all('a', attrs={'class':'post-page-numbers'})

[<a class="post-page-numbers" href="https://bittermelodies.com/2021/02/14/1000-greatest-songs-of-all-time-part-4-250-1/2/"><span>2</span></a>,
 <a class="post-page-numbers" href="https://bittermelodies.com/2021/02/14/1000-greatest-songs-of-all-time-part-4-250-1/3/"><span>3</span></a>,
 <a class="post-page-numbers" href="https://bittermelodies.com/2021/02/14/1000-greatest-songs-of-all-time-part-4-250-1/4/"><span>4</span></a>,
 <a class="post-page-numbers" href="https://bittermelodies.com/2021/02/14/1000-greatest-songs-of-all-time-part-4-250-1/5/"><span>5</span></a>]

In [6]:
# There are 5 pages, 50 songs/page
starts = range(1,6)
list(starts)

[1, 2, 3, 4, 5]

In [7]:
# get the list of songs from the first page
songs = [song.get_text() for song in soup.find_all('strong')]
songs

['250. Angel Olsen – “Shut Up Kiss Me” (2016)',
 '249. Dolly Parton – “Jolene” (1973)',
 '248. Neil Young – “After the Gold Rush” (1970)',
 '247. The Orb – “Little Fluffy Clouds” (1990)',
 '246. Daryl Hall & John Oates – “I Can’t Go for That (No Can Do)” (1981)',
 '245. Toots & the Maytals – “Pressure Drop” (1969)',
 '244. John Lee Hooker – “Boogie Chillen” (1948)',
 '243. Louis Armstrong – “What a Wonderful World” (1967)',
 '242. Arcade Fire – “Neighborhood #1 (Tunnels)” (2004)',
 '241. Eric B. & Rakim – “Paid in Full” (1987)',
 '240. Built to Spill – “Car” (1994)',
 '239. Frank Ocean – “Thinkin Bout You” (2012)',
 '238. Del Shannon – “Runaway” (1961)',
 '237. Laurie Anderson – “O Superman” (1981)',
 '236. Lil Wayne – “A Milli” (2008)',
 '235. The Smiths – “There Is a Light That Never Goes Out” (1986)',
 '234. Public Enemy – “Bring the Noise” (1988)',
 '233. Destiny’s Child – “Say My Name” (1999)',
 '232. The Righteous Brothers – “You’ve Lost That Lovin’ Feeling” (1964)',
 '231. The C

In [8]:
# extracting one song
text = soup.find('strong').get_text()
text

'250. Angel Olsen – “Shut Up Kiss Me” (2016)'

In [9]:
# extracting information from text
match = re.search(r'^(\d+)\. ([^\–]+) \– “([^”]+)” \((\d+)\)$', text)
result = [match.group(1), match.group(2).strip(), match.group(3).strip(), match.group(4)]
result

['250', 'Angel Olsen', 'Shut Up Kiss Me', '2016']

In [18]:
def get_song_info(songs):    
    orders = []
    artists = []
    titles = []
    years = []

    for song in songs:
        match = re.search(r'^(\d+)\. ([^\–]+) \– “([^”]+)” \((\d+)\)$', song)
        
        order = int(match.group(1)) if match is not None else 'Not informed.'    
        artist = match.group(2).strip() if match is not None else 'Not informed.'
        title = match.group(3).strip() if match is not None else 'Not informed.'
        year = match.group(4) if match is not None else 'Not informed.'
        
        orders.append(order)
        artists.append(artist)
        titles.append(title)
        years.append(year)
        
    dct = {'order': orders, 'artist': artists, 'title': titles, 'year': years}
    
    return dct

In [19]:
df = pd.DataFrame()

for start in starts:
    r = requests.get(f'https://bittermelodies.com/2021/02/14/1000-greatest-songs-of-all-time-part-4-250-1/{start}')
    soup = BeautifulSoup(r.content, 'html.parser')
    songs = [song.get_text() for song in soup.find_all('strong')]
    info = get_song_info(songs)
    new_df = pd.DataFrame.from_dict(info)
    df = pd.concat([df, new_df]).sort_values('order')
    
df

Unnamed: 0,order,artist,title,year
49,1,The Beatles,A Day in the Life,1967
48,2,Nina Simone,Sinnerman,1965
47,3,The Ronettes,Be My Baby,1963
46,4,The Beach Boys,God Only Knows,1966
45,5,Billie Holiday,Strange Fruit,1939
...,...,...,...,...
4,246,Daryl Hall & John Oates,I Can’t Go for That (No Can Do),1981
3,247,The Orb,Little Fluffy Clouds,1990
2,248,Neil Young,After the Gold Rush,1970
1,249,Dolly Parton,Jolene,1973
