# GNOD week 6

## LAB | Web Scraping Single Page (GNOD part 1)

- Check the case_study_gnod.md file.
- Make sure you've understood the big picture of your project:
    - the goal of the company (Gnod),
    - their current product (Gnoosic),
    - their strategy, and
    - how your project fits into this context.
- Re-read the business case and the e-mail from the CTO.

**Instructions - Scraping popular songs** <br>
Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will also enjoy a recommendation of another song that is popular at the moment.

You have to find data on the internet about currently popular songs. Popvortex maintains a weekly Top 100 of "hot" songs here: http://www.popvortex.com/music/charts/top-100-songs.php.

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [112]:
# import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

import random
from time import sleep

In [5]:
# find url and store it in variable
url = "https://www.popvortex.com/music/charts/top-100-songs.php"

In [6]:
# download html with GET req and check status code
response = requests.get(url)
response.status_code


200

In [7]:
# create the soup
soup = BeautifulSoup(response.content, "html.parser")
# soup

In [2]:
# check that everything is okay
# print(soup.prettify())

In [1]:
# retrieve desired info
# for song in soup.select("body > div.container > div:nth-child(4) > div.col-xs-12.col-md-8 > div.chart-wrapper > div.feed-item.music-chart.flex-row"):
#     print(song.cite.get_text(), song.em.get_text())

In [24]:
# init empty lists
songs = []
artists = []

# save copied selector into a var
path = "body > div.container > div:nth-child(4) > div.col-xs-12.col-md-8 > div.chart-wrapper > div.feed-item.music-chart.flex-row"

# grab necessary items and append it to respective list
for i in soup.select(path):
    songs.append(i.cite.get_text())
    artists.append(i.em.get_text())

print(songs)
print(artists)

['Margaritaville', 'Come Monday', 'Rich Men North of Richmond', 'Cheeseburger In Paradise', 'Changes In Latitudes, Changes In Attitudes', 'A Pirate Looks at Forty', "It's Five O'Clock Somewhere (Live)", 'Paint The Town Red', 'Last Time I Saw You', 'I Remember Everything (feat. Kacey Musgraves)', 'Son of a Son of a Sailor', 'Fast Car', 'Lil Boo Thang', 'Used To Be Young', 'I Want to go Home', 'Fins', 'Cruel Summer', 'Last Night', 'Need A Favor', 'Southern Cross (Live)', 'Try That In A Small Town', 'Keep Going Up', 'White Horse', 'Brown Eyed Girl (Live)', "Why Don't We Get Drunk", 'Dance The Night', "It's Five O'Clock Somewhere (with Jimmy Buffett)", '90 some Chevy', 'Save Me (with Lainey Wilson)', 'Aint Gotta Dollar', 'Watermelon Moonshine', 'Indiana Jones', 'Volcano', 'Rockstar', 'Flowers', 'Thinkin’ Bout Me', 'Religiously', 'Lose Control', 'Dreams', 'Used To Be Young', 'Southern Cross', 'Calm Down', 'vampire', 'Trip Around the Sun', 'How You Remind Me', 'Son Of A Sinner', 'Beat You Th

In [25]:
# create the df with organised info
top_100 = pd.DataFrame({"song_title": songs,
                        "artist": artists})

top_100

Unnamed: 0,song_title,artist
0,Margaritaville,Jimmy Buffett
1,Come Monday,Jimmy Buffett
2,Rich Men North of Richmond,Oliver Anthony Music
3,Cheeseburger In Paradise,Jimmy Buffett
4,"Changes In Latitudes, Changes In Attitudes",Jimmy Buffett
...,...,...
95,Bring Me to Life,Evanescence
96,Demons,Doja Cat
97,Edge of Seventeen,Stevie Nicks
98,Bad Moon Rising,Creedence Clearwater Revival


## LAB | Web Scraping Multiple Pages

**Expand the project** <br>
If you're done, you can try to expand the project on your own. 
- Chosen option: expand by using Eurovision songs in the last 20 years.

In [99]:
# find url and store it in var
url = "https://en.wikipedia.org/wiki/List_of_Eurovision_Song_Contest_winners"

In [100]:
# dowload html with get req
response = requests.get(url)
response.status_code

200

In [101]:
# make good soup
soup = BeautifulSoup(response.content, "html.parser")

In [260]:
euwinners = soup.select("table")[0]

In [85]:
eudata = []
for song in euwinners.select("tr td a"):
    link = song.get("href")
    print(link)
    if link is not None:
        if (("/wiki" in link) &   
            ("Eurovision" not in link) &         #only add to presidents when it has /wiki in link
            ("composer" not in link) &
            ("singer" not in link) &
            ("musician" not in link) &
            ("lyricist" not in link)):
            eudata.append(song["href"])

/wiki/Switzerland_in_the_Eurovision_Song_Contest
/wiki/Refrain_(Lys_Assia_song)
/wiki/Lys_Assia
/wiki/G%C3%A9o_Voumard
/wiki/%C3%89mile_Gardaz
#cite_note-10
/wiki/Netherlands_in_the_Eurovision_Song_Contest
/wiki/Net_als_toen
/wiki/Corry_Brokken
/wiki/Willy_van_Hemert
#cite_note-11
/wiki/France_in_the_Eurovision_Song_Contest
/wiki/Dors,_mon_amour
/wiki/Andr%C3%A9_Claveau
/wiki/Hubert_Giraud_(composer)
/wiki/Pierre_Delano%C3%AB
#cite_note-12
/wiki/Netherlands_in_the_Eurovision_Song_Contest
/wiki/%27n_Beetje
/wiki/Teddy_Scholten
#cite_note-13
/wiki/France_in_the_Eurovision_Song_Contest
/wiki/Tom_Pillibi
/wiki/Jacqueline_Boyer
/wiki/Andr%C3%A9_Popp
/wiki/Pierre_Cour
#cite_note-14
/wiki/Luxembourg_in_the_Eurovision_Song_Contest
/wiki/Nous_les_amoureux
/wiki/Jean-Claude_Pascal
/wiki/Jacques_Datin
#cite_note-15
/wiki/France_in_the_Eurovision_Song_Contest
/wiki/Un_premier_amour
/wiki/Isabelle_Aubret
#cite_note-16
/wiki/Denmark_in_the_Eurovision_Song_Contest
/wiki/Dansevise
/wiki/Grethe_and_J%C

In [105]:
# This will find all `a` tags under the third(2nd index) `td` of it's type
eusongs = []

for tag in euwinners.select("td:nth-of-type(2) a"):
    eusongs.append(tag["href"])

In [109]:
eusongs

['/wiki/Refrain_(Lys_Assia_song)',
 '/wiki/Net_als_toen',
 '/wiki/Dors,_mon_amour',
 '/wiki/%27n_Beetje',
 '/wiki/Tom_Pillibi',
 '/wiki/Nous_les_amoureux',
 '/wiki/Un_premier_amour',
 '/wiki/Dansevise',
 '/wiki/Non_ho_l%27et%C3%A0',
 '/wiki/Poup%C3%A9e_de_cire,_poup%C3%A9e_de_son',
 '/wiki/Merci,_Ch%C3%A9rie',
 '/wiki/Puppet_on_a_String_(Sandie_Shaw_song)',
 '/wiki/La_La_La_(Massiel_song)',
 '/wiki/Vivo_cantando',
 '/wiki/Boom_Bang-a-Bang',
 '/wiki/De_troubadour',
 '/wiki/Un_jour,_un_enfant',
 '/wiki/All_Kinds_of_Everything',
 '/wiki/Un_banc,_un_arbre,_une_rue',
 '/wiki/Apr%C3%A8s_toi',
 '/wiki/Tu_te_reconna%C3%AEtras',
 '/wiki/Waterloo_(ABBA_song)',
 '/wiki/Ding-a-dong',
 '/wiki/Save_Your_Kisses_for_Me',
 '/wiki/L%27Oiseau_et_l%27Enfant',
 '/wiki/A-Ba-Ni-Bi',
 '/wiki/Hallelujah_(Milk_and_Honey_song)',
 '/wiki/What%27s_Another_Year',
 '/wiki/Making_Your_Mind_Up',
 '/wiki/Ein_bi%C3%9Fchen_Frieden',
 '/wiki/Si_la_vie_est_cadeau',
 '/wiki/Diggi-Loo_Diggi-Ley',
 '/wiki/La_det_swinge',
 '/w

In [111]:
# send request with full song link
url = "https://en.wikipedia.org" + eusongs[0]
response = requests.get(url)
print(response.status_code)

# make soup
soup = BeautifulSoup(response.content, "html.parser")
soup.select("table.infobox")


200


[<table class="infobox"><tbody><tr><th class="infobox-above" colspan="2" style="background: #BFDFFF;"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Switzerland" title="Switzerland"><img alt="Switzerland" class="mw-file-element" data-file-height="512" data-file-width="512" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/08/Flag_of_Switzerland_%28Pantone%29.svg/16px-Flag_of_Switzerland_%28Pantone%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/08/Flag_of_Switzerland_%28Pantone%29.svg/24px-Flag_of_Switzerland_%28Pantone%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/08/Flag_of_Switzerland_%28Pantone%29.svg/32px-Flag_of_Switzerland_%28Pantone%29.svg.png 2x" width="16"/></a></span></span> "Refrain"</th></tr><tr><td class="infobox-image" colspan="2"><span typeof="mw:File"><a class="mw-file-description" href="/wiki/File:Lys_Assia_-_Refrain.jpg"><img class="mw-file-element" data-fi

In [114]:
# 2. find url and store it in a variable

eusongs_soups = []

# if it stop sdue to long timeout err, use the index of the leftovers to fill it  
for song in eusongs:
    # send request
    url = "https://en.wikipedia.org" + song
    response = requests.get(url)
    print(song, response.status_code)

    # parse & store html
    soup = BeautifulSoup(response.content, "html.parser")
    eusongs_soups.append(soup.select("table.infobox"))

    # respectful nap:
    wait_time = random.randint(1,4)
    print("I will sleep for " + str(wait_time) + " second/s.")
    sleep(wait_time)

/wiki/Refrain_(Lys_Assia_song) 200
I will sleep for 1 second/s.
/wiki/Net_als_toen 200
I will sleep for 2 second/s.
/wiki/Dors,_mon_amour 200
I will sleep for 3 second/s.
/wiki/%27n_Beetje 200
I will sleep for 2 second/s.
/wiki/Tom_Pillibi 200
I will sleep for 1 second/s.
/wiki/Nous_les_amoureux 200
I will sleep for 2 second/s.
/wiki/Un_premier_amour 200
I will sleep for 2 second/s.
/wiki/Dansevise 200
I will sleep for 3 second/s.
/wiki/Non_ho_l%27et%C3%A0 200
I will sleep for 3 second/s.
/wiki/Poup%C3%A9e_de_cire,_poup%C3%A9e_de_son 200
I will sleep for 4 second/s.
/wiki/Merci,_Ch%C3%A9rie 200
I will sleep for 1 second/s.
/wiki/Puppet_on_a_String_(Sandie_Shaw_song) 200
I will sleep for 3 second/s.
/wiki/La_La_La_(Massiel_song) 200
I will sleep for 3 second/s.
/wiki/Vivo_cantando 200
I will sleep for 3 second/s.
/wiki/Boom_Bang-a-Bang 200
I will sleep for 3 second/s.
/wiki/De_troubadour 200
I will sleep for 2 second/s.
/wiki/Un_jour,_un_enfant 200
I will sleep for 2 second/s.
/wiki/All

In [248]:
### get song, country and artist from infocards ready for extraction
## standard approach:   
# eusongs_soups[5][0].find("th", string = "Country").parent.select("a")[0].get_text()
# eusongs_soups[5][0].find("th", string = "Artist(s)").parent.select("a")[0].get_text()
# eusongs_soups[5][0].find("th", string = "Artist(s)").parent.select("div")[0].get_text()

# ## if main card is extended and some details are not in standard order - usually in hits:
# eusongs_soups[57][0].select("th.infobox-above")[0].get_text()                           # single title
# eusongs_soups[57][0].select("th.infobox-header")[0].select("a")[1].get_text()           # artist name

# ## if details are in 2nd infocard:
# eusongs_soups[57][1].find("th", string = "Country").parent.select("a")[0].get_text()    # country in 2nd card

# ## if they have pseudonyms or groups and individuals are both shown:
# eusongs[0][0].find("th", string = "As").parent.select("a")[0].get_text()                # group name after members list


In [266]:
# extract song name, country and artists
eusongs = []
euartists = []
eucountries = []

for song in eusongs_soups:
    try:
        eusongs.append(song[0].select("th.infobox-above")[0].get_text())
    except:
        eusongs.append("NA")
    try:
        euartists.append(song[0].find("th", string = "Artist(s)").parent.select("a")[0].get_text())
    except:
        try:
            euartists.append(song[0].select("th.infobox-header")[0].select("a")[1].get_text())
        except:
            try:
                euartists.append(song[0].find("th", string = "As").parent.select("a")[0].get_text())
            except:
                try:
                    euartists.append(song[0].find("th", string = "Artist(s)").parent.select("div")[0].get_text())
                except:
                    euartists.append("NA")
    try:
        eucountries.append(song[0].find("th", string = "Country").parent.select("a")[0].get_text())
    except:
        try:
            eucountries.append(song[1].find("th", string = "Country").parent.select("a")[0].get_text())
        except:
            eucountries.append("NA")

euwinners_df = pd.DataFrame({"Song":eusongs,
                             "Artist(s)":euartists,
                             "Country":eucountries})


In [269]:
# drop leftover covid row
euwinners_df = euwinners_df.drop([euwinners_df.index[67]]).reset_index(drop=True)
euwinners_df



Unnamed: 0,Song,Artist(s),Country
0,"""Refrain""",Lys Assia,Switzerland
1,"""Net als toen""",Corry Brokken,Netherlands
2,"""Dors, mon amour""",André Claveau,France
3,"""'n Beetje""",Teddy Scholten,Netherlands
4,"""Tom Pillibi""",Jacqueline Boyer,France
...,...,...,...
65,"""Toy""",Netta,Israel
66,"""Arcade""",Duncan Laurence,Netherlands
67,"""Zitti e buoni""",Måneskin,Italy
68,"""Stefania""",Kalush Orchestra,Ukraine
