# GNOD week 6

## LAB | Web Scraping Single Page (GNOD part 1)

- Check the case_study_gnod.md file.
- Make sure you've understood the big picture of your project:
    - the goal of the company (Gnod),
    - their current product (Gnoosic),
    - their strategy, and
    - how your project fits into this context.
- Re-read the business case and the e-mail from the CTO.

**Instructions - Scraping popular songs** <br>
Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will also enjoy a recommendation of another song that is popular at the moment.

You have to find data on the internet about currently popular songs. Popvortex maintains a weekly Top 100 of "hot" songs here: http://www.popvortex.com/music/charts/top-100-songs.php.

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [1]:
# import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

import random
from time import sleep

In [2]:
# find url and store it in variable
url = "https://www.popvortex.com/music/charts/top-100-songs.php"

In [3]:
# download html with GET req and check status code
response = requests.get(url)
response.status_code


200

In [4]:
# create the soup
soup = BeautifulSoup(response.content, "html.parser")
# soup

In [5]:
# check that everything is okay
# print(soup.prettify())

In [6]:
# retrieve desired info
# for song in soup.select("body > div.container > div:nth-child(4) > div.col-xs-12.col-md-8 > div.chart-wrapper > div.feed-item.music-chart.flex-row"):
#     print(song.cite.get_text(), song.em.get_text())

In [7]:
# init empty lists
songs = []
artists = []

# save copied selector into a var
path = "body > div.container > div:nth-child(4) > div.col-xs-12.col-md-8 > div.chart-wrapper > div.feed-item.music-chart.flex-row"

# grab necessary items and append it to respective list
for i in soup.select(path):
    songs.append(i.cite.get_text())
    artists.append(i.em.get_text())

In [40]:
# create the df with organised info
top_100 = pd.DataFrame({"song": songs,
                        "artist": artists})

top_100

Unnamed: 0,song,artist
0,Margaritaville,Jimmy Buffett
1,Come Monday,Jimmy Buffett
2,Rich Men North of Richmond,Oliver Anthony Music
3,Cheeseburger In Paradise,Jimmy Buffett
4,"Changes In Latitudes, Changes In Attitudes",Jimmy Buffett
...,...,...
95,Can't Get Enough of You Baby,Smash Mouth
96,Spirit In the Sky,Norman Greenbaum
97,bad idea right?,Olivia Rodrigo
98,Whiteboyz,Tom MacDonald & Adam Calhoun


## LAB | Web Scraping Multiple Pages

**Expand the project** <br>
If you're done, you can try to expand the project on your own. 
- Chosen option: expand by using Eurovision songs in the last 20 years.

In [9]:
# find url and store it in var
url = "https://en.wikipedia.org/wiki/List_of_Eurovision_Song_Contest_winners"

In [10]:
# dowload html with get req
response = requests.get(url)
response.status_code

200

In [11]:
## make good soup
soup = BeautifulSoup(response.content, "html.parser")

In [12]:
euwinners = soup.select("table")[0]

In [13]:
# This will find all `a` tags under the third(2nd index) `td` of it's type
eusongs = []

for tag in euwinners.select("td:nth-of-type(2) a"):
    eusongs.append(tag["href"])

In [14]:
# # send request with full song link
# url = "https://en.wikipedia.org" + eusongs[0]
# response = requests.get(url)
# print(response.status_code)

# # make soup
# soup = BeautifulSoup(response.content, "html.parser")
# soup.select("table.infobox")


200


[<table class="infobox"><tbody><tr><th class="infobox-above" colspan="2" style="background: #BFDFFF;"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Switzerland" title="Switzerland"><img alt="Switzerland" class="mw-file-element" data-file-height="512" data-file-width="512" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/08/Flag_of_Switzerland_%28Pantone%29.svg/16px-Flag_of_Switzerland_%28Pantone%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/08/Flag_of_Switzerland_%28Pantone%29.svg/24px-Flag_of_Switzerland_%28Pantone%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/08/Flag_of_Switzerland_%28Pantone%29.svg/32px-Flag_of_Switzerland_%28Pantone%29.svg.png 2x" width="16"/></a></span></span> "Refrain"</th></tr><tr><td class="infobox-image" colspan="2"><span typeof="mw:File"><a class="mw-file-description" href="/wiki/File:Lys_Assia_-_Refrain.jpg"><img class="mw-file-element" data-fi

In [65]:
# # 2. find url and store it in a variable

# eusongs_soups = []

# # if it stop sdue to long timeout err, use the index of the leftovers to fill it  
# for song in eusongs:
#     # send request
#     url = "https://en.wikipedia.org" + song
#     response = requests.get(url)
#     print(song, response.status_code)

#     # parse & store html
#     soup = BeautifulSoup(response.content, "html.parser")
#     eusongs_soups.append(soup.select("table.infobox"))

#     # respectful nap:
#     wait_time = random.randint(1,4)
#     print("I will sleep for " + str(wait_time) + " second/s.")
#     sleep(wait_time)

In [16]:
### get song, country and artist from infocards ready for extraction
## standard approach:   
# eusongs_soups[5][0].find("th", string = "Country").parent.select("a")[0].get_text()
# eusongs_soups[5][0].find("th", string = "Artist(s)").parent.select("a")[0].get_text()
# eusongs_soups[5][0].find("th", string = "Artist(s)").parent.select("div")[0].get_text()

# ## if main card is extended and some details are not in standard order - usually in hits:
# eusongs_soups[57][0].select("th.infobox-above")[0].get_text()                           # single title
# eusongs_soups[57][0].select("th.infobox-header")[0].select("a")[1].get_text()           # artist name

# ## if details are in 2nd infocard:
# eusongs_soups[57][1].find("th", string = "Country").parent.select("a")[0].get_text()    # country in 2nd card

# ## if they have pseudonyms or groups and individuals are both shown:
# eusongs[0][0].find("th", string = "As").parent.select("a")[0].get_text()                # group name after members list


In [41]:
# extract song name, country and artists
eusongs = []
euartists = []
eucountries = []

for song in eusongs_soups:
    try:
        eusongs.append(song[0].select("th.infobox-above")[0].get_text())
    except:
        eusongs.append("NA")
    try:
        euartists.append(song[0].find("th", string = "Artist(s)").parent.select("a")[0].get_text())
    except:
        try:
            euartists.append(song[0].select("th.infobox-header")[0].select("a")[1].get_text())
        except:
            try:
                euartists.append(song[0].find("th", string = "As").parent.select("a")[0].get_text())
            except:
                try:
                    euartists.append(song[0].find("th", string = "Artist(s)").parent.select("div")[0].get_text())
                except:
                    euartists.append("NA")
    try:
        eucountries.append(song[0].find("th", string = "Country").parent.select("a")[0].get_text())
    except:
        try:
            eucountries.append(song[1].find("th", string = "Country").parent.select("a")[0].get_text())
        except:
            eucountries.append("NA")

euwinners_df = pd.DataFrame({"song":eusongs,
                             "artist":euartists,
                             "country":eucountries})


In [None]:
# drop leftover covid row and country to concat
euwinners_df = euwinners_df.drop(index = 67, columns = 'country').reset_index(drop=True)

In [None]:
# remove quotation marks from eu songs for consistency + language-specific special chars that affect search
euwinners_df['song'] = euwinners_df.song.str.replace('"', '')
euwinners_df['song'] = euwinners_df.song.replace(r'^ | $','',regex=True)

In [47]:
# concat the two dfs
hot_songs = pd.concat([top_100, euwinners_df]).reset_index(drop=True)
hot_songs

Unnamed: 0,song,artist
0,Margaritaville,Jimmy Buffett
1,Come Monday,Jimmy Buffett
2,Rich Men North of Richmond,Oliver Anthony Music
3,Cheeseburger In Paradise,Jimmy Buffett
4,"Changes In Latitudes, Changes In Attitudes",Jimmy Buffett
...,...,...
165,Toy,Netta
166,Arcade,Duncan Laurence
167,Zitti e buoni,Måneskin
168,Stefania,Kalush Orchestra


## Song recommendation

In [64]:
## user input to search song
search = input("Want a recommendation?")
print("Your rec is in our database:", search in hot_songs.song.values)

Your rec is in our database: True
