# **Scenario**

You have been hired as a Data Analyst for "Gnod".

"Gnod" is a site that provides recommendations for music, art, literature and products based on collaborative filtering algorithms. Their flagship product is the music recommender, which you can try at www.gnoosic.com. The site asks users to input 3 bands they like, and computes similarity scores with the rest of the users. Then, they recommend to the user bands that users with similar tastes have picked.

"Gnod" is a small company, and its only revenue stream so far are adds in the site. In the future, they would like to explore partnership options with music apps (such as Deezer, Soundcloud or even Apple Music and Spotify). However, for that to be possible, they need to expand and improve their recommendations.

That's precisely where you come. They have hired you as a Data Analyst, and they expect you to bring a mix of technical expertise and business mindset to the table.


**The goal of the company (Gnod)**: Explore partnership options with music apps(Deezer, Soundcloud, Apple Music, Spotify etc.)

**Their current product (Gnoosic)**: Music Recommender (asks users to input 3 bands they like, and computes similarity scores with the rest of the users. Then, they recommend to the user bands that users with similar tastes have picked).

**How your project fits into this context**: Expand and improve music recommendations. Enhance song recommendations (not only bands).

### **Instructions - Scraping popular songs**

Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

You have find data on the internet about currently popular songs. Billboard maintains a weekly Top 100 of "hot" songs here: https://www.billboard.com/charts/hot-100.

## Billboard Hot 100

### 1. Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [21]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [22]:
## Getting the html code of the web page
url = "https://www.billboard.com/charts/hot-100/"

In [23]:
## Getting the html code of the web page
response = requests.get(url)
response # 200 status code means OK!

<Response [200]>

In [24]:
## Parsing the html code
soup = BeautifulSoup(response.content, "html.parser")
#soup

In [25]:
#creating 2 empty lists
songs=[]
artists=[]

In [26]:
# First song (rest of the songs have a different htlm code!)
first_song=soup.find('h3').get_text(strip=True)
songs.append(first_song)

In [27]:
# rest of the songs
song_titles=soup.find_all("h3", attrs={"class": "c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only"})
song_titles

[<h3 class="c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only" id="title-of-a-story">
 
 	
 	
 		
 					Flowers		
 	
 </h3>,
 <h3 class="c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only" id="title-of-a-story">
 
 	
 	
 		
 					Fast Car		
 	
 </h3>,
 <h3 class="c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only" id="title-of-a-story">
 
 	
 	
 		
 					Calm Down		
 	
 </h3>,
 <h3 class="c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tabl

In [29]:
for title in song_titles:
    
    title_ = title.get_text(strip=True)
    #print(title_)
    songs.append(title_)

songs

['Last Night',
 'Flowers',
 'Fast Car',
 'Calm Down',
 'All My Life',
 'Favorite Song',
 'Kill Bill',
 "Creepin'",
 'Karma',
 'Ella Baila Sola',
 'Sure Thing',
 'Anti-Hero',
 'Die For You',
 'Something In The Orange',
 'Snooze',
 'La Bebe',
 'Where She Goes',
 'Un x100to',
 'Need A Favor',
 'Search & Rescue',
 'You Proof',
 "Thinkin' Bout Me",
 'Chemical',
 'Cupid',
 'Rock And A Hard Place',
 'Eyes Closed',
 "Boy's A Liar, Pt. 2",
 'Next Thing You Know',
 'Put It On Da Floor Again',
 'Thought You Should Know',
 "I'm Good (Blue)",
 'Dance The Night',
 'Area Codes',
 "Dancin' In The Country",
 'One Thing At A Time',
 'Memory Lane',
 'Bzrp Music Sessions, Vol. 55',
 'Tennessee Orange',
 'Cruel Summer',
 'TQM',
 'Stand By Me',
 'Religiously',
 'Dial Drunk',
 'Under The Influence',
 'Players',
 'Calling',
 'Annihilate',
 'Take Two',
 'Love You Anyway',
 'Thank God',
 'Am I Dreaming',
 'Princess Diana',
 'Bye',
 'Self Love',
 'It Matters To Her',
 'Daylight',
 'PRC',
 'Por Las Noches',
 'Mou

In [30]:
len(songs)

100

In [31]:
# Artists

#First artist  (different html code from rest of artists)

first_artist = "c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only u-font-size-20@tablet"
first_= soup.find_all("span", attrs={"class": first_artist})

In [32]:
for i in first_:
    
    first_artist = i.get_text(strip=True)
    artists.append(first_artist)

In [33]:
#Rest of the artists

artists_="c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only"
artists2 = soup.find_all("span", attrs={"class": artists_})
#artists2

In [34]:
for i in artists2:
    
    artist_list = i.get_text(strip=True)
    artists.append(artist_list)

In [35]:
artists

['Morgan Wallen',
 'Miley Cyrus',
 'Luke Combs',
 'Rema & Selena Gomez',
 'Lil Durk Featuring J. Cole',
 'Toosii',
 'SZA',
 'Metro Boomin, The Weeknd & 21 Savage',
 'Taylor Swift Featuring Ice Spice',
 'Eslabon Armado X Peso Pluma',
 'Miguel',
 'Taylor Swift',
 'The Weeknd & Ariana Grande',
 'Zach Bryan',
 'SZA',
 'Yng Lvcas x Peso Pluma',
 'Bad Bunny',
 'Grupo Frontera X Bad Bunny',
 'Jelly Roll',
 'Drake',
 'Morgan Wallen',
 'Morgan Wallen',
 'Post Malone',
 'Fifty Fifty',
 'Bailey Zimmerman',
 'Ed Sheeran',
 'PinkPantheress & Ice Spice',
 'Jordan Davis',
 'Latto Featuring Cardi B',
 'Morgan Wallen',
 'David Guetta & Bebe Rexha',
 'Dua Lipa',
 'Kali',
 'Tyler Hubbard',
 'Morgan Wallen',
 'Old Dominion',
 'Bizarrap & Peso Pluma',
 'Megan Moroney',
 'Taylor Swift',
 'Fuerza Regida',
 'Lil Durk Featuring Morgan Wallen',
 'Bailey Zimmerman',
 'Noah Kahan',
 'Chris Brown',
 'Coi Leray',
 'Metro Boomin, Swae Lee & NAV Featuring A Boogie Wit da Hoodie',
 'Metro Boomin, Swae Lee, Lil Wayne &

In [36]:
len(artists)

100

In [37]:
## Constructing the dataframe

# each list becomes a column
df = pd.DataFrame({"artists":artists,
                       "songs":songs
                      })

df.head()

Unnamed: 0,artists,songs
0,Morgan Wallen,Last Night
1,Miley Cyrus,Flowers
2,Luke Combs,Fast Car
3,Rema & Selena Gomez,Calm Down
4,Lil Durk Featuring J. Cole,All My Life


In [38]:
df.nunique()

artists     86
songs      100
dtype: int64

In [39]:
df.artists.value_counts()

Morgan Wallen                 7
Taylor Swift                  3
Luke Combs                    2
Peso Pluma                    2
Miley Cyrus                   2
                             ..
Old Dominion                  1
Tyler Hubbard                 1
Kali                          1
Dua Lipa                      1
Metro Boomin & James Blake    1
Name: artists, Length: 86, dtype: int64

In [None]:
# some artists have multiple songs on the top 100.

# Lab | Web Scraping Multiple Pages

Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.

## RollingStone The 100 Best Songs of 2022

In [40]:
## Getting the html code for the first 50 songs
url2 = "https://www.rollingstone.com/music/music-lists/best-songs-2022-list-1234632381/"

In [41]:
response2 = requests.get(url2)
response2

<Response [200]>

In [4]:
## Parsing the html code
soup2 = BeautifulSoup(response2.content, "html.parser")
soup2

<!DOCTYPE html>

<!--[if IE 6]>
<html id="ie6" lang="en-US">
<![endif]-->
<!--[if IE 7]>
<html id="ie7" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html id="ie8" lang="en-US">
<![endif]-->
<!--[if !(IE 6) | !(IE 7) | !(IE 8) ]><!-->
<html lang="en-US">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="#ffffff" name="theme-color"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<!--
		 _     _ _           ____          _          _____ _    ___
		| |   (_) | _____   / ___|___   __| | ___    | ____| |__|__ \
		| |   | | |/ / _ \ | |   / _ \ / _` |/ _ \   |  _| | '_ \ / /
		| |___| |   <  __/ | |__| (_) | (_| |  __/_  | |___| | | |_|
		|_____|_|_|\_\___|  \____\___/ \__,_|\___( ) |_____|_| |_(_)
												  |/

		 Work on Rolling Stone and other iconic brands!

		 Visit our careers page at https://pmc.com/careers/

-->
<meta content="Bad Bunny, Beyoncé, Steve Lacy, Pharrell, and Quavo and

In [42]:
#creating 2 empty lists
songs2=[]
artists2=[]

In [43]:
for songs in soup2.find_all("article", attrs={"class": "pmc-fallback-list-item"}):
    songs2.append(songs.find('h2').get_text())

In [None]:
songs2

In [44]:
len(songs2)

50

In [45]:
## Getting the html code for the second 50 songs

url3="https://www.rollingstone.com/music/music-lists/best-songs-2022-list-1234632381/bad-bunny-ft-bomba-estereo-ojitos-lindos-1234632596/"

In [46]:
response3= requests.get(url3)
response3

<Response [200]>

In [10]:
## Parsing the html code
soup3 = BeautifulSoup(response3.content, "html.parser")
soup3

<!DOCTYPE html>

<!--[if IE 6]>
<html id="ie6" lang="en-US">
<![endif]-->
<!--[if IE 7]>
<html id="ie7" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html id="ie8" lang="en-US">
<![endif]-->
<!--[if !(IE 6) | !(IE 7) | !(IE 8) ]><!-->
<html lang="en-US">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="#ffffff" name="theme-color"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<!--
		 _     _ _           ____          _          _____ _    ___
		| |   (_) | _____   / ___|___   __| | ___    | ____| |__|__ \
		| |   | | |/ / _ \ | |   / _ \ / _` |/ _ \   |  _| | '_ \ / /
		| |___| |   <  __/ | |__| (_) | (_| |  __/_  | |___| | | |_|
		|_____|_|_|\_\___|  \____\___/ \__,_|\___( ) |_____|_| |_(_)
												  |/

		 Work on Rolling Stone and other iconic brands!

		 Visit our careers page at https://pmc.com/careers/

-->
<meta content="Bad Bunny, Beyoncé, Steve Lacy, Pharrell, and Quavo and

In [47]:
for songs in soup3.find_all("article", attrs={"class": "pmc-fallback-list-item"}):
    songs2.append(songs.find('h2').get_text())

In [None]:
songs2

In [48]:
len(songs2)

100

In [13]:
## Constructing the dataframe

# each list becomes a column
df2 = pd.DataFrame({"artists_songs":songs2})
df2

Unnamed: 0,artists_songs
0,"Lainey Wilson, ‘Heart Like a Truck’"
1,"Chronixx, ‘Never Give Up’"
2,"Plains, ‘Problem With It’"
3,"Hurray for the Riff Raff, ‘Saga’"
4,"Camilo ft. Grupo Firme, ‘Alaska’"
...,...
95,"Rosalia, ‘Despecha’"
96,"Taylor Swift, ‘Karma’"
97,"Steve Lacy, ‘Bad Habit’"
98,"Beyonce, ‘Cuff It’"


In [14]:
# splitting artist and song

df2 = df2['artists_songs'].str.split(",", n=1, expand=True)
# n= Limit number of splits in output.
# expand = If True, return DataFrame/MultiIndex expanding dimensionality.

In [15]:
df2.head()

Unnamed: 0,0,1
0,Lainey Wilson,‘Heart Like a Truck’
1,Chronixx,‘Never Give Up’
2,Plains,‘Problem With It’
3,Hurray for the Riff Raff,‘Saga’
4,Camilo ft. Grupo Firme,‘Alaska’


In [16]:
#adding column names

cols=["artists", "songs"]
df2.columns=cols
df2

Unnamed: 0,artists,songs
0,Lainey Wilson,‘Heart Like a Truck’
1,Chronixx,‘Never Give Up’
2,Plains,‘Problem With It’
3,Hurray for the Riff Raff,‘Saga’
4,Camilo ft. Grupo Firme,‘Alaska’
...,...,...
95,Rosalia,‘Despecha’
96,Taylor Swift,‘Karma’
97,Steve Lacy,‘Bad Habit’
98,Beyonce,‘Cuff It’


In [17]:
df2["songs"] = df2["songs"].str.replace(" ","")

In [19]:
df2["songs"] = df2["songs"].str.strip("‘’")

In [49]:
# concatenating both Billboard and Rollingstone dataframes along the index.

final_df = pd.concat([df, df2], axis=0)
final_df

Unnamed: 0,artists,songs
0,Morgan Wallen,Last Night
1,Miley Cyrus,Flowers
2,Luke Combs,Fast Car
3,Rema & Selena Gomez,Calm Down
4,Lil Durk Featuring J. Cole,All My Life
...,...,...
95,Rosalia,Despecha
96,Taylor Swift,Karma
97,Steve Lacy,BadHabit
98,Beyonce,CuffIt


In [50]:
final_df.shape

(200, 2)