# Lab | Web Scraping Single Page
### Business goal:
- Check the case_study_gnod.md file.

- Make sure you've understood the big picture of your project:

    - the goal of the company (Gnod),
    - their current product (Gnoosic),
    - their strategy, and
    - how your project fits into this context.
    
    Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

### Instructions - Scraping popular songs
- Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

- You have find data on the internet about currently popular songs. Billboard maintains a weekly Top 100 of "hot" songs here: https://www.billboard.com/charts/hot-100.

- It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

<div class="alert alert-block alert-info">
    
    
<h1>Objective of this lab:</h1> 
    
    
<h5> - Get top 100 songs on Billboard, listed into a pandas dataframe</h5>
</div>



# Import libraries

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
from urllib.request import urlopen
import re

# Scrapping the content from the web

In [2]:
# open url and read the whole url page
html = urlopen("https://www.billboard.com/charts/hot-100").read()

# Parsing the data into a readable form & assign into a "soup" object
soup = BeautifulSoup(html, "html.parser")

In [3]:
# Retrieve all the h3 tag

tag = soup("h3")
tag

[<h3 class="c-title a-font-primary-bold-l a-font-primary-bold-m@mobile-max lrv-u-color-black u-color-white@mobile-max lrv-u-margin-r-150" id="">
 <a class="c-title__link lrv-a-unstyle-link" href="#">
 	
 		
 					Last Night		
 					</a>
 </h3>,
 <h3 class="c-title a-font-primary-medium-s u-letter-spacing-0021" id="title-of-a-story">
 
 	
 	
 		
 					Songwriter(s):		
 	
 </h3>,
 <h3 class="c-title a-font-primary-medium-s u-letter-spacing-0021" id="title-of-a-story">
 
 	
 	
 		
 					Producer(s):		
 	
 </h3>,
 <h3 class="c-title a-font-primary-medium-s u-letter-spacing-0021" id="title-of-a-story">
 
 	
 	
 		
 					Imprint/Promotion Label:		
 	
 </h3>,
 <h3 class="c-title a-font-primary-m lrv-u-color-brand-primary:hover" id="title-of-a-story">
 <a class="c-title__link lrv-a-unstyle-link" href="https://www.billboard.com/music/chart-beat/morgan-wallen-last-night-number-one-hot-100-11-weeks-1235357558/">
 	
 		
 					Morgan Wallen’s ‘Last Night’ Tops Billboard Hot 100 for 11th Week		
 			

In [4]:
# Class for all the song in the top 100 songs except the top1
class_100 = "c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only"

# Class for the top 1 song
class_1 = "c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 u-font-size-23@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-245 u-max-width-230@tablet-only u-letter-spacing-0028@tablet"

In [5]:
# Set a list for collecting song title
title = []

In [6]:
# Retrieve the top 1 song
song_1 = soup.find_all("h3", attrs={"class": class_1})

# Get the name of the song
title_1 = song_1[0].get_text()

# Append the top 1 song into the list
title.append(title_1)
title

['\n\n\t\n\t\n\t\t\n\t\t\t\t\tLast Night\t\t\n\t\n']

In [7]:
# Retrieve the top 100 song (except the top1)
song_100 = soup.find_all("h3", attrs={"class": class_100})

# Get the name of the song for all songs in this class_100 & append in the same list above
for i in range(len(song_100)):
    title.append(song_100[i].get_text())

In [8]:
# Check the list
title

['\n\n\t\n\t\n\t\t\n\t\t\t\t\tLast Night\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tFlowers\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tFast Car\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tCalm Down\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tAll My Life\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tFavorite Song\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tKill Bill\t\t\n\t\n',
 "\n\n\t\n\t\n\t\t\n\t\t\t\t\tCreepin'\t\t\n\t\n",
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tKarma\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tElla Baila Sola\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tSure Thing\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tAnti-Hero\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tDie For You\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tSomething In The Orange\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tSnooze\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tLa Bebe\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tWhere She Goes\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tUn x100to\t\t\n\t\n',
 '\n\n\t\n\t\n\t\t\n\t\t\t\t\tNeed A Favor

- We see all the whitespace pattern before and after the title (string)
- Remove them using regex

In [9]:
# Regex for finding all whitespace & other unicode characters
pattern = re.compile("^\s+$")

# Iterates through the list "title" 
# and strip() removes/trims all the the whitespace from string and put it into a list []
title = [item.strip() for item in title if not pattern.match(item)]
title

['Last Night',
 'Flowers',
 'Fast Car',
 'Calm Down',
 'All My Life',
 'Favorite Song',
 'Kill Bill',
 "Creepin'",
 'Karma',
 'Ella Baila Sola',
 'Sure Thing',
 'Anti-Hero',
 'Die For You',
 'Something In The Orange',
 'Snooze',
 'La Bebe',
 'Where She Goes',
 'Un x100to',
 'Need A Favor',
 'Search & Rescue',
 'You Proof',
 "Thinkin' Bout Me",
 'Chemical',
 'Cupid',
 'Rock And A Hard Place',
 'Eyes Closed',
 "Boy's A Liar, Pt. 2",
 'Next Thing You Know',
 'Put It On Da Floor Again',
 'Thought You Should Know',
 "I'm Good (Blue)",
 'Dance The Night',
 'Area Codes',
 "Dancin' In The Country",
 'One Thing At A Time',
 'Memory Lane',
 'Bzrp Music Sessions, Vol. 55',
 'Tennessee Orange',
 'Cruel Summer',
 'TQM',
 'Stand By Me',
 'Religiously',
 'Dial Drunk',
 'Under The Influence',
 'Players',
 'Calling',
 'Annihilate',
 'Take Two',
 'Love You Anyway',
 'Thank God',
 'Am I Dreaming',
 'Princess Diana',
 'Bye',
 'Self Love',
 'It Matters To Her',
 'Daylight',
 'PRC',
 'Por Las Noches',
 'Mou

#### What does the script above do?

- Note: Need a little explanation for myself

**Regular Expression**:

   ^ --> start with 
   
   \s --> matches a whitespace (blank, tab \t, and newline \r or \n)
   
   "+" --> one or more 
   
   $ --> end of string
   

---
\s
For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].

- [Ref](https://docs.python.org/3/library/re.html): https://docs.python.org/3/library/re.html
---





#### For Loop:
**[item.strip() for item in title if not regex.match(item)]**

- It iterates through the list "title" and strip() removes all the the white space from string and put it into a  list []


- The loop can be written (in the way that easy to understand) as follow:

```
for item in title:
    if not pattern.match(item):
        title.append(item.strip()) 

```


##### Some functions:

- match() - to find the match string and return the first occurence
- [re.compile()](https://pynative.com/python-regex-compile/) - method is used to compile a regular expression pattern provided as a string into a regex pattern object 

## Get Artists Name

In [10]:
# Retrieve all the span tag

span_tage = soup("span")
# span_tage

In [11]:
# We'll get artist name in these classes
class_artist_1 = "c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only u-font-size-20@tablet"

class_artist_100 = "c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only"

In [12]:
# Set a list for collecting artist 
artist = []

# Retrieve the top 1 artist
artist_1 = soup.find_all("span", attrs={"class": class_artist_1})

# Get the name of the artist
artist_name_1 = artist_1[0].get_text()

# Append the top 1 artist into the list
artist.append(artist_name_1)
artist

['\n\t\n\tMorgan Wallen\n']

In [13]:
# Retrieve the top 100 song (except the top1)
artist_100 = soup.find_all("span", attrs={"class": class_artist_100})

# Get the name of the song for all songs in this class_100 & append in the same list above
for j in range(len(artist_100)):
    artist.append(artist_100[j].get_text())

# Check the list
artist

['\n\t\n\tMorgan Wallen\n',
 '\n\t\n\tMiley Cyrus\n',
 '\n\t\n\tLuke Combs\n',
 '\n\t\n\tRema & Selena Gomez\n',
 '\n\t\n\tLil Durk Featuring J. Cole\n',
 '\n\t\n\tToosii\n',
 '\n\t\n\tSZA\n',
 '\n\t\n\tMetro Boomin, The Weeknd & 21 Savage\n',
 '\n\t\n\tTaylor Swift Featuring Ice Spice\n',
 '\n\t\n\tEslabon Armado X Peso Pluma\n',
 '\n\t\n\tMiguel\n',
 '\n\t\n\tTaylor Swift\n',
 '\n\t\n\tThe Weeknd & Ariana Grande\n',
 '\n\t\n\tZach Bryan\n',
 '\n\t\n\tSZA\n',
 '\n\t\n\tYng Lvcas x Peso Pluma\n',
 '\n\t\n\tBad Bunny\n',
 '\n\t\n\tGrupo Frontera X Bad Bunny\n',
 '\n\t\n\tJelly Roll\n',
 '\n\t\n\tDrake\n',
 '\n\t\n\tMorgan Wallen\n',
 '\n\t\n\tMorgan Wallen\n',
 '\n\t\n\tPost Malone\n',
 '\n\t\n\tFifty Fifty\n',
 '\n\t\n\tBailey Zimmerman\n',
 '\n\t\n\tEd Sheeran\n',
 '\n\t\n\tPinkPantheress & Ice Spice\n',
 '\n\t\n\tJordan Davis\n',
 '\n\t\n\tLatto Featuring Cardi B\n',
 '\n\t\n\tMorgan Wallen\n',
 '\n\t\n\tDavid Guetta & Bebe Rexha\n',
 '\n\t\n\tDua Lipa\n',
 '\n\t\n\tKali\n',
 '\n\t\n

In [14]:
# Remove \n\t characters
artist = [name.strip() for name in artist if not pattern.match(name)]
artist

['Morgan Wallen',
 'Miley Cyrus',
 'Luke Combs',
 'Rema & Selena Gomez',
 'Lil Durk Featuring J. Cole',
 'Toosii',
 'SZA',
 'Metro Boomin, The Weeknd & 21 Savage',
 'Taylor Swift Featuring Ice Spice',
 'Eslabon Armado X Peso Pluma',
 'Miguel',
 'Taylor Swift',
 'The Weeknd & Ariana Grande',
 'Zach Bryan',
 'SZA',
 'Yng Lvcas x Peso Pluma',
 'Bad Bunny',
 'Grupo Frontera X Bad Bunny',
 'Jelly Roll',
 'Drake',
 'Morgan Wallen',
 'Morgan Wallen',
 'Post Malone',
 'Fifty Fifty',
 'Bailey Zimmerman',
 'Ed Sheeran',
 'PinkPantheress & Ice Spice',
 'Jordan Davis',
 'Latto Featuring Cardi B',
 'Morgan Wallen',
 'David Guetta & Bebe Rexha',
 'Dua Lipa',
 'Kali',
 'Tyler Hubbard',
 'Morgan Wallen',
 'Old Dominion',
 'Bizarrap & Peso Pluma',
 'Megan Moroney',
 'Taylor Swift',
 'Fuerza Regida',
 'Lil Durk Featuring Morgan Wallen',
 'Bailey Zimmerman',
 'Noah Kahan',
 'Chris Brown',
 'Coi Leray',
 'Metro Boomin, Swae Lee & NAV Featuring A Boogie Wit da Hoodie',
 'Metro Boomin, Swae Lee, Lil Wayne &

# Billboard Hot 100

In [15]:
# Put data into a pandas dataframe
df = pd.DataFrame({"Song": title, "Artist": artist})
df

Unnamed: 0,Song,Artist
0,Last Night,Morgan Wallen
1,Flowers,Miley Cyrus
2,Fast Car,Luke Combs
3,Calm Down,Rema & Selena Gomez
4,All My Life,Lil Durk Featuring J. Cole
...,...,...
95,Save Me,Jelly Roll With Lainey Wilson
96,Yandel 150,Yandel & Feid
97,Beso,Rosalia & Rauw Alejandro
98,I Wrote The Book,Morgan Wallen


### Sources:
#### About requests.get()
   - [GET and POST Requests Using Python](https://www.geeksforgeeks.org/get-post-requests-using-python/)
   - [Python requests: GET Request Explained](https://datagy.io/python-requests-get-request/)
   - [Python Requests get() Method](https://www.w3schools.com/python/ref_requests_get.asp)
   

#### Web Scrapping
   - [BeautifulSoup Guide: Scraping HTML Pages With Python](https://scrapeops.io/python-web-scraping-playbook/python-beautifulsoup-web-scraping/)
   

#### Regular Express Cheat Sheet
   - [RegEX cheatsheet](https://quickref.me/regex.html)
   - [re.compile()](https://pynative.com/python-regex-compile/) - method is used to compile a regular expression pattern provided as a string into a regex pattern object 
   - [Removing \n\t from string](https://stackoverflow.com/questions/47953523/remove-an-n-t-t-t-element-from-list)