<div class="alert alert-block alert-warning">
    
    
<h4>Please scroll down to Lab | Web Scraping Multiple Pages</h4> 
    
    
</div>



# Lab | Web Scraping Single Page
### Business goal:
- Check the case_study_gnod.md file.

- Make sure you've understood the big picture of your project:

    - the goal of the company (Gnod),
    - their current product (Gnoosic),
    - their strategy, and
    - how your project fits into this context.
    
    Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

### Instructions - Scraping popular songs
- Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

- You have find data on the internet about currently popular songs. Billboard maintains a weekly Top 100 of "hot" songs here: https://www.billboard.com/charts/hot-100.

- It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

<div class="alert alert-block alert-info">
    
    
<h1>Objective of this lab:</h1> 
    
    
<h5> - Get top 100 songs on Billboard, list into a pandas dataframe</h5>
</div>



# Import libraries

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
from urllib.request import urlopen
import re

# Scrapping the content from the web

In [2]:
# open url and read the whole url page
html = urlopen("https://www.billboard.com/charts/hot-100").read()

# Parsing the data into a readable form & assign into a "soup" object
soup = BeautifulSoup(html, "html.parser")

In [3]:
# Retrieve all the h3 tag

tag = soup("h3")
# tag

In [4]:
# Class for all the song in the top 100 songs except the top1
class_100 = "c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only"

# Class for the top 1 song
class_1 = "c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 u-font-size-23@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-245 u-max-width-230@tablet-only u-letter-spacing-0028@tablet"

In [5]:
# Set a list for collecting song title
title = []

In [6]:
# Retrieve the top 1 song
song_1 = soup.find_all("h3", attrs={"class": class_1})

# Get the name of the song
title_1 = song_1[0].get_text()

# Append the top 1 song into the list
title.append(title_1)
title

['\n\n\t\n\t\n\t\t\n\t\t\t\t\tLast Night\t\t\n\t\n']

In [7]:
# Retrieve the top 100 song (except the top1)
song_100 = soup.find_all("h3", attrs={"class": class_100})

# Get the name of the song for all songs in this class_100 & append in the same list above
for i in range(len(song_100)):
    title.append(song_100[i].get_text())

In [8]:
# Check the list
# title

- We see all the whitespace pattern before and after the title (string)
- Remove them using regex

In [9]:
# Regex for finding all whitespace & other unicode characters
pattern = re.compile("^\s+$")

# Iterates through the list "title" 
# and strip() removes/trims all the the whitespace from string and put it into a list []
title = [item.strip() for item in title if not pattern.match(item)]
# title

#### What does the script above do?

- Note: Need a little explanation for myself

**Regular Expression**:

   ^ --> start with 
   
   \s --> matches a whitespace (blank, tab \t, and newline \r or \n)
   
   "+" --> one or more 
   
   $ --> end of string
   

---
\s
For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].

- [Ref](https://docs.python.org/3/library/re.html): https://docs.python.org/3/library/re.html
---





#### For Loop:
**[item.strip() for item in title if not regex.match(item)]**

- It iterates through the list "title" and strip() removes all the the white space from string and put it into a  list []


- The loop can be written (in the way that easy to understand) as follow:

```
for item in title:
    if not pattern.match(item):
        title.append(item.strip()) 

```


##### Some functions:

- match() - to find the match string and return the first occurence
- [re.compile()](https://pynative.com/python-regex-compile/) - method is used to compile a regular expression pattern provided as a string into a regex pattern object 

## Get Artists Name

In [10]:
# Retrieve all the span tag

span_tage = soup("span")
# span_tage

In [11]:
# We'll get artist name in these classes
class_artist_1 = "c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only u-font-size-20@tablet"

class_artist_100 = "c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only"

In [12]:
# Set a list for collecting artist 
artist = []

# Retrieve the top 1 artist
artist_1 = soup.find_all("span", attrs={"class": class_artist_1})

# Get the name of the artist
artist_name_1 = artist_1[0].get_text()

# Append the top 1 artist into the list
artist.append(artist_name_1)
artist

['\n\t\n\tMorgan Wallen\n']

In [13]:
# Retrieve the top 100 song (except the top1)
artist_100 = soup.find_all("span", attrs={"class": class_artist_100})

# Get the name of the song for all songs in this class_100 & append in the same list above
for j in range(len(artist_100)):
    artist.append(artist_100[j].get_text())

# Check the list
# artist

In [14]:
# Remove \n\t characters
artist = [name.strip() for name in artist if not pattern.match(name)]
# artist

# Billboard Hot 100

In [15]:
# Put data into a pandas dataframe
df = pd.DataFrame({"Song": title, "Artist": artist})
df

Unnamed: 0,Song,Artist
0,Last Night,Morgan Wallen
1,Flowers,Miley Cyrus
2,Fast Car,Luke Combs
3,Calm Down,Rema & Selena Gomez
4,All My Life,Lil Durk Featuring J. Cole
...,...,...
95,Save Me,Jelly Roll With Lainey Wilson
96,Yandel 150,Yandel & Feid
97,Beso,Rosalia & Rauw Alejandro
98,I Wrote The Book,Morgan Wallen


### Sources:
#### About requests.get()
   - [GET and POST Requests Using Python](https://www.geeksforgeeks.org/get-post-requests-using-python/)
   - [Python requests: GET Request Explained](https://datagy.io/python-requests-get-request/)
   - [Python Requests get() Method](https://www.w3schools.com/python/ref_requests_get.asp)
   

#### Web Scrapping
   - [BeautifulSoup Guide: Scraping HTML Pages With Python](https://scrapeops.io/python-web-scraping-playbook/python-beautifulsoup-web-scraping/)
   

#### Regular Express Cheat Sheet
   - [RegEX cheatsheet](https://quickref.me/regex.html)
   - [re.compile()](https://pynative.com/python-regex-compile/) - method is used to compile a regular expression pattern provided as a string into a regex pattern object 
   - [Removing \n\t from string](https://stackoverflow.com/questions/47953523/remove-an-n-t-t-t-element-from-list)

<div class="alert alert-block alert-success">
    
<h1>Lab | Web Scraping Multiple Pages</h1> 
    
<h3>Objective of this lab:</h3> 
    
    
<h5> - Get more songs from different sources and list them into the same pandas dataframe from previous lab</h5>
</div>


### Instructions
Prioritize the MVP
In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

### Expand the project
If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

### Practice web scraping
As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:


- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
- Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
- Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
- A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [21]:
import requests as r

# Send a request to a web page

youtube = r.get("https://charts.youtube.com/charts/TopSongs/global")
spotify = r.get("https://kworb.net/spotify/country/global_weekly_totals.html")
itunes = r.get("https://kworb.net/ww/")
apple = r.get("https://kworb.net/apple_songs/")
rollingstone = r.get("https://www.rollingstone.com/music/music-lists/best-songs-of-2023-so-far-1234766821/")

In [29]:
# Return the status code after sending request

web = {"youtube": youtube, "spotify": spotify, "itunes": itunes, "apple": apple, "rollingstone": rollingstone}
for k, v in web.items():
    print(k, "status:", v.status_code)

youtube status: 200
spotify status: 200
itunes status: 200
apple status: 200
rollingstone status: 200


- Status-code 200: Everything went okay and the result has been returned (if any)

In [61]:
# Parsing the data into a readable form
yt = BeautifulSoup(youtube.content, "html.parser")
sptf = BeautifulSoup(spotify.content, "html.parser") 
itun = BeautifulSoup(itunes.content, "html.parser")
appl = BeautifulSoup(apple.content, "html.parser")
rlst = BeautifulSoup(rollingstone.content, "html.parser")

In [96]:
rlst

<!DOCTYPE html>

<!--[if IE 6]>
<html id="ie6" lang="en-US">
<![endif]-->
<!--[if IE 7]>
<html id="ie7" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html id="ie8" lang="en-US">
<![endif]-->
<!--[if !(IE 6) | !(IE 7) | !(IE 8) ]><!-->
<html lang="en-US">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="#ffffff" name="theme-color"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<!--
		 _     _ _           ____          _          _____ _    ___
		| |   (_) | _____   / ___|___   __| | ___    | ____| |__|__ \
		| |   | | |/ / _ \ | |   / _ \ / _` |/ _ \   |  _| | '_ \ / /
		| |___| |   <  __/ | |__| (_) | (_| |  __/_  | |___| | | |_|
		|_____|_|_|\_\___|  \____\___/ \__,_|\___( ) |_____|_| |_(_)
												  |/

		 Work on Rolling Stone and other iconic brands!

		 Visit our careers page at https://pmc.com/careers/

-->
<meta content="The Best Songs of 2023 So Far, from Lana Del Rey to Lil

In [97]:
yt_html = urlopen("https://charts.youtube.com/charts/TopSongs/global").read()

In [98]:
yt_soup = BeautifulSoup(yt_html, "html.parser")

In [99]:
yt_soup

<!DOCTYPE html>
<html dir="ltr" lang="de-DE"><head><script nonce="Sz8PrsBuxjFBbjq_ODsI3A">var ytcsi={gt:function(n){n=(n||"")+"data_";return ytcsi[n]||(ytcsi[n]={tick:{},info:{}})},now:window.performance&&window.performance.timing&&window.performance.now&&window.performance.timing.navigationStart?function(){return window.performance.timing.navigationStart+window.performance.now()}:function(){return(new Date).getTime()},tick:function(l,t,n){var ticks=ytcsi.gt(n).tick;var v=t||ytcsi.now();if(ticks[l]){ticks["_"+l]=ticks["_"+l]||[ticks[l]];ticks["_"+l].push(v)}ticks[l]=v},info:function(k,
v,n){ytcsi.gt(n).info[k]=v},setStart:function(t,n){ytcsi.tick("_start",t,n)}};
(function(w,d){function isGecko(){if(!w.navigator)return false;try{if(w.navigator.userAgentData&&w.navigator.userAgentData.brands&&w.navigator.userAgentData.brands.length){var brands=w.navigator.userAgentData.brands;var i=0;for(;i<brands.length;i++)if(brands[i]&&brands[i].brand==="Firefox")return true;return false}}catch(e){se

# YouTube List

In [71]:
yt

<!DOCTYPE html>
<html dir="ltr" lang="de-DE"><head><script nonce="wRS7lqYOX83uVWU3HNCJTQ">var ytcsi={gt:function(n){n=(n||"")+"data_";return ytcsi[n]||(ytcsi[n]={tick:{},info:{}})},now:window.performance&&window.performance.timing&&window.performance.now&&window.performance.timing.navigationStart?function(){return window.performance.timing.navigationStart+window.performance.now()}:function(){return(new Date).getTime()},tick:function(l,t,n){var ticks=ytcsi.gt(n).tick;var v=t||ytcsi.now();if(ticks[l]){ticks["_"+l]=ticks["_"+l]||[ticks[l]];ticks["_"+l].push(v)}ticks[l]=v},info:function(k,
v,n){ytcsi.gt(n).info[k]=v},setStart:function(t,n){ytcsi.tick("_start",t,n)}};
(function(w,d){function isGecko(){if(!w.navigator)return false;try{if(w.navigator.userAgentData&&w.navigator.userAgentData.brands&&w.navigator.userAgentData.brands.length){var brands=w.navigator.userAgentData.brands;var i=0;for(;i<brands.length;i++)if(brands[i]&&brands[i].brand==="Firefox")return true;return false}}catch(e){se

In [73]:
yt.find_all('')

[]

In [78]:
# Retrieve song title

cls_yt_title = "style-scope paper-tooltip hidden"

title_yt = yt.find_all("div", attrs={"class": cls_yt_title})
title_yt

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [95]:
title_yt = yt.find("div", id="tooltip", attrs={"class": "style-scope paper-tooltip hidden"})
title_yt.get_text()

AttributeError: 'NoneType' object has no attribute 'get_text'

In [65]:
soup = BeautifulSoup(page)
badges = soup.body.find('div', attrs={'class': 'badges'})
for span in badges.span.find_all('span', recursive=False):
    print span.attrs['title']
    
    
    from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_input, 'html.parser')
span = soup.find("span", id="count_text")
span.text

[]

In [None]:
<div class="entity-title style-scope ytmc-chart-table">
              
                <ytmc-ellipsis-text class="ellipsis-title clickable style-scope ytmc-chart-table" clickable="true" tabindex="0" aria-label="Rank 2 Kya Loge Tum" endpoint="{&quot;urlEndpoint&quot;:{&quot;url&quot;:&quot;https://www.youtube.com/watch?v=cAMHx-m9oh8&quot;,&quot;target&quot;:&quot;TARGET_NEW_WINDOW&quot;}}"><!--css_build_mark:video.youtube.src.web.polymer.music_charts.components.ui.ytmc_ellipsis_text.ytmc.ellipsis.text.css.js--><!--css_build_scope:ytmc-ellipsis-text--><!--css_build_styles:video.youtube.src.web.polymer.shared.ui.styles.yt_base_styles.yt.base.styles.css.js,video.youtube.src.web.polymer.music_charts.components.ui.ytmc_ellipsis_text.ytmc.ellipsis.text.css.js--><div class="ytmc-ellipsis-text-container style-scope ytmc-ellipsis-text">
  <span class="ytmc-ellipsis-text style-scope">Kya Loge Tum</span>
  <paper-tooltip id="tooltip" class="ytmc-ellipsis-tooltip style-scope ytmc-ellipsis-text" fit-to-visible-bounds="" role="tooltip" tabindex="-1" style="left: 672.523px; top: 669.5px;">
    

    <div id="tooltip" class="style-scope paper-tooltip hidden">
      Kya Loge Tum
    </div>
</paper-tooltip>
</div>
</ytmc-ellipsis-text>
              <dom-if class="style-scope ytmc-chart-table" style="display: none;"><template is="dom-if"></template></dom-if>
              <dom-if class="style-scope ytmc-chart-table" style="display: none;"><template is="dom-if"></template></dom-if>
            </div>