<div class="alert alert-block alert-warning">
    
    
<h4>Please scroll down to Lab | Web Scraping Multiple Pages</h4> 
    
    
</div>



# Lab | Web Scraping Single Page
### Business goal:
- Check the case_study_gnod.md file.

- Make sure you've understood the big picture of your project:

    - the goal of the company (Gnod),
    - their current product (Gnoosic),
    - their strategy, and
    - how your project fits into this context.
    
    Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

### Instructions - Scraping popular songs
- Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

- You have find data on the internet about currently popular songs. Billboard maintains a weekly Top 100 of "hot" songs here: https://www.billboard.com/charts/hot-100.

- It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

<div class="alert alert-block alert-info">
    
    
<h1>Objective of this lab:</h1> 
    
    
<h5> - Get top 100 songs on Billboard, list into a pandas dataframe</h5>
</div>



# Import libraries

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
from urllib.request import urlopen
import re

# Scrapping the content from the web

In [2]:
# open url and read the whole url page
html = urlopen("https://www.billboard.com/charts/hot-100").read()

# Parsing the data into a readable form & assign into a "soup" object
soup = BeautifulSoup(html, "html.parser")

In [3]:
# Retrieve all the h3 tag

tag = soup("h3")
# tag

In [4]:
# Class for all the song in the top 100 songs except the top1
class_100 = "c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only"

# Class for the top 1 song
class_1 = "c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 u-font-size-23@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-245 u-max-width-230@tablet-only u-letter-spacing-0028@tablet"

In [5]:
# Set a list for collecting song title
title = []

In [6]:
# Retrieve the top 1 song
song_1 = soup.find_all("h3", attrs={"class": class_1})

# Get the name of the song
title_1 = song_1[0].get_text()

# Append the top 1 song into the list
title.append(title_1)
title

['\n\n\t\n\t\n\t\t\n\t\t\t\t\tLast Night\t\t\n\t\n']

In [7]:
# Retrieve the top 100 song (except the top1)
song_100 = soup.find_all("h3", attrs={"class": class_100})

# Get the name of the song for all songs in this class_100 & append in the same list above
for i in range(len(song_100)):
    title.append(song_100[i].get_text())

In [8]:
# Check the list
# title

- We see all the whitespace pattern before and after the title (string)
- Remove them using regex

In [9]:
# Regex for finding all whitespace & other unicode characters
pattern = re.compile("^\s+$")

# Iterates through the list "title" 
# and strip() removes/trims all the the whitespace from string and put it into a list []
title = [item.strip() for item in title if not pattern.match(item)]
# title

#### What does the script above do?

- Note: Need a little explanation for myself

**Regular Expression**:

   ^ --> start with 
   
   \s --> matches a whitespace (blank, tab \t, and newline \r or \n)
   
   "+" --> one or more 
   
   $ --> end of string
   

---
\s
For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].

- [Ref](https://docs.python.org/3/library/re.html): https://docs.python.org/3/library/re.html
---





#### For Loop:
**[item.strip() for item in title if not regex.match(item)]**

- It iterates through the list "title" and strip() removes all the the white space from string and put it into a  list []


- The loop can be written (in the way that easy to understand) as follow:

```
for item in title:
    if not pattern.match(item):
        title.append(item.strip()) 

```


##### Some functions:

- match() - to find the match string and return the first occurence
- [re.compile()](https://pynative.com/python-regex-compile/) - method is used to compile a regular expression pattern provided as a string into a regex pattern object 

## Get Artists Name

In [10]:
# Retrieve all the span tag

span_tage = soup("span")
# span_tage

In [11]:
# We'll get artist name in these classes
class_artist_1 = "c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only u-font-size-20@tablet"

class_artist_100 = "c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only"

In [12]:
# Set a list for collecting artist 
artist = []

# Retrieve the top 1 artist
artist_1 = soup.find_all("span", attrs={"class": class_artist_1})

# Get the name of the artist
artist_name_1 = artist_1[0].get_text()

# Append the top 1 artist into the list
artist.append(artist_name_1)
artist

['\n\t\n\tMorgan Wallen\n']

In [13]:
# Retrieve the top 100 song (except the top1)
artist_100 = soup.find_all("span", attrs={"class": class_artist_100})

# Get the name of the song for all songs in this class_100 & append in the same list above
for j in range(len(artist_100)):
    artist.append(artist_100[j].get_text())

# Check the list
# artist

In [14]:
# Remove \n\t characters
artist = [name.strip() for name in artist if not pattern.match(name)]
# artist

# Billboard Hot 100

In [15]:
# Put data into a pandas dataframe
df = pd.DataFrame({"Song": title, "Artist": artist})
df

Unnamed: 0,Song,Artist
0,Last Night,Morgan Wallen
1,Flowers,Miley Cyrus
2,Fast Car,Luke Combs
3,Calm Down,Rema & Selena Gomez
4,All My Life,Lil Durk Featuring J. Cole
...,...,...
95,Save Me,Jelly Roll With Lainey Wilson
96,Yandel 150,Yandel & Feid
97,Beso,Rosalia & Rauw Alejandro
98,I Wrote The Book,Morgan Wallen


### Sources:
#### About requests.get()
   - [GET and POST Requests Using Python](https://www.geeksforgeeks.org/get-post-requests-using-python/)
   - [Python requests: GET Request Explained](https://datagy.io/python-requests-get-request/)
   - [Python Requests get() Method](https://www.w3schools.com/python/ref_requests_get.asp)
   

#### Web Scrapping
   - [BeautifulSoup Guide: Scraping HTML Pages With Python](https://scrapeops.io/python-web-scraping-playbook/python-beautifulsoup-web-scraping/)
   

#### Regular Express Cheat Sheet
   - [RegEX cheatsheet](https://quickref.me/regex.html)
   - [re.compile()](https://pynative.com/python-regex-compile/) - method is used to compile a regular expression pattern provided as a string into a regex pattern object 
   - [Removing \n\t from string](https://stackoverflow.com/questions/47953523/remove-an-n-t-t-t-element-from-list)

<div class="alert alert-block alert-success">
    
<h1>Lab | Web Scraping Multiple Pages</h1> 
    
<h3>Objective of this lab:</h3> 
    
    
<h5> - Get more songs from different sources and list them into the same pandas dataframe from previous lab</h5>
</div>


### Instructions
Prioritize the MVP
In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

### Expand the project
If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

### Practice web scraping
As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:


- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
- Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
- Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
- A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [16]:
import requests as r

# Send a request to a web page

youtube = r.get("https://charts.youtube.com/charts/TopSongs/global")
spotify = r.get("https://kworb.net/spotify/country/global_weekly_totals.html")
itunes = r.get("https://kworb.net/ww/")
apple = r.get("https://kworb.net/apple_songs/")
rollingstone = r.get("https://www.rollingstone.com/music/music-lists/best-songs-of-2023-so-far-1234766821/")

In [17]:
# Return the status code after sending request

web = {"youtube": youtube, "spotify": spotify, "itunes": itunes, "apple": apple, "rollingstone": rollingstone}
for k, v in web.items():
    print(k, "status:", v.status_code)

youtube status: 200
spotify status: 200
itunes status: 200
apple status: 200
rollingstone status: 200


- Status-code 200: Everything went okay and the result has been returned (if any)

In [18]:
# Parsing the data into a readable form
yt = BeautifulSoup(youtube.content, "html.parser")
sptf = BeautifulSoup(spotify.content, "html.parser") 
itun = BeautifulSoup(itunes.content, "html.parser")
appl = BeautifulSoup(apple.content, "html.parser")
rlst = BeautifulSoup(rollingstone.content, "html.parser")

 - I went over all the webs above. I can't retrieve data from youtube and rollingstone.
 - For the time I have, I decided to do one web from above which is itun (& the song list is similar to apple).
 - And I also do multiple page from npr.com (see script below)

# Itunes

In [19]:
# Find all song title & artist in the list
itun_200 = itun.find_all('td', attrs={"class": "mp text"})

In [20]:
# Put them in a list
itun_list = []
for i in range(len(itun_200)):
    itun_list.append(itun_200[i].get_text())

# itun_list

In [21]:
# Separate song title from artist name
a = []
itun_title = []
itun_artist = []

for j in range(len(itun_list)):
    a.append(itun_list[j].split(" - "))

for element in a:
    itun_title.append(element[1])
    itun_artist.append(element[0])

In [22]:
# Put song list into a dataframe
itun_df = pd.DataFrame({"Song": itun_title, "Artist": itun_artist})
itun_df.head()
print("Total", len(itun_df), "songs")

Total 200 songs


In [23]:
# Pool all songs in one dataframe (billboard + itunes)
df = pd.concat([df, itun_df])
print("Total", len(df), "songs")

Total 300 songs


# NPR (multiple page)

In [24]:
import time

# time.sleep(15)
html_lst = []
npr_list = []
pages = [1, 2, 3, 4, 5]
digit = [2083, 2978, 3422, 3900, 4266]

for (di, page) in zip(digit, pages):
    npr_req_get = r.get(f'https://www.npr.org/2022/12/15/113580{di}/100-best-songs-2022-page-{page}')
    npr_soup = BeautifulSoup(npr_req_get.content, 'html.parser')
    songs = npr_soup.find_all('h3', attrs={"class": "edTag"})
    html_lst.append(songs)
    
    for m in range(len(songs)):
        npr_list.append(songs[m].get_text())

    time.sleep(15)

html_lst

[[<h3 class="edTag">Little Simz</h3>,
  <h3 class="edTag">"Gorilla"<em> </em></h3>,
  <h3 class="edTag">Ian William Craig</h3>,
  <h3 class="edTag">"Attention For It Radiates"</h3>,
  <h3 class="edTag">Viking Ding Dong x Ravi B</h3>,
  <h3 class="edTag">"Leave It Alone (Remix)"</h3>,
  <h3 class="edTag">Adeem the Artist</h3>,
  <h3 class="edTag">"Middle of a Heart"<em> </em></h3>,
  <h3 class="edTag">Zahsosaa, D STURDY &amp; DJ Crazy</h3>,
  <h3 class="edTag">"Shakedhat"</h3>,
  <h3 class="edTag">Gabriels</h3>,
  <h3 class="edTag">"If You Only Knew"</h3>,
  <h3 class="edTag">DOMi &amp; JD BECK</h3>,
  <h3 class="edTag">"SMiLE"</h3>,
  <h3 class="edTag">Rema</h3>,
  <h3 class="edTag">"Calm Down"</h3>,
  <h3 class="edTag">Pigeon Pit</h3>,
  <h3 class="edTag">"milk crates"</h3>,
  <h3 class="edTag">Tyler Childers</h3>,
  <h3 class="edTag">"Angel Band (Jubilee Version)"</h3>,
  <h3 class="edTag">Straw Man Army</h3>,
  <h3 class="edTag">"Human Kind"</h3>,
  <h3 class="edTag">Guitarricadelaf

In [25]:
npr_list

['Little Simz',
 '"Gorilla" ',
 'Ian William Craig',
 '"Attention For It Radiates"',
 'Viking Ding Dong x Ravi B',
 '"Leave It Alone (Remix)"',
 'Adeem the Artist',
 '"Middle of a Heart" ',
 'Zahsosaa, D STURDY & DJ Crazy',
 '"Shakedhat"',
 'Gabriels',
 '"If You Only Knew"',
 'DOMi & JD BECK',
 '"SMiLE"',
 'Rema',
 '"Calm Down"',
 'Pigeon Pit',
 '"milk crates"',
 'Tyler Childers',
 '"Angel Band (Jubilee Version)"',
 'Straw Man Army',
 '"Human Kind"',
 'Guitarricadelafuente',
 '"Quien encendió la luz"',
 'Mary Halvorson',
 '"Night Shift"',
 'Leyla McCalla',
 '"Dodinin"',
 'The Mountain Goats',
 '"Bleed Out"',
 'NewJeans',
 '"Hype Boy"',
 'Joyce',
 '"Feminina"',
 'Ayra Starr',
 '"Rush"',
 'Disclosure feat. RAYE',
 '"Waterfall"',
 'Ari Lennox',
 '"POF"',
 'Next >',
 'The 1975',
 '"Part of the Band"',
 'Anna Tivel',
 '"Black Umbrella"',
 'Caroline Shaw & Attacca Quartet',
 '"First Essay (Nimrod)"',
 'Beth Orton',
 '"Friday Night"',
 'DJ Python',
 '"Angel"',
 'Patricia Brennan',
 '"Unquiet 

In [26]:
# Separate song title and artist into different lists
npr_artist = []
npr_title = []

for q in range(0, len(npr_list)):
    if q % 2:
        npr_title.append(npr_list[q])
    else:
        npr_artist.append(npr_list[q])


In [27]:
# Remove some characters from song title list
npr_correct_title = []
for ti in range(len(npr_title)):
    npr_correct_title.append(npr_title[ti].strip('"').replace('" ', ''))

# Remove some characters from artist list
npr_correct_artist = []
for ar in range(len(npr_artist)):
    npr_correct_artist.append(npr_artist[ar].strip('"').replace('" ', ''))

In [28]:
npr_df = pd.DataFrame({"Song": npr_correct_title, "Artist": npr_correct_artist})
npr_df

Unnamed: 0,Song,Artist
0,Gorilla,Little Simz
1,Attention For It Radiates,Ian William Craig
2,Leave It Alone (Remix),Viking Ding Dong x Ravi B
3,Middle of a Heart,Adeem the Artist
4,Shakedhat,"Zahsosaa, D STURDY & DJ Crazy"
...,...,...
97,SAOKO,ROSALÍA
98,Runner,Alex G
99,El Apagón,Bad Bunny
100,ALIEN SUPERSTAR,Beyoncé


- There are 2 extra rows
- How can I find out which of them are not song?

In [29]:
pd.set_option('display.max_rows', None)

In [30]:
npr_df

Unnamed: 0,Song,Artist
0,Gorilla,Little Simz
1,Attention For It Radiates,Ian William Craig
2,Leave It Alone (Remix),Viking Ding Dong x Ravi B
3,Middle of a Heart,Adeem the Artist
4,Shakedhat,"Zahsosaa, D STURDY & DJ Crazy"
5,If You Only Knew,Gabriels
6,SMiLE,DOMi & JD BECK
7,Calm Down,Rema
8,milk crates,Pigeon Pit
9,Angel Band (Jubilee Version),Tyler Childers


In [31]:
# Concat into one dataframe
df = pd.concat([df, npr_df])
print("Total", len(df), "songs")

Total 402 songs


### Next step
- Check for duplication song

### Note
- I learned later that we have .get_text(strip=True) function to get cleaned, nice string. We don't need extra work for that.