# DS 3000 HW 4

Due: Fri Mar 5 @ 11:59 PM EST

### Submission Instructions
Submit this `ipynb` file to gradescope.

The `ipynb` format stores outputs from the last time you ran the notebook.  (When you open a notebook it has the figures and outputs of the last time you ran it too).  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh run "Kernel > Restart & Run All" just before uploading the `ipynb` file to gradescope.

### Academic Integrity

**Writing your homework is an individual effort.**  You may discuss general python problems with other students but under no circumstances should you observe another student's code which was written for this assignment, from this year or past years.  Pop into office hours or post a piazza note if you have a specific question about your work you'd like another pair of eyes to talk through.  (Remember, mark your piazza note private if it contains anything which may be considered a solution to the exercise).

Don't forget to cite websites which helped you solve a problem in a unique way.  You can do this in markdown near the code or with a simple one-line comment.  For example, a python trick I find particularly clever (and useful, sometimes):

```python
from collections import defaultdict

def tree(): 
    # https://gist.github.com/hrldcpr/2012250
    return defaultdict(tree)
```

You need not cite the official python documentation or the documentation of any python library which is imported in the template (e.g. matplotlib, numpy, scipy).

**Documentation / style counts for credit**  Please see our course's python style guide, available on canvas, for further information.

## Overview: "Good" songs

You will start **(but not complete)** a comparison of "good" songs as determined by two websites.
 - The [best music](https://pitchfork.com/reviews/best/tracks/) according to [Pitchfork](https://pitchfork.com/)
     - new (mostly independent) music
 - The [best music](https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/) according to [Billboard](billboard.com)
     - "good" defined based on record sales    
    
The analysis pipeline will
 - scrape top songs from pitchfork
 - scrape top songs from billboard
 - query the Spotify API to get popularity rankings on each song
 - produce the histogram shown below

<img src="https://i.ibb.co/0Z8VPQV/Screenshot-from-2021-02-25-15-02-18.png" alt="Drawing" style="width: 400px;"/>


## Part 1: Program design (28 points)
The task above may be completed by running the following script.  Note that `clean_pitchfork()` and `clean_billboard()` both return dataframes with columns `track` and `artist`.

```python
url_pitchfork = 'https://pitchfork.com/reviews/best/tracks/'
url_billboard = 'https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/'
spot_api_key = 'aisduhfaoidshufaoidshufapodsihfapiu'

# get html of each set of songs
html_str_pitchfork = get_url(url_pitchfork)
html_str_billboard = get_url(url_billboard)

# web scrape tracks from html of pages
df_pitchfork = clean_pitchfork(html_str_pitchfork)
df_billboard = clean_billboard(html_str_billboard)

# record source of each track
df_pitchfork['source'] = 'pitchfork'
df_billboard['source'] = 'billboard'

# concatenate all tracks
df_track = pd.concat((df_pitchfork, df_billboard), axis=0)

# query spotify API for popularity of each track
df_track = get_popularity(df_track, api_key=spot_api_key)

# plot histogram of popularity per source
hist_feat(df_track, feat='popularity')
```

For each of the functions listed in sub-parts below, write a function statement and docstring.  

The "work" of this problem is being able to clearly define the inputs and outputs as needed so the pipeline produces the desired result.  Be sure to describe the inputs / outputs of each function by writing the function statement / docstring as shown in the example below:

```python

def some_fnc(input0, input1):
    """ this function does a thing!
    
    Args:
        input0 (type of input0): input0 is a ...
        input1 (type of input1): input1 is ...
        
    Returns:
        output0 (type of output0): output0 is ...
    """
    # "pass" allow us to end an indentation body without causing
    # any errors when from the python interpreter
    pass
```

### Part 1.1: `get_url()`

In [1]:
# The funcition is to get the whole html of the whole page

def get_url(link):
    """
    parsing throught the link and using request to get the whole html as a string
    
    Args:
    link("string"): the link of the target web
    
    returns:
    
    html(string) the text of the requested html
    
    
    """
    
    return

### Part 1.2: `clean_pitchfork()`
(No need to write a seperate docstring for `clean_billboard()`, as it has the same inputs / outputs as `clean_pitchfork()`. 

In [2]:
# The function is to get the cleaned version of the html_str,and converted into dataframe
# find the target song list by find_all "class"
def clean_pitchfork(html_str):
    """ 
    parsing through the html_str by getting from the get_url fucntion,
    then return a dataframe of the target information
    
    Args:
        html_str (str): the string of the html information

    Returns:
        df(dataframe): the dataframe that contains the target info we want
    """
    return

### Part 1.3 `get_popularity()`

In [3]:
# query spotify API for popularity of each track
def get_popularity(df, api_key):
    """ get popularity rankings on each song through the given api_key

    Args:
        df (dataframe) : the cleaned dataframe from the clean_** function
        api_key(str) : Key for the API

    Returns:
        the popularity information of each track as a dataframe
    """
    return

### Part 1.4: `hist_feat()`

In [4]:
# plot histogram of popularity per source by takiing int he dataframe and the feat
# the feat could be the column that we need to draw as the x axis.
def hist_feat(df, feat):
    """ draw a hist graph of the given information

    Args:
        df (dataframe): the dataframe that we take in from the previous func and contains the data we need to draw
        feat (str) : the str that we want to look up from the takin dataframe, used as the x axis

    Returns:
        plot : the hist graph of the given information
    """
    return

### Part 2: Build `get_url()` (6 points)
When you're done, check that it works by outputting to the jupyter notebook the `html_str` associated with input:
```python
url='https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/'
```

Tip: double click / click the margin just below `Out[x]` to hide / show this output ... the full html string can be quite long

**Hint:** Stuck on what exactly `get_url()` does?  Check the class notes for examples

In [5]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import numpy as np


In [6]:
# The funcition is to get the whole html of the whole page
def get_url(link):
    """
    parsing throught the link and using request to get the whole html as a string
    
    Args:
    link("string"): the link of the target web
    
    returns:
    
    html(string) the text of the requested html
    
    
    """
    #request the html and convert it to text
    response = requests.get(link).text
    return response

In [7]:
# The link
url_pitchfork = 'https://pitchfork.com/reviews/best/tracks/'
url_billboard = 'https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/'

In [8]:
# calling the get_url
html_strpitchfork = get_url(url_pitchfork)
html_strbillboard = get_url(url_billboard)

In [9]:
html_strbillboard

'\n<!doctype html>\n<html lang="en">\n<head>\n<title data-rh="true">Best Songs of 2020: The 50 Best | Billboard</title>\n<meta data-rh="true" charset="utf-8" /><meta data-rh="true" http-equiv="x-ua-compatible" content="ie=edge" /><meta data-rh="true" name="viewport" content="width=device-width, initial-scale=1" /><meta data-rh="true" name="theme-color" content="#344072" /><meta data-rh="true" name="twitter:site" content="@billboard" /><meta data-rh="true" property="og:site_name" content="Billboard" /><meta data-rh="true" property="og:url" content="https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/" /><meta data-rh="true" name="description" content="The year 2020 was unforgettable for many reasons, including its pop music. Here are the 100 songs we most hope to remember it by. " /><meta data-rh="true" name="og:description" property="og:description" content="The year 2020 was unforgettable for many reasons, including its pop music. Here are the 100 songs we mos

<!-- describe your pipeline here -->

### Part 3:  Build `clean_pitchfork()`  (28 points)
Be sure that you clean up the track names by discarding those pesky double quotes.  (Note, the quotes used are directional and not the typical <shift + comma> character.  We include them below so you can copy paste for convenience).  Hint: One way to approach this is to [replace](https://docs.python.org/3/library/stdtypes.html#str.replace) them with an empty string?

the offending characters:

“ ”


When you're done, check that it works by outputting to the jupyter notebook the first few rows of a DataFrame of Pitchfork songs:
```python
url = 'https://pitchfork.com/reviews/best/tracks/?page=1'
html_str = get_url(url)
df_pitch1 = clean_pitchfork(html_str)
df_pitch1.head()
```

which should show:

|   |              artist |    source |                   track |
|--:|--------------------:|----------:|------------------------:|
| 0 |   Cassandra Jenkins | pitchfork |              Hard Drive |
| 1 | The Weather Station | pitchfork |                  Robber |
| 2 |     Adrianne Lenker | pitchfork |                anything |
| 3 |    Jazmine Sullivan | pitchfork |                Lost One |
| 4 |               Nazar | pitchfork | Bunker [ft. Shannen SP] |

In [10]:
#clean up the html_str get artist, source, track info, and stroing then into a dataframe.
def clean_pitchfork(html_str):
    
    """ 
    parsing through the html_str by getting from the get_url fucntion,
    then return a dataframe of the target information
    Args:
        html_str (str): the string of the html information

    Returns:
        df(dataframe): the dataframe that contains the target info we want
    """
    
    # the beautiful style of the html info
    soup = BeautifulSoup(html_str)
    #create the dic frame 
    dic = {"artist": [], "source" : [], "track": []}
    
    # find the target class which contains the target info that we want
    for track in soup.find_all(class_ = "row"):
        
        # assign the info
        artist = track.ul.text.strip()
        source = "pitchfork"
        if track.h2 is None:
            continue
        #clean up the  track names by discarding those pesky double quotes
        else:
            track = track.h2.text.strip().replace('“','').replace('”', '')
        
        #assign values to the dic
        dic["artist"].append(artist)
        dic["source"].append(source)
        dic["track"].append(track)
        
    #create the datafrme with the complete dic
    df = pd.DataFrame(dic)
    
    return df

    

In [11]:
url = 'https://pitchfork.com/reviews/best/tracks/?page=1'
html_str = get_url(url)
df_pitch1 = clean_pitchfork(html_str)
df_pitch1.head()

Unnamed: 0,artist,source,track
0,Japanese Breakfast,pitchfork,Be Sweet
1,FKA twigsHeadie OneFred again..,pitchfork,Don’t Judge Me
2,Cassandra Jenkins,pitchfork,Hard Drive
3,The Weather Station,pitchfork,Robber
4,Adrianne Lenker,pitchfork,anything


### Part 4 Managing the scrolling on Pitchfork's website (10 points)
Notice that as one scrolls to the bottom of the pitchfork page the `?page=x` counter increments.  [Try it yourself](https://pitchfork.com/reviews/best/tracks/).  Just as we did with the API work, we can modify the URL to get different sets of songs from Pitchfork.

Write a script which scrolls through 10 pages of Pitchfork's music reccomendations and aggregate all the songs you find into a single `df_pitch` DataFrame.  Use the functions you've created above!

Validation: We found 56 songs running this on Feb 25

In [12]:
# The script to scroll 10 pages of Pitchfork's music reccommendations.
listsong = []
    # go through pages
for i in range(1,11):
        # url for each page with each index of page
        url = f'https://pitchfork.com/reviews/best/tracks/?page={i}'
        # get html info
        html_str = get_url(url)
        # clean  information for each page
        df_pitch = clean_pitchfork(html_str)
        # store in the list for later concating
        listsong.append(df_pitch)
        


In [13]:
#concating all of the df then recalculating it index
df_tenpages = pd.concat(listsong, ignore_index=True)
#show the first 20 of the songs
df_tenpages.head(20)

Unnamed: 0,artist,source,track
0,Japanese Breakfast,pitchfork,Be Sweet
1,FKA twigsHeadie OneFred again..,pitchfork,Don’t Judge Me
2,Cassandra Jenkins,pitchfork,Hard Drive
3,The Weather Station,pitchfork,Robber
4,Adrianne Lenker,pitchfork,anything
5,Jazmine Sullivan,pitchfork,Lost One
6,Nazar,pitchfork,Bunker [ft. Shannen SP]
7,Moor Motherbilly woods,pitchfork,Furies
8,Sufjan Stevens,pitchfork,America
9,Megan Thee Stallion,pitchfork,Girls in the Hood


In [14]:
# The validation to check whether the songs are 56 in total found on Feb 25
len(df_tenpages)

56

## "Scraping quotes sounds like a fun idea for HW"
-Prof Higger

### Part 5 (28 points)
Write a function, `clean_quote()` which scrapes all the quotes from https://www.brainyquote.com/topics/websites-quotes_1.  Your resulting dataframe should contain a column for the `text` and `author` of each quote.  You can get an `html_str` from your `get_url()` function as defined above.

<img src="https://i.ibb.co/vXb8xvz/Screenshot-from-2021-02-25-15-19-17.png" alt="Drawing" style="width: 600px;"/>

**Extra Credit (up to +3 points)**: Store the tags associated with each quote too.  For example, Bill Gate's quote above has three tags: `'truth'`, `'government'` and `'internet'`.  Think carefully about how you store the tags so that one may easily understand exactly how many of each tag appear in your dataframe with simple pandas manipulations.

In [15]:

# clean quote function get the quotes from the target url and return the dataframe
def clean_quote(html_str):
    """
    scrapes all the quotes from the target url then return the target info into dataframe
    
    Args:
    html_str(string): the html get fromthe get)url function
    
    Returns:
    df (dataframe): the dataframe that contains the text, authro and tag
    
    
    """
    
    # parse as a styled html_str
    soup = BeautifulSoup(html_str)
    #initialize the dic,
    dic = { "author" : [],"text": [], "tag":[]}
    
    # find the target class that contains the quote
    for info in soup.find_all(class_ = "m-brick grid-item boxy bqQt r-width"):
        
        #assign the artist, text, and tag
        artist = info.find_all(title = 'view author')[0].text
        text = info.find_all(title = 'view quote')[0].text
        tagclass = info.find_all(class_ = 'qll-dsk-kw-box')[0]
        tag = tagclass.find_all(class_ = "qkw-btn btn btn-xs oncl_klc")
        #store the tag of each quote in one list
        taglist = [tag[0].text,tag[1].text,tag[2].text]
        
        # assign the each three to the dic
        dic["text"].append(text)
        dic["author"].append(artist)
        dic["tag"].append(taglist)
        
    # create the dataframe with the stored dic
    df = pd.DataFrame(dic)
        
    return df
        
    

In [16]:
# run the script and func
url = "https://www.brainyquote.com/topics/websites-quotes_1"
html_str = get_url(url)
clean_quote(html_str).head()

Unnamed: 0,author,text,tag
0,Anthony Carmona,Social media websites are no longer performing...,"[Positive, Communication, Family]"
1,Shreya Ghoshal,"I'm not a gadget freak, so to say. I own an iP...","[Love, Technology, Always]"
2,Michael Bennet,As we all become increasingly reliant on socia...,"[Information, New, Important]"
3,David McCandless,In an endless jungle of websites with text-bas...,"[Space, Beautiful, Jungle]"
4,David Talbot,I think there is a difference between Slate an...,"[Thankful, Internet, Important]"


<!-- answer the questions here -->