## Overview: "Good" songs

You will start **(but not complete)** a comparison of "good" songs as determined by two websites.
 - The [best music](https://pitchfork.com/reviews/best/tracks/) according to [Pitchfork](https://pitchfork.com/)
     - new (mostly independent) music
 - The [best music](https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/) according to [Billboard](billboard.com)
     - "good" defined based on record sales    
    
The analysis pipeline will
 - scrape top songs from pitchfork
 - scrape top songs from billboard
 - query the Spotify API to get popularity rankings on each song
 - produce the histogram shown below

<img src="https://i.ibb.co/0Z8VPQV/Screenshot-from-2021-02-25-15-02-18.png" alt="Drawing" style="width: 400px;"/>


## Part 1: Program design (28 points)
The task above may be completed by running the following script.  Note that `clean_pitchfork()` and `clean_billboard()` both return dataframes with columns `track` and `artist`.

```python
url_pitchfork = 'https://pitchfork.com/reviews/best/tracks/'
url_billboard = 'https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/'
spot_api_key = '<spotify-key-here>'

# get html of each set of songs
html_str_pitchfork = get_url(url_pitchfork)
html_str_billboard = get_url(url_billboard)

# web scrape tracks from html of pages
df_pitchfork = clean_pitchfork(html_str_pitchfork)
df_billboard = clean_billboard(html_str_billboard)

# record source of each track
df_pitchfork['source'] = 'pitchfork'
df_billboard['source'] = 'billboard'

# concatenate all tracks
df_track = pd.concat((df_pitchfork, df_billboard), axis=0)

# query spotify API for popularity of each track
df_track = get_popularity(df_track, api_key=spot_api_key)

# plot histogram of popularity per source
hist_feat(df_track, feat='popularity')
```

For each of the functions listed in sub-parts below, write a function statement and docstring.  

The "work" of this problem is being able to clearly define the inputs and outputs as needed so the pipeline produces the desired result.  Be sure to describe the inputs / outputs of each function by writing the function statement / docstring as shown in the example below:

```python

def some_fnc(input0, input1):
    """ this function does a thing!
    
    Args:
        input0 (type of input0): input0 is a ...
        input1 (type of input1): input1 is ...
        
    Returns:
        output0 (type of output0): output0 is ...
    """
    # "pass" allow us to end an indentation body without causing
    # any errors when from the python interpreter
    pass
```

### Part 1.1: `get_url()`

In [144]:
def get_url(url):
    '''
    renders text of url page as str
    args: https link to webpage (str)
    returns: text format of html content (str)
    '''
    pass

### Part 1.2: `clean_pitchfork()`
(No need to write a seperate docstring for `clean_billboard()`, as it has the same inputs / outputs as `clean_pitchfork()`. 

In [145]:
def clean_pitchfork(html_text):
    '''
    changes raw html text content into pandas dataframe
    argrs: html text of song list page (str)
    returns: pandas dataframe of track and artist per song
    '''
    pass

### Part 1.3 `get_popularity()`

In [146]:
def get_popularity(songs_df, api_key):
    '''
    finds value of popularity for given songs
    args: dataframe of songs, including track title and artist
    returns: df_track where each row is a song and popularity is col
    '''
    pass

### Part 1.4: `hist_feat()`

In [147]:
def hist_feat(songs_df, feat_lst):
    '''
    plots histogram of popularity for each song
    args: dataframe of songs, list of feature values to plot (ints)
    returns: plt histogram of song popularity by website
    '''
    pass

### Part 2: Build `get_url()` (6 points)
When you're done, check that it works by outputting to the jupyter notebook the `html_str` associated with input:
```python
url='https://www.billboard.com/media/lists/best-songs-2020-top-100-9494940/'
```

Tip: you can click or double click the margin just below `Out[x]` to hide / limit this output ... the full html string can be quite long.

In [5]:
import requests

In [6]:
# even simpler:    return requests.get(url).text

In [7]:
def get_url(url):
    '''
    renders text of url page as str
    args: https link to webpage (str)
    returns: text format of html content (str)
    '''
    
    html_resp = requests.get(url)
    html_text = html_resp.text

    return html_text

In [8]:
url='https://www.billboard.com/media/lists/best-songs-2020-top-100-9494940/'
get_url(url);

<!-- extracts song info from a list of top songs using webscraping -->

### Part 3:  Build `clean_pitchfork()`  (28 points)

Build `clean_pitchfork()`

- You may skip the initial track "Porridge Radio"
- Be sure to remove the double quotes: `“` `”` from the track names.  Note these are not the typical <shift + comma> character, copy and paste them from above to ensure you get the proper string match.

When you're done, check that it works by outputting to the jupyter notebook the first few rows of a DataFrame of Pitchfork songs:
```python
url = 'https://pitchfork.com/reviews/best/tracks/?page=1'
html_str = get_url(url)
df_pitch = clean_pitchfork(html_str)
df_pitch.head()
```

which should show (as of Feb 22 @ 1PM):

| artist |           track |                                     source |           |
|-------:|----------------:|-------------------------------------------:|-----------|
|      0 |           yeule |                           Bites on My Neck | pitchfork |
|      1 |       Two Shell |                                       home | pitchfork |
|      2 |   Nilüfer Yanya |                               Midnight Sun | pitchfork |
|      3 |        Soul Glo | Jump!! (Or Get Jumped!!!)((By the Future)) | pitchfork |
|      4 | Earl Sweatshirt |                                       2010 | pitchfork |

In [9]:
from bs4 import BeautifulSoup
import pandas as pd

In [24]:
def clean_pitchfork(html_text):
    '''
    changes raw html text content into pandas dataframe
    argrs: html text of song list page (str)
    returns: pandas dataframe of track and artist per song
    '''
    
    # build soup object from text
    soup = BeautifulSoup(html_text)
    
    song_df = pd.DataFrame()
    
    
    for song in soup.find_all(class_='track-collection-item'):
        
        # extract artist
        artist = song.find_all('ul', class_='artist-list')[0].text
        #extract track
        track = song.find_all('h2', class_='track-collection-item__title')[0].text
        # discard all directional double quotes
        track = track.replace('“', '')
        track = track.replace('”', '')
        
        # collect song data in dataframe
        song_dict = {'artist': artist, 
                     'track': track,
                     'source': 'pitchfork'}
        
        song_df = song_df.append(song_dict, ignore_index=True)
        
        # extract song name by class
        #song_name = song.find_all('h2', class_='track-collection-item__title')[0].text
        #song_name = song.text.split('“”')
         
        #song_names = song_name
        
        #song_df['name'] = song_names
        #song_df['artist'] = song.find_all(class_='linked display-name display-name--linked')
        
        
        
    return song_df
    

In [25]:
url = 'https://pitchfork.com/reviews/best/tracks/?page=1'
html_str = get_url(url)
df_pitch = clean_pitchfork(html_str)
df_pitch

Unnamed: 0,artist,source,track
0,Floating Points,pitchfork,Vocoder
1,Charlotte AdigéryBolis Pupul,pitchfork,It Hit Me
2,Porridge Radio,pitchfork,Back to the Radio
3,Caroline Polachek,pitchfork,Billions
4,yeule,pitchfork,Bites on My Neck
5,Two Shell,pitchfork,home
6,Nilüfer Yanya,pitchfork,Midnight Sun
7,Soul Glo,pitchfork,Jump!! (Or Get Jumped!!!)((By the Future))
8,Earl Sweatshirt,pitchfork,2010
9,Adele,pitchfork,To Be Loved


### Part 4 Managing the scrolling on Pitchfork's website (10 points)

Notice that as one scrolls to the bottom of the pitchfork page the `?page=x` counter increments.  [Try it yourself](https://pitchfork.com/reviews/best/tracks/).  Just as we did with the API work, we can modify the URL to get different sets of songs from Pitchfork.

Write a script which scrolls through 10 pages of Pitchfork's music reccomendations and collects all the songs you find into a single `df_pitch` DataFrame.  Be sure to use the functions you've created above.

Validation: We found 56 songs running this on Feb 22.

In [154]:
def crawl_scrolling_page(pages):
    '''
    for pitchfork scrolling from html of song lists to pandas 
    args: num of scrolings pages to scrape
    returns: a single df for song lists
    '''
    
    
    page_nums = [x for x in range(pages)]
    
    df_pitch = pd.DataFrame()
    
    # iterate through get url process specified times
    for i in range(len(page_nums)):
        
        page_num = page_nums[i]
        
        # change page location for each iteration
        url = f'https://pitchfork.com/reviews/best/tracks/?page={page_num}'
        
        page_html = get_url(url)
        df_pitch = df_pitch.append(clean_pitchfork(page_html))
        
        
    return df_pitch
    # output into one dataframe

In [155]:
# call function for ten pages
##this is consistent with the 56 items on the validation
ten_song_pages = crawl_scrolling_page(10)
print(ten_song_pages.shape)
ten_song_pages.head()

(56, 2)


Unnamed: 0,name,artist
0,[“Back to the Radio”],"[[by: ], Jayson Greene]"
1,[“Billions”],"[[by: ], Gio Santiago]"
2,[“Bites on My Neck”],"[[by: ], Marc Hogan]"
3,[“home”],"[[by: ], Philip Sherburne]"
4,[“Midnight Sun”],"[[by: ], Jayson Greene]"


### Part 5 (28 points)

<img src="https://i.ibb.co/wht5NB0/Screenshot-from-2022-02-23-05-14-26.png" alt="Drawing" style="width: 600px;"/>

Write a function, `clean_quote()` which scrapes all the quotes from https://www.brainyquote.com/topics/websites-quotes:

```python
url = 'https://www.brainyquote.com/topics/websites-quotes'
html = get_url(url)
df_quote = clean_quote(html)
df_quote.head()
```

gives:

|   |          author |                                              text |
|--:|----------------:|--------------------------------------------------:|
| 0 |  Shreya Ghoshal | I'm not a gadget freak, so to say. I own an iP... |
| 1 | Anthony Carmona | Social media websites are no longer performing... |
| 2 |    M. J. Hyland | As is the case for many people with multiple s... |
| 3 |     Brie Larson | There are so many opportunities to learn thing... |
| 4 |      Ben Barnes |        There are loads of websites devoted to me. |

**Extra Credit (up to +3 points)**: Navigate to each quote's own webpage and you'll find more information:

<img src="https://i.ibb.co/ZKQS1ks/Screenshot-from-2022-02-23-05-14-37.png" alt="Drawing" style="width: 600px;"/>

Store the tags associated with each quote too.  For example, Bill Gate's quote above has tags: `'truth'`, `'government'`, `'internet'`, `'never'` and '`hard'`.  Think carefully about how you store the tags so that one may easily understand how many times each tag (e.g. `'internet'`) appears in your dataframe with simple pandas manipulations (hint: look tags are stored for boardgames in `Out [3]` of the `ipynb` for the [board game example project](https://course.ccs.neu.edu/ds3000/proj_example.html)).


In [156]:
def clean_quote(html_text):
    '''
    scrapes quotes and authors from url
    args: str formatted html text
    returns: df of author and quote text
    '''
    # build soup object from text
    soup = BeautifulSoup(html_text)
    
    
    quote_df = pd.DataFrame()
    
    #for quote in soup.find_all(class_='bq-aut qa_623749 oncl_a'):
    authors = soup.find_all(class_='bq-aut qa_623749 oncl_a')
        
        #quote_df['text'] = quote.text
    quote_df['text'] = soup.text
    quote_df['author'] = authors

    return quote_df

In [157]:
url = 'https://www.brainyquote.com/topics/websites-quotes'
html = get_url(url)
df_quote = clean_quote(html)
df_quote
#df_quote.head()

Unnamed: 0,text,author
0,,[Shreya Ghoshal]
