# __Country Music Lyric Analysis__
#### Analyzing song lyrics with natural language processing and Non-negative Matrix Factorization (NMF) / Correlation Explanation (CorEx) topic models  
<hr>  
This project explores Country music as a genre by extracting and analyzing lyrics from songs posted on [Billboard's weekly hot 50 charts](https://www.billboard.com/archive/charts/2018/country-songs). Topic modeling, though not an exhaustive approach, proves useful in exploring music genres by allowing users to discover previously hidden themes and motifs in song lyrics. Introducing a time series element to the dataset also allows users to visualize trends in topic distributions over several decades.     

In this case, the models and visualizations demonstrate how country music can in fact, be quite relatable; namely, that its lyrics represent a lot more than just beer, trucks, and women (themes commonly present in modern day bro-country). Applying CorEx / word anchors in a semi-supervised learning manner also reveals some more interesting, esoteric topics in Country music.  

The project can be broken down into the following steps:  

#### ***1. Data Acquisition***  
\- ***1a. Scrape Billboard chart archives and populate corpus of country songs***  
\- 1b. Scrape lyrics for each song from WikiLyrics and Genius APIs
#### 2. Preprocessing - Lyrics / Data  
\- 2a. Use natural language processing and other methods to process text lyrics and data. Introduce some EDA and basic feature engineering    
#### 3. Models / Analysis  
\- 3a. Apply non-negative matrix factorzation and CorEx to model topics and then analyze the results  



<hr>

# __1a. Data Acquisition - Billboard Country Top 50__

### __Sections__  

[1a1. Functions to scrape Billboard](#1a1)  
[1a2. Scrape from Billboard](#1a2)  
[1a3. Process / Save Data](#1a3)  

In [77]:
import pandas as pd
import numpy as np
import requests
import os

# Imports for scraping
from bs4 import BeautifulSoup
from multiprocessing import Pool
from tqdm import tqdm
from datetime import datetime, timedelta

from rq_config import project_4_path

<a id='1a1'></a>

### __1a1. Functions to scrape Billboard__

In [73]:
def generate_dates(first_week,weeks):
    """
    Generates a list of dates to scrape from Billboard's archives
    
    Parameters
    -----
    first_week: The most recent chart week to start scraping from (e.g. 2019-05-29)
    weeks: Total number of weeks to scrape (e.g. 104 = 2019-05-29 -> 2017-05-29)
    
    Returns:
    List: List of all dates to be scraped from Billboard 
    """
    # Convert week to datetime format
    week = datetime.strptime(first_week,'%Y-%m-%d')
    interval = timedelta(days = 7)
    alldates = []
    i = 0
    
    # Create week list by going back 7 days at a time and appending to list
    while i < weeks:
        alldates.append(week.strftime('%Y-%m-%d'))
        week = (week - interval)
        i += 1
    return alldates

In [74]:
def scrape_week(url):
    """
    Pulls a list of songs, artist, and ranks from a single chart week in billboards archives
    
    Parameters
    -----
    url: unique week URL to be scraped (e.g. https://www.billboard.com/archive/charts/1960/r-b-hip-hop-songs/2019-05-02)
    
    Returns
    -----
    List: Concatenated list with songs, artists, ranks, and respective weeks
    """     
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page,'lxml')
    weekslist = []
    
    songlist = [song.text.strip() for song in soup.find_all(class_="chart-list-item__title-text")]
    artistlist = [artist.text.strip() for artist in soup.find_all(class_="chart-list-item__artist")]
    rankslist = [rank.text.strip() for rank in soup.find_all(class_="chart-list-item__rank")]
    weekslist.extend([url[-10:]]*len(songlist))
    
    return [songlist,artistlist,rankslist,weekslist]

In [75]:
def scrape_billboard(first_week,last_week,url):
    """
    Generate unique urls and concurrently scrape all urls using threading
    
    Parameters
    -----
    first_week: The most recent chart week that is being scraped (e.g. 2019)
    last_week: The earliest chart week (e.g. 1959, etc.)
    url: Specific URL from Billboard website. Can be alternated with other genres
    
    Returns
    -----
    List: Nested lists which contain data from all Billboard weekly archives that were scraped
    """
    
    # Total number of weeks to scrape
    n_weeks = int(round((datetime.strptime(first_week,'%Y-%m-%d') - datetime.strptime(last_week,'%Y-%m-%d')).days / 7))
    
    # Generate list of unique URLs to scrape from Billboard
    date_list = generate_dates(first_week,n_weeks)
    url_list = [url + week for week in date_list]
    
    # Scrape html responses from Billboard
    print(f'Scraping {len(url_list)} weeks from Billboard.')
          
    pool = Pool(25)
    if __name__ == '__main__':
        chart_data = list(tqdm(pool.imap(scrape_week,url_list),total = len(url_list)))
        
    pool.terminate()
    pool.join()
    
    return chart_data

<a id='1a2'></a>

### __1a2. Scrape from Billboard weekly__

In [79]:
# Week to start scraping from

first_week = '2019-05-12'

# Last week to scrape
last_week = '1959-07-23'

# Scrape base url

base_url = 'https://www.billboard.com/charts/country-songs/'

In [34]:
# Scrape responses - this should take a few minutes..

billboard_responses = scrape_billboard(first_week,last_week,base_url)

Scraping 3120 weeks from Billboard.


100%|██████████| 3120/3120 [06:02<00:00,  8.61it/s]


In [38]:
billboard_df = pd.DataFrame(billboard_responses,columns = ['songs','artists','ranks','weeks'])

In [54]:
billboard_df.head()

Unnamed: 0,songs,artists,ranks,weeks
0,"[Whiskey Glasses, Beautiful Crazy, God's Count...","[Morgan Wallen, Luke Combs, Blake Shelton, Cha...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[2019-05-12, 2019-05-12, 2019-05-12, 2019-05-1..."
1,"[Beautiful Crazy, Whiskey Glasses, God's Count...","[Luke Combs, Morgan Wallen, Blake Shelton, Cha...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[2019-05-05, 2019-05-05, 2019-05-05, 2019-05-0..."
2,"[Beautiful Crazy, God's Country, Eyes On You, ...","[Luke Combs, Blake Shelton, Chase Rice, Morgan...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[2019-04-28, 2019-04-28, 2019-04-28, 2019-04-2..."
3,"[Beautiful Crazy, Here Tonight, Eyes On You, G...","[Luke Combs, Brett Young, Chase Rice, Blake Sh...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[2019-04-21, 2019-04-21, 2019-04-21, 2019-04-2..."
4,"[Beautiful Crazy, Tequila, Here Tonight, Look ...","[Luke Combs, Dan + Shay, Brett Young, Thomas R...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[2019-04-14, 2019-04-14, 2019-04-14, 2019-04-1..."


### __1a3. Process / Save Data__

In [56]:
tracks_df = pd.DataFrame()

In [57]:
tracks_df['title'] = [item for lst in billboard_df['songs'] for item in lst]
tracks_df['artist'] = [item for lst in billboard_df['artists'] for item in lst]
tracks_df['rank'] = [item for lst in billboard_df['ranks'] for item in lst]
tracks_df['week'] = [item for lst in billboard_df['weeks'] for item in lst]

In [70]:
tracks_df.head(20)

Unnamed: 0,title,artist,weeks,week_count
0,"""Never More"" Quote The Raven",Stonewall Jackson,"1969-08-03,1969-07-27,1969-07-20,1969-07-13,19...",7
1,"""You've Got"" The Touch",Alabama,"1987-04-26,1987-04-19,1987-04-12,1987-04-05,19...",15
2,'57 Chevrolet,Billie Jo Spears,"1978-10-15,1978-10-08,1978-10-01,1978-09-24,19...",9
3,'Cause I Have You,Wynn Stewart,"1967-10-22,1967-10-15,1967-10-08,1967-10-01,19...",14
4,'Fore She Was Mama,Clay Walker,"2007-03-18,2007-03-11,2007-03-04,2007-02-25,20...",25
5,'Gator Hollow,Lefty Frizzell,"1965-01-17,1965-01-10",2
6,'Round Here,Sawyer Brown,"1996-02-25,1996-02-18,1996-02-11,1996-02-04,19...",12
7,'Tater Raisin' Man,Dick Curless,"1965-11-14,1965-11-07,1965-10-31",3
8,'Til A Tear Becomes A Rose,Keith Whitley & Lorrie Morgan,"1990-11-11,1990-11-04,1990-10-28,1990-10-21,19...",16
9,'Til A Tear Becomes A Rose,Leon Everette,"1985-11-10,1985-11-03,1985-10-27",3


#### There are ~300,000 songs. However, there are duplicates as many songs have repeat appearances on the popcharts. Therefore, flatten current list of scraped songs and append with new "count" column so we are not losing track of repeat appearances:

In [59]:
# Group by unique track, then flatten with new column of all aggregated weeks

tracks_df = (tracks_df.groupby(['title','artist'])
             ['week'].apply(','.join)
             .reset_index(name='weeks'))

In [60]:
tracks_df['week_count'] = tracks_df['weeks'].apply(lambda x: len(x.split(',')))

In [62]:
tracks_df.shape

(12375, 4)

In [68]:
data_dir = os.path.join(project_4_path,'data/')

In [69]:
tracks_df.to_pickle(data_dir + 'tracks_df.pkl')