# __Country Music Lyric Analysis__
#### Analyzing song lyrics with natural language processing and Non-negative Matrix Factorization (NMF) / Correlation Explanation (CorEx) topic models  
<hr>  
This project explores Country music as a genre by extracting and analyzing lyrics from songs posted on [Billboard's weekly hot 50 charts](https://www.billboard.com/archive/charts/2018/country-songs). Topic modeling, though not an exhaustive approach, proves useful in exploring music genres by allowing users to discover previously hidden themes and motifs in song lyrics. Introducing a time series element to the dataset also allows users to visualize trends in topic distributions over several decades.     

In this case, the models and visualizations demonstrate how country music can in fact, be quite relatable; namely, that its lyrics represent a lot more than just beer, trucks, and women (themes commonly present in modern day bro-country). Applying CorEx / word anchors in a semi-supervised learning manner also reveals some more interesting, esoteric topics in Country music.  

The project can be broken down into the following steps:  

#### ***1. Data Acquisition***  
\- 1a. Scrape Billboard chart archives and populate corpus of country songs  
\- ***1b. Scrape lyrics for each song from WikiLyrics and Genius APIs***
#### 2. Preprocessing - Lyrics / Data  
\- 2a. Use natural language processing and other methods to process text lyrics and data. Introduce some EDA and basic feature engineering    
#### 3. Topic Models / Lyric Analysis   
\- 3a. Apply non-negative matrix factorzation and CorEx to model topics and then analyze the results  



<hr>

# __1b. Data Acquisition - Lyrics__

### __Sections__  

[1b1. Load corpus of country songs](#1b1)  
[1b2. Scrape lyrics from WikiLyrics](#1b2)  
[1b3. Scrape remaining lyrics from Genius](#1b3)  
[1b4. Merge / Prepare data for preprocessing](#1b4)

In [1]:
import numpy as np
import pandas as pd
import re
import os
import warnings
from time import sleep

# Imports for lyric scraping
import pprint
import requests
from PyLyrics import *
from bs4 import BeautifulSoup
from tqdm import tqdm
from multiprocessing import Pool

from keys import genius_id, genius_secret, genius_token
from rq_config import project_4_path

warnings.filterwarnings('ignore')

<a id='1b1'></a>

### __1b1. Load corpus of country songs__

In [2]:
# Directory where data is stored

data_dir = os.path.join(project_4_path,'data/')

In [3]:
# Load scraped billboard songs

tracks_df = pd.read_pickle(data_dir + 'tracks_df.pkl')

In [4]:
tracks_df.head()

Unnamed: 0,title,artist,weeks,week_count
0,"""Never More"" Quote The Raven",Stonewall Jackson,"1969-08-03,1969-07-27,1969-07-20,1969-07-13,19...",7
1,"""You've Got"" The Touch",Alabama,"1987-04-26,1987-04-19,1987-04-12,1987-04-05,19...",15
2,'57 Chevrolet,Billie Jo Spears,"1978-10-15,1978-10-08,1978-10-01,1978-09-24,19...",9
3,'Cause I Have You,Wynn Stewart,"1967-10-22,1967-10-15,1967-10-08,1967-10-01,19...",14
4,'Fore She Was Mama,Clay Walker,"2007-03-18,2007-03-11,2007-03-04,2007-02-25,20...",25


In [5]:
tracks_df.shape

(12375, 4)

<a id='1b2'></a>

### __1b2. Scrape from WikiLyrics__  
Scrape lyrics using WikiLyrics as a source first since WikiLyrics has more reliable results / lyrics

In [6]:
def query_wikilyrics(query):
    """
    Queries WikiLyrics with API with Pylyrics library, returns song lyrics if match is found.
    
    Parameters
    -----
        query: str containing concatenated song title and artist
    
    Returns
    -----
        str: lyrics
    """
    # Separate artist and title from query object
    artist = query[0]
    title = query[1]
    try:
        try:
            return PyLyrics.getLyrics(artist,title)
        except requests.exceptions.ConnectionError:
            sleep(1)
            return PyLyrics.getLyrics(artist,title)
    except ValueError:
        return None
    

In [7]:
def pull_lyrics_wikilyrics(df):
    """
    Scrape lyrics from WikiLyrics for all tracks in dataframe
    
    Parameters
    -----
    df: dataframe containing all songs
    
    Returns
    -----
    dataframe with lyrics column
    """
    search_queries = list(zip(list(df['artist']),list(df['title'])))
    
    print(f'Scraping lyrics for {len(search_queries)} tracks.')
    
    pool = Pool(50)
    if __name__ == '__main__':   
        lyrics = list(tqdm(pool.imap(query_wikilyrics, search_queries), total = len(search_queries)))
    pool.terminate()
    pool.join()
    
    df['lyrics'] = lyrics
    
    print(f'Finished. Unable to scrape lyrics for {df.lyrics.isnull().sum()} songs')

    return df  

#### Scrape lyrics - this will take a few minutes:

In [8]:
tracks_df = pull_lyrics_wikilyrics(tracks_df)

Scraping lyrics for 12375 tracks.


100%|██████████| 12375/12375 [04:35<00:00, 44.97it/s]


Finished. Unable to scrape lyrics for 4593 songs


#### Clean WikiLyrics results - some lyrics were "returned" but can still represent instrumentals / licensed content. Convert these to None

In [9]:
# Replace instrumentals with None

mask = tracks_df['lyrics'].str.contains('span style',na = False)

tracks_df.loc[mask, 'lyrics'] = None

In [10]:
# Replace unlicensed lyrics with None

mask = tracks_df['lyrics'].str.contains('unfortunately, we are not licensed',na = False)

tracks_df.loc[mask, 'lyrics'] = None

#### Separate songs with no WikiLyrics results:

In [11]:
wikilyrics_df = tracks_df[tracks_df['lyrics'].notnull()]
remaining_df = tracks_df[tracks_df['lyrics'].isnull()]
remaining_df.drop(columns = 'lyrics',inplace = True)

In [12]:
wikilyrics_df['source'] = 'wikilyrics'

In [13]:
wikilyrics_df.head()

Unnamed: 0,title,artist,weeks,week_count,lyrics,source
1,"""You've Got"" The Touch",Alabama,"1987-04-26,1987-04-19,1987-04-12,1987-04-05,19...",15,Lyin' beside you watching you sleepin'\nAfter ...,wikilyrics
2,'57 Chevrolet,Billie Jo Spears,"1978-10-15,1978-10-08,1978-10-01,1978-09-24,19...",9,Come and look at this old faded photograph\nHo...,wikilyrics
4,'Fore She Was Mama,Clay Walker,"2007-03-18,2007-03-11,2007-03-04,2007-02-25,20...",25,"'Bout ten years old, hide and seek\nI found me...",wikilyrics
6,'Round Here,Sawyer Brown,"1996-02-25,1996-02-18,1996-02-11,1996-02-04,19...",12,Sue and Jack fell in love 'round here \nThey b...,wikilyrics
10,'Til I Get It Right,Tammy Wynette,"1973-04-01,1973-03-25,1973-03-18,1973-03-11,19...",13,I'll just keep on falling in love till I get i...,wikilyrics


<a id='1b3'></a>

### __1b3. Scrape remaining lyrics from Genius API (second run)__

In [14]:
def generate_search_query(df):
    '''
    Transforms track name and artist into a concatenated format which can be queried in Genius API.
    
    Parameters
    -----
    dataframe: dataframe of tracks with track_name and track_artist columns    
    
    Returns
    -----
    dataframe: Updated dataframe with a new column of URLs which are ready to be scraped for lyrics. 
    '''
    track_names = [re.sub(r' \(.*\)','',item) for item in list(df['title'])]
    track_artists = list(df['artist'])
    
    queries = [[f'{track} {artist}',artist] for track,artist in zip(track_names,track_artists)]
    return queries

In [15]:
def pull_url_lyrics(genius_url):
    '''
    Pulls lyrics from a single Genius URL using beautifulsoup
    
    Parameters
    ------
    genius_url: url that will be scraped for lyrics
    
    Returns
    -----
    str: lyrics in string format
    '''
    try:
        response = requests.get(genius_url,timeout = 5)
    except requests.exceptions.SSLError:
        pass
    except requests.exceptions.ConnectionError:
        time.sleep(5)
        response = requests.get(genius_url,timeout = 5)

    html = BeautifulSoup(response.text,features = 'html.parser')
    try:
        lyrics = html.find("div",class_='lyrics').get_text()
        return lyrics
    except AttributeError:
        return None

In [16]:
def search_genius_url(track_query):
    '''
    Searches for the song's Genius URL in the Genius API. 

    Parameters
    -----
    track_query: query of track name and artist
    
    Returns
    -----
    str: URL for respective song which can then be scraped for lyrics.
    '''
    artist = track_query[1]
    track_query = track_query[0]
    headers = {'Authorization': f'Bearer {genius_token}'}
    params = {'q':track_query}
    response = requests.get('https://api.genius.com/search',params = params,headers = headers,timeout = 5)
    if len(response.json()['response']['hits']) > 0:
        url = response.json()['response']['hits'][0]['result']['url']
        genius_artist = response.json()['response']['hits'][0]['result']['primary_artist']['name']
        if artist == genius_artist:
            return pull_url_lyrics(url)
        else:
            return None
    else:
        return None
    

In [17]:
def pull_lyrics_genius(df):
    search_queries = generate_search_query(df)
    
    print(f'Scraping lyrics for {len(search_queries)} tracks.')
    
    pool = Pool(50)
    if __name__ == '__main__':   
        lyrics = list(tqdm(pool.imap(search_genius_url, search_queries), total = len(search_queries)))  
    pool.terminate()  
    pool.join()  

    print(f'Finished')
    
    df['lyrics'] = lyrics
    
    return df  

#### Scrape lyrics from Genius - this will take a few minutes

In [18]:
remaining_df = pull_lyrics_genius(remaining_df)

Scraping lyrics for 4598 tracks.


100%|██████████| 4598/4598 [00:56<00:00, 81.63it/s]

Finished





#### Separate Genius results:

In [19]:
genius_df = remaining_df[remaining_df['lyrics'].notnull()]

In [20]:
genius_df['source'] = 'genius'

In [21]:
# Track with no results from either source

remaining_df = remaining_df[remaining_df['lyrics'].isnull()]

<a id='1b4'></a>

### __1b4. Merge / Prepare data for Preprocessing__

In [22]:
# Results from wikilyrics scrape

print(wikilyrics_df.shape)
wikilyrics_df.head()

(7777, 6)


Unnamed: 0,title,artist,weeks,week_count,lyrics,source
1,"""You've Got"" The Touch",Alabama,"1987-04-26,1987-04-19,1987-04-12,1987-04-05,19...",15,Lyin' beside you watching you sleepin'\nAfter ...,wikilyrics
2,'57 Chevrolet,Billie Jo Spears,"1978-10-15,1978-10-08,1978-10-01,1978-09-24,19...",9,Come and look at this old faded photograph\nHo...,wikilyrics
4,'Fore She Was Mama,Clay Walker,"2007-03-18,2007-03-11,2007-03-04,2007-02-25,20...",25,"'Bout ten years old, hide and seek\nI found me...",wikilyrics
6,'Round Here,Sawyer Brown,"1996-02-25,1996-02-18,1996-02-11,1996-02-04,19...",12,Sue and Jack fell in love 'round here \nThey b...,wikilyrics
10,'Til I Get It Right,Tammy Wynette,"1973-04-01,1973-03-25,1973-03-18,1973-03-11,19...",13,I'll just keep on falling in love till I get i...,wikilyrics


In [23]:
# Results from Genius scrape

print(genius_df.shape)
genius_df.head()

(938, 6)


Unnamed: 0,title,artist,weeks,week_count,lyrics,source
3,'Cause I Have You,Wynn Stewart,"1967-10-22,1967-10-15,1967-10-08,1967-10-01,19...",14,\n\nA flower needs the earth to make it grow\n...,genius
16,'round The World With Rubber Duck,C.W. McCall,"1977-01-23,1977-01-16,1977-01-09",3,"\n\n[On the CB.]\nBreaker, one-nine, this here...",genius
17,'til I Can Make It On My Own,Tammy Wynette,"1976-05-09,1976-05-02,1976-04-25,1976-04-18,19...",12,\n\nI'll need time to get you off my mind\nAnd...,genius
18,'til I Gain Control Again,Crystal Gayle,"1983-03-06,1983-02-27,1983-02-20,1983-02-13,19...",14,\n\nJust like the sun over the mountain top\nY...,genius
26,(Don't Let The Sun Set On You) Tulsa,Waylon Jennings,"1971-02-14,1971-02-07,1971-01-31,1971-01-24,19...",11,"\n\n[Verse 1]\nWhen I left Tulsa, Jamie was an...",genius


In [24]:
# Merge WikiLyrics and Genius results

lyrics_merged_df = pd.concat([wikilyrics_df,genius_df,remaining_df]).sort_index()

In [25]:
lyrics_merged_df.head(10)

Unnamed: 0,artist,lyrics,source,title,week_count,weeks
0,Stonewall Jackson,,,"""Never More"" Quote The Raven",7,"1969-08-03,1969-07-27,1969-07-20,1969-07-13,19..."
1,Alabama,Lyin' beside you watching you sleepin'\nAfter ...,wikilyrics,"""You've Got"" The Touch",15,"1987-04-26,1987-04-19,1987-04-12,1987-04-05,19..."
2,Billie Jo Spears,Come and look at this old faded photograph\nHo...,wikilyrics,'57 Chevrolet,9,"1978-10-15,1978-10-08,1978-10-01,1978-09-24,19..."
3,Wynn Stewart,\n\nA flower needs the earth to make it grow\n...,genius,'Cause I Have You,14,"1967-10-22,1967-10-15,1967-10-08,1967-10-01,19..."
4,Clay Walker,"'Bout ten years old, hide and seek\nI found me...",wikilyrics,'Fore She Was Mama,25,"2007-03-18,2007-03-11,2007-03-04,2007-02-25,20..."
5,Lefty Frizzell,,,'Gator Hollow,2,"1965-01-17,1965-01-10"
6,Sawyer Brown,Sue and Jack fell in love 'round here \nThey b...,wikilyrics,'Round Here,12,"1996-02-25,1996-02-18,1996-02-11,1996-02-04,19..."
7,Dick Curless,,,'Tater Raisin' Man,3,"1965-11-14,1965-11-07,1965-10-31"
8,Keith Whitley & Lorrie Morgan,,,'Til A Tear Becomes A Rose,16,"1990-11-11,1990-11-04,1990-10-28,1990-10-21,19..."
9,Leon Everette,,,'Til A Tear Becomes A Rose,3,"1985-11-10,1985-11-03,1985-10-27"


In [26]:
# Save results

lyrics_merged_df.to_pickle(data_dir + 'lyrics_merged_df.pkl')