#### In this project, I scraped a popular guitar tab website, Ultimate-Guitar.com, to analyze lyrics and chords from popular songs in various genres. 

#### I have a few dozen guitar students, and when songwriting comes up, I always stipulate that I am not great at writing lyrics. I usually try to help the students find a variety of words or phrases that they can later begin to stitch together into something cohesive.  

#### The goal of this project was to analyze popular lyrics in order to find a set of lyric writing "parameters". 
#### For example: "A verse usually has x number of words, k fraction of are usually nouns, j fraction are verbs, ect. You want so many of them to be objective, so many of them positive. Here are some examples of objective or positive words, here are some examples of common verbs, nouns, etc. that are used in a song, are here are some common 'filler' phrases." 

#### I analyzed all the collected lyrics together, then analyzed them by genre to see if there were any genre-specific patterns. I found the number of unique words per song-part to be the biggest statistical difference (e.g. Hip-hop lyrics uses many more unique words in a verse than Pop lyrics). However, the differences in sentiment and part of speech (p.o.s.) content to be minute. This may be an artifact of the TextBlob dictionary, or might be a reflection of the English language. There was a significant difference in the unique common words and phrases, and the examples for subjectivity and polarity. 

#### After finding these 'parameters', I will create a sort of lyric-writing check list, which a student can fill out to help generate more words on a page to start stitching together into more complete lyrics. 

#### Improvements can be made. The genre for many chord charts on Ultimate-Guitar.com aren't labeled correctly ('Somewhere Over the Rainbow' is NOT a Country song). In the future I plan to try to find another resource to get the genre labels for specific artists and replace the current ones. I also plan to add the ability to scrape all the songs for a specific artist so I can analyze their writing style

## Load and install necessary packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.gridspec as gridspec

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


!pip install requests-html
!pip install -U textblob
!python -m textblob.download_corpora
!pip install wordcloud

from requests_html import AsyncHTMLSession, HTMLSession
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS


nltk.download('wordnet')
nltk.download('brown')

## Functions to be used

In [None]:
#Create function that scrapes links to songs


async def get_song_links(song_html):
    '''Takes in a list of links that point to 
       Ultimate Guitar's top searched songs, 
       returns a list of links to specific songs.'''
    
    
    #grabs html elements
    asession = AsyncHTMLSession()

    r = await asession.get(song_html)
    await r.html.arender()
    links = r.html.find('a[href*="tabs."]')


    #parses out links and adds them to list
    list_o_links = []

    for elem in links:
        link = str(elem).split("'")[-2]
        list_o_links.append(link)
    
    #closes
    await asession.close()

    return list_o_links

In [None]:
def split_lyrics_chords(html_main_body, all_chords):
    
    """Takes html of main body of the chord chart, 
    separates the html into the parts for each song, 
    then separates the chords from the lyrics"""
    
    #Grab name of each part, then lyrics and chords in each part
    song_part = []
    part_range = []
    part_body = []
    part_chords = []
    part_lyrics = []

    for i, item in enumerate(html_main_body):
        part_body = []
        #Grabs part name, and the index of that part name
        if not item.find("["):
            song_part.append(item.replace('[', '').replace(']', ''))
            part_range.append(i)
        else: continue

    #Use the index of each part name to grab lines between them       
    for i in range(len(part_range)):


        low_range = part_range[i]+1

        if i+1 < len(part_range):
            high_range = part_range[i+1]
            part_body.append(html_main_body[low_range:high_range])
        else:
            part_body.append(html_main_body[low_range:])

    #Separates chords from lyrics        
    for part in part_body:
        chords = []
        lyrics = []
        for i, line in enumerate(part):
            line_list = line.split(' ')

            crd = [x for x in all_chords if x in line_list] 

            if len(crd) > 0:
                fixed_list = []
                for chord in line.split(' '):


                    if "/" in chord:
                        fixed_list.append(chord.split('/')[0])
                    else:
                        fixed_list.append(chord)

                fixed_line = ' '.join(fixed_list)

                chords.append(fixed_line)

            else:     

                lyrics.append(line)

        #removes unnecessary symbols
        chords = " ".join(chords).replace(',', '').replace('|', '').replace('*', '').replace('/', '')
        lyrics = " ".join(lyrics).replace(',', '').replace('|', '').replace("'", '')

        part_chords.append(chords)

        if '---' in lyrics:

            part_lyrics.append('')
        else:
            part_lyrics.append(lyrics)
                
                
    return {'Part': song_part, 'Chords': part_chords, 'Lyrics': part_lyrics }

In [None]:
async def get_song_dataframe(html, genre):
    
    """Takes in html for a song's chord chart and the genre of the song,
    parses out the song's title, artist, lyrics, chords, and the 
    part-of-song labels, combines with song genre and returns a dataframe"""
    
    #Grab html from javascript page
    asession = AsyncHTMLSession()

    r = await asession.get(html)
    await r.html.arender(timeout=20)
    

    #Extract the main body of text

    elements = r.html.find("span._1zlI0")
    
    await asession.close()
    main_body = []
    for elem in elements:
        main_body.append(elem.text)
        
    #Extract list of all the chords in the song
    elements = r.html.find('header._2jxI1')
    
    song_chordlist = []
    for elem in elements:
        song_chordlist.append(elem.text)
    song_chordlist.append('N.C.')

    #Extract Title and Artist of the song
    header = r.html.find('title')[0].text
    song_title = header.split(' CHORDS')[0]
    song_artist = header.split('by ')[1].split(' @')[0]


    parts_chords_lyrics = split_lyrics_chords(main_body, song_chordlist)
    song_part = parts_chords_lyrics['Part']
    part_chords = parts_chords_lyrics['Chords']
    part_lyrics = parts_chords_lyrics['Lyrics']

    #Create dataframe
    song_list = []

    for i, part in enumerate(song_part):
        if part_chords[i]:

            partchords = part_chords[i]



        else:
            partchords = ''


        song_dict = {'Song Artist':song_artist, 'Song Title': song_title, 'Part': part.lower(), 'Chords':partchords, 'Lyrics':part_lyrics[i].replace("â€™", ''), 'Genre': genre}  
        song_list.append(song_dict)

        
    dflist = pd.DataFrame(song_list)

    return dflist.replace('  ', np.nan).replace(' ', np.nan).replace('', np.nan)
        

In [None]:

def get_pol_sub_scores(lyric_list):
    
    """Takes in a list of lyrics, then uses TextBlob
    to analyze the sentiment of the lyrics, returns dataframe
    with: 
    
    Polarity - 1 for positive lyrics, -1 for negative lyrics
    
    Subjectivity - 0 for facts, 1 for opinions
    
    Words- the words analyzed for the scoring, and their respective scores"""
    
    
    sentiments_list_of_dicts = []
    
    
    for lyrics in lyric_list:
        blob = TextBlob(str(lyrics))
        sentiment = blob.sentiment_assessments
        pol = sentiment[0]
        sub = sentiment[1]
        words = sentiment[2]

        sentiment_dict = {'Polarity':pol, "Subjectivity":sub, "Words":words}
        
        sentiments_list_of_dicts.append(sentiment_dict)
        
    return  pd.DataFrame(sentiments_list_of_dicts)
    

In [None]:


def get_pos_tags(lyric_list):
    
    """Takes a list of lyrics, removes common words, 
    then tags each word with its part of speech (p.o.s.) label,
    returns a dataframe with the word and its label."""
    
    #list of common words in english
    stop_words = stopwords.words('english')
    
    tags_list_of_dicts = []
    
    for lyrics in lyric_list:
        blob = TextBlob(str(lyrics))
        part_tags = blob.tags
        
        for tags in part_tags:
        
            word = tags[0].lower()
            tag = tags[1]


            if word in stop_words:
                continue

            elif len(tags_list_of_dicts) > 0:
                if word.lower() == tags_list_of_dicts[-1]['Word'].lower():
                    continue
                else:
                    tag_dict = {'Word': word, 'Tag': tag}
                    tags_list_of_dicts.append(tag_dict)
            else:
                tag_dict = {'Word': word, 'Tag': tag}
                tags_list_of_dicts.append(tag_dict)

    return pd.DataFrame(tags_list_of_dicts)
    

In [None]:
def get_pos_count_df(lyric_list):
    
    """Takes a list of lyrics, counts instances of 
    nouns, verbs, adjectives, and adverbs. 
    Returns a dataframe with the percentage of 
    each part of speech in the total words counted."""
    
    #list of words to skip over
    stop_words = stopwords.words('english')
    
    
    #selection of p.o.s. tags to count
    noun_list = ['NN', 'NNS']
    verb_list = [ 'VB', 'VBD', 'VBG', 'VBN', ]
    adj_list = ['JJ', 'JJR', 'JJS',]
    adverb_list = ['RB', 'RBR', 'RBS']


    tags_list_of_dicts = []

    for lyrics in lyric_list:

        nouncount = 0
        verbcount = 0
        adjcount = 0
        adverbcount = 0
        
        #Create textblob object to access tags
        blob = TextBlob(str(lyrics))
        part_tags = blob.tags

        for tags in part_tags:

            word = tags[0].lower()
            tag = tags[1]

            #Don't count if word is in stopwords
            if (word in stop_words):
                continue

            else :
                if tag in noun_list:
                    nouncount = nouncount + 1

                elif tag in verb_list:
                    verbcount = verbcount + 1

                elif tag in adj_list:
                    adjcount = adjcount + 1

                elif tag in adverb_list:
                    adverbcount = adverbcount + 1
        
        total = nouncount + verbcount + adjcount + adverbcount 
        
        #avoid division by zero
        if total > 0:
            
            total_words_counted = total
        else:
            total_words_counted = 1
            
              

        #create dictionary and append it to list
        tag_dict = { 'Noun':nouncount/total_words_counted , 'Verb': verbcount/total_words_counted, 'Adjective': adjcount/total_words_counted, 'Adverb': adverbcount/total_words_counted}
        tags_list_of_dicts.append(tag_dict)
        
    #Create data frame from list of dictionaries
    df = pd.DataFrame(tags_list_of_dicts)
    return df

In [None]:
def get_common_pos_df(tags_df, sample_size=5, genre=np.nan):
    
    """"This function takes in a dataframe of words and their part of speech tag, and returns a dataframe 
    where each row contains the word, the frequency of the word, the p.o.s. tag, and the genre of the song """
    
    #filters specific pos tags
    noun = tags_df[(tags_df.Tag == 'NN') | (tags_df.Tag == 'NNS')]
    verb = tags_df[(tags_df.Tag == 'VB') | (tags_df.Tag == 'VBD') | (tags_df.Tag == 'VBG') | (tags_df.Tag == 'VBN')]
    adjective = tags_df[(tags_df.Tag == 'JJ') | (tags_df.Tag == 'JJR') | (tags_df.Tag == 'JJS')]
    adverb = tags_df[(tags_df.Tag == 'RB') | (tags_df.Tag == 'RBR') | (tags_df.Tag == 'RBS')]
    
    pos_list = [noun, verb, adjective, adverb]
    pos_names = ['noun', 'verb', 'adjective', 'adverb']
    
    list_of_dicts = []
    
   
    for pos, name in zip(pos_list, pos_names):
        
        
        top_list = pos.Word.value_counts().head(sample_size)
        for i in range(len(top_list)):
            word = top_list.index.values[i]
            count = top_list[i]
            
            pos_dict = {'Word': word, 'Count':count, 'POS': name, 'Genre': genre}
            list_of_dicts.append(pos_dict)
            
    return pd.DataFrame(list_of_dicts)
 

        

In [None]:
def get_sentiment_examples(df, sentiment,  statistic):
    """Takes a dataframe and choice of statistic, 
    returns a string of lyrics that matches the score
    of the specified statistic (min, mean, or max)"""
    
    df = df
    
    if sentiment == 'Subjectivity':
        
        series_sent = df.Subjectivity
        
        if statistic == 'min':
        
            value = series_sent.min()
            
        elif statistic == 'max':
        
            value = series_sent.max()
            
        elif statistic == 'mean':
            
            value = series_sent.mean()
            
        else: return "check statistic argument"
            
            
        new_df = df[(df.Subjectivity >= (value - 0.3)) & (df.Subjectivity <= (value + 0.3)) ].reset_index(drop = True)
        
        words = ''
        for lists in new_df.Words:
            for wordlist in lists:
                for word in wordlist[0]:
                    words = words + word + ' '


        return words
    
    elif sentiment == 'Polarity':
        
        series_sent = df.Polarity
        
        if statistic == 'min':
        
            value = series_sent.min()
            
        elif statistic == 'max':
        
            value = series_sent.max()
            
        elif statistic == 'mean':
            
            value = series_sent.mean()
            
        else: return "check statistic argument"
            
        new_df = df[(df.Polarity >= (value - 0.3)) & (df.Polarity <= (value + 0.3)) ].reset_index(drop = True)
        
        words = ''
        for lists in new_df.Words:
            for wordlist in lists:
                for word in wordlist[0]:
                    words = words + word + ' '
        

        return words
        
    else: return "check sentiment argument"
    

## Scraping lyrics and chords from Ultimate-Guitar

In [None]:
pop1 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=14&page=1'
pop2 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=14&page=2'
pop2010_1 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=14&decade[]=2010&page=1'
pop2010_2 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=14&decade[]=2010&page=2'
pop2010_3 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=14&decade[]=2010&page=3'

rock1 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=4&page=1'
rock2 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=4&page=2'
rock2010_1 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=4&decade[]=2010&page=1'
rock2010_2 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=4&decade[]=2010&page=2'
rock2010_3 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=4&decade[]=2010&page=3'

country1 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=49&page=1'
country2 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=49&page=2'
country2010_1 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=49decade[]=2010&page=1'
country2010_2 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=49decade[]=2010&page=2'
country2010_3 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=49decade[]=2010&page=3'

rnb1  = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=1787&page=1'
rnb2  = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=1787&page=2'
rnb2010_1 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=1787&decade[]=2010&page=1'
rnb2010_2 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=1787&decade[]=2010&page=2'
rnb2010_3 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=1787&decade[]=2010&page=3'


hiphop1 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=45&page=1'
hiphop2 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&genres[]=45&page=2'
hiphop2010_1 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&decade[]=2010&genres[]=45&page=1'
hiphop2010_2 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&decade[]=2010&genres[]=45&page=2'
hiphop2010_3 = 'https://www.ultimate-guitar.com/explore?type[]=Chords&tuning[]=1&decade[]=2010&genres[]=45&page=3'

In [None]:
#List of links to most viewed songs, but filtered by genre, and a list of the genre names
#Chose to sample songs from the last decade, otherwise "The Beatles" would be in every genre

genre_list = [pop2010_1, pop2010_2, pop2010_3, rock2010_1, rock2010_2, rock2010_3, 
              country2010_1, country2010_2, country2010_3, rnb2010_1, rnb2010_2, rnb2010_3, 
              hiphop2010_1, hiphop2010_2, hiphop2010_3]

genre_names = ['pop', 'pop', 'pop', 'rock', 'rock', 'rock', 'country', 'country', 'country', 
               'rnb', 'rnb', 'rnb', 'hiphop', 'hiphop', 'hiphop']

genre_list_test = [pop2010_1, pop2010_2, rock2010_1]

#### The scraped data was exported, so I've commented out the following and loaded in the .csv files later.

In [None]:
#Create dataframe of all song links, and their respective genre

links_list_dicts = []
links_list = []


for address, name in zip(genre_list, genre_names):
    
    #Downloads HTML
    links = await get_song_links(address)
    
    #Only add link to list if it isn't already in there
    for link in links:
        
        if link in links_list:
            continue
        else:
            link_dict = {'Link': link, 'Genre': name}
            links_list_dicts.append(link_dict)
            links_list.append(link)
    
links_df = pd.DataFrame(links_list_dicts)    

In [None]:
pop_links = links_df[links_df.Genre == 'pop'].reset_index(drop = True)
rock_links = links_df[links_df.Genre == 'rock'].reset_index(drop = True)
country_links = links_df[links_df.Genre == 'country'].reset_index(drop = True)
rnb_links = links_df[links_df.Genre == 'rnb'].reset_index(drop = True)
hiphop_links = links_df[links_df.Genre == 'hiphop'].reset_index(drop = True)

In [None]:
#get dataframe for pop songs

list_o_dfs = []

for i in range(len(pop_links)):
    link = pop_links.Link[i]
    genre = pop_links.Genre[i]
    song_df = await get_song_dataframe(link, genre)
    list_o_dfs.append(song_df)
    
pop_df = pd.concat(list_o_dfs, axis = 0).reset_index(drop = True)

pop_df.to_csv('poplyricsraw2.csv',index=False)

In [None]:
#get dataframe for rock songs
list_o_dfs = []

for i in range(len(rock_links)):
    link = rock_links.Link[i]
    genre = rock_links.Genre[i]
    song_df = await get_song_dataframe(link, genre)
    list_o_dfs.append(song_df)
    
rock_df = pd.concat(list_o_dfs, axis = 0).reset_index(drop = True)

rock_df.to_csv('rocklyricsraw2.csv',index=False)

In [None]:
#get dataframe for country songs

list_o_dfs = []

for i in range(len(country_links)):
    link = country_links.Link[i]
    genre = country_links.Genre[i]
    song_df = await get_song_dataframe(link, genre)
    list_o_dfs.append(song_df)
    
country_df = pd.concat(list_o_dfs, axis = 0).reset_index(drop = True)

country_df.to_csv('countrylyricsraw2.csv',index=False)

In [None]:
#get dataframe for rnb songs

list_o_dfs = []

for i in range(len(rnb_links)):
    link = rnb_links.Link[i]
    genre = rnb_links.Genre[i]
    song_df = await get_song_dataframe(link, genre)
    list_o_dfs.append(song_df)
    
rnb_df = pd.concat(list_o_dfs, axis = 0).reset_index(drop = True)

rnb_df.to_csv('rnblyricsraw2.csv',index=False)

In [None]:
#get dataframe for hip hop songs

list_o_dfs = []

for i in range(len(hiphop_links)):
    link = hiphop_links.Link[i]
    genre = hiphop_links.Genre[i]
    song_df = await get_song_dataframe(link, genre)
    list_o_dfs.append(song_df)
    
hiphop_df = pd.concat(list_o_dfs, axis = 0).reset_index(drop = True)

hiphop_df.to_csv('hiphoplyricsraw2.csv',index=False)

## Load data and combine

In [None]:
# #load in csv files of lyrics from different genres
# country_df = pd.read_csv('../input/lyrics-and-chords-from-ultimateguitar/country_lyrics_df.csv', index_col = 0)
# pop_df = pd.read_csv('../input/lyrics-and-chords-from-ultimateguitar/pop_lyrics_df.csv', index_col = 0)
# rock_df = pd.read_csv('../input/lyrics-and-chords-from-ultimateguitar/rock_lyrics_df.csv', index_col = 0)
# rnb_df = pd.read_csv('../input/lyrics-and-chords-from-ultimateguitar/rnb_lyrics_df.csv', index_col = 0)
# hiphop_df = pd.read_csv('../input/lyrics-and-chords-from-ultimateguitar/hiphop_lyrics_df.csv', index_col = 0)

genre_df_list = [country_df, pop_df, rock_df, rnb_df, hiphop_df]

main_df = pd.concat(genre_df_list).reset_index(drop=True)

In [None]:
rock_df

## Cleaning data

In [None]:
#Attempt to clean the part-of-song labels

for i, name, label in zip(main_df.index, main_df.Part, main_df.Genre):
    
    
    if ('verse' in name) or ('rap'in name) or ('part'in name):
        main_df.loc[i, 'Part'] = main_df.Part[i].replace(name, 'Verse')
        
        
    elif ('pre' in name) & ('chorus' in name):        
        main_df.loc[i, 'Part'] = main_df.Part[i].replace(name, 'Pre-Chorus')
        
        
    elif ('pre' in name) & ('hook' in name):        
        main_df.loc[i, 'Part'] = main_df.Part[i].replace(name, 'Pre-Chorus')
           
            
    elif ('chorus' in name) or ('refrain' in name) or ('hook' in name):        
        main_df.loc[i, 'Part'] = main_df.Part[i].replace(name, 'Chorus')
        
        
    elif ('bridge' in name):        
        main_df.loc[i, 'Part'] = main_df.Part[i].replace(name, 'Bridge')
        
        
    elif ('intro' in name):       
        main_df.loc[i, 'Part'] = main_df.Part[i].replace(name, 'Intro')
        
        
    elif ('instrumental' in name) or ('break' in name) or ('interlude' in name) or ('riff' in name) or ('solo' in name) or ('lead' in name):        
        main_df.loc[i, 'Part'] = main_df.Part[i].replace(name, 'Interlude')
        
        
    elif ('outro' in name) or ('end' in name) or ('coda' in name):       
        main_df.loc[i, 'Part'] = main_df.Part[i].replace(name, 'Outro')
    
    
    elif ('chords' in name) or ('tuning' in name) or ('picking pattern' in name) or ('note' in name) or ('2x' in name):        
        main_df = main_df.drop([i])
        
    else:
        main_df.loc[i, 'Part'] = np.nan 
        
#Clean genre labels
    if label == 'rnb':
        main_df.loc[i, 'Genre'] = 'RnB'
    else:
        new_label = label.capitalize()
        main_df.loc[i, 'Genre'] = new_label

In [None]:
# drop rows without lyrics and duplicate parts
main_df_dropped = main_df.dropna(subset=['Lyrics'])
main_df_dropped  = main_df_dropped.drop_duplicates(subset=['Lyrics']).reset_index(drop = True)
main_df_dropped

### Adding Features

In [None]:
#add sentiment scores to dataframe
sentiments_df = get_pol_sub_scores(main_df_dropped.Lyrics)
main_df_clean = main_df_dropped.join(sentiments_df)

In [None]:
#Count number of unique words in each lyric and add column to main dataframe

unique_words_count = []

for lyric in main_df_clean.Lyrics:
    words = pd.Series(lyric.lower().split(' '))
    count = len(words.unique())
    
    unique_words_count.append(count)

main_df_clean['Unique Words'] = pd.Series(unique_words_count)

In [None]:
main_df_clean

In [None]:
#Add columns for part of speech counts
count_df = get_pos_count_df(main_df_clean.Lyrics)
main_df_clean = pd.concat([main_df_clean, count_df], axis = 1)
main_df_clean

## Analysis

In [None]:
#Split clean df into genre-specific dfs
clean_popdf = main_df_clean[main_df_clean.Genre == 'Pop']
clean_rockdf = main_df_clean[main_df_clean.Genre == 'Rock']
clean_countrydf = main_df_clean[main_df_clean.Genre == 'Country']
clean_rnbdf = main_df_clean[main_df_clean.Genre == 'RnB']
clean_hiphopdf = main_df_clean[main_df_clean.Genre == 'Hiphop']


In [None]:
#Adding stopwords for the Word Cloud
stopwords2 = set(STOPWORDS) 
stopwords2.add('ill')
stopwords2.add( 'na')
stopwords2.add( 'shit')
stopwords2.add( 'fuck')
stopwords2.add( 'fucking')

In [None]:
def remove_repeating_sections(df):
    
    """Removes repeated instances of chorus and pre-chorus in each song,
    returns a dataframe with one chorus and pre-chorus per song"""
    
    #filter out repetitive sections
    chorus = df[df.Part == 'Chorus']
    prechorus = df[df.Part == 'Pre-Chorus']
    other = df[(df.Part != 'Chorus') & (df.Part != 'Pre-Chorus')]

    #Create dataframe of single chorus from each song
    list_of_chorus = []

    for songtitle in chorus['Song Title'].unique():

        songdf = chorus[chorus['Song Title'] == songtitle]
        songdf_reset = songdf.reset_index(drop = True)
        instance = songdf_reset.iloc[0, :]
        list_of_chorus.append(instance)

    single_chorusdf = pd.DataFrame(list_of_chorus)

    #Create dataframe of single pre-chorus from each song
    list_of_prechorus = []

    for songtitle in prechorus['Song Title'].unique():

        songdf = prechorus[prechorus['Song Title'] == songtitle]
        songdf_reset = songdf.reset_index(drop = True)
        instance = songdf_reset.iloc[0, :]
        list_of_prechorus.append(instance)

    single_prechorusdf = pd.DataFrame(list_of_prechorus)

    return pd.concat([other, single_chorusdf, single_prechorusdf]).reset_index(drop=True)

In [None]:
def get_common_phrases(lyric_list):
    
    """Creates a dataframe with columns 
    for the phrase, and the number of counts"""
    
#Filters out repetition of lyrics
    sampled_lyrics = []
    for lyric in lyric_list: 
        ngram_check = []
        blob = TextBlob(lyric)
        for wordlist in blob.ngrams(5):

            if wordlist not in ngram_check:

                ngram_check.append(wordlist)
            else: break

        lyric_split = lyric.split(' ')[:len(ngram_check)]
        lyric_sample = " ".join(lyric_split)

        sampled_lyrics.append(lyric_sample)

    sampled_lyrics_series = pd.Series(sampled_lyrics)

    #Get 3-gram chunks, ignore 3-grams with two words in a row
    all_phrases = []
    for lyric in sampled_lyrics_series.drop_duplicates(): 
        sentences = []
        blob = TextBlob(lyric)
        for wordlist in blob.ngrams(3):
            if wordlist[0].lower() == wordlist[1].lower():
                continue
            elif wordlist[-1].lower() == wordlist[-2].lower():
                continue
            else:

                sentence = ""
                for word in wordlist:


                    sentence = sentence + " " + str(word)

                all_phrases.append(sentence)


    #Get the top 15 instances, store in dataframe 
    common_phrase2 = pd.Series(all_phrases).value_counts().head(12)
    common_phrases = pd.DataFrame()
    common_phrases['Phrase'] = common_phrase2.index
    common_phrases['Count'] = common_phrase2.values

    return common_phrases

In [None]:
#Function for generating plots

def get_genre_analysis(df):
    """Takes the main dataframe, returns a figure with various analyses of the lyrics provided."""
    
    #Remove repeated sections of each song
    df = remove_repeating_sections(df)


    #filter out default sentiment scores
    analyzed_words = df[[(len(x) != 0) for x in df.Words]]

    polarity_list = analyzed_words.Polarity
    subjectivity_list = analyzed_words.Subjectivity


    #find most common words for each part of speech tag
    df_tags = get_pos_tags(df.Lyrics)
    df_top = get_common_pos_df(df_tags, 5, df.Genre)

    with sns.color_palette('tab10'), sns.axes_style("darkgrid"):

        #Create figure and grid
        fig = plt.figure(figsize=(16, 16)) 
        gs = gridspec.GridSpec(4, 3, hspace=.4, wspace = .25)


        #Violin plot of Subjectivity and Polarity

        sub_pol_melt = pd.melt(df[['Polarity', 'Subjectivity']], var_name='Sentiment', value_name='Sentiment Value')
        ax = plt.subplot(gs[0, 0])
        ax = sns.violinplot(x = 'Sentiment', y = 'Sentiment Value', data =sub_pol_melt)
        ax.set_title('Distribution of sentiment values') 

        #Violinplots of unique words in each part of the song
        ax2 = plt.subplot(gs[0, 1])
        ax2 = sns.violinplot(x = 'Part', y = 'Unique Words', data = df, order = ['Verse', 'Chorus', 'Pre-Chorus', 'Intro', 'Interlude', 'Outro'])
        plt.xticks(rotation = 10, size = 7)
        ax2.set_title('Distribution of unique words in each part')


        #Violin plot for part of speech instance in each song
        pos_count_melt = pd.melt(df.iloc[:, -4:], var_name='Parts of Speech', value_name='Fraction of total words')
        ax3 = plt.subplot(gs[0, 2])
        ax3 = sns.violinplot(x = 'Parts of Speech', y = 'Fraction of total words', data = pos_count_melt)
        ax3.set_title('Distribution of P.O.S. for each song')

        #Create Wordcloud

        cloud = WordCloud(width = 2000, height = 1200, stopwords = stopwords2, background_color ='white',  min_font_size = 12, colormap = 'tab10', collocations = False,  repeat = True)
        pad = .01

        ##Subjectivity Examples
        # min

        min_sub = get_sentiment_examples(analyzed_words, sentiment = 'Subjectivity', statistic = 'min')

        sub_min_cloud = cloud
        sub_min_cloud.generate(min_sub)

        ax4 = plt.subplot(gs[1, 0])
        ax4 = plt.axis("off") 
        ax4 = plt.title('Examples of low subjectivity')
        ax4 = plt.imshow(sub_min_cloud, aspect='auto').axes

        
        #mean

        mean_sub = get_sentiment_examples(analyzed_words, sentiment = 'Subjectivity', statistic = 'mean')

        sub_mean_cloud = cloud
        sub_mean_cloud.generate(mean_sub)

        ax5 = plt.subplot(gs[1, 1])
        ax5 = plt.axis("off")  
        ax5 = plt.title('Examples of average subjectivity')
        ax5 = plt.imshow(sub_mean_cloud, aspect='auto').axes

        
        #max

        max_sub = get_sentiment_examples(analyzed_words, sentiment = 'Subjectivity', statistic = 'max')

        sub_max_cloud = cloud
        sub_max_cloud.generate(max_sub)

        ax6 = plt.subplot(gs[1, 2])
        ax6 = plt.axis("off")         
        ax6 = plt.title('Examples of high subjectivity')
        ax6 = plt.imshow(sub_max_cloud, aspect='auto').axes


        ##Polarity Examples

        #min

        min_pol = get_sentiment_examples(analyzed_words, sentiment = 'Polarity', statistic = 'min')

        pol_min_cloud = cloud
        pol_min_cloud.generate(min_pol)

        ax7 = plt.subplot(gs[2, 0])
        ax7 = plt.axis("off") 
        ax7 = plt.title('Examples of low polarity')
        ax7 = plt.imshow(pol_min_cloud, aspect='auto').axes

        ##mean

        mean_pol = get_sentiment_examples(analyzed_words, sentiment = 'Polarity', statistic = 'mean')

        pol_mean_cloud = cloud
        pol_mean_cloud.generate(mean_pol)

        ax8 = plt.subplot(gs[2, 1])
        ax8 = plt.axis("off") 
        ax8 = plt.title('Examples of average polarity')
        ax8 = plt.imshow(pol_mean_cloud, aspect='auto').axes

        ##max

        max_pol = get_sentiment_examples(analyzed_words, sentiment = 'Polarity', statistic = 'max')

        pol_max_cloud = cloud
        pol_max_cloud.generate(max_pol)

        ax9 = plt.subplot(gs[2, 2])
        ax9 = plt.axis("off") 
        ax9 = plt.title('Examples of high polarity' )
        ax9 = plt.imshow(pol_max_cloud, aspect='auto').axes

        #Barplot of parts of speech
        ax10 = plt.subplot(gs[3, 0])
        ax10 = sns.barplot(y="Word", x="Count", hue="POS", data=df_top, dodge=False)
        ax10.set_title('Common words') 

        ##Common 3 word Phrases

        common_phrasesdf = get_common_phrases(df.Lyrics)

        #Plot and add to figure
        ax12 = plt.subplot(gs[3, 2])
        ax12 = sns.barplot(y='Phrase', x='Count', data=common_phrasesdf, dodge=False, palette="tab10")
        ax12.set_title('Common phrases')
        
    #Add title to the Grid, shows genre and number of lyrics that were analyzed    
    if len(df.Genre.unique()) > 1: 
        selection = 'all'
    else: 
        selection = str(df.Genre[0])
    lyric_count = str(len(df))    
    plt.suptitle('Analysis of '+ selection +  ' lyrics  (Count: ' + lyric_count + ') ',fontsize=20)
    


In [None]:
#Just to break "Run All"
#3 = 2

## Results

### Analysis of all collected data

In [None]:
get_genre_analysis(main_df_clean)

### Analysis generated for each genre

In [None]:
get_genre_analysis(clean_popdf)

In [None]:
get_genre_analysis(clean_rockdf)

In [None]:
get_genre_analysis(clean_countrydf)

In [None]:
get_genre_analysis(clean_rnbdf)

In [None]:
get_genre_analysis(clean_hiphopdf)