# Cleansing Song Lyrics Dataset
---

## Purpose

Everyone knows that 90% of the time data scientists want to be building and fine tuning their models, but unfortunately they tend to spend 90% of their time collecting and preparing their datasets for said model instead. In light of this, we figured it would be a good idea to show off our own data cleansing skills and demonstrate the process in action using jupyter notebooks. More specifically, we are going to use this notebook to document the steps taken to prepare the lyrical dataset we will be using for training our generative Steely Dan song writing model.

---

## The Data

The dataset we will be using for this project is a collection of 57,650 from 643 different artists, including 88 Steely Dan songs. This dataset of songs was scraped from the website [lyricsfreak](https://www.lyricsfreak.com/) and is available at the following [Kaggle site](https://www.kaggle.com/mousehead/songlyrics), compliments of Sergey Kuznetsov who took the liberty of scraping the data and providig it publicly.

The dataset is conveniantly stored in a csv file, so the first thing we want to do is load in the file using Pandas and take a look at its structure as well as a couple example records.

In [3]:
import pandas as pd

# Read in csv file
filename = '../songdata.csv'
dataset = pd.read_csv(filename, encoding = 'UTF-8')

# Display sample of dataset
dataset

Unnamed: 0,artist,song,link,text
0,ABBA,She's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...
...,...,...,...,...
57645,Ziggy Marley,Good Old Days,/z/ziggy+marley/good+old+days_10198588.html,Irie days come on play \nLet the angels fly l...
57646,Ziggy Marley,Hand To Mouth,/z/ziggy+marley/hand+to+mouth_20531167.html,Power to the workers \nMore power \nPower to...
57647,Zwan,Come With Me,/z/zwan/come+with+me_20148981.html,all you need \nis something i'll believe \nf...
57648,Zwan,Desire,/z/zwan/desire_20148986.html,northern star \nam i frightened \nwhere can ...


From the above we see that our dataset includes 57,650 songs as previously mentioned and for each song record we have 4 fields:

    1) artist: name of the artist the song belongs to
    2) song: title of the song
    3) link: subpage of https://www.lyricsfreak.com/ from which the song was pulled
    4) text: single string value of all the lyrics for the song

We wont be using the link field for our purposes, so lets go ahead and delete this column. After doing so, lets then print some more information and statistics on our remaining 3 columns. The statistics might not be too useful since these are all string fields, but it is generally good practice to do this to get a feel for the scope of the data in each column.

In [2]:
# delete "link" field
dataset = dataset.drop(columns = ['link'], axis = 1)

# display some useful statistics
dataset.info()
dataset.describe(include = 'all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57650 entries, 0 to 57649
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  57650 non-null  object
 1   song    57650 non-null  object
 2   text    57650 non-null  object
dtypes: object(3)
memory usage: 1.3+ MB


Unnamed: 0,artist,song,text
count,57650,57650,57650
unique,643,44824,57494
top,Donna Summer,Have Yourself A Merry Little Christmas,I just came back from a lovely trip along the ...
freq,191,35,6


Above we can see that our dataset no longer has the "link" column. It is worth noting from the displayed info that the remaining columns contain no Null values. This means all records have a value in each of the three fields and is nice since we generally would need to handle Null values either by filling in the information, removing the records, or just being aware of their existance.

In the lower table we have some information on the number of unique values for each field and the most frequent values. For the artists we confirm the dataset contains songs from 643 unique artists, with the most prolific being Donna Summer. Additionally, we see that not all of our songs are unique. There is a significant amount of overlap in the song titles, with 12,826 duplicates, and a small overlap in the exact lyrics to these songs, with 156 duplicates. This will be the first issue we need to address when cleansing and preparing our data.

## Data Cleansing

Before we even think about working with the data and reformatting it for our models, we first need to validate the integrity of the dataset and the values within it. Typically this usually involves removing duplicate records, handling Null values, changing datatypes, and altering incorrect information. Thankfully for this dataset all three fields are Null free and in the format we want, string format. So in the section to follow we will focus on removing the duplicates songs, and handling extraneous issues with our song lyrics.

### 1) Removing Duplicates

As mentioned already, our dataset has a number of duplicate song records. We don't want any songs to be over represented in our dataset, so we are going to need to identify all the duplicates and decide which version to keep.

Our initial instinct might be to simply remove all the songs with the same title, but it turns out that completely different songs sometimes just have the same name. As an example, below are two records each with the song title "Movin' Out", one by Aerosmith and the other by Billy Joel. As can we see, the lyrics for each are completely different and the two are distinct songs despite having the same name.

In [3]:
# Displaying songs titled "Movin' Out"
dataset[dataset['song'] == "Movin' Out"]

Unnamed: 0,artist,song,text
169,Aerosmith,Movin' Out,We all live on the edge of town \nWhere we al...
1410,Billy Joel,Movin' Out,Anthony works in the grocery store \nSavin hi...


In [4]:
# Printing some more of the songs
movin_out_songs = dataset[dataset['song'] == "Movin' Out"]

print('Artist: ' + movin_out_songs.iloc[0]['artist'])
print('Title: ' + movin_out_songs.iloc[0]['song'])
print(movin_out_songs.iloc[0]['text'][:236])
print('\n')
print('Artist: ' + movin_out_songs.iloc[1]['artist'])
print('Title: ' + movin_out_songs.iloc[1]['song'])
print(movin_out_songs.iloc[1]['text'][:330])

Artist: Aerosmith
Title: Movin' Out
We all live on the edge of town  
Where we all live ain't a soul around  
People start a-comin'  
All we do is just a-grin  
Said we gotta move out  
'Cause the city's movin' in  
I said we gotta move out  
'Cause the city's movin' in  


Artist: Billy Joel
Title: Movin' Out
Anthony works in the grocery store  
Savin his pennies for some day  
Mama Leone left a note on the door  
She said "Sonny move out to the country"  
Ah but working too hard can give you  
A heart attack, ack, ack, ack, ack, ack  
You ought-a know by now  
Who needs a house out in Hackensack?  
Is that all you get for your money


This shows that not all songs with the same title are necessarily duplicates, and that we should probably focus on the text in each song to determine if they are the same. Obviously songs that have the exact same text should be considered duplicates, like the 6 copies of "My Girl" shown above (the ones that start with "I've got sunshine on a cloudy day \nWhen it's..."). However, what about songs where only a few words or characters have been changed? For instance, below are two records of the song "Have Yourself A Merry Little Christmas", one by America and the other by Barbra Streisand. These are clearly the same song, but with slightly different formattings and where Barbra Streisand's version includes an intro.

In [5]:
# Grab all songs with title "Have Yourself A Merry Little Christmas"
merry_christmas_songs = dataset[dataset['song'] == "Have Yourself A Merry Little Christmas"]

# Show Two Examples With Slightly Different Lyrics
print('Artist: ' + merry_christmas_songs.iloc[0]['artist'])
print('Title: ' + merry_christmas_songs.iloc[0]['song'])
print(merry_christmas_songs.iloc[0]['text'][:226])
print('\n')
print('Artist: ' + merry_christmas_songs.iloc[1]['artist'])
print('Title: ' + merry_christmas_songs.iloc[1]['song'])
print(merry_christmas_songs.iloc[1]['text'][:350])

Artist: America
Title: Have Yourself A Merry Little Christmas
Have yourself a Merry Little Christmas  
Let your heart be light  
From now on our troubles will be out of sight  
  
Have yourself a Merry Little Christmas  
Make the Yuletide gay  
From now on our troubles will be miles away


Artist: Barbra Streisand
Title: Have Yourself A Merry Little Christmas
Christmas' future is far away  
Christmas' past is passed  
Christmas' present is here today  
Bringing joy that will last  
Have yourself a merry little Christmas  
Let your hearts be light  
From now on our troubles will be out of sight  
Have yourself a merry little Christmas  
Make the Yule time gay  
From now on our troubles will be miles away


Unnamed: 0,artist,song,text
0,ABBA,She's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...
...,...,...,...
57645,Ziggy Marley,Good Old Days,Irie days come on play \nLet the angels fly l...
57646,Ziggy Marley,Hand To Mouth,Power to the workers \nMore power \nPower to...
57647,Zwan,Come With Me,all you need \nis something i'll believe \nf...
57648,Zwan,Desire,northern star \nam i frightened \nwhere can ...


To capture instances like this, where the songs are essentially the same but with a few changes, we are going to use the python library fuzzywuzzy. This library has a useful function called fuzzy match that tells us how similar two strings are. More specifically, it calculates how similar two strings are using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between the two strings, which is a measure of how many changes need to be made to make the strings exactly the same. Additionally, the fuzzywuzzy library has additional functions that preprocess the strings for us to improve the comparison. Below we test several of these functions on the two "Movin' Out" songs and the two "Have Yourself A Merry Little Christmas" songs from earlier. Note we convert all text to lowercase for the functions that don't implicitly do so.

In [6]:
from fuzzywuzzy import fuzz

# Get text for each song
movin_out_1 = movin_out_songs.iloc[0]['text'].lower()
movin_out_2 = movin_out_songs.iloc[1]['text'].lower()
merry_christmas_1 = merry_christmas_songs.iloc[0]['text'].lower()
merry_christmas_2 = merry_christmas_songs.iloc[1]['text'].lower()

# Calculate fuzzy match values
print("Movin' Out fuzzy match ratio: ", fuzz.ratio(movin_out_1, movin_out_2))
print("Movin' Out fuzzy match partial ratio: ", fuzz.partial_ratio(movin_out_1, movin_out_2))
print("Movin' Out fuzzy match token sort ratio: ", fuzz.token_sort_ratio(movin_out_1, movin_out_2))
print("Movin' Out fuzzy match token set ratio: ", fuzz.token_set_ratio(movin_out_1, movin_out_2))
print('\n')
print("Merry Little Christmas fuzzy match ratio: ", fuzz.ratio(merry_christmas_1, merry_christmas_2))
print("Merry Little Christmas fuzzy match partial ratio: ", fuzz.partial_ratio(merry_christmas_1, merry_christmas_2))
print("Merry Little Christmas fuzzy match token sort ratio: ", fuzz.token_sort_ratio(merry_christmas_1, merry_christmas_2))
print("Merry Little Christmas fuzzy match token set ratio: ", fuzz.token_set_ratio(merry_christmas_1, merry_christmas_2))

Movin' Out fuzzy match ratio:  42
Movin' Out fuzzy match partial ratio:  45
Movin' Out fuzzy match token sort ratio:  52
Movin' Out fuzzy match token set ratio:  58


Merry Little Christmas fuzzy match ratio:  73
Merry Little Christmas fuzzy match partial ratio:  76
Merry Little Christmas fuzzy match token sort ratio:  82
Merry Little Christmas fuzzy match token set ratio:  95


In general, the fuzzy match ratio increases if the two strings are more similar. The different functions process the strings differently (Ex. reordering words alphabetically, etc.) , hence why the ratios are different for each function. Ideally, we would calculate these ratios between each combination of songs in our dataset, but that would be extremely computationally intensive and take years for my pathetic PC to process. Instead, we will assume that songs with different titles are 100% not the same song, so then we only need to calculate these values between songs with the same title. 

To help us compare the songs with the same title we will use the built in "extractOne" function from the fuzzywuzzy library. This function lets us find the maximum match between one string and a list of strings. We can use this function to more efficiently compare the lyrics from songs with the same name. Below we wrote a function that does just that and tested it on the set of all "Have Yourself A Merry Little Christmas" songs.

In [7]:
from fuzzywuzzy import process

def fuzzyCompare(data_subset):
    '''
    Takes a subset of our dataset and creates a new column with the highest fuzzy match ratio
    between this song and all other songs in the subset.
    '''
    # Initialize list of values for fuzzy match
    fuzzy_match = [[],[],[],[]]
    
    # Iterate over songs to find highest fuzzy match between the song and the rest of the dataset
    for i in range(len(data_subset)):
        row_number = data_subset.index[i]
        title = data_subset.iloc[i]['text']
        other_titles = list(data_subset.drop(row_number)['text'])
        ratio_1 = process.extractOne(title, other_titles, str.lower, fuzz.ratio)[1]
        ratio_2 = process.extractOne(title, other_titles, str.lower, fuzz.partial_ratio)[1]
        ratio_3 = process.extractOne(title, other_titles, str.lower, fuzz.token_sort_ratio)[1]
        ratio_4 = process.extractOne(title, other_titles, str.lower, fuzz.token_set_ratio)[1]
        fuzzy_match[0].append(ratio_1)
        fuzzy_match[1].append(ratio_2)
        fuzzy_match[2].append(ratio_3)
        fuzzy_match[3].append(ratio_4)
    
    # Add ratios to dataframe
    data_subset.insert(3, 'max_fm_ratio', fuzzy_match[0])
    data_subset.insert(4, 'max_fm_partial_ratio', fuzzy_match[1])
    data_subset.insert(5, 'max_fm_token_sort_ratio', fuzzy_match[2])
    data_subset.insert(6, 'max_fm_token_set_ratio', fuzzy_match[3])
        
    return data_subset

# Test function
fuzzy_test = fuzzyCompare(merry_christmas_songs.copy())

# Display results
fuzzy_test.sort_values(['max_fm_ratio','max_fm_partial_ratio','max_fm_token_sort_ratio','max_fm_token_set_ratio'])

Unnamed: 0,artist,song,text,max_fm_ratio,max_fm_partial_ratio,max_fm_token_sort_ratio,max_fm_token_set_ratio
9072,James Taylor,Have Yourself A Merry Little Christmas,(Hugh Martin and Ralph Blane) \n \nChristmas...,81,88,84,96
25838,Bob Dylan,Have Yourself A Merry Little Christmas,Have yourself \nA merry little Christmas \nL...,83,92,88,98
1077,Barbra Streisand,Have Yourself A Merry Little Christmas,Christmas' future is far away \nChristmas' pa...,84,95,87,97
3222,Cliff Richard,Have Yourself A Merry Little Christmas,"Have yourself a merry little Christmas, \nLet...",85,96,86,99
31083,Ellie Goulding,Have Yourself A Merry Little Christmas,Have yourself a merry little Christmas \nLet ...,91,91,90,97
40250,Kim Wilde,Have Yourself A Merry Little Christmas,Have yourself a merry little Christmas \nLet ...,91,100,92,100
32613,Fifth Harmony,Have Yourself A Merry Little Christmas,[Camila:] \nHave yourself a Merry Little Chri...,92,90,95,100
47895,Perry Como,Have Yourself A Merry Little Christmas,Have yourself a merry little Christmas \nLet ...,93,92,91,99
43581,Michael Bolton,Have Yourself A Merry Little Christmas,Have yourself a merry little Christmas \nLet ...,93,96,95,100
43662,Michael Buble,Have Yourself A Merry Little Christmas,Have yourself a merry little Christmas \nLet ...,93,98,94,100


Judging from the above test on the set of "Have Yourself A Merry Little Christmas" songs, it appears that the "token set ratio" fuzzy match function is the most consistent. Since all 36 versions of this song are essentially the same and should be removed, we want to use the function that returns high values for these songs. The "token set ratio" function does this best in this regard with all value > 95.

Now lets check the performance of the functions on a set of randomly chosen, and completely unrelated songs.

In [8]:
import numpy as np

# choose random song indices
n = len(dataset)
random_indxs = np.random.choice(n, 50, replace = False)

# grab songs
random_songs = dataset.iloc[random_indxs]

# fuzzy match songs
fuzzy_test = fuzzyCompare(random_songs.copy())

# Display results
fuzzy_test.sort_values(['max_fm_ratio','max_fm_partial_ratio','max_fm_token_sort_ratio','max_fm_token_set_ratio'])

Unnamed: 0,artist,song,text,max_fm_ratio,max_fm_partial_ratio,max_fm_token_sort_ratio,max_fm_token_set_ratio
37301,Iwan Fals,Mimpi Yang Terbeli,Mimpi yang terbeli \n \nBerjalan di situ... ...,41,42,45,45
49226,Puff Daddy,Through The Pain (She Told Me),[Diddy:] \nCan You Feel Me? \nCan You Touch ...,41,48,54,75
4003,Deep Purple,Pictured Within,here be friends... \nhere be heroes... \nher...,43,45,51,56
46866,One Direction,Best Song Ever,Maybe it's the way she walked (Wow) \nStraigh...,44,48,56,62
46563,Ofra Haza,Shki'a,YOM SHEL 'AMAL SHOKE'A EL SOF HAYAM \nRACHOK ...,45,45,45,51
14343,Noa,Hi,"Shalachta elai et hao'ach, \nLehaireni mishen...",45,45,45,51
15049,Opeth,Remember Tomorrow,"Unchain the colours before my eyes, \nYesterd...",45,46,51,54
1077,Barbra Streisand,Have Yourself A Merry Little Christmas,Christmas' future is far away \nChristmas' pa...,45,46,53,58
1229,The Beatles,Misery,The world is treating me bad...Misery. \nI'm ...,45,46,55,68
8754,Irving Berlin,Slumming On Park Avenue,Put on your slumming clothes and get your car ...,45,46,56,59


Judging from above, it appears that all 4 of the different fuzzy match functions do a good job of returning low values for different songs. Though the "token set ratio" function returns pretty high values in this case, they are still distinguishably lower than when the songs are similar. As such, I am still leaning towards using this function over the others, possibly with a cutoff value of 90 for matches.

Just to be sure, let's test each function one more time on a set of songs that have similar lyrics and word choices, but are not the same songs. I feel a good test for this is to use just Beatles songs, since roughly half of their songs are about love in some shape or manner.

In [9]:
# grab beatles songs
bealtes_songs = dataset[dataset['artist'] == "The Beatles"]

# choose random songs
n = len(bealtes_songs)
random_indxs = np.random.choice(n, 50, replace = False)
random_beatles_songs = bealtes_songs.iloc[random_indxs]

# fuzzy match songs
fuzzy_test = fuzzyCompare(random_beatles_songs.copy())

# Display results
fuzzy_test.sort_values(['max_fm_ratio','max_fm_partial_ratio','max_fm_token_sort_ratio','max_fm_token_set_ratio'])

Unnamed: 0,artist,song,text,max_fm_ratio,max_fm_partial_ratio,max_fm_token_sort_ratio,max_fm_token_set_ratio
24794,The Beatles,Junk,"Motorcars, handlebars, bicycles for two \nBro...",43,44,46,47
1222,The Beatles,"Hello, Goodbye","You say ""Yes"", I say ""No"". \nYou say ""Stop"" a...",43,44,49,60
1214,The Beatles,Dear Wack!,[Speech] \n \nBrian Matthew: But despite the...,43,45,56,59
24713,The Beatles,Crinsk Dee Night,[Speech] \n \nBrian Matthew: The next few mi...,43,46,53,63
24798,The Beatles,"Komm, Gib Mir Deine Hand","Oh, komm doch, komm zu mir \nDo nimmst mir de...",44,45,44,47
24792,The Beatles,Johnny B. Goode,Deep down in Louisianna \nClose to New Orlean...,44,45,54,61
24807,The Beatles,Love These Goon Shows!,"[Speech] \n \nLee Peters: But now, John has ...",44,46,56,61
24700,The Beatles,Bad Boy,A bad little kid moved into my neighborhood \...,45,46,56,61
24791,The Beatles,Jingle Bells,"Dashing through the snow, in a one-horse open ...",45,46,58,58
24739,The Beatles,Golden Slumbers,"Once there was a way, \nTo get back homeward....",45,47,51,55


Looks like my Beatles theory either doesn't hold up or these functions distinguish between them pretty well even though the concepts and vocabulary between the songs may be similar. As such, moving forward lets use the token set ratio fuzzy match function to determine if different records are the same song. And let's consider anysong with a value >= 90 to be a duplicate.

With our new rule in place, lets re-define our function for calculating the fuzzy match values using only the token set ratio function.

In [10]:
# Re-define our fuzzyCompare function to only perform token_set_ratio fuzzy compare
def fuzzyCompare(data_subset):
    '''
    Takes a subset of our dataset and creates a new column with the highest fuzzy match ratio
    between this song and all other songs in the subset.
    '''
    # Initialize list of values for fuzzy match
    fuzzy_match = []
    
    # Iterate over songs to find highest fuzzy match between the song and the rest of the dataset
    for i in range(len(data_subset)):
        row_number = data_subset.index[i]
        title = data_subset.iloc[i]['text']
        other_titles = list(data_subset.drop(row_number)['text'])
        ratio = process.extractOne(title, other_titles, str.lower, fuzz.token_set_ratio)[1]
        fuzzy_match.append(ratio)
    
    # Add ratios to dataframe
    data_subset.insert(3, 'max_fuzzy_match', fuzzy_match)
        
    return data_subset

Now lets use the above function to actually remove duplicates from our dataset. To do so, lets first build a wrapper function that identifies all sets of duplicated song titles, calculates the fuzzy match rates for each using th above function, and using these values picks the unique songs. For clarity, we remind the user that a duplicate song will be considered a song which has a fuzzy match value >= 90 with another song of the same name already in our dataset.

In [11]:
import string

# function that removes all records with duplicate song titles from dataset
def DedupeDataset(data, fuzzy_cutoff):
    '''
    Removes records that share a song title with another record. Returns the
    transformed dataset as well as a list of dataframes eachcontaining the 
    records for a particular song title.
    '''
    # Grab titles with duplicates
    # Note we lowercase, replace "-" with spaces, and remove all other non-alphanumeric characters from the titles
    characters = string.printable[62:-6]
    pattern = "[" + characters + "]"
    title_freq = data['song'].str.lower().str.replace('-',' ').str.replace(pattern, '').value_counts()
    duplicate_titles = title_freq[title_freq > 1]
    
    # Remove songs with duplicated titles from dataset and add to list
    duplicate_songs = []
    for i in range(len(duplicate_titles)):
        title = duplicate_titles.index[i]
        duplicate_songs.append(dataset[dataset['song'].str.lower().str.replace('-',' ').str.replace(pattern, '') == title])
        data = data.drop(dataset.index[dataset['song'].str.lower().str.replace('-',' ').str.replace(pattern, '') == title])
        
    # Iterate over each set of duplicated titles, calculate fuzzy match rates, and keep songs below fuzzy_cutoff
    # If no songs below fuzzy cutoff, grab first song
    unique_sets = [data.copy()]
    for subset in duplicate_songs:
        fuzzy_test = fuzzyCompare(subset)
        unique_records = fuzzy_test[fuzzy_test['max_fuzzy_match'] < fuzzy_cutoff]
        if len(unique_records) == 0:
            unique_records = fuzzy_test.iloc[0:1]
        unique_sets.append(unique_records.drop(columns = ['max_fuzzy_match'], axis = 1))
    
    # Append unique songs back onto our dataset
    data = pd.concat(unique_sets)
    
    return data.sort_index()
        

Now lets run the function on our dataset and print some statistics to make sure it's working. NOTE - the dedupe function may take a while to run since it is comparing so many long string values. For me it took ~10 minutes.

In [12]:
# set cutoff values
fuzzy_cutoff = 90

# Remove duplicates
dataset = DedupeDataset(dataset.copy(), fuzzy_cutoff)

# Display deduped dataset
dataset

Unnamed: 0,artist,song,text
0,ABBA,She's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...
...,...,...,...
57645,Ziggy Marley,Good Old Days,Irie days come on play \nLet the angels fly l...
57646,Ziggy Marley,Hand To Mouth,Power to the workers \nMore power \nPower to...
57647,Zwan,Come With Me,all you need \nis something i'll believe \nf...
57648,Zwan,Desire,northern star \nam i frightened \nwhere can ...


In [13]:
# display some useful statistics
dataset.info()
dataset.describe(include = 'all')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52808 entries, 0 to 57649
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  52808 non-null  object
 1   song    52808 non-null  object
 2   text    52808 non-null  object
dtypes: object(3)
memory usage: 1.6+ MB


Unnamed: 0,artist,song,text
count,52808,52808,52808
unique,643,44627,52776
top,Gordon Lightfoot,Hold On,You've made this a Christmas to remember \nSp...
freq,184,25,2


Above we see that we removed a significant chunk of duplicate songs, 4,842 to be exact. Good thing we used fuzzy matching between song text too, since there are plenty of instances where the same title is shared across multiple uncorrelated songs. One such example shown in the statistics above is the 25 songs titled 'Hold On'. Below you can see that they are clearly unique.

In [14]:
dataset[dataset['song'] == 'Hold On']

Unnamed: 0,artist,song,text
3225,Cliff Richard,Hold On,Love show me the way when I am lost and hope b...
5375,Eric Clapton,Hold On,Deep down in the dead of night I call out your...
7253,Green Day,Hold On,As I stepped to the edge \nBeyond the shadow ...
10895,Korn,Hold On,With aversion this distortion \nCame so swift...
15679,Pearl Jam,Hold On,I was doom riding on top a black horse \nWhat...
15830,Pet Shop Boys,Hold On,"Hold on, hold on, \nThere's got to be a futur..."
21612,Wishbone Ash,Hold On,"Checked in tonight, when I noticed the red lig..."
21820,Xscape,Hold On,I can make \nAny man do whatever it takes \n...
22005,Yes,Hold On,Justice to the left of you \nJustice to the r...
22057,Yngwie Malmsteen,Hold On,"Look at me, see the love that you're missing ..."


With that said, there is still clearly room for improvement. Above we see that there are still cases where multiple records share the exact same text. For example, the song "The Very Thought Of You" still shows up twice in our dataset with the exact same text for each. Turns out the two records have slightly different titles, with the other being "Very Thought Of You", which is why we did not try comparing or removing them. There are numerous examples of this behavior. To try and address it, below we wrote another function that removes any duplicates with the exact same text. Note this takes about 5 minutes to run.

In [15]:
# Function that removes duplicate records with the exact same text
def DedupeText(data):
    '''
    Function that removes duplicate records that share the same text value. We
    lowercase the text before comparing, as well as remove all non-alphanumerical
    characters to account for minor formatting differences.
    '''
    # Grab titles with duplicates
    # Note we lowercase and remove all other non-alphanumeric characters from text
    text_freq = data['text'].str.lower().str.replace('[\W_]+','').value_counts()
    duplicate_text = text_freq[text_freq > 1]
    
    # Remove songs with duplicated text from dataset and add to list
    duplicate_songs = []
    for i in range(len(duplicate_text)):
        text = duplicate_text.index[i]
        duplicate_songs.append(dataset[dataset['text'].str.lower().str.replace('[\W_]+','') == text])
        data = data.drop(dataset.index[dataset['text'].str.lower().str.replace('[\W_]+','') == text])
        
    # Iterate over each set of duplicated titles grab first song and place back in dataset
    unique_sets = [data.copy()]
    for subset in duplicate_songs:
        unique_records = subset.iloc[0:1]
        unique_sets.append(unique_records)
    
    # Append unique songs back onto our dataset
    data = pd.concat(unique_sets)
    
    return data.sort_index()

In [16]:
# Remove duplicate Text
dataset = DedupeText(dataset.copy())

# Display new dataset
dataset

Unnamed: 0,artist,song,text
0,ABBA,She's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...
...,...,...,...
57645,Ziggy Marley,Good Old Days,Irie days come on play \nLet the angels fly l...
57646,Ziggy Marley,Hand To Mouth,Power to the workers \nMore power \nPower to...
57647,Zwan,Come With Me,all you need \nis something i'll believe \nf...
57648,Zwan,Desire,northern star \nam i frightened \nwhere can ...


In [17]:
# Show statistics
dataset.info()
dataset.describe(include = 'all')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52731 entries, 0 to 57649
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  52731 non-null  object
 1   song    52731 non-null  object
 2   text    52731 non-null  object
dtypes: object(3)
memory usage: 1.6+ MB


Unnamed: 0,artist,song,text
count,52731,52731,52731
unique,643,44562,52731
top,Gordon Lightfoot,Hold On,"You walk into the room, everybody stares \nTh..."
freq,184,25,1


As shown above, we removed another 77 songs from our dataset and no longer have any records with duplicte text. This does not guarantee that all our remaining records are distinct, but brings us most of the way there. There is always the possibility that two records are the same song technically, but they have slightly differently worded titles and slightly altered lyrics. However, these are unlikely and would be too much effort to remove on a case by case basis. We are happy with the current state of our dataset and will now move on to preparing the final records for analysis.

### 2) Normalizing Song Text

The neural network models we train will pick up and imitate the same patterns that occur in the text of our songs. As such, we want to make sure that the patterns the model should learn are very clear and the patterns it should not learn are removed or at least minimized. 

The types of patterns we don't want our model to learn are abnormalities and formatting shorthands that are present in a lot of the song text. For example, comments on the artist, song version, etc. at the start of the song or text that indicates repeated versus in the lyrics, such \[Chorus\] or \[Chorus:\]. 

The types of patterns we do want our model to learn are the breaks between verses, ends of lines, and the beginning/end of songs. These aren't clear to our models so we will want to add clear markers to the text to indicate thes occurances. By doing so, we hope to aid our model in identifying, learning, and generating these patterns and sructures.

Firstly, let's try to remove certain abnormalities that are present in our songs on a case by case basis. The types of issues we have noticed and need to address include the following:

    1) Verse Placeholders: such as \[Verse\] or \[Chorus\]
    2) Comments: additional information about the song within parenthesis within the song. Ex, backup vocals, who is singing, etc.
    3) Extraneous Characters: Double spaces, empty lines, unwanted characters
    4) Capitalization: so capitalized words aren't considered different, reduce everything to lowercase
    
#### Removing Verse Placeholders

The most important of these three issues is probably the verse placeholders. These are in the song lyrics to either indicate the different sections of the song and/or to repeat these sections within explicitly repeating the whole text. However, I find it unlikely that our model would pick up on these conventions since they are very high level and used sporadically within our dataset. As such, it is probably best we remove them entirely and replace the placeholders with the text they mean to represent. Then we can just hope and pray that our model understands that verses tend to repeat when it trains on thousands of songs with repeating verses. To help us understand what we ned to accomplish, below are a few examples of these placeholders within song text.

In [18]:
my_song_1 = dataset[dataset['song']=='Bad Sneakers']
print('Artist: ' + my_song_1.iloc[0]['artist'])
print('Title: ' + my_song_1.iloc[0]['song'])
print(my_song_1.iloc[0]['text'])

my_song_2 = dataset[dataset['song']=='Just A Little Bit Of Your Heart']
print('Artist: ' + my_song_2.iloc[0]['artist'])
print('Title: ' + my_song_2.iloc[0]['song'])
print(my_song_2.iloc[0]['text'])

Artist: Steely Dan
Title: Bad Sneakers
Five names that I can hardly  
Stand to hear  
Including yours and mine  
And one more chimp who isn't here  
I can see the ladies talking  
How the times are getting hard  
And that fearsome excavation  
On Magnolia Boulevard  
  
[CHORUS:]  
And I'm going insane  
And I'm laughing at the frozen rain  
And I'm so alone  
Honey when they gonna send me home  
Bad sneakers and a Pina Colada  
My friend  
Stompin' on the avenue  
By Radio City with a  
Transistor and a large  
Sum of money to spend  
  
You fellah, you tearin' up the street  
You wear that white tuxedo  
How you gonna beat the heat  
Do you take me for a fool  
Do you think that I don't see  
That ditch out in the valley  
That they're digging just for me  
  
[CHORUS]


Artist: Ariana Grande
Title: Just A Little Bit Of Your Heart
[Verse 1]  
Oh, oh, oh, oh  
I don't ever ask you where you've been  
And I don't feel the need to  
Know who you're with  
I can't even think straight but

Notice the first song is an example of where the placeholders represent chunks of the song. The section below "\[Chorus:\]" defines what the chorus is, and the next use of "\[Chorus\]" is meant to repeat this chunk of lyrics. In the second song we see an example of these placeholders simply identifying the different structural components of the song, from verses, to chorus, bridges, etc. These will be a bit easier to handle since they don't replace any text, so we just need to remove the placeholders. Before we go pulling sections from our song, lets try and identify what types of placeholders we have. Below we scan over our song texts and pull out these various placeholders.

In [19]:
import re

# define function to remove words in brackets from string
def BracketFinder(text):
    return re.findall('\[(.*?)\]', text)

# initiate list of text in brackets
bracket_text = []

# iterate over the text for all songs
for index, record in dataset.iterrows():
    text = record['text']
    bracket_words = BracketFinder(text)
    if len(bracket_words) > 0:
        for word in bracket_words:
            bracket_text.append(word)
bracket_text = sorted(set(bracket_text))

# Display text found
print(len(bracket_text))
bracket_text

3278


['',
 ' ',
 ' -CHORUS-',
 ' 12str guitar ',
 ' 1989 version - Find your Brothers and sisters ',
 ' 1990 version - Find your sisters and brothers ',
 ' 5x ',
 ' CHORUS: Benzalino ',
 ' Chorus ',
 ' Chorus - "I\'ll see you coming..." ',
 ' Chorus - x2 ',
 ' Chorus x3 ',
 ' Chorus: ',
 ' Dolly Parton ',
 ' Fade ',
 ' Lil Wayne ',
 ' PREJUDICE ',
 ' Pa pa pa pa pa pa ',
 ' Paul McCartney Lyrics are found on www.songlyrics.com ',
 ' Repeat ',
 ' The Tragically Hip Lyrics are found on www.songlyrics.com ',
 ' VERSE 1: Yukmouth ',
 ' VERSE 2: Yukmouth ',
 ' VERSE 3: Yukmouth ',
 ' Vertical Horizon Lyrics are found on www.songlyrics.com ',
 ' ac. Guitar ',
 ' ac.guitar ',
 ' ac.guitar - steel ',
 ' ac.guitar - strings ',
 ' ad lib ',
 ' banjo ',
 ' banjo - guitar ',
 ' choir ',
 ' dobro ',
 ' dobro - sax ',
 ' dobro - steel ',
 ' duet with Mike Post ',
 ' end of extra lyrics from 12" remix ',
 ' extra lyrics from 1989 12" remix ',
 " feat. Lani misalucha recorded by regine velasquez for kris a

Above we see that there are over 3,000 different words/phrases found between brackets within our songs. Most of these are just additional song info that we will delete later, but some of them indicate the repeating patterns we need to replace. The key is going to be identifying which of these phrases to use. Below we wrote a function to find and print the text of songs with a few of these idenitfyers to help us sift through our possible choses.

In [20]:
# function that finds all songs with given phrase between brackets
def FindBracketSongs(data, bracket_phrase):
    records = []
    for i in range(len(data)):
        text = data.iloc[i]['text']
        if re.search(bracket_phrase, text):
            records.append(data.iloc[i:i+1])
    
    return pd.concat(records)

In [21]:
# Example, find all occurences of [Chorus x 2]
bracket_phrase = '\[Chorus X2\]'
text_samples = FindBracketSongs(dataset, bracket_phrase)
text_samples

Unnamed: 0,artist,song,text
16723,R. Kelly,Ghetto Religion,Had the landlord at my door \nI heard him say...
43082,Mark Ronson,In Case Of Fire,"[Verse 1] \nBurnin' up, the second Adderall ..."


In [22]:
# full text for specific song
print(text_samples.iloc[0]['text'])
text_samples.iloc[0]['text']

Had the landlord at my door  
I heard him saying,  
Tomorrow no more  
Pay me now or leave  
But we didn't have anything to give(ah can you feel me)  
  
Searching for restoration(and we need restoration)  
Make the church my family  
This is my story  
This is my song  
And I can sing it all night long I tell you why because,  
  
[Chorus]  
  
The ghetto is a part of my religion(the only thing my eyes can see)  
The only thing my eyes can see(and I tell you there ain't no man)  
There ain't no man gonna stop the vision(I'm a part of the ghetto)  
The ghetto is a part of me  
  
Children cry no more(children cry no more)  
Because heaven is upon you  
Please put down your guns  
And we shall overcome  
  
Thought your load may be heavy  
Know that the weight makes you strong  
Take my life for example  
While I sing my song  
Mr. Kelly help me sing this song  
  
[Chorus X2]  
  
La la la la la lala la..X3 (fade out)  
The ghetto is a part of me  
The ghetto is a part of my religion



"Had the landlord at my door  \nI heard him saying,  \nTomorrow no more  \nPay me now or leave  \nBut we didn't have anything to give(ah can you feel me)  \n  \nSearching for restoration(and we need restoration)  \nMake the church my family  \nThis is my story  \nThis is my song  \nAnd I can sing it all night long I tell you why because,  \n  \n[Chorus]  \n  \nThe ghetto is a part of my religion(the only thing my eyes can see)  \nThe only thing my eyes can see(and I tell you there ain't no man)  \nThere ain't no man gonna stop the vision(I'm a part of the ghetto)  \nThe ghetto is a part of me  \n  \nChildren cry no more(children cry no more)  \nBecause heaven is upon you  \nPlease put down your guns  \nAnd we shall overcome  \n  \nThought your load may be heavy  \nKnow that the weight makes you strong  \nTake my life for example  \nWhile I sing my song  \nMr. Kelly help me sing this song  \n  \n[Chorus X2]  \n  \nLa la la la la lala la..X3 (fade out)  \nThe ghetto is a part of me  \nTh

After looking at countless examples of songs with these phrases within the text, we've decided that it would be too much trouble to try and account for every type of specific occurence. There is just no clear pattern between songs or even for specific phrases. As such, we've decided to just remove all records that have any of these phrases in their text. Thankfully, this is only about 20% of our dataset, which is significant, but our dataset is large enough that this shouldn't affect our model too much. The only downside is that the songs for certain artists have more of these formatting conventionalities than others and will be affected disproportionately.

Before we delete all the songs with these phrases, we are going to do our best to fix the Steely Dan songs. This is the artist we want our model to mimic, so we need to maintain as many of the songs as possible. Thankfully, there are only a few instances of these bracketed phrases in the Steely Dan songs, primarily for indicating repeated choruses. Below we wrote some functions to identify these phrases in each song text, replace them with the actual chorus lyrics, and then remove the remaining phrases.

In [23]:
# function that replaces repeated choruses with the actual lyrics in our song text
def ReplaceChorus(data):
    # substrings we will look for
    start_phrases = ['[CHORUS:]  \n', '[Chorus:]  \n', 'Chorus:  \n', 'CHORUS:  \n',
                     '[CHORUS]  \n', '[Chorus]  \n', 'Chorus  \n', 'CHORUS  \n']
    end_phrases = ['  \n  \n', '  \n  \n', '  \n  \n', '  \n  \n',
                   '  \n  \n', '  \n  \n', '  \n  \n', '  \n  \n']
    repeat_phrases = ['[CHORUS]  \n  \n', '[Chorus]  \n  \n', 'Chorus  \n  \n', 'CHORUS  \n  \n',
                      '[CHORUS]  \n  \n', '[Chorus]  \n  \n', 'Chorus  \n  \n', 'CHORUS  \n  \n']
    end_chorus_phrases = ['[CHORUS]\n\n', '[Chorus]\n\n', 'Chorus\n\n', 'CHORUS\n\n',
                          '[CHORUS]\n\n', '[Chorus]\n\n', 'Chorus\n\n', 'CHORUS\n\n']
    
    text_list = []
    # Grab text for each song and store in list
    for index, record in data.iterrows():
        text_list.append(record['text'])
    
    # Iterate over each substring
    for i in range(len(start_phrases)):
        # Iterate over each song and replace chorus placeholders
        for j in range(len(text_list)):
            text = text_list[j]
            start_phrase = start_phrases[i]
            end_phrase = end_phrases[i]
            repeat_phrase = repeat_phrases[i]
            end_chorus_phrase = end_chorus_phrases[i]
            n = len(start_phrase)
            m = len(end_phrase)
            l = len(end_chorus_phrase)
            start = text.find(start_phrase) + n
            end = text[start:].find(end_phrase) + start + m
            if (start - n) != -1 and end - start - m != -1:
                chorus = text[start:end]
                text = text.replace(repeat_phrase, chorus)
                if text[-l:] == end_chorus_phrase:
                    text = text[:-l] + chorus[:-m] + '\n\n'
                text = text.replace(start_phrase, '')
            text_list[j] = text
    
    # Replace text column in dataset with new text
    data['text'] = text_list
    
    return data

In [24]:
# Run function on Steely Dan songs
steely_dan_songs = dataset[dataset['artist']=='Steely Dan']
steely_dan_songs = ReplaceChorus(steely_dan_songs.copy())

In [25]:
# Look at example before and after
before = dataset[dataset['song'] == "Rikki Don't Lose That Number"]
print('Before\n------')
print('Artist: ' + before.iloc[0]['artist'])
print('Title: ' + before.iloc[0]['song'])
print(before.iloc[0]['text'])
after = steely_dan_songs[steely_dan_songs['song'] == "Rikki Don't Lose That Number"]
print('After\n-----')
print('Artist: ' + after.iloc[0]['artist'])
print('Title: ' + after.iloc[0]['song'])
print(after.iloc[0]['text'])

Before
------
Artist: Steely Dan
Title: Rikki Don't Lose That Number
We hear you're leaving, that's OK  
I thought our little wild time had just begun  
I guess you kind of scared yourself, you turn and run  
But if you have a change of heart  
  
[Chorus]  
Rikki don't lose that number  
You don't want to call nobody else  
Send it off in a letter to yourself  
Rikki don't lose that number  
It's the only one you own  
You might use it if you feel better  
When you get home  
  
I have a friend in town, he's heard your name  
We can go out driving on Slow Hand Row  
We could stay inside and play games, I don't know  
And you could have a change of heart  
  
[Chorus]  
  
You tell yourself you're not my kind  
But you don't even know your mind  
And you could have a change of heart  
  
[Chorus]


After
-----
Artist: Steely Dan
Title: Rikki Don't Lose That Number
We hear you're leaving, that's OK  
I thought our little wild time had just begun  
I guess you kind of scared yourself, yo

The above example shows that our function properly remove the chorus indicators from our example song and replaces it with the actual lyrics. Turns out that our function is able to succesfully accomplish this task for 33 of our 87 Steely Dan songs. Unfortunately, there are 4 other songs which have much stranger formattings that I could not work around. As a result, this will only leave us with 83 out of 87 of our Steely Dan songs for training. Though we made the function specifically, for the Steely Dan songs, we might as well run it on the whole dataset to save as many songs as we can. Many of the others follow the same Chorus format, which is the most common of these issues, so we could save a significant amount of records. If it doesn'twork on the songs, then no matter, they will just be deleted in our following step.

In [26]:
# running function for chorus on entire dataset
dataset = ReplaceChorus(dataset.copy())

Now that we have fixed what we reasonably can, let's delete all the remaining songs that have these structural placeholders within brackets, since these songs are potentially missing lyrics. After doing so we will print some statistics to measure the change to the dataset.

In [27]:
dataset = dataset[dataset['text'].str.contains('\[(.*?)\]') == False]

  return func(self, *args, **kwargs)


In [28]:
# Show statistics
dataset.info()
dataset.describe(include = 'all')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47973 entries, 0 to 57649
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  47973 non-null  object
 1   song    47973 non-null  object
 2   text    47973 non-null  object
dtypes: object(3)
memory usage: 1.5+ MB


Unnamed: 0,artist,song,text
count,47973,47973,47973
unique,639,40998,47973
top,Gordon Lightfoot,Angel,"Motorcars, handlebars, bicycles for two \nBro..."
freq,183,21,1


#### Removing Additional Comments

Now that we finally addressed the verse placeholders, as best we could, let's move on the next issue with our song lyrics, the extraneous information about the song included in the lyrics. Typically this information is listed in the song between parenthesis, and from what I can tell is most often used to indicate backup vocals, the artist performing the song, and extrneous sounds in the music. None of these are very useful for the model we want to train, so let's try and remove as much as we can. To get an idea of what we are trying to tackle, below are two examples of these types of comments.

In [29]:
example_1 = dataset.loc[1]
example_2 = dataset.loc[15962]

print('Artist: ' + example_1['artist'])
print('Title: ' + example_1['song'])
print(example_1['text'])
print('Artist: ' + example_2['artist'])
print('Title: ' + example_2['song'])
print(example_2['text'])

Artist: ABBA
Title: Andante, Andante
Take it easy with me, please  
Touch me gently like a summer evening breeze  
Take your time, make it slow  
Andante, Andante  
Just let the feeling grow  
  
Make your fingers soft and light  
Let your body be the velvet of the night  
Touch my soul, you know how  
Andante, Andante  
Go slowly with me now  
  
I'm your music  
(I am your music and I am your song)  
I'm your song  
(I am your music and I am your song)  
Play me time and time again and make me strong  
(Play me again 'cause you're making me strong)  
Make me sing, make me sound  
(You make me sing and you make me)  
Andante, Andante  
Tread lightly on my ground  
Andante, Andante  
Oh please don't let me down  
  
There's a shimmer in your eyes  
Like the feeling of a thousand butterflies  
Please don't talk, go on, play  
Andante, Andante  
And let me float away  
  
I'm your music  
(I am your music and I am your song)  
I'm your song  
(I am your music and I am your song)  
Play m

Unlike with the bracketed verse placheholders, these comments within parenthesis are not detrimental to the song, so we are going to just remove them without affecting the rest of the lyrics. Below we do just that and display one of the songs from above to show the difference.

In [30]:
# remove comments/lyrics in parenthesis and brackets (turns out some of these exist over multiple lines)
new_text = []
for indx, record in dataset.iterrows():
    text = record['text']
    pattern = re.compile('\((.*?)\)', re.DOTALL)
    text = re.sub(pattern,'', text)
    pattern = re.compile('\[(.*?)\]', re.DOTALL)
    text = re.sub(pattern,'', text)
    new_text.append(text)

dataset['text'] = new_text

# show example
example_1 = dataset.loc[1]
print('Artist: ' + example_1['artist'])
print('Title: ' + example_1['song'])
print(example_1['text'])

Artist: ABBA
Title: Andante, Andante
Take it easy with me, please  
Touch me gently like a summer evening breeze  
Take your time, make it slow  
Andante, Andante  
Just let the feeling grow  
  
Make your fingers soft and light  
Let your body be the velvet of the night  
Touch my soul, you know how  
Andante, Andante  
Go slowly with me now  
  
I'm your music  
  
I'm your song  
  
Play me time and time again and make me strong  
  
Make me sing, make me sound  
  
Andante, Andante  
Tread lightly on my ground  
Andante, Andante  
Oh please don't let me down  
  
There's a shimmer in your eyes  
Like the feeling of a thousand butterflies  
Please don't talk, go on, play  
Andante, Andante  
And let me float away  
  
I'm your music  
  
I'm your song  
  
Play me time and time again and make me strong  
  
Make me sing, make me sound  
  
Andante, Andante  
Tread lightly on my ground  
Andante, Andante  
Oh please don't let me down  
  
Make me sing, make me sound  
  
Andante, And

#### Removing Extra White Space, Empty Lines, Unwanter Characters, Etc.

Looking at the above ABBA song, you see that there are a lot of cases where we have meaningless white space and empty lines. Some of the whitespace served to break up verse, some is meaningless, and some we accidentally created when removing other unwanted text. Ideally, we would leave in the breaks between verses in the hopes that our model could learn this structural pattern. However, with this being our first language model, we are going for simplicity and instead elect to remove the verse breaks, so the only meaningul structure in each song is a line. Below we remove instances of extended whitespace and remove these extra lines. Afterwords we display the same song as above to show the difference.

In [31]:
# remove extra whitespace and empty lines
dataset['text'] = dataset['text'].str.replace(' +',' ').str.replace(' \n','\n').str.replace('\n+','\n')

# remove empty line at front of songs
new_text = []
for index, record in dataset.iterrows():
    text = record['text']
    if text[0] == '\n':
        text = text[1:]
    new_text.append(text)
    
# replace text
dataset['text'] = new_text

In [32]:
# show example
example_1 = dataset.loc[1]
print('Artist: ' + example_1['artist'])
print('Title: ' + example_1['song'])
print(example_1['text'])

Artist: ABBA
Title: Andante, Andante
Take it easy with me, please
Touch me gently like a summer evening breeze
Take your time, make it slow
Andante, Andante
Just let the feeling grow
Make your fingers soft and light
Let your body be the velvet of the night
Touch my soul, you know how
Andante, Andante
Go slowly with me now
I'm your music
I'm your song
Play me time and time again and make me strong
Make me sing, make me sound
Andante, Andante
Tread lightly on my ground
Andante, Andante
Oh please don't let me down
There's a shimmer in your eyes
Like the feeling of a thousand butterflies
Please don't talk, go on, play
Andante, Andante
And let me float away
I'm your music
I'm your song
Play me time and time again and make me strong
Make me sing, make me sound
Andante, Andante
Tread lightly on my ground
Andante, Andante
Oh please don't let me down
Make me sing, make me sound
Andante, Andante
Tread lightly on my ground
Andante, Andante
Oh please don't let me down
Andante, Andante
Oh please do

Above we see that we properly removed the extra whitespace from our lyrics and now they have a much simpler and uniform structure. While we are at it, let's go ahead and remove some more unwanted characters from our lyrics. Examples of this include $, \%, _, etc. None of these should hold any meaning, so we might as well get rid of them. Below we print all the unwanted characters and then remove them from our lyrics.

In [33]:
unwanted_char = string.punctuation.replace('"','').replace("'",'').replace('.','').replace('?','').replace('!','')\
                .replace(',','').replace(':','').replace(';','').replace('-','')

print(unwanted_char)

#$%&()*+/<=>@[\]^_`{|}~


In [34]:
pattern = '[\\/'+unwanted_char+']'

# remove unwanted characters
dataset['text'] = dataset['text'].str.replace(pattern,'')

You may notice above that we did not remove certain punctuation from our text. This will become apparent later, but in a nutshell we are going to treat these special characters as special words within each line of each song, since commas, periods, question marks, etc. all carry special meaning to them.

#### Reduce Lyrics to Lowercase

Last abnormality we are going to remove from our song lyrics is capitalization. Ideally, we would like our model to understand the meaning of capitalized words and their use, but once again we are going for simplicity. For the most part, our capitalized words are just the first word of each line. This is really just a format and we could understand the lyrics just as well if the words were lowercase. So instead of hoping that our model catches onto this pattern, we will just make everything lowercase. This also has the added benefit of reducing the vocabulary space for our model, since it now does not have to differentiate between a word and its capitalized version. The only downside to doing this is that it might affect proper nouns and our models use of them. We think these are too far and in-between to make a significant impact. We may have to address this if it becomes an issue though.

Below we make the artist, song title, and song lyrics all lowercase.

In [35]:
# reduce tect in each column of dataset to lowercase
dataset['artist'] = dataset['artist'].str.lower()
dataset['song'] = dataset['song'].str.lower()
dataset['text'] = dataset['text'].str.lower()

In [36]:
dataset

Unnamed: 0,artist,song,text
0,abba,she's my kind of girl,"look at her face, it's a wonderful face\nand i..."
1,abba,"andante, andante","take it easy with me, please\ntouch me gently ..."
2,abba,as good as new,i'll never know why i had to go\nwhy i had to ...
3,abba,bang,making somebody happy is a question of give an...
4,abba,bang-a-boomerang,making somebody happy is a question of give an...
...,...,...,...
57644,ziggy marley,generation,many generation have passed away\nfighting for...
57645,ziggy marley,good old days,irie days come on play\nlet the angels fly let...
57647,zwan,come with me,all you need\nis something i'll believe\nflash...
57648,zwan,desire,northern star\nam i frightened\nwhere can i go...


Now that we have removed all the unwanted qualities, characters, and structures in our song lyrics, we will try adding a few to make training our model more effective. Below are some of the changes and additions we are going to make:

    1) Punctuation: Make punctuation characters their own words
    2) Line Indicators: Add special characters to indicate start and end of each lyrical line
    3) Artist and Song Title: Add artist name and song title to front of each song text, wrapped with indicators
    4) Song End and Start: Add indicators for the start and end of each song
    
By adding these patterns and indicators into our song lyrics we hope to make it easier for our model to learn what a line is, how to start and end a song, specific patterns for certain artists, and relationships between the song title and the lyrics of the song.

#### Punctuation

First, we are going to separate punctuation within our song lyrics so that any of the following characters {",:;.!?} are treated as their own word. Our reasoning behind this is two fold. Firstly, each of these characters carries their own special meaning seperate from the word they are attached to. In act the fact they are attached to a word is just a formatting convention. Because of this it makes sense to treat them as separate elements that our model needs to learn how to use properly. Secondly, separating these characters from the words they are attached to significantly reduces our vocabulary space. If ever possible combination of word and punctuation existed, then our sample space would be 7 times as large.

Below we wrote a function that pads any occurrence of these characters with spaces on each side. It then removes all occurrences of multiple spaces in case we created any.

In [37]:
# define function that pads characters {",:;.!?} with spaces so they are treated like separate words
def CharPadder(data, characters):
    # Iterate over characters and pad
    for c in characters:
        data['text'] = data['text'].str.replace(c, ' '+c+' ')
    # remove occurences of multiple spaces
    data['text'] = data['text'].str.replace(' +', ' ')
    
    return data
    
# define characters to pad
characters = '",:;.!?'

# pad characters in dataset
dataset = CharPadder(dataset.copy(), characters)

#### Adding Line Indicators

Next let's add the line indicators to the beginning and end of each line in our text. It doesn't really matter what we use for the indicators so long as they don't already exist in our dataset. They should probably also be more descriptive as well, but I'm not that creative and want to make sure they stand out, so we are going to be using "xxx110" and "xxx111" for the start and end of each line respectively. These look pretty dumb, but actually follow a pattern similar to the indicators that will be used for the song, title, and artist. Below we actually check to make sure these indicators don't exist in the dataset already, as well as the indicators we will use later.

In [39]:
# grabbing unique words from artist, title, and text for each record
unique_words = set()
for index, record in dataset.iterrows():
    artist_words = set(record['artist'].split(' '))
    title_words = set(record['song'].split(' '))
    text_words = set(record['text'].split(' '))
    unique_words = unique_words.union(artist_words, title_words, text_words)

# Indicators we will be using for start/end of song, artist, title, line
indicators_dict = {'start_song': 'xxx000', 'start_artist': 'xxx010', 'start_title': 'xxx100', 'start_line': 'xxx110', 
                    'end_song': 'xxx001',   'end_artist': 'xxx011',   'end_title': 'xxx101',   'end_line': 'xxx111'}

passes = 0
for key in indicators_dict:
    indicator = indicators_dict[key]
    if indicator in unique_words:
        print('Fail: {} indicator, "{}", found in dataset vocabulary.'.format(key, indicator))
    else:
        print('Pass: {} indicator, "{}", distinct from dataset vocabulary.'.format(key, indicator))
        passes += 1
        
if passes == len(indicators_dict):
    print("Success!!! All indicators are unique and valid.")
else:
    print("NOOOOO!!! Some of the indicators are not valid and must be replaced.")        

Pass: start_song indicator, "xxx000", distinct from dataset vocabulary.
Pass: start_artist indicator, "xxx010", distinct from dataset vocabulary.
Pass: start_title indicator, "xxx100", distinct from dataset vocabulary.
Pass: start_line indicator, "xxx110", distinct from dataset vocabulary.
Pass: end_song indicator, "xxx001", distinct from dataset vocabulary.
Pass: end_artist indicator, "xxx011", distinct from dataset vocabulary.
Pass: end_title indicator, "xxx101", distinct from dataset vocabulary.
Pass: end_line indicator, "xxx111", distinct from dataset vocabulary.
Success!!! All indicators are unique and valid.


Now that we know our indicators are unique, let's make a function that adds the line indicators to our song text. This will be easy since the character '\n' indicates the end of each line and the start of the next. We probably should remove the new line character, '\n', during the replacement, but for now we will leave them in to make it easier to look at the results.

In [40]:
# function that adds line indicators to song text
def IndicateLines(data, start, end):
    # initiate list for new text
    new_text = []
    for index, record in data.iterrows():
        text = record['text']
        # strip leading and trailing whitespace
        text = text.lstrip().rstrip()
        # add indicators at each new line
        pattern = ' '+end+'\n'+start+' '
        text = text.replace('\n', pattern)
        # add indicators to beginning and end of song
        text = start+' '+text
        text = text+' '+end
        # append to list
        new_text.append(text)
    
    # replace txt
    data['text'] = new_text
    
    # remove occurences of multiple spaces
    data['text'] = data['text'].str.replace(' +', ' ')
    
    return data

# run function on dataset
start = indicators_dict['start_line']
end = indicators_dict['end_line']
dataset = IndicateLines(dataset.copy(), start, end)

# Display example
example_1 = dataset.loc[1]
print('Artist: ' + example_1['artist'])
print('Title: ' + example_1['song'])
print(example_1['text'])

Artist: abba
Title: andante, andante
xxx110 take it easy with me , please xxx111
xxx110 touch me gently like a summer evening breeze xxx111
xxx110 take your time , make it slow xxx111
xxx110 andante , andante xxx111
xxx110 just let the feeling grow xxx111
xxx110 make your fingers soft and light xxx111
xxx110 let your body be the velvet of the night xxx111
xxx110 touch my soul , you know how xxx111
xxx110 andante , andante xxx111
xxx110 go slowly with me now xxx111
xxx110 i'm your music xxx111
xxx110 i'm your song xxx111
xxx110 play me time and time again and make me strong xxx111
xxx110 make me sing , make me sound xxx111
xxx110 andante , andante xxx111
xxx110 tread lightly on my ground xxx111
xxx110 andante , andante xxx111
xxx110 oh please don't let me down xxx111
xxx110 there's a shimmer in your eyes xxx111
xxx110 like the feeling of a thousand butterflies xxx111
xxx110 please don't talk , go on , play xxx111
xxx110 andante , andante xxx111
xxx110 and let me float away xxx111
xxx110

#### Adding Artist and Song Title

Next we are going to add the name of the artist and the title of the song to the front of each song's lyrics. Each will be wrapped in indicators that specify when the artist/title begins and ends. We hope that by adding these two features into the lyrics our models can learn to identify certain lyrical themes and patterns for each artist and learn relationships between the name of the song and the lyrics of the song. For instance, typically the title of the song appears within its lyrics, usually repeated multiple times in the chorus. This is the type of pattern we would like our model to pick up on.

Below we wrote a function that adds both the artist and title of the song to the front of the lyrics in that order. We then apply it to the dataset and then display an exampl of the results.

In [41]:
# function that adds line indicators to song text
def AddArtistTitle(data, artist_start, artist_end, title_start, title_end):
    # initiate list for new text
    new_text = []
    for index, record in data.iterrows():
        text = record['text'].lstrip().rstrip()
        artist = record['artist'].lstrip().rstrip()
        title = record['song'].lstrip().rstrip()
        # add title
        title_line = title_start+' '+title+' '+title_end+'\n'
        text = title_line + text
        # add artist
        artist_line = artist_start+' '+artist+' '+artist_end+'\n'
        text = artist_line + text
        # append to list
        new_text.append(text)
    
    # replace txt
    data['text'] = new_text
    
    # remove occurences of multiple spaces
    data['text'] = data['text'].str.replace(' +', ' ')
    
    return data

# run function on dataset
artist_start = indicators_dict['start_artist']
artist_end = indicators_dict['end_artist']
title_start = indicators_dict['start_title']
title_end = indicators_dict['end_title']
dataset = AddArtistTitle(dataset.copy(), artist_start, artist_end, title_start, title_end)

# Display example
example_1 = dataset.loc[1]
print('Artist: ' + example_1['artist'])
print('Title: ' + example_1['song'])
print(example_1['text'])

Artist: abba
Title: andante, andante
xxx010 abba xxx011
xxx100 andante, andante xxx101
xxx110 take it easy with me , please xxx111
xxx110 touch me gently like a summer evening breeze xxx111
xxx110 take your time , make it slow xxx111
xxx110 andante , andante xxx111
xxx110 just let the feeling grow xxx111
xxx110 make your fingers soft and light xxx111
xxx110 let your body be the velvet of the night xxx111
xxx110 touch my soul , you know how xxx111
xxx110 andante , andante xxx111
xxx110 go slowly with me now xxx111
xxx110 i'm your music xxx111
xxx110 i'm your song xxx111
xxx110 play me time and time again and make me strong xxx111
xxx110 make me sing , make me sound xxx111
xxx110 andante , andante xxx111
xxx110 tread lightly on my ground xxx111
xxx110 andante , andante xxx111
xxx110 oh please don't let me down xxx111
xxx110 there's a shimmer in your eyes xxx111
xxx110 like the feeling of a thousand butterflies xxx111
xxx110 please don't talk , go on , play xxx111
xxx110 andante , andante

#### Adding Start/End Song Indicators

Lastly, we are going to add indicators which specify the beginning and end of each song's lyrics. One might think this isn't necessary since all our songs are separated. However, depending on the type of model we are training, we may combine the songs into a single string from which we pull segments for training examples. Furthermore, making the start and end of our songs well defined should help our model learn how to indicate the start/end of a song. This will be especially useful during song generation.

Below we wrote a function to add the indicators to the start and end of each song. Then we run the function on our dataset and display an example of the results.

In [42]:
# function that adds line indicators to song text
def IndicateSong(data, start, end):
    # initiate list for new text
    new_text = []
    for index, record in data.iterrows():
        text = record['text'].lstrip().rstrip()
        # add start
        text = start + '\n' + text
        # add end
        text = text + '\n' + end
        # append to list
        new_text.append(text)
    
    # replace txt
    data['text'] = new_text
    
    # remove occurences of multiple spaces
    data['text'] = data['text'].str.replace(' +', ' ')
    
    return data

# run function on dataset
start = indicators_dict['start_song']
end = indicators_dict['end_song']
dataset = IndicateSong(dataset.copy(), start, end)

# Display example
example_1 = dataset.loc[1]
print('Artist: ' + example_1['artist'])
print('Title: ' + example_1['song'])
print(example_1['text'])

Artist: abba
Title: andante, andante
xxx000
xxx010 abba xxx011
xxx100 andante, andante xxx101
xxx110 take it easy with me , please xxx111
xxx110 touch me gently like a summer evening breeze xxx111
xxx110 take your time , make it slow xxx111
xxx110 andante , andante xxx111
xxx110 just let the feeling grow xxx111
xxx110 make your fingers soft and light xxx111
xxx110 let your body be the velvet of the night xxx111
xxx110 touch my soul , you know how xxx111
xxx110 andante , andante xxx111
xxx110 go slowly with me now xxx111
xxx110 i'm your music xxx111
xxx110 i'm your song xxx111
xxx110 play me time and time again and make me strong xxx111
xxx110 make me sing , make me sound xxx111
xxx110 andante , andante xxx111
xxx110 tread lightly on my ground xxx111
xxx110 andante , andante xxx111
xxx110 oh please don't let me down xxx111
xxx110 there's a shimmer in your eyes xxx111
xxx110 like the feeling of a thousand butterflies xxx111
xxx110 please don't talk , go on , play xxx111
xxx110 andante , 

And now that we have the text formatted the way we want, we can remove the newline characters. We only included them up to this point to visualy make it easier to inspect the results of our transformations. Having completed these alterations, we will remove the newlines so the lyrics are a sequence of words.

In [43]:
# replaceing newline characters with spaces
dataset['text'] = dataset['text'].str.replace('\n', ' ')

# display example result
example_1 = dataset.loc[1]
print('Artist: ' + example_1['artist'])
print('Title: ' + example_1['song'])
print(example_1['text'])

Artist: abba
Title: andante, andante
xxx000 xxx010 abba xxx011 xxx100 andante, andante xxx101 xxx110 take it easy with me , please xxx111 xxx110 touch me gently like a summer evening breeze xxx111 xxx110 take your time , make it slow xxx111 xxx110 andante , andante xxx111 xxx110 just let the feeling grow xxx111 xxx110 make your fingers soft and light xxx111 xxx110 let your body be the velvet of the night xxx111 xxx110 touch my soul , you know how xxx111 xxx110 andante , andante xxx111 xxx110 go slowly with me now xxx111 xxx110 i'm your music xxx111 xxx110 i'm your song xxx111 xxx110 play me time and time again and make me strong xxx111 xxx110 make me sing , make me sound xxx111 xxx110 andante , andante xxx111 xxx110 tread lightly on my ground xxx111 xxx110 andante , andante xxx111 xxx110 oh please don't let me down xxx111 xxx110 there's a shimmer in your eyes xxx111 xxx110 like the feeling of a thousand butterflies xxx111 xxx110 please don't talk , go on , play xxx111 xxx110 andante , 

Looking at the above example, we just realized that special characters were not removed or padded from the artist and title. As such, they could exist in the new lyrics now that we appended the artist name and song title to the front of the lyrics. Let's remedy this mistake by removing the unwanted characters and padding our special characters one more time. After doing so we display an example of the results.

In [44]:
# define characters to remove
unwanted_char = string.punctuation.replace('"','').replace("'",'').replace('.','').replace('?','').replace('!','')\
                .replace(',','').replace(':','').replace(';','').replace('-','')
pattern = '[\\/'+unwanted_char+']'

# remove unwanted characters
dataset['text'] = dataset['text'].str.replace(pattern,'')

# define characters to pad
characters = '",:;.!?'

# pad characters in dataset
dataset = CharPadder(dataset.copy(), characters)

# display example result
example_1 = dataset.loc[1]
print('Artist: ' + example_1['artist'])
print('Title: ' + example_1['song'])
print(example_1['text'])

Artist: abba
Title: andante, andante
xxx000 xxx010 abba xxx011 xxx100 andante , andante xxx101 xxx110 take it easy with me , please xxx111 xxx110 touch me gently like a summer evening breeze xxx111 xxx110 take your time , make it slow xxx111 xxx110 andante , andante xxx111 xxx110 just let the feeling grow xxx111 xxx110 make your fingers soft and light xxx111 xxx110 let your body be the velvet of the night xxx111 xxx110 touch my soul , you know how xxx111 xxx110 andante , andante xxx111 xxx110 go slowly with me now xxx111 xxx110 i'm your music xxx111 xxx110 i'm your song xxx111 xxx110 play me time and time again and make me strong xxx111 xxx110 make me sing , make me sound xxx111 xxx110 andante , andante xxx111 xxx110 tread lightly on my ground xxx111 xxx110 andante , andante xxx111 xxx110 oh please don't let me down xxx111 xxx110 there's a shimmer in your eyes xxx111 xxx110 like the feeling of a thousand butterflies xxx111 xxx110 please don't talk , go on , play xxx111 xxx110 andante ,

## Formatting Data For Models

Now that we have cleansed our data and normalized the text properly, it is time to structure the data in a format that can be used by the models we expect to train. To get our data in this format we will perform the following tasks:

    1) Vectorize Data: tranform our song lyrics from a sequence of words to a sequence of integer representations
    2) Combine Songs: final reformatting of data combining all songs into a single array

### 1) Vectorize Data

As far as I know there is no way to train Tensorflow models using non-numerical inputs. Therefore, we will be transforming our textual song lyrics to numerical representations. To do so, we will convert our lyrics from a sequence of words to a sequence of integers. Below we wrote a function that performs this vectorization. The function returns a new dataframe with the vectorized text in a new column, as well as a dictionary used to map from text -> integers and the inverse mapping from integers -> text. After making the function we then use it to vectorize the dataset and display an example of the integer representation of a songs.

In [45]:
# defining function for vectorizing textual data
def Vectorizer(data):
    # Find unique words
    unique_words = set()
    for index, record in dataset.iterrows():
        words = set(record['text'].split(' '))
        unique_words = unique_words.union(words)
    unique_words = sorted(unique_words)
    
    # Create transformation mappings
    text2int = {word:i for i, word in enumerate(unique_words)}
    int2text = np.array(unique_words)
    
    # Initiate list for vectorized text
    int_text = []
    for index, record in dataset.iterrows():
        text = record['text'].split(' ')
        vectorized_text = []
        for word in text:
            vectorized_text.append(text2int[word])
        int_text.append(vectorized_text)
        
    data['int_text'] = int_text
    
    return data, text2int, int2text

In [46]:
# vectorize dataset
dataset, text2int, int2text = Vectorizer(dataset.copy())

In [47]:
# Display mappings
print('Dimensionality of Word Space: {}\n'.format(len(text2int)))

# Display example vectorized song and devectorized song
example_1 = dataset.loc[1]
print('Example Song\n------------')
print('Artist: ' + example_1['artist'])
print('Title: ' + example_1['song'])
print('Vectorized Song')
print(example_1['int_text'])
print('De-vectorized Text')
print(' '.join(int2text[example_1['int_text']]))

Dimensionality of Word Space: 92447

Example Song
------------
Artist: abba
Title: andante, andante
Vectorized Song
[91159, 91161, 2848, 91162, 91163, 5090, 1054, 5090, 91164, 91165, 80025, 42094, 25970, 90248, 51010, 1054, 62371, 91166, 91165, 83107, 51010, 33174, 47316, 2032, 78669, 27750, 12126, 91166, 91165, 80025, 91976, 82266, 1054, 49661, 42094, 74629, 91166, 91165, 5090, 1054, 5090, 91166, 91165, 43575, 46864, 81371, 29351, 35198, 91166, 91165, 49661, 91976, 29965, 75288, 5070, 47239, 91166, 91165, 46864, 91976, 11070, 8666, 81371, 86939, 57453, 81371, 56003, 91166, 91165, 83107, 54494, 75762, 1054, 91855, 44921, 39260, 91166, 91165, 5090, 1054, 5090, 91166, 91165, 34005, 74647, 90248, 51010, 56804, 91166, 91165, 39850, 91976, 54334, 91166, 91165, 39850, 91976, 75563, 91166, 91165, 62299, 51010, 82266, 5070, 82266, 3741, 5070, 49661, 51010, 78062, 91166, 91165, 49661, 51010, 73760, 1054, 49661, 51010, 75798, 91166, 91165, 5090, 1054, 5090, 91166, 91165, 83601, 47276, 57929, 544

### 2) Reformat Dataset

The last transformation we need to do to our dataset is reformat it as a single array. This is a conveniant format that we can save in a json file and then reload later when training our models. In doing so, we are also going to create several versions of our datasets and for each we will create a training set as well as a validation set. We need multiple types of datasets because the various models we will train take different inputs. Primarily, some of the models will take variable length seqeunces for which we will combine all the songs into a single sequence, and the other models will train on each individual song. Additionally, we are going to separate out just the Steely Dan songs into their own set since we fill fine tune our models using just these songs. Below is a description of the 4 datasets we will create:

    1) All Songs, Single Sequence: All the songs combined into a single sequence (includes Steely Dan songs)
    2) Steely Dan Songs, Single Sequence: Steely Dan songs combined into a single sequence
    3) All Songs Separated: All songs combined into a single list as separate entries (includes Steely Dan songs)
    4) Steely Dan Songs Separated: Steely Dan songs combined into a single
    
For each of these datasets we will have both a training set and a validation set. The validation and training songs will be randomly chosen, but will be the same between the sets, so the single sequence set has the same songs as the separated sequences set. We will partition 10% of the set for validation. 

#### Separate Steely Dan Songs

First let's find our Steely Dan songs. Below we filter out these songs into their own dataset.

In [48]:
# Filter to Steely Dan songs
SD_dataset = dataset[dataset['artist']=='steely dan']

# display dataset
SD_dataset

Unnamed: 0,artist,song,text,int_text
18783,steely dan,barrytown,xxx000 xxx010 steely dan xxx011 xxx100 barryto...,"[91159, 91161, 77260, 20937, 91162, 91163, 836..."
18784,steely dan,black cow,xxx000 xxx010 steely dan xxx011 xxx100 black c...,"[91159, 91161, 77260, 20937, 91162, 91163, 103..."
18785,steely dan,book of liars,xxx000 xxx010 steely dan xxx011 xxx100 book of...,"[91159, 91161, 77260, 20937, 91162, 91163, 113..."
18786,steely dan,brain tap shuffle,xxx000 xxx010 steely dan xxx011 xxx100 brain t...,"[91159, 91161, 77260, 20937, 91162, 91163, 118..."
18787,steely dan,brooklyn,xxx000 xxx010 steely dan xxx011 xxx100 brookly...,"[91159, 91161, 77260, 20937, 91162, 91163, 124..."
...,...,...,...,...
52188,steely dan,two against nature,xxx000 xxx010 steely dan xxx011 xxx100 two aga...,"[91159, 91161, 77260, 20937, 91162, 91163, 847..."
52190,steely dan,what a shame about me,xxx000 xxx010 steely dan xxx011 xxx100 what a ...,"[91159, 91161, 77260, 20937, 91162, 91163, 892..."
52191,steely dan,with a gun,xxx000 xxx010 steely dan xxx011 xxx100 with a ...,"[91159, 91161, 77260, 20937, 91162, 91163, 902..."
52192,steely dan,your gold teeth,xxx000 xxx010 steely dan xxx011 xxx100 your go...,"[91159, 91161, 77260, 20937, 91162, 91163, 919..."


#### Training and Validation

Now that we separated out our Steely Dan songs, lets break each dataset into a training and validation dataset. As mentioned before, we will randomly choose which songs are in each dataset, and will allocate 10% of the total set for validation. Below we wrote a function to perform this operation.

In [49]:
import numpy as np

# function that partitions dataset into training and validation
def TVPartition(data, val_ratio):
    N = len(data)
    n = int(np.floor(N*val_ratio))
    random_indx = np.random.choice(N, n, replace = False)
    val_indx = data.iloc[random_indx].index
    validation_dataset = data.loc[val_indx]
    training_dataset = data.drop(val_indx)
    
    return training_dataset, validation_dataset

In [50]:
val_ratio = .1

# split up full dataset
training_dataset , validation_dataset = TVPartition(dataset.copy(), val_ratio)

# split up Steely Dan dataset
SD_training_dataset , SD_validation_dataset = TVPartition(SD_dataset.copy(), val_ratio)

#### Reformat As List

With our datasets separated out into validation and training subsets, we can now move on to our final step, reformatting the dataframes into lists. We want to reformat our data into a list because it will let us easily save the transformed data into a json file and easily allow us to reload the data later into a numpy array. After transforming the records into a list of records, we will also create our secondary datasets, where all the songs are combined into  single sequence, by concatenating all the lists together. Below we wrote a function to extract the records into a list from a dataframe, and a function to combine the entries of thes lists into a single depth list.

In [57]:
# function to extract records from dataframe into list
def DF2List(data, column_name):
    return list(data[column_name])

# function to convert list of list to single depth list (one sequence)
def SingleSequence(list_data):
    # initiate new list
    new_list = []
    # add each list in list_data to new list
    for data in list_data:
        new_list += data
        
    return new_list

In [64]:
# tranform our dataframes to list
column_name = 'int_text'

# First set of data (all songs separate list per song)
training_1 = DF2List(training_dataset, column_name)
validation_1 = DF2List(validation_dataset, column_name)

# Second set of data (Steely Dan songs separt list per song)
training_2 = DF2List(SD_training_dataset, column_name)
validation_2 = DF2List(SD_validation_dataset, column_name)

# Third set of data (all songs single list/sequence)
training_3 = SingleSequence(training_1)
validation_3 = SingleSequence(validation_1)

# Fourth set of data (Steely Dan songs single list/sequence)
training_4 = SingleSequence(training_2)
validation_4 = SingleSequence(validation_2)

## Saving Data

After cleansing, transforming, and reformatting our data, it is now finally time to save the resultant datasets so that we don't have to go trhough this process again. We have decided to save each dataset, validation and training, into a separate file along with the vectorizing maps, text2int and int2text. We will be save thes as JSON files. Below we wrote a function to perform these actions.

In [70]:
import json

# function for saving dataset to json file
def SaveData(filename, training_dataset, validation_dataset, int2text):
    print('Saving datasets and mapping to filename: '+filename)
    # Build data dictionary that will be saved
    data = {'training_dataset': training_dataset,
            'validation_dataset': validation_dataset,
            'unique_values': int2text.tolist()}
    # Write to savefile
    with open(filename, 'w') as f:
        json.dump(data, f)
    print('Saving datasets complete')

In [71]:
# Save each dataset to separate file
filename_1 = 'all_song_lyrics_1.json'
filename_2 = 'SD_song_lyrics_1.json'
filename_3 = 'all_song_lyrics_2.json'
filename_4 = 'SD_song_lyrics_2.json'

SaveData(filename_1, training_1, validation_1, int2text)
SaveData(filename_2, training_2, validation_2, int2text)
SaveData(filename_3, training_3, validation_3, int2text)
SaveData(filename_4, training_4, validation_4, int2text)

Saving datasets and mapping to filename: all_song_lyrics_1.json
Saving datasets complete
Saving datasets and mapping to filename: SD_song_lyrics_1.json
Saving datasets complete
Saving datasets and mapping to filename: all_song_lyrics_2.json
Saving datasets complete
Saving datasets and mapping to filename: SD_song_lyrics_2.json
Saving datasets complete
