## Most Synced Songs in Film & TV
###### Part 2
In this notebook, we'll combine and clean what was gathered in the last notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load in the CSV's and combine into one DataFrame.

In [2]:
to_concat = []
for i in range(0, 22):
    df = pd.read_csv(f'./data/raw_songs_{i}.csv')
    to_concat.append(df)

In [3]:
songs = pd.concat(to_concat, axis = 0)

Reset the index and check that is came through correctly.

In [4]:
songs.reset_index(inplace = True)

In [5]:
songs.head(2)

Unnamed: 0,index,artist,song_title,use,show
0,0,Dusty-Springfield,Girls It Ain't Easy,1:22A jackrabbit appears as Crystal gets up fr...,The Hunt12 Mar 20200
1,1,Dusty-Springfield,Wishin' and Hopin',,Sex Education • S2E816 Jan 20200


In [6]:
songs.shape

(137926, 5)

In [7]:
songs[songs['show'].str.contains('Pitch Perfect 2')]

Unnamed: 0,index,artist,song_title,use,show
628,628,Flo-Rida,Low (feat. T-Pain),0:03Third song about butts.,Pitch Perfect 214 May 20150
629,629,Flo-Rida,Low (feat. T-Pain),0:44Third song about butts.,Pitch Perfect 214 May 20154
10485,10485,KC-and-The-Sunshine-Band,"(Shake, Shake, Shake) Shake Your Booty",0:44Second song at the riff off about butts.,Pitch Perfect 214 May 20152
11900,11900,Mika,Lollipop,0:12The Treblemakers perform at freshman orien...,Pitch Perfect 214 May 20152
19802,19802,Los-Lobos,Marine's Hymn,0:01Universal logo humming.,Pitch Perfect 214 May 20151
...,...,...,...,...,...
83199,4157,Das-Sound-Machine-Tone-Hangers-The-Barden-Bell...,Riff Off,,Pitch Perfect 214 May 20150
83200,4158,Das-Sound-Machine-Tone-Hangers-The-Treblemaker...,Jump,,Pitch Perfect 214 May 20150
83203,4161,Rebel-Wilson-and-Robbie-Fairchild,We Belong,,Pitch Perfect 214 May 20150
83212,4170,Adam-DeVine,All of Me (Bumper's Audition),,Pitch Perfect 214 May 20150


In [8]:
songs.loc[2357, :]

index                                                      2357
artist                                                 The-Cure
song_title                                   Six Different Ways
use           1:02Song that plays when the Losers Club is cl...
show                                              It7 Sep 20174
Name: 2357, dtype: object

Check for duplicates and nulls.

In [9]:
songs.duplicated().sum()

0

In [10]:
songs.isna().sum()

index             0
artist            1
song_title        0
use           72209
show              0
dtype: int64

About half of the observations are missing a use description. This wasn't going to be a point of focus for me so I will drop that row.

In [11]:
songs.drop(columns = ['use', 'index'], inplace = True)

Next up is cleaning the individual columns. First I'll find the length of the show column. This is useful because it contains up to four pieces of information: show, episode, date, and favorites on the site. 

As you'll see in the code below, cleaning and parsing this columns was very tedious because there were so many possibilities for the "show" column. Some had no favorites, some had hundreds, some had not date, some were titles that were strictly numeric. Because of this I had to set a lot of if/else statements and the tests below are just some of the different cases I wanted to be cautious of when splitting up the text.

In [12]:
def find_len(string):
    return len(string)

In [13]:
songs['len'] = songs['show'].apply(find_len)


songs[songs['len'] < 13]['show'].value_counts()

0               446
1                11
2                 3
31 Dec 20150      2
8                 2
4                 2
28 Sep 20170      1
3                 1
31 Dec 20151      1
Name: show, dtype: int64

In [14]:
songs = songs[songs['len'] >= 13]

In [15]:
songs.isna().sum()

artist        1
song_title    0
show          0
len           0
dtype: int64

The last few digits before the date is the number of favorites it has on the site. This function helps split that into a new column.

In [16]:
def take_last(string):
    if string[-5] == ' ':
        return np.nan
    elif string[-6] not in '1234567890 ':
        return string[-1]
    else:
        i = -6
        while string[i] in '1234567890 ':
            i -= 1 
        
        i += 6
        return string [i:]

In [17]:
test_1 = 'It7 Sep 20174'
test_2 = 'Big Little Lies • S1E34 Mar 2017115'
test_3 = 'Pitch Perfect 214 May 20150'
test_4 = 'Suits • S2E5'
test_5 = 'Blade Runner 20495 Oct 20171'
test_6 = '191724 Dec 20190'

In [18]:
def test(function):
    print(test_1 + ": " + str(function(test_1)))
    print(test_2 + ": " + str(function(test_2)))
    print(test_3 + ": " + str(function(test_3)))
    print(test_4 + ": " + str(function(test_4)))
    print(test_5 + ": " + str(function(test_5)))
    print(test_6 + ": " + str(function(test_6)))
    return function

In [19]:
test(take_last)

It7 Sep 20174: 4
Big Little Lies • S1E34 Mar 2017115: 115
Pitch Perfect 214 May 20150: 0
Suits • S2E5: nan
Blade Runner 20495 Oct 20171: 1
191724 Dec 20190: 0


<function __main__.take_last(string)>

In [20]:
songs['favorites'] = songs['show'].apply(take_last)

In [21]:
songs.isna().sum()

artist        1
song_title    0
show          0
len           0
favorites     0
dtype: int64

Now that the favorites have their own column, we can remove it from the show.

In [22]:
def remove_last(string):
     if string[-5] == ' ':
        return string
     elif string[-6] not in '1234567890 ':
        return string[:-1]
     else:
        i = -6
        while string[i] in '1234567890 ':
            i -= 1 
        
        i += 6
        return string [:i]

In [23]:
test(remove_last)

It7 Sep 20174: It7 Sep 2017
Big Little Lies • S1E34 Mar 2017115: Big Little Lies • S1E34 Mar 2017
Pitch Perfect 214 May 20150: Pitch Perfect 214 May 2015
Suits • S2E5: Suits • S2E5
Blade Runner 20495 Oct 20171: Blade Runner 20495 Oct 2017
191724 Dec 20190: 191724 Dec 2019


<function __main__.remove_last(string)>

In [24]:
songs['show'] = songs['show'].apply(remove_last)

In [25]:
songs.isna().sum()

artist        1
song_title    0
show          0
len           0
favorites     0
dtype: int64

In [26]:
songs[songs['len'] == 13]

Unnamed: 0,artist,song_title,show,len,favorites
936,Mutemath,Blood Pressure,Suits • S2E5,13,2
1595,Vampire-Weekend,Oxford Comma,Suits • S1E1,13,4
1596,Vampire-Weekend,I'm Going Down,Girls • S2E1,13,1
1632,LCD-Soundsystem,I Can Change,Girls • S1E3,13,5
2357,The-Cure,Six Different Ways,It7 Sep 2017,13,4
...,...,...,...,...,...
118631,Nicolas-Folmer,Tchou Cha Cha,It7 Sep 2017,13,0
132516,Ghostface-Killah,The Champ,Girls • S1E7,13,1
132986,Passion-Pit,Take a Walk,Girls • S2E9,13,1
133427,Juvenile,Who's Ya Daddy,Girls • S1E4,13,0


Working from the end of the string, the next thing to parse out is the date. Again, this function helps check if it's there before separating it.

In [27]:
test_1 = 'It7 Sep 2017'
test_2 = 'Big Little Lies • S1E34 Mar 2017'
test_3 = 'Pitch Perfect 214 May 2015'
test_4 = 'Suits • S2E5'
test_5 = 'Blade Runner 20495 Oct 2017'
test_6 = '191724 Dec 2019'

In [28]:
def get_date(string):
    #This means there is no date
    if (string[-2] not in '0123456789 ') or (string[-3] not in '0123456789 '):
        return np.nan
    
    #This means there is a date
    elif string[-9] == ' ':
        if string[-11] not in '1234567890 ':
            return string [-10:]
        elif len(string) > 13:
            if (string[-14] not in '1234567890') or (string[-13] not in '1234567890') or (string[-12] not in '1234567890'):
                 if int(string[-11:-9]) > 31:
                    return string [-10:] 
                 else:
                    return string[-11:]
            elif (string[-14] in '1234567890') and (string[-13] in '1234567890') and (string[-12] in '1234567890'):
                 if int(string[-11:-9]) > 31:
                    return string [-10:] 
                 else:
                    return string[-11:]
        elif (string[-13] not in '1234567890') or (string[-12] not in '1234567890'):
            
            if int(string[-11:-9]) > 31:
                return string [-10:] 
            else:
                return string[-11:]
        elif (string[-13]  in '1234567890') and (string[-12]  in '1234567890'):
            
            if int(string[-11:-9]) > 31:
                return string [-10:] 
            else:
                return string[-11:]
    else:
        return string[-11:]

In [29]:
test(get_date)

It7 Sep 2017: 7 Sep 2017
Big Little Lies • S1E34 Mar 2017: 4 Mar 2017
Pitch Perfect 214 May 2015: 14 May 2015
Suits • S2E5: nan
Blade Runner 20495 Oct 2017: 5 Oct 2017
191724 Dec 2019: 24 Dec 2019


<function __main__.get_date(string)>

In [30]:
get_date('2127 Mar 2008')

'27 Mar 2008'

In [31]:
songs['date'] = songs['show'].apply(get_date)

In [32]:
songs.isna().sum()

artist          1
song_title      0
show            0
len             0
favorites       0
date          318
dtype: int64

In [33]:
songs[songs['date'].isna()]

Unnamed: 0,artist,song_title,show,len,favorites,date
936,Mutemath,Blood Pressure,Suits • S2E5,13,2,
1595,Vampire-Weekend,Oxford Comma,Suits • S1E1,13,4,
1596,Vampire-Weekend,I'm Going Down,Girls • S2E1,13,1,
1632,LCD-Soundsystem,I Can Change,Girls • S1E3,13,5,
1633,LCD-Soundsystem,Dance Yrself Clean,Suits • S2E10,14,0,
...,...,...,...,...,...,...
132516,Ghostface-Killah,The Champ,Girls • S1E7,13,1,
132986,Passion-Pit,Take a Walk,Girls • S2E9,13,1,
133403,Raphael-Lake,Get Dirty,Hello Ladies • S1E5,20,0,
133427,Juvenile,Who's Ya Daddy,Girls • S1E4,13,0,


In [34]:
songs.head()

Unnamed: 0,artist,song_title,show,len,favorites,date
0,Dusty-Springfield,Girls It Ain't Easy,The Hunt12 Mar 2020,20,0,12 Mar 2020
1,Dusty-Springfield,Wishin' and Hopin',Sex Education • S2E816 Jan 2020,32,0,16 Jan 2020
2,Dusty-Springfield,Spooky,9-1-1 • S3E627 Oct 2019,24,0,27 Oct 2019
3,Dusty-Springfield,I Can't Make It Alone,The Deuce • S3E429 Sep 2019,28,0,29 Sep 2019
4,Dusty-Springfield,No Easy Way Down,The Deuce • S3E429 Sep 2019,28,0,29 Sep 2019


We'll also remove the date from the show column so all that's left there is the show title and episode number if available.

In [35]:
def remove_date(string):
    #This means there is no date
    if (string[-2] not in '0123456789 ') or (string[-3] not in '0123456789 '):
        return string
    
    #This means there is a date
    elif string[-9] == ' ':
        if string[-11] not in '1234567890 ':
            return string [:-10]
        elif len(string) > 13:
            if (string[-14] not in '1234567890') or (string[-13] not in '1234567890') or (string[-12] not in '1234567890'):
                 if int(string[-11:-9]) > 31:
                    return string [:-10] 
                 else:
                    return string[:-11]
            elif (string[-14] in '1234567890') and (string[-13] in '1234567890') and (string[-12] in '1234567890'):
                 if int(string[-11:-9]) > 31:
                    return string [:-10] 
                 else:
                    return string[:-11]
        elif (string[-13] not in '1234567890') or (string[-12] not in '1234567890'):
            
            if int(string[-11:-9]) > 31:
                return string [:-10] 
            else:
                return string[:-11]
        elif (string[-13]  in '1234567890') and (string[-12]  in '1234567890'):
            
            if int(string[-11:-9]) > 31:
                return string [:-10] 
            else:
                return string[:-11]
    else:
        return string[:-11]

In [36]:
test(remove_date)

It7 Sep 2017: It
Big Little Lies • S1E34 Mar 2017: Big Little Lies • S1E3
Pitch Perfect 214 May 2015: Pitch Perfect 2
Suits • S2E5: Suits • S2E5
Blade Runner 20495 Oct 2017: Blade Runner 2049
191724 Dec 2019: 1917


<function __main__.remove_date(string)>

In [37]:
songs['show'] = songs['show'].apply(remove_date)

In [38]:
songs.tail()

Unnamed: 0,artist,song_title,show,len,favorites,date
137921,Theo-Kottis,Turning Around,White Lines • S1E4,30,0,14 May 2020
137922,Futurebirds,Olive Garden Daydream #47,Supergirl • S5E19,29,0,16 May 2020
137923,The-Veil,Welcome To My World,Batwoman • S1E20,28,0,16 May 2020
137924,Don-Tosti-Y-Su-Trio,Los Blues,Penny Dreadful: City of Angels • S1E4,49,0,16 May 2020
137925,Billy-May-and-His-Orchestra-as-Billy-May-and-H...,Rudolph the Red-Nosed Reindeer (1949),Christmas with the Kranks,37,0,14 Nov 2004


In [39]:
songs.isna().sum()

artist          1
song_title      0
show            0
len             0
favorites       0
date          318
dtype: int64

Another thing to clean up would be the artist name. These mostly have dashes where it should be a space.

In [40]:
songs = songs[songs['artist'].notna()]

In [41]:
def remove_dash(string):
    return string.strip().replace('-', ' ')

In [42]:
songs['artist'] = songs['artist'].apply(remove_dash) 

In [43]:
songs.head(1)

Unnamed: 0,artist,song_title,show,len,favorites,date
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,20,0,12 Mar 2020


There are 71 shows with dates missing. Since this information is easily found online, I'll manually impute it so I can include it in my analysis. Luckily these mostly came from the same shows so I was able to find the airdates laid out on Wikipedia.

In [44]:
songs[songs['date'].isna()]['show'].nunique()

71

In [45]:
songs[songs['date'].isna()]['show'].value_counts()[:10]

Girls • S3E3     15
Girls • S1E7     15
Girls • S1E10    10
Girls • S1E6     10
Girls • S1E3      9
Girls • S3E7      9
Girls • S1E1      9
Girls • S2E3      8
Girls • S3E8      8
Girls • S2E9      8
Name: show, dtype: int64

In [46]:
songs[songs['show'] == 'It']

Unnamed: 0,artist,song_title,show,len,favorites,date
2357,The Cure,Six Different Ways,It,13,4,7 Sep 2017
9906,The Cult,Love Removal Machine,It,13,4,7 Sep 2017
9928,Young MC,Bust a Move,It,13,6,7 Sep 2017
24852,XTC,Dear God,It,13,4,7 Sep 2017
49615,Anvil,666,It,13,4,7 Sep 2017
93834,Marc Durst,Holiday Kiss,It,13,0,7 Sep 2017
105087,Benjamin Wallfisch,Deadlights,It,13,2,7 Sep 2017
105088,Benjamin Wallfisch,Searching For Stanley,It,13,2,7 Sep 2017
105089,Benjamin Wallfisch,Return to Neibolt,It,13,2,7 Sep 2017
105090,Benjamin Wallfisch,Into the Well,It,13,2,7 Sep 2017


In [47]:
songs.loc[songs['show'] == 'Girls • S1E7', 'date'] = '27 May 2012'
songs.loc[songs['show'] == 'Girls • S3E3', 'date'] = '19 Jan 2014'
songs.loc[songs['show'] == 'Girls • S1E10', 'date'] = '17 Jun 2012'
songs.loc[songs['show'] == 'Girls • S1E6', 'date'] = '20 May 2012'
songs.loc[songs['show'] == 'Girls • S3E7', 'date'] = '16 Feb 2014'

In [48]:
songs[songs['date'].isna()]['show'].value_counts()[:10]

Girls • S1E3     9
Girls • S1E1     9
Girls • S3E8     8
Girls • S2E4     8
Girls • S2E3     8
Girls • S2E9     8
Suits • S1E10    7
Girls • S3E10    7
Girls • S1E2     7
Girls • S2E1     7
Name: show, dtype: int64

In [49]:
songs.loc[songs['show'] == 'Girls • S1E1', 'date'] = '15 Apr 2012'
songs.loc[songs['show'] == 'Girls • S1E3', 'date'] = '29 Apr 2012'
songs.loc[songs['show'] == 'Girls • S2E9', 'date'] = '10 Mar 2013'
songs.loc[songs['show'] == 'Girls • S2E4', 'date'] = '2 Feb 2014'
songs.loc[songs['show'] == 'Girls • S2E3', 'date'] = '27 Jan 2013'
songs.loc[songs['show'] == 'Girls • S3E8', 'date'] = '23 Feb 2014'
songs.loc[songs['show'] == 'Suits • S1E10', 'date'] = '25 Aug 2011'
songs.loc[songs['show'] == 'Girls • S2E1', 'date'] = '13 Jan 2013'
songs.loc[songs['show'] == 'Girls • S1E2', 'date'] = '22 Apr 2012'
songs.loc[songs['show'] == 'Girls • S3E10', 'date'] = '9 Mar 2014'

In [50]:
songs[songs['date'].isna()]['show'].value_counts()[:10]

Suits • S2E8           6
Hello Ladies • S1E1    6
Suits • S2E16          6
Girls • S1E8           6
Girls • S2E8           6
Girls • S3E11          6
Girls • S1E9           6
Suits • S1E11          5
Girls • S1E5           5
Suits • S2E10          5
Name: show, dtype: int64

In [51]:
songs.loc[songs['show'] == 'Hello Ladies • S1E1', 'date'] = '29 Sep 2013'
songs.loc[songs['show'] == 'Suits • S2E16', 'date'] = '21 Feb 2013'
songs.loc[songs['show'] == 'Girls • S1E9', 'date'] = '10 Jun 2012'
songs.loc[songs['show'] == 'Suits • S2E8', 'date'] = '9 Aug 2012'
songs.loc[songs['show'] == 'Girls • S3E11', 'date'] = '16 Mar 2014'
songs.loc[songs['show'] == 'Girls • S2E8', 'date'] = '3 Mar 2013'
songs.loc[songs['show'] == 'Suits • S1E8', 'date'] = '11 Aug 2011'
songs.loc[songs['show'] == 'Girls • S3E4', 'date'] = '26 Jan 2014'
songs.loc[songs['show'] == 'Girls • S1E5', 'date'] = '13 May 2012'
songs.loc[songs['show'] == 'Girls • S1E12', 'date'] = '24 Jan 2013'

songs[songs['date'].isna()]['show'].value_counts()[:10]

Girls • S1E8           6
Suits • S1E7           5
Suits • S2E10          5
Girls • S3E1           5
Suits • S1E11          5
Girls • S2E7           5
Suits • S1E12          5
Hello Ladies • S1E6    4
Hello Ladies • S1E5    4
Girls • S2E10          4
Name: show, dtype: int64

In [52]:
songs.loc[songs['show'] == 'Girls • S1E8', 'date'] = '3 Jun 2012'
songs.loc[songs['show'] == 'Suits • S1E12', 'date'] = '8 Sep 2011'
songs.loc[songs['show'] == 'Suits • S1E11', 'date'] = '1 Sep 2011'
songs.loc[songs['show'] == 'Girls • S3E1', 'date'] = '12 Jan 2014'
songs.loc[songs['show'] == 'Girls • S2E7', 'date'] = '24 Feb 2013'
songs.loc[songs['show'] == 'Suits • S1E7', 'date'] = '27 May 2012'
songs.loc[songs['show'] == 'Suits • S2E10', 'date'] = '23 Aug 2012'
songs.loc[songs['show'] == 'Girls • S2E6', 'date'] = '17 Feb 2013'
songs.loc[songs['show'] == 'Girls • S2E10', 'date'] = '17 Mar 2013'
songs.loc[songs['show'] == 'Suits • S1E1', 'date'] = '23 Jun 2011'

songs[songs['date'].isna()]['show'].value_counts()[:10]

Suits • S1E5           4
Hello Ladies • S1E5    4
Hello Ladies • S1E2    4
Girls • S3E6           4
Hello Ladies • S1E6    4
Hello Ladies • S1E8    3
Girls • S3E9           3
Suits • S2E2           3
Suits • S2E1           3
Girls • S2E5           3
Name: show, dtype: int64

In [53]:
songs.loc[songs['show'] == 'Hello Ladies • S1E2', 'date'] = '6 Oct 2013'
songs.loc[songs['show'] == 'Hello Ladies • S1E6', 'date'] = '3 Nov 2013'
songs.loc[songs['show'] == 'Suits • S1E5', 'date'] = '21 Jul 2011'
songs.loc[songs['show'] == 'Girls • S3E6', 'date'] = '9 Feb 2014'
songs.loc[songs['show'] == 'Hello Ladies • S1E5', 'date'] = '27 Oct 2013'
songs.loc[songs['show'] == 'Suits • S2E5', 'date'] = '21 July 2011'
songs.loc[songs['show'] == 'Hello Ladies • S1E8', 'date'] = '17 Nov 2013'
songs.loc[songs['show'] == 'Suits • S2E1', 'date'] = '14 Jun 2012'
songs.loc[songs['show'] == 'Suits • S2E6', 'date'] = '26 Jul 2012'
songs.loc[songs['show'] == 'Hello Ladies • S1E7', 'date'] = '10 Nov 2013'

songs[songs['date'].isna()]['show'].value_counts()[:10]

Suits • S1E6     3
Girls • S3E12    3
Suits • S2E9     3
Girls • S3E5     3
Suits • S1E2     3
Girls • S3E9     3
Girls • S2E5     3
Suits • S2E2     3
Suits • S2E12    2
Suits • S2E13    2
Name: show, dtype: int64

In [54]:
songs.loc[songs['show'] == 'Girls • S3E12', 'date'] = '23 Mar 2014'
songs.loc[songs['show'] == 'Girls • S3E9', 'date'] = '2 Mar 2014'
songs.loc[songs['show'] == 'Girls • S2E5', 'date'] = '1 Feb 2014'
songs.loc[songs['show'] == 'Suits • S2E9', 'date'] = '16 Aug 2012'
songs.loc[songs['show'] == 'Suits • S1E6', 'date'] = '28 Jul 2011'
songs.loc[songs['show'] == 'Girls • S3E5', 'date'] = '1 Feb 2014'
songs.loc[songs['show'] == 'Suits • S1E2', 'date'] = '30 Jun 2011'
songs.loc[songs['show'] == 'Suits • S2E2', 'date'] = '21 June 2012'
songs.loc[songs['show'] == 'Suits • S2E15', 'date'] = '14 Feb 2013'
songs.loc[songs['show'] == 'Suits • S2E13', 'date'] = '31 Jan 2013'

songs[songs['date'].isna()]['show'].value_counts()[:10]

Suits • S2E7     2
Suits • S2E12    2
Girls • S1E4     2
Suits • S2E4     2
Girls • S3E2     2
Suits • S1E13    1
Suits • S1E3     1
Suits • S1E4     1
Suits • S2E14    1
Suits • S1E71    1
Name: show, dtype: int64

In [55]:
songs.loc[songs['show'] == 'Suits • S2E7', 'date'] = '2 Aug 2012'
songs.loc[songs['show'] == 'Suits • S2E12', 'date'] = '24 Jan 2013'
songs.loc[songs['show'] == 'Girls • S3E2', 'date'] = '12 Jan 2014'
songs.loc[songs['show'] == 'Suits • S2E4', 'date'] = '12 Jul 2012'
songs.loc[songs['show'] == 'Girls • S1E4', 'date'] = '6 May 2012'
songs.loc[songs['show'] == 'Suits • S1E4', 'date'] = '14 Jul 2011'
songs.loc[songs['show'] == 'Suits • S2E14', 'date'] = '7 Feb 2013'
songs.loc[songs['show'] == 'Girls • S2E2', 'date'] = '22 Apr 2012'
songs.loc[songs['show'] == 'Suits • S2E11', 'date'] = '17 Jan 2013'
songs.loc[songs['show'] == 'Hello Ladies • S1E4', 'date'] = '20 Oct 2013'

songs[songs['date'].isna()]['show'].value_counts()[:10]

Hello Ladies • S1E3    1
Suits • S1E13          1
Suits • S1E41          1
Suits • S1E21          1
Suits • S1E71          1
Suits • S1E3           1
Suits • S2E3           1
Name: show, dtype: int64

In [56]:
songs.loc[songs['show'] == 'Suits • S1E71', 'show'] = 'Suits • S1E7'

In [57]:
songs.loc[songs['show'] == 'Suits • S1E41', 'show'] = 'Suits • S1E4'
songs.loc[songs['show'] == 'Suits • S1E21', 'show'] = 'Suits • S1E2'

In [58]:
songs.loc[songs['show'] == 'Suits • S1E13', 'show'] = 'Suits • S1E1'

In [59]:
songs.loc[songs['show'] == 'Suits • S1E3', 'date'] = '7 Jul 2011'
songs.loc[songs['show'] == 'Suits • S1E7', 'date'] = '4 Aug 2011'
songs.loc[songs['show'] == 'Hello Ladies • S1E3', 'date'] = '13 Oct 2013'
songs.loc[songs['show'] == 'Suits • S1E4', 'date'] = '14 Jul 2011'
songs.loc[songs['show'] == 'Suits • S1E2', 'date'] = '30 Jun 2011a'
songs.loc[songs['show'] == 'Suits • S1E1', 'date'] = '23 Jun 2011'
songs.loc[songs['show'] == 'Suits • S2E3', 'date'] = '28 Jun 2012'

songs[songs['date'].isna()]['show']

Series([], Name: show, dtype: object)

In [60]:
songs.isna().sum()

artist        0
song_title    0
show          0
len           0
favorites     0
date          0
dtype: int64

To make sure I've cleaned property, the column 'check' should contain the day of the month. 

In [61]:
def check_date(string):
    return string[:2]

In [62]:
songs['check'] = songs['date'].apply(check_date)

In [63]:
songs['check'].value_counts().sort_values()[:10]

s       1
 2      3
 6     13
 4     28
 8     37
 1     43
 7     68
02     85
09     93
06    106
Name: check, dtype: int64

This one slipped through the cracks so I'll manually change it.

In [64]:
songs[songs['check'] == 's ']

Unnamed: 0,artist,song_title,show,len,favorites,date,check
54867,Azealia Banks,212 (feat. Lazy Jay),Girl,15,0,s • S1E101,s


In [65]:
songs.loc[54867, 'show'] = 'Girls • S1E10'
songs.loc[54867, 'favorites'] = 10
songs.loc[54867, 'date'] = '17 Jun 2012'

In [66]:
songs.loc[54867]

artist               Azealia Banks
song_title    212 (feat. Lazy Jay)
show                 Girls • S1E10
len                             15
favorites                       10
date                   17 Jun 2012
check                           s 
Name: 54867, dtype: object

In [67]:
sorted(songs['check'].unique())

[' 1',
 ' 2',
 ' 4',
 ' 6',
 ' 7',
 ' 8',
 '01',
 '02',
 '03',
 '04',
 '05',
 '06',
 '07',
 '08',
 '09',
 '1 ',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '2 ',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '3 ',
 '30',
 '31',
 '4 ',
 '5 ',
 '6 ',
 '7 ',
 '8 ',
 '9 ',
 's ']

In [68]:
songs.head()

Unnamed: 0,artist,song_title,show,len,favorites,date,check
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,20,0,12 Mar 2020,12
1,Dusty Springfield,Wishin' and Hopin',Sex Education • S2E8,32,0,16 Jan 2020,16
2,Dusty Springfield,Spooky,9-1-1 • S3E6,24,0,27 Oct 2019,27
3,Dusty Springfield,I Can't Make It Alone,The Deuce • S3E4,28,0,29 Sep 2019,29
4,Dusty Springfield,No Easy Way Down,The Deuce • S3E4,28,0,29 Sep 2019,29


Next step is splitting the episode from the title. Per the research I've done on the site, it seems if there is no episode listed it's actually a film.

In [69]:
def split_ep(string):
    if '•' in string:
        return string.split('•')[1]
    else:
        return np.nan

In [70]:
songs['episode'] = songs['show'].apply(split_ep)

In [71]:
songs.head()

Unnamed: 0,artist,song_title,show,len,favorites,date,check,episode
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,20,0,12 Mar 2020,12,
1,Dusty Springfield,Wishin' and Hopin',Sex Education • S2E8,32,0,16 Jan 2020,16,S2E8
2,Dusty Springfield,Spooky,9-1-1 • S3E6,24,0,27 Oct 2019,27,S3E6
3,Dusty Springfield,I Can't Make It Alone,The Deuce • S3E4,28,0,29 Sep 2019,29,S3E4
4,Dusty Springfield,No Easy Way Down,The Deuce • S3E4,28,0,29 Sep 2019,29,S3E4


In [72]:
def remove_ep(string):
    if '•' in string:
        return string.split('•')[0]
    else:
        return string

In [73]:
songs['show'] = songs['show'].apply(remove_ep)

In [74]:
songs.head()

Unnamed: 0,artist,song_title,show,len,favorites,date,check,episode
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,20,0,12 Mar 2020,12,
1,Dusty Springfield,Wishin' and Hopin',Sex Education,32,0,16 Jan 2020,16,S2E8
2,Dusty Springfield,Spooky,9-1-1,24,0,27 Oct 2019,27,S3E6
3,Dusty Springfield,I Can't Make It Alone,The Deuce,28,0,29 Sep 2019,29,S3E4
4,Dusty Springfield,No Easy Way Down,The Deuce,28,0,29 Sep 2019,29,S3E4


In [75]:
songs.drop(columns = ['len', 'check'], inplace = True)

Adding another function to clean up all the test a bit.

In [76]:
def remove_space(string):
    return string.strip()

In [77]:
songs['artist'] = songs['artist'].apply(remove_space)
songs['song_title'] = songs['song_title'].apply(remove_space)
songs['show'] = songs['show'].apply(remove_space)
songs['date'] = songs['date'].apply(remove_space)

Creating a combo song/artist field will make it easy to see not only top synced artists but also their top song (and not group any songs with the same title but different artist).

In [78]:
songs['song_artist'] = songs['song_title'] + ' - ' + songs['artist']

In [79]:
songs.head()

Unnamed: 0,artist,song_title,show,favorites,date,episode,song_artist
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,0,12 Mar 2020,,Girls It Ain't Easy - Dusty Springfield
1,Dusty Springfield,Wishin' and Hopin',Sex Education,0,16 Jan 2020,S2E8,Wishin' and Hopin' - Dusty Springfield
2,Dusty Springfield,Spooky,9-1-1,0,27 Oct 2019,S3E6,Spooky - Dusty Springfield
3,Dusty Springfield,I Can't Make It Alone,The Deuce,0,29 Sep 2019,S3E4,I Can't Make It Alone - Dusty Springfield
4,Dusty Springfield,No Easy Way Down,The Deuce,0,29 Sep 2019,S3E4,No Easy Way Down - Dusty Springfield


In [80]:
songs['artist'].value_counts().sort_values(ascending = False)[60:80]

Queen                            133
Justin Hurwitz                   132
Alan Menken                      129
Dario Marianelli                 125
Nicholas Britell                 124
Los Lobos                        123
James Horner                     122
Jed Kurzel                       121
Johnny Cash with Bob Dylan       118
Trent Reznor and Atticus Ross    117
Max Richter and Sara Leonard     114
Trevor Rabin                     112
Beck                             112
Harry Gregson Williams           110
Junkie XL                        110
Ludwig Goeransson                109
Johann Johannsson                109
Star Cast                        109
Jon Brion                        107
South Park Cast                  105
Name: artist, dtype: int64

In [81]:
songs['song_artist'].value_counts().sort_values(ascending = False)[:20]

Push It - Salt N Pepa                                  27
Spirit In the Sky - Norman Greenbaum                   22
I Will Survive - Gloria Gaynor                         22
Let's Get It On - Marvin Gaye                          22
September - Earth Wind and Fire                        21
Escape (The Pina Colada Song) - Rupert Holmes          20
Eye of the Tiger - Survivor                            18
True - Spandau Ballet                                  18
At Last - Etta James                                   17
Dance Hall Days - Wang Chung                           17
I Want to Know What Love Is - Foreigner                17
U Can't Touch This - MC Hammer                         17
Don't You (Forget About Me) - Simple Minds             16
Feel It Still - Portugal The Man                       16
Barracuda - Heart                                      16
This Is How We Do It - Montell Jordan                  16
Everybody Wants to Rule the World - Tears for Fears    16
Jump Around - 

In [82]:
songs[songs['date'].str.contains('2015')]['song_artist'].value_counts().sort_values(ascending = False)[:10]

My Type - Saint Motel                                     5
Poison - Bell Biv DeVoe                                   4
Scenario - A Tribe Called Quest featuring Busta Rhymes    4
Turn Down For What - DJ Snake                             4
I Can't Wait - Nu Shooz                                   4
I Need My Girl - The National                             4
These Arms of Mine - Otis Redding                         4
Hypnotic - Zella Day                                      4
Pushing On - Oliver USD and Jimi Jules                    4
Classic (feat. Powers) - The Knocks                       4
Name: song_artist, dtype: int64

Happy Birthda is in the public domain so I don't want it to count towards licensed songs:

In [83]:
songs = songs[songs['song_artist'] != 'Happy Birthday - CAST']

In [84]:
songs = songs[songs['song_artist'] != 'Happy Birthday - Cast']

In [85]:
songs[songs['date'].str.contains('2020')].groupby('show').count().sort_values('song_title', ascending = False).head(20)

Unnamed: 0_level_0,artist,song_title,favorites,date,episode,song_artist
show,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
The Masked Singer,142,142,142,142,142,142
High Fidelity,120,120,120,120,120,120
Zoey's Extraordinary Playlist,90,90,90,90,90,90
Good Trouble,90,90,90,90,90,90
The Bold Type,88,88,88,88,88,88
Stumptown,81,81,81,81,81,81
Dynasty,74,74,74,74,74,74
Dare Me,71,71,71,71,71,71
Gentefied,71,71,71,71,71,71
Station 19,70,70,70,70,70,70


Since the date was so messy, I think it would be best to also include columns specifically for date and month. In the end, I don't think the specific airdate will matter as much as knowing generally how many songs sync in every year and month.

In [86]:
def month_year(string):
    return string[-8:]

In [87]:
def get_year(string):
    return string[-4:]

In [88]:
songs['month_year'] = songs['date'].apply(month_year)
songs['year'] = songs['date'].apply(get_year)

In [89]:
songs.head()

Unnamed: 0,artist,song_title,show,favorites,date,episode,song_artist,month_year,year
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,0,12 Mar 2020,,Girls It Ain't Easy - Dusty Springfield,Mar 2020,2020
1,Dusty Springfield,Wishin' and Hopin',Sex Education,0,16 Jan 2020,S2E8,Wishin' and Hopin' - Dusty Springfield,Jan 2020,2020
2,Dusty Springfield,Spooky,9-1-1,0,27 Oct 2019,S3E6,Spooky - Dusty Springfield,Oct 2019,2019
3,Dusty Springfield,I Can't Make It Alone,The Deuce,0,29 Sep 2019,S3E4,I Can't Make It Alone - Dusty Springfield,Sep 2019,2019
4,Dusty Springfield,No Easy Way Down,The Deuce,0,29 Sep 2019,S3E4,No Easy Way Down - Dusty Springfield,Sep 2019,2019


In [90]:
new_col_order = ['artist', 'song_title', 'show', 'episode', 
                'date', 'month_year', 'year', 'favorites', 'song_artist']

In [91]:
songs = songs[new_col_order]

In [92]:
songs.head()

Unnamed: 0,artist,song_title,show,episode,date,month_year,year,favorites,song_artist
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,,12 Mar 2020,Mar 2020,2020,0,Girls It Ain't Easy - Dusty Springfield
1,Dusty Springfield,Wishin' and Hopin',Sex Education,S2E8,16 Jan 2020,Jan 2020,2020,0,Wishin' and Hopin' - Dusty Springfield
2,Dusty Springfield,Spooky,9-1-1,S3E6,27 Oct 2019,Oct 2019,2019,0,Spooky - Dusty Springfield
3,Dusty Springfield,I Can't Make It Alone,The Deuce,S3E4,29 Sep 2019,Sep 2019,2019,0,I Can't Make It Alone - Dusty Springfield
4,Dusty Springfield,No Easy Way Down,The Deuce,S3E4,29 Sep 2019,Sep 2019,2019,0,No Easy Way Down - Dusty Springfield


In [93]:
songs['episode'].fillna('none', inplace = True)

Here we'll add a column to specify if it's a TV show or movie.

In [94]:
songs['type'] = ['Movie' if ep == 'none' else 'TV' for ep in songs['episode']]

In [95]:
songs.head()

Unnamed: 0,artist,song_title,show,episode,date,month_year,year,favorites,song_artist,type
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,none,12 Mar 2020,Mar 2020,2020,0,Girls It Ain't Easy - Dusty Springfield,Movie
1,Dusty Springfield,Wishin' and Hopin',Sex Education,S2E8,16 Jan 2020,Jan 2020,2020,0,Wishin' and Hopin' - Dusty Springfield,TV
2,Dusty Springfield,Spooky,9-1-1,S3E6,27 Oct 2019,Oct 2019,2019,0,Spooky - Dusty Springfield,TV
3,Dusty Springfield,I Can't Make It Alone,The Deuce,S3E4,29 Sep 2019,Sep 2019,2019,0,I Can't Make It Alone - Dusty Springfield,TV
4,Dusty Springfield,No Easy Way Down,The Deuce,S3E4,29 Sep 2019,Sep 2019,2019,0,No Easy Way Down - Dusty Springfield,TV


In [96]:
songs['type'].value_counts()

Movie    73097
TV       64341
Name: type, dtype: int64

Finally, I'd love to know the average songs per episode for the TV shows. To do this, I'll take all the songs we have recorded in the dataset and divde by the number of unique episodes recorded.

In [97]:
tv = set(songs[songs['type'] == 'TV']['show'].unique())

In [98]:
movies = set(songs[songs['type'] == 'Movie']['show'].unique())

In [99]:
tv_dict = {}
for show in tv:
    uses = len(songs[(songs['show'] == show) & (songs['type'] == 'TV')])
    episodes = songs[(songs['show'] == show) & (songs['type'] == 'TV')]['episode'].nunique()
    average = uses / episodes
    tv_dict[show] = average

movie_dict = {}
for movie in movies:
    uses = len(songs[(songs['show'] == movie) & (songs['type'] == 'Movie')])
    movie_dict[movie] = uses

In [100]:
songs.loc[songs['type'] == 'TV','avg_per_ep'] = songs['show']. map(tv_dict)
songs.loc[songs['type'] == 'Movie','avg_per_ep'] = songs['show']. map(movie_dict)

In [101]:
songs['avg_per_ep'] = round(songs['avg_per_ep'], 2)

In [102]:
songs[songs['show'] == 'It']

Unnamed: 0,artist,song_title,show,episode,date,month_year,year,favorites,song_artist,type,avg_per_ep
2357,The Cure,Six Different Ways,It,none,7 Sep 2017,Sep 2017,2017,4,Six Different Ways - The Cure,Movie,49.0
2891,The Beach Boys,I Get Around,It,S1E2,20 Nov 1990,Nov 1990,1990,2,I Get Around - The Beach Boys,TV,3.0
9906,The Cult,Love Removal Machine,It,none,7 Sep 2017,Sep 2017,2017,4,Love Removal Machine - The Cult,Movie,49.0
9928,Young MC,Bust a Move,It,none,7 Sep 2017,Sep 2017,2017,6,Bust a Move - Young MC,Movie,49.0
15834,The Temptations,The Way You Do the Things You Do,It,S1E2,20 Nov 1990,Nov 1990,1990,2,The Way You Do the Things You Do - The Temptat...,TV,3.0
21909,The Impressions,It's All Right,It,S1E1,18 Nov 1990,Nov 1990,1990,2,It's All Right - The Impressions,TV,3.0
24680,Cast,Fur Elise,It,S1E1,18 Nov 1990,Nov 1990,1990,1,Fur Elise - Cast,TV,3.0
24681,Cast,Itsy Bitsy Spider,It,S1E1,18 Nov 1990,Nov 1990,1990,0,Itsy Bitsy Spider - Cast,TV,3.0
24852,XTC,Dear God,It,none,7 Sep 2017,Sep 2017,2017,4,Dear God - XTC,Movie,49.0
49615,Anvil,666,It,none,7 Sep 2017,Sep 2017,2017,4,666 - Anvil,Movie,49.0


In [103]:
songs.head()

Unnamed: 0,artist,song_title,show,episode,date,month_year,year,favorites,song_artist,type,avg_per_ep
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,none,12 Mar 2020,Mar 2020,2020,0,Girls It Ain't Easy - Dusty Springfield,Movie,35.0
1,Dusty Springfield,Wishin' and Hopin',Sex Education,S2E8,16 Jan 2020,Jan 2020,2020,0,Wishin' and Hopin' - Dusty Springfield,TV,7.94
2,Dusty Springfield,Spooky,9-1-1,S3E6,27 Oct 2019,Oct 2019,2019,0,Spooky - Dusty Springfield,TV,4.11
3,Dusty Springfield,I Can't Make It Alone,The Deuce,S3E4,29 Sep 2019,Sep 2019,2019,0,I Can't Make It Alone - Dusty Springfield,TV,7.8
4,Dusty Springfield,No Easy Way Down,The Deuce,S3E4,29 Sep 2019,Sep 2019,2019,0,No Easy Way Down - Dusty Springfield,TV,7.8


In [104]:
songs.loc[:, 'song_artist'] = '\"' + songs['song_title'] + '\" - ' + songs['artist']

In [105]:
songs.head()

Unnamed: 0,artist,song_title,show,episode,date,month_year,year,favorites,song_artist,type,avg_per_ep
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,none,12 Mar 2020,Mar 2020,2020,0,"""Girls It Ain't Easy"" - Dusty Springfield",Movie,35.0
1,Dusty Springfield,Wishin' and Hopin',Sex Education,S2E8,16 Jan 2020,Jan 2020,2020,0,"""Wishin' and Hopin'"" - Dusty Springfield",TV,7.94
2,Dusty Springfield,Spooky,9-1-1,S3E6,27 Oct 2019,Oct 2019,2019,0,"""Spooky"" - Dusty Springfield",TV,4.11
3,Dusty Springfield,I Can't Make It Alone,The Deuce,S3E4,29 Sep 2019,Sep 2019,2019,0,"""I Can't Make It Alone"" - Dusty Springfield",TV,7.8
4,Dusty Springfield,No Easy Way Down,The Deuce,S3E4,29 Sep 2019,Sep 2019,2019,0,"""No Easy Way Down"" - Dusty Springfield",TV,7.8


In [106]:
songs.to_csv('./data/all_songs_cleaned.csv', index = False)