<a href="https://colab.research.google.com/github/raghavmittal101/music_mind_tech_project/blob/master/MMT_Project_with_watson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MMT Project
My aim is to find associations between a person's score in Big-5 personality assessment and the emotions depicted by lyrics of their favourite songs.

For this purpose, I have collected data from more than 50 participants. This data consists of
* demographic data
* big-5 personality assessment data
* name and artist of 3 to 10 favourite songs

## Process
1. Gather lyrics of songs
2. Analyze the emotions depected in lyrics
3. Analyze associations between the Big-5 personality scores and emotions depected in lyrics


1. I will automate the lyrics gathering process by using `lyrics-extractor` package, which searches the song on Google and scrapes the lyrics shown on the results page. I also tried other packages for searching lyrics, but the problem with them is that they are not tolerant to spelling mistakes and incomplete words. The data was collected by circulating a Google form, there are high chances of misspelled and incomplete song names.  
After extracting the lyrics, we need to remove any labelling brackets like *\[chorus\]* or *\[bridge\]* as they may contain meaningful words which may affect the analysis of emotions. For this purpose I will use regular expressions.

2. To analyze the emotions depected in lyrics, I will use Tone Analyzer service by IBM Watson. This service takes in the lyrics and return the scorings of lyrics in 13 different components which consist of Big-5 components O,C,E,A,N under social category and anger, disgust, fear, joy, and sadness under emotional category and analytical, confident, and tentative under language category.

3. Once we have all the scores for each song, we need to calculate cummulative scores for each participant. To calculate cummulative scores, we will do average of each emotion type accross all the favourite songs. This will give us a score vector against each participant.  
Where ever the score is less than 0.5, we will make it zero. It's done because according to the IBM watson documents, it's difficult to percieve the emotions which score less than 0.5.


In [0]:
import pandas as pd

In [0]:
"""
Load dataset which contains `id`, `songnames`, `songartist`. Always load latest 
version of dataset which may contain lyrics and scores fetched previously. This 
will save time and any quota imposed on third-party services which we will use.
"""
songs = pd.read_csv("/content/drive/My Drive/acads/MMT/MMT project/collected data/songs_dataset.csv")

## Fetch and process the lyrics

In [0]:
'''
Load external package to fetch lyrics for each song in dataset.

Extracting lyrics of songs by using `lyrics_extractor` API
(https://pypi.org/project/lyrics-extractor/#description)

We will use `lyrics_extractor` API for this purpose. It is tolerant to spelling 
mistakes and incomplete words because it does a google search for the given data,
google autocorrects and autocompletes misspelled or incomplete words. Which is
very useful for us because it takes away the tiring task of manually checking and 
correcting spelling of each entry in dataset. 
'''
!pip install lyrics-extractor

# Example
# song_title, song_lyrics = extract_lyrics.get_lyrics("Its you	Ali Gatie")

In [0]:
'''
Extract and store lyrics of songs.

Go through each row in songs dataset and add lyric to each row. 
To save time and Google search requests quota, skip rows which already have lyrics.
'''
from lyrics_extractor import Song_Lyrics
extract_lyrics = Song_Lyrics("","") # insert the Google api keys here

for i in songs.index:
  if len(str(songs['lyrics'][i])) < 15: # is true if the corresponding lyrics coloumn is empty in dataset
    try:  # fetch lyrics of song from the Internet
      title, lyrics = extract_lyrics.get_lyrics(str(songs['songnames'][i])+" "+str(songs['songartists'][i]))
      if lyrics == '':  # if `extract_lyrics` is unable to fetch/find lyrics of song, print song details
        print(i, str(songs['songnames'][i])+" "+str(songs['songartists'][i]), " lyrics not found!!!")
      songs['lyrics'][i] = lyrics
    except: # print song details with keyword `ERROR!!` if try fails
      print(i, songs['email'][i], songs['songnames'][i], songs['songartists'][i], "ERROR!!")


In [0]:
"""
Just to play safe, save the current dataframe to CSV. This way we are free to
play with current dataframe without any worry of destroying the df by mistake. 
"""

songs.to_csv('songs_dataset.csv')

In [0]:
'''
Now we have lyrics in place, let's do some necessary preprocessing.

We noticed some labels here and there in the fetched lyrics.
Let's remove labelling brackets with labels from lyrics because labels may 
contain words which will unncessarily influence the sentiment of the lyrics.

examples of labels: [chorus], [bridge]
'''
import re
regex1 = '\[[a-zA-Z 0-9]*\]'  # select words like [chorus], [bridge]

for i in songs.index:
  try:
    lyrics = str(songs['lyrics'][i])
    lyrics = re.sub(regex1, ' ', lyrics)
    songs['lyrics'][i] = lyrics
  except:
    print("something went wrong!!!")

In [21]:
songs.head()

Unnamed: 0.1,Unnamed: 0,id,songnames,songartists,lyrics,scoringDone,anger,fear,joy,sadness,analytical,confident,tentative
0,0,2,Habit,Still Woozy,I could let you have it\nYou could be my habit...,1.0,0.4845,0.249156,0.013715,0.414249,0.0,0.0,0.972431
1,1,2,Someday,Flypsyde,"Come on\nShalalala, shalalala\nSomeday we gonn...",1.0,0.633487,0.190181,0.032559,0.368816,0.0,0.0,0.168827
2,2,2,Venus,Sleeping at Last,The night sky once ruled my imagination.\nNow ...,1.0,0.109705,0.647529,0.506882,0.228974,0.555642,0.0,0.196453
3,3,2,Will Do,TV on the Radio,It might be impractical to seek out a new roma...,1.0,0.226913,0.04525,0.516772,0.21769,0.569231,0.0,0.927953
4,4,2,Every other Freckle,Alt-J,Aah\nI want to share your mouthful\nI want to ...,1.0,0.128174,0.268708,0.409157,0.121439,0.0,0.812988,0.0


## Analyze tone of lyrics
We will use **IBM Watson Tone Analyzer** tool for this purpose.

In [0]:
"""
IBM Watson Tone Analyzer API is available as a package in pip.
In order to use it, you need to signup to Watson, create a service in 
ToneAnalyzer and get the credentials.
"""

!pip install --upgrade "ibm-watson>=4.3.0"

In [0]:
from ibm_watson import ToneAnalyzerV3
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
import json
authenticator = IAMAuthenticator('') # insert Watson API key here
'''
The service with version '2017-09-21' can return results for the following tone IDs:
anger, fear, joy, and sadness (emotional tones); analytical,
confident, and tentative (language tones). The service returns
results only for tones whose scores meet a minimum threshold of 0.5.

The service with version '2016-05-19' can return results for the following tone IDs of the different categories: 
for the emotion category: anger, disgust, fear, joy, and sadness; 
for the language category: analytical, confident, and tentative; 
for the social category: openness_big5, conscientiousness_big5,
extraversion_big5, agreeableness_big5, and emotional_range_big5. 
The service returns scores for all tones of a category, regardless of their values.
'''
version='2017-09-21'
tone_analyzer = ToneAnalyzerV3(
    version=version,
    authenticator=authenticator
)

tone_analyzer.set_service_url('https://api.eu-gb.tone-analyzer.watson.cloud.ibm.com')

In [0]:
'''
Function to get sentiment scores of lyrics. Based on version of the tone analyzer, 
it will return the factors.
parameter: `lyrics`
return: JSON object returned by watson API.
'''
def analyzeLyrics(lyrics):
  try:
    tone_analysis = tone_analyzer.tone(
        {'text': lyrics},
        content_type='application/json',
        sentences=False # sentencewise analysis not required
    ).get_result()
  except:
    print(i, songs['id'][i], songs['songnames'][i], 'unable to fetch scores!!!')
  return json.dumps(tone_analysis, indent=2)

In [14]:
# example input
# Let's check if our function is working fine or not.
i = 436
results = analyzeLyrics(songs['lyrics'][i])
print(songs['id'][i], songs["songnames"][i])
r_dict = json.loads(results)

""" use this snippet when you with version '2017-09-21' of Tone Analyzer """
for tones in r_dict['document_tone']['tones']:
  print(tones['tone_id'], ':', tones['score'])

# """ use this snippet with version '2016-05-19' of Tone Analyzer """
# for category in r_dict['document_tone']['tone_categories']:
#   for scores in category['tones']:
#     print(scores['tone_id'], ':', scores['score'])

74 Rossetta Stoned
fear : 0.611615
joy : 0.62383
sadness : 0.695139
anger : 0.714518
tentative : 0.661148


In [22]:
# add new columns to DF 'songs'
songs_with_scores = pd.concat(\
                              [songs, \
                               pd.DataFrame(\
                                columns = [ 'scoringDone','anger','fear',\
                                           'joy', 'sadness', 'analytical', 'confident',\
                                           'tentative'])])
# songs_with_scores = songs
songs_with_scores.head()

Unnamed: 0.1,Unnamed: 0,id,songnames,songartists,lyrics,scoringDone,anger,fear,joy,sadness,analytical,confident,tentative
0,0.0,2.0,Habit,Still Woozy,I could let you have it\nYou could be my habit...,1.0,0.4845,0.249156,0.013715,0.414249,0.0,0.0,0.972431
1,1.0,2.0,Someday,Flypsyde,"Come on\nShalalala, shalalala\nSomeday we gonn...",1.0,0.633487,0.190181,0.032559,0.368816,0.0,0.0,0.168827
2,2.0,2.0,Venus,Sleeping at Last,The night sky once ruled my imagination.\nNow ...,1.0,0.109705,0.647529,0.506882,0.228974,0.555642,0.0,0.196453
3,3.0,2.0,Will Do,TV on the Radio,It might be impractical to seek out a new roma...,1.0,0.226913,0.04525,0.516772,0.21769,0.569231,0.0,0.927953
4,4.0,2.0,Every other Freckle,Alt-J,Aah\nI want to share your mouthful\nI want to ...,1.0,0.128174,0.268708,0.409157,0.121439,0.0,0.812988,0.0


In [0]:
# iterate through each row in DF songs and calculate the scores for lyrics.
# add the scores back to DF in corresponding song row

for i in songs_with_scores.index:
  results_dict=[]
  try:
    lyrics = songs_with_scores['lyrics'][i]
    if(len(lyrics)>15 and pd.isna(songs_with_scores['scoringDone'][i])):
      results = analyzeLyrics(lyrics)
      results_dict = json.loads(results)
      for category in results_dict['document_tone']['tone_categories']:
        for scores in category['tones']:
          songs_with_scores.at[i, scores['tone_id']] = scores['score']
      songs_with_scores.at[i, 'scoringDone'] = 1
  except:
    print(i, songs['id'][i], songs['songnames'][i], "unable to put scores in df!!!")

In [23]:
songs_with_scores.head()

Unnamed: 0.1,Unnamed: 0,id,songnames,songartists,lyrics,scoringDone,anger,fear,joy,sadness,analytical,confident,tentative
0,0.0,2.0,Habit,Still Woozy,I could let you have it\nYou could be my habit...,1.0,0.4845,0.249156,0.013715,0.414249,0.0,0.0,0.972431
1,1.0,2.0,Someday,Flypsyde,"Come on\nShalalala, shalalala\nSomeday we gonn...",1.0,0.633487,0.190181,0.032559,0.368816,0.0,0.0,0.168827
2,2.0,2.0,Venus,Sleeping at Last,The night sky once ruled my imagination.\nNow ...,1.0,0.109705,0.647529,0.506882,0.228974,0.555642,0.0,0.196453
3,3.0,2.0,Will Do,TV on the Radio,It might be impractical to seek out a new roma...,1.0,0.226913,0.04525,0.516772,0.21769,0.569231,0.0,0.927953
4,4.0,2.0,Every other Freckle,Alt-J,Aah\nI want to share your mouthful\nI want to ...,1.0,0.128174,0.268708,0.409157,0.121439,0.0,0.812988,0.0


## Prepare the data for correlation analysis

In [0]:
# Drop columns which are not required in Data analysis
id_lyrics_scores = songs_with_scores.drop(columns=["Unnamed: 0","scoringDone","songnames", \
                                                    "songartists", "lyrics"],\
                                           errors='ignore')

In [25]:
# # group lyrics scores into positive emotions and negative emotions by adding columns

# using apply function to create a new column 
# id_lyrics_scores['negativity'] = id_lyrics_scores.apply(lambda row: (row.anger+row.disgust+row.fear+row.sadness)/4, axis = 1)
id_lyrics_scores.head()

Unnamed: 0,id,anger,fear,joy,sadness,analytical,confident,tentative
0,2.0,0.4845,0.249156,0.013715,0.414249,0.0,0.0,0.972431
1,2.0,0.633487,0.190181,0.032559,0.368816,0.0,0.0,0.168827
2,2.0,0.109705,0.647529,0.506882,0.228974,0.555642,0.0,0.196453
3,2.0,0.226913,0.04525,0.516772,0.21769,0.569231,0.0,0.927953
4,2.0,0.128174,0.268708,0.409157,0.121439,0.0,0.812988,0.0


In [0]:
# watson documentation states that any score below 0.5 is unpercievable by humans.
# such scores should be floored to 0.
for i in id_lyrics_scores.columns:
  if(i not in ["id"]):
    id_lyrics_scores.loc[id_lyrics_scores[i] < 0.5, i] = 0

In [27]:
id_lyrics_scores.head()

Unnamed: 0,id,anger,fear,joy,sadness,analytical,confident,tentative
0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.972431
1,2.0,0.633487,0.0,0.0,0.0,0.0,0.0,0.0
2,2.0,0.0,0.647529,0.506882,0.0,0.555642,0.0,0.0
3,2.0,0.0,0.0,0.516772,0.0,0.569231,0.0,0.927953
4,2.0,0.0,0.0,0.0,0.0,0.0,0.812988,0.0


In [29]:
"""
Here we find mean lyrics scores for each participant.
"""

# Find mean and sum of lyrics scores over each participant
lyrics_scores_mean = id_lyrics_scores.groupby(['id']).mean()
lyrics_scores_mean.head()

Unnamed: 0_level_0,anger,fear,joy,sadness,analytical,confident,tentative
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2.0,0.079186,0.15219,0.211465,0.221433,0.216763,0.101624,0.521714
3.0,0.0,0.072347,0.226194,0.286345,0.188826,0.059999,0.28687
4.0,0.079258,0.0,0.422219,0.0,0.0,0.265145,0.210948
6.0,0.0,0.057586,0.058399,0.269365,0.145375,0.0,0.276035
7.0,0.054502,0.0,0.197023,0.061493,0.226869,0.30612,0.0


In [32]:
# let's get big-5 scores and demographic data here. Later we use these scores and lyrics scores to analyze associations between them
big5_scores = pd.read_csv("/content/drive/My Drive/acads/MMT/MMT project/collected data/big5_continuous_scores.csv")
demographic_data = pd.read_csv("/content/drive/My Drive/acads/MMT/MMT project/collected data/demographics.csv")

# and merge `lyrics_scores_mean` and above DFs by 'id'
df1 = pd.merge(left=demographic_data, right=big5_scores, left_on='id', right_on='id')
big5_continuous_with_lyrics_mean = pd.merge(left=df1, right=lyrics_scores_mean, left_on='id', right_on='id')

big5_continuous_with_lyrics_mean.head()

Unnamed: 0,id,age group,sex,instumental or lyrical music,listen to english lyrics,extraversion,agreeableness,conscientiousness,neuroticism,openness,anger,fear,joy,sadness,analytical,confident,tentative
0,2,24-26 years,Female,Music with lyrics,Always,21,34,26,26,47,0.079186,0.15219,0.211465,0.221433,0.216763,0.101624,0.521714
1,3,24-26 years,Male,Music with lyrics,Always,24,41,35,23,43,0.0,0.072347,0.226194,0.286345,0.188826,0.059999,0.28687
2,4,24-26 years,Female,Music with lyrics,Sometimes,27,34,24,28,33,0.079258,0.0,0.422219,0.0,0.0,0.265145,0.210948
3,6,24-26 years,Male,Music with lyrics,Often,30,41,38,18,36,0.0,0.057586,0.058399,0.269365,0.145375,0.0,0.276035
4,7,27-30 years,Male,Music with lyrics,Often,27,33,33,16,42,0.054502,0.0,0.197023,0.061493,0.226869,0.30612,0.0


In [0]:
"""
Let's save the curent DF to save time in future.
"""
big5_continuous_with_lyrics_mean.to_csv("big5_continuous_with_lyrics_mean.csv")

In [0]:
"""
now accessing the dataset which we saved in previous block :P :D
"""
big5_continuous_with_lyrics_mean = pd.read_csv("/content/drive/My Drive/acads/MMT/MMT project/collected data/big5_continuous_with_lyrics_mean.csv")

In [33]:
"""
Now let's prepare the dataset for Spearman rho analysis. 
We need to label the categorical with numbers in order to use the algorithm. 
"""
object_data = big5_continuous_with_lyrics_mean.select_dtypes(include="object")
numeric_data = big5_continuous_with_lyrics_mean.select_dtypes(exclude="object")

# Import label encoder 
from sklearn import preprocessing 
  
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 
object_data = object_data.apply(label_encoder.fit_transform)
data  = pd.merge(left=object_data, right=numeric_data, left_on=numeric_data.index, right_on=object_data.index)
data.drop(columns=["id"])

Unnamed: 0,key_0,age group,sex,instumental or lyrical music,listen to english lyrics,extraversion,agreeableness,conscientiousness,neuroticism,openness,anger,fear,joy,sadness,analytical,confident,tentative
0,0,1,0,1,0,21,34,26,26,47,0.079186,0.152190,0.211465,0.221433,0.216763,0.101624,0.521714
1,1,1,1,1,0,24,41,35,23,43,0.000000,0.072347,0.226194,0.286345,0.188826,0.059999,0.286870
2,2,1,0,1,2,27,34,24,28,33,0.079258,0.000000,0.422219,0.000000,0.000000,0.265145,0.210948
3,3,1,1,1,1,30,41,38,18,36,0.000000,0.057586,0.058399,0.269365,0.145375,0.000000,0.276035
4,4,2,1,1,1,27,33,33,16,42,0.054502,0.000000,0.197023,0.061493,0.226869,0.306120,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,57,0,1,1,1,25,29,28,26,34,0.000000,0.000000,0.201566,0.348010,0.000000,0.150057,0.256812
58,58,0,1,1,0,19,28,24,29,25,0.000000,0.000000,0.319708,0.451885,0.000000,0.000000,0.147234
59,59,0,1,1,1,34,42,38,14,40,0.142029,0.112704,0.279541,0.315449,0.301320,0.176632,0.341260
60,60,0,1,1,1,23,33,29,20,32,0.000000,0.000000,0.000000,0.616278,0.000000,0.592273,0.000000


## Correlation Analysis

In [36]:
"""
Now we have our desired df with us, let's do some analysis over it.
We have to perform statistics between binary and non-normally distributed continuous data.
For this we will choose "Point Biserial Correlation" which is a non-parametric analysis and
it's a variable to chi2 test specially for comparing dichotomous data with continuous data.
"""

"""
H0: Big-5 personality type of a person is not assiciated with the emotion depicted
    in lyrics of their favourite songs.
H1: Big-5 personality type of a person is associated with the emotion depicted in 
    in lyrics of their favourite songs.
"""

from scipy.stats import pointbiserialr, spearmanr, chisquare, mannwhitneyu, kendalltau, pearsonr

current_method = [pointbiserialr, spearmanr, chisquare, mannwhitneyu, pearsonr]

big5_traits = ['agreeableness', 'extraversion',	'conscientiousness',\
               'neuroticism',	'openness']
demographics = ['age group','sex', 'instumental or lyrical music',	'listen to english lyrics']
emotions = ['anger',	'fear',	'joy',	'sadness',	'analytical',\
                  'confident',	'tentative']

# Find correlations and p-value for aspects of data. For example: 
# between demographics and big5_traits
for i in demographics:
  for j in big5_traits:
    if(i is not j):
      i_score = data[i]
      j_score = data[j]
      pbc = current_method[1](i_score, j_score)
      if(pbc[1]<=0.06 and (pbc[0]>=0.3 or pbc[0]<=-0.3)):
        print(i, j, pbc)



age group agreeableness SpearmanrResult(correlation=0.39231765313308486, pvalue=0.0016120427154733164)
age group openness SpearmanrResult(correlation=0.3738420572594634, pvalue=0.0027618559864112492)
sex agreeableness SpearmanrResult(correlation=-0.3807540496254783, pvalue=0.002266142455208097)
listen to english lyrics openness SpearmanrResult(correlation=-0.4612220943040446, pvalue=0.00016142593158098995)


In [0]:
# Plot correlation matrix
matrix = data.corr(method="spearman")
matrix = matrix.drop(index=['id', 'key_0'],columns=['id', 'key_0'])

In [42]:
matrix

Unnamed: 0,age group,sex,instumental or lyrical music,listen to english lyrics,extraversion,agreeableness,conscientiousness,neuroticism,openness,anger,fear,joy,sadness,analytical,confident,tentative
age group,1.0,-0.369054,-0.008177,-0.008551,0.125131,0.392318,0.130578,-0.088144,0.373842,-0.084278,0.261549,0.280185,-0.226027,0.184411,-0.062666,-0.035726
sex,-0.369054,1.0,-0.175454,-0.009522,-0.247968,-0.380754,0.145378,-0.267583,-0.198165,0.06706,-0.091735,-0.252283,-0.08142,0.082782,0.041211,-0.103257
instumental or lyrical music,-0.008177,-0.175454,1.0,0.127498,-0.08649,0.08332,-0.272411,0.150106,-0.259456,-0.193379,-0.026719,0.107828,-0.081924,0.034565,-0.051287,0.12072
listen to english lyrics,-0.008551,-0.009522,0.127498,1.0,-0.156225,-0.052094,-0.103277,0.132858,-0.461222,-0.237323,-0.118935,-0.059845,-0.139058,0.029176,0.081838,0.288723
extraversion,0.125131,-0.247968,-0.08649,-0.156225,1.0,0.271803,0.149192,-0.256995,0.309172,-0.016635,-0.060768,0.168314,0.070242,-0.027151,0.077961,0.08389
agreeableness,0.392318,-0.380754,0.08332,-0.052094,0.271803,1.0,0.218802,-0.036046,0.260074,-0.29076,0.136943,0.336636,-0.140496,0.109051,-0.130029,0.050162
conscientiousness,0.130578,0.145378,-0.272411,-0.103277,0.149192,0.218802,1.0,-0.260458,0.254493,0.05627,-0.020349,0.1546,0.003839,0.150412,-0.024471,-0.011063
neuroticism,-0.088144,-0.267583,0.150106,0.132858,-0.256995,-0.036046,-0.260458,1.0,-0.257184,-0.0564,-0.245074,0.009895,0.018358,-0.121589,-0.128516,0.156187
openness,0.373842,-0.198165,-0.259456,-0.461222,0.309172,0.260074,0.254493,-0.257184,1.0,0.133554,0.261617,0.123554,0.082832,0.232181,-0.075285,-0.030993
anger,-0.084278,0.06706,-0.193379,-0.237323,-0.016635,-0.29076,0.05627,-0.0564,0.133554,1.0,0.124853,0.077586,-0.101484,-0.098103,-0.113255,0.081835
