## Data Analysis

Aaron Wollman, Albin Joseph, Kelsey Richardson Blackwell, Will Huang

In this notebook, we analyze if the measure of “musical positiveness”in the Top 100 Hits and the US’s unemployment data have a strong correlation. Is the correlation strong enough to predict next month? Are there other attributes besides happiness that have a stronger correlation - danceability, energy, tempo, speech?

In [None]:
%matplotlib inline

In [None]:
# Dependencies
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import numpy as np
from scipy.stats import linregress
import scipy.stats as st

In [None]:
# Import the csv
music_unemployment = pd.read_csv('../data/music_and_unemployment.csv')
music_unemployment.drop('Unnamed: 0',axis=1,inplace=True)
music_unemployment.head()

## Unemployment Rate

Before we jumped into running recressions and statistical tests, we wanted to understand the in the unemployment rate during the timeframe. We wanted to visually understand the changes, so we created a heat map.

In [None]:
#Unemployment rate monthly data from 1960 to 2019 
unemployment_time=music_unemployment[['Year','Month','Unemployment Rate']].drop_duplicates().reset_index(drop=True)
unemployment_time.head()

In [None]:
#unemployment rate data by Year and Month
unemployment_time_pivot=unemployment_time.pivot('Year','Month','Unemployment Rate')
unemployment_time_pivot.head()

In [None]:
#the maximum of unemplyment rate
vmax=unemployment_time_pivot.max().max()

#the minimum of unemployment rate
vmin=unemployment_time_pivot.min().min()

In [None]:
#unemployment rate heat map by decades
fig,axes=plt.subplots(6,1,figsize=(15,20))
sns.heatmap(unemployment_time_pivot[:10],cmap=("Blues"),ax=axes[0],vmax=vmax,vmin=vmin)
sns.heatmap(unemployment_time_pivot[10:20],cmap=("Blues"),ax=axes[1],vmax=vmax,vmin=vmin)
sns.heatmap(unemployment_time_pivot[20:30],cmap=("Blues"),ax=axes[2],vmax=vmax,vmin=vmin)
sns.heatmap(unemployment_time_pivot[30:40],cmap=("Blues"),ax=axes[3],vmax=vmax,vmin=vmin)
sns.heatmap(unemployment_time_pivot[40:50],cmap=("Blues"),ax=axes[4],vmax=vmax,vmin=vmin)
sns.heatmap(unemployment_time_pivot[50:],cmap=("Blues"),ax=axes[5],vmax=vmax,vmin=vmin)

## Valence "Happiness"

We also wanted to run a similar heat map for the valence score, so help us visually understand the changes in such a large timeframe.

Now it is time to do the work!

## Unemployment Rate vs. Valence "Happiness"

We ran a regression for valence versus unemployment rate. 

In [None]:
# Define regression function
def linearplt(dataframe, x_values, y_values, ylabel, coordinates):
    plt.scatter(x_values, y_values)

    (slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
    regress_values = x_values * slope + intercept
    line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))

    plt.plot(x_values,regress_values,"r-")
    plt.annotate(line_eq,coordinates,fontsize=15,color="red")

    plt.xlabel("Unemployment Rate")
    plt.ylabel(ylabel)

Weighed Valence


In [None]:
# Create a new data point "Weighted Valence"
music_unemployment["weighed valence"] = music_unemployment["valence"] * (101 - music_unemployment["Placement"])

# Group by the song's date
music_unemployment_gb = music_unemployment.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
rate_v_valence = music_unemployment_gb.mean()[["Unemployment Rate", "weighed valence"]]

# Create a Scatter Graph
x_values = rate_v_valence["Unemployment Rate"]
y_values = rate_v_valence["weighed valence"]
linearplt(rate_v_valence, x_values, y_values, "Unemployment Rate", (5,22))
plt.title("Unemployment Rate vs. Valence (Happiness)")
plt.show()

We decided to look at valence since 2010 to see if the past decade has been different since music has changed significanlty since 2010.

In [None]:
# Find the data for songs 2010 and after
music_unemployment_years = (music_unemployment.loc[(music_unemployment["Year"]) >= 2010])
music_unemployment_years

# Group by the song's date
music_unemployment_years_gb = music_unemployment_years.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
two_rate_v_valence = music_unemployment_years_gb.mean()[["Unemployment Rate", "weighed valence"]]

# Create a Scatter Graph
x_values = two_rate_v_valence["Unemployment Rate"]
y_values = two_rate_v_valence["weighed valence"]
linearplt(rate_v_tempo, x_values, y_values, "valence", (7,22))
plt.title("Unemployment Rate vs. Valence in Song 2010 and after")
plt.show()

Conclusion: We discovered that the unemployment rate does not impact happiness in a Top 100 hit song. As you can see in the regression graph below, there was not a strong correlation. 

So we decided to run a statistical test next. 

In [None]:
# Will's Code

In [None]:
unemployment_rate_list=[]
for i in range(len(unemployment_time_pivot)):
    for j in unemployment_time_pivot.iloc[i,1:]:
        unemployment_rate_list.append(j)

In [None]:
#check if there is outlier in unemployemnt rate
plt.boxplot(unemployment_rate_list)

In [None]:
#categorize song by its unemployment rate at the time
# if unemployment rate higher than 7.0, assign into High_Unemployment group
# 7.0 is descided by the 3rd quantile of all the unemployment rate data

high_unemployment_rate=np.quantile(unemployment_rate_list, .75) ###7.0
music_unemployment["weighed valence"] = music_unemployment["valence"] * (101 - music_unemployment["Placement"])
music_unemployment['High_Unemployment'] = music_unemployment['Unemployment Rate'].apply(lambda x: 1 if x>=high_unemployment_rate else 0)

music_unemployment.head()

In [None]:
#the calculation of weighted features could be done together 
#this section should be move above of top of the jupyter nobebook

music_unemployment["weighed valence"] = music_unemployment["valence"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed danceability']=music_unemployment["danceability"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed energy']=music_unemployment["energy"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed key']=music_unemployment["key"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed loudness']=music_unemployment["loudness"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed speechiness']=music_unemployment["speechiness"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed acousticness']=music_unemployment["acousticness"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed liveness']=music_unemployment["liveness"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed tempo']=music_unemployment["tempo"] * (101 - music_unemployment["Placement"])
music_unemployment['High_Unemployment']=music_unemployment['Unemployment Rate'].apply(lambda x: 1 if x>=high_unemployment_rate else 0)

In [None]:
#we could drop the original scroe and replace it by the weighted score
music_unemployment.head()

In [None]:
#assign the weighted feature scores to the mean of the monthly feature score
music_unemployment_group=music_unemployment.groupby(['Year','Month','Day'])[
    ['High_Unemployment','Unemployment Rate',
       'weighed valence', 'weighed danceability', 'weighed energy',
       'weighed key', 'weighed loudness', 'weighed speechiness',
       'weighed acousticness', 'weighed liveness', 'weighed tempo']].mean()

In [None]:
feature_list=['weighed valence', 'weighed danceability', 'weighed energy',
       'weighed key', 'weighed loudness', 'weighed speechiness',
       'weighed acousticness', 'weighed liveness', 'weighed tempo']

In [None]:
#scatter plots for weigthed feature scroes by unemployment rate
row=0
col=0
fig,axes=plt.subplots(3,3,figsize=(15,15))

for i in feature_list:
    if col>2:
        row+=1
        col=0
        sns.scatterplot(x='Unemployment Rate',y=i,hue='High_Unemployment',data=music_unemployment_group,ax=axes[row][col])
        col+=1
        
    else:
        sns.scatterplot(x='Unemployment Rate',y=i,hue='High_Unemployment',data=music_unemployment_group,ax=axes[row][col])
        col+=1

In [None]:
#boxplots for weigthed feature scroes by unemployment rate
row=0
col=0
fig,axes=plt.subplots(3,3,figsize=(15,15))
for i in feature_list:
    if col>2:
        row+=1
        col=0
        sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group[[i,'High_Unemployment']],ax=axes[row][col])
        col+=1
        
    else:
        sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group[[i,'High_Unemployment']],ax=axes[row][col])
        col+=1
        

In [None]:
#anova test for weighted features
statistic_list=[]
pvalue_list=[]
for i in feature_list:
    group1=music_unemployment_group[i][music_unemployment_group['High_Unemployment']==1]
    group2=music_unemployment_group[i][music_unemployment_group['High_Unemployment']==0]
    statistic=st.f_oneway(group1,group2)[0]
    pvalue=st.f_oneway(group1,group2)[1]
    statistic_list.append(statistic)
    pvalue_list.append(pvalue)
    print(f' ANOVA Result for {i} vs. High_Unemployment\n {st.f_oneway(group1,group2)}\n==================')

In [None]:
#anova test results df
significant_list=[1 if i <=0.05 else 0 for i in pvalue_list]
anova=pd.DataFrame({'Feature':feature_list,'Statistic':statistic_list,'Pvalue':pvalue_list,'Significant':significant_list})
anova.sort_values('Pvalue')

In [None]:
#boxplots for the weighted fetures which has significant 
row=0
col=0
fig,axes=plt.subplots(2,3,figsize=(15,10))
for i in anova['Feature'][anova['Significant']==1]:
    if col>2:
        row+=1
        col=0
        sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group[[i,'High_Unemployment']],ax=axes[row][col])
        col+=1
        
    else:
        sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group[[i,'High_Unemployment']],ax=axes[row][col])
        col+=1
        

In [None]:
# ANOVA Test on Yearly Base

In [None]:
music_unemployment_group_y=music_unemployment.groupby(['Year'])[
    ['Unemployment Rate',
       'weighed valence', 'weighed danceability', 'weighed energy',
       'weighed key', 'weighed loudness', 'weighed speechiness',
       'weighed acousticness', 'weighed liveness', 'weighed tempo']].mean()
music_unemployment_group_y['High_Unemployment']=music_unemployment_group_y['Unemployment Rate'].apply(lambda x: 1 if x>=high_unemployment_rate else 0)

In [None]:
statistic_list=[]
pvalue_list=[]
for i in feature_list:
    group1=music_unemployment_group_y[i][music_unemployment_group_y['High_Unemployment']==1]
    group2=music_unemployment_group_y[i][music_unemployment_group_y['High_Unemployment']==0]
    statistic=st.f_oneway(group1,group2)[0]
    pvalue=st.f_oneway(group1,group2)[1]
    statistic_list.append(statistic)
    pvalue_list.append(pvalue)
    print(f' ANOVA Result for {i} vs. High_Unemployment\n {st.f_oneway(group1,group2)}\n==================')

In [None]:
significant_list=[1 if i <=0.05 else 0 for i in pvalue_list]
anova=pd.DataFrame({'Feature':feature_list,'Statistic':statistic_list,'Pvalue':pvalue_list,'Significant':significant_list})
anova.sort_values('Pvalue')

In [None]:
#scatter plots for weigthed feature scroes by unemployment rate
row=0
col=0
fig,axes=plt.subplots(3,3,figsize=(15,15))

for i in feature_list:
    if col>2:
        row+=1
        col=0
        sns.scatterplot(x='Unemployment Rate',y=i,hue='High_Unemployment',data=music_unemployment_group_y,ax=axes[row][col])
        col+=1
        
    else:
        sns.scatterplot(x='Unemployment Rate',y=i,hue='High_Unemployment',data=music_unemployment_group_y,ax=axes[row][col])
        col+=1

In [None]:
n=0
fig,axes=plt.subplots(2,1,figsize=(10,10))
for i in anova['Feature'][anova['Significant']==1]:
    sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group_y[[i,'High_Unemployment']],ax=axes[n])
    n+=1
        

## Unemployment Rate vs. Tempo

From the ANOVA test, we knew that energy and tempo may correlate with the unemployment rate.

We ran a regression for the unemployment rate versus tempo and discovered there is a slight negative relationship between tempo in a song and the unemployment rate

In [None]:
music_unemployment["weighed energy"] = music_unemployment["energy"] * (101 - music_unemployment["Placement"])

In [None]:
#Create a weighed tempo
music_unemployment["weighed tempo"] = music_unemployment["tempo"] * (101 - music_unemployment["tempo"])

# Group by the song's date
music_unemployment_gb = music_unemployment.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
rate_v_tempo = music_unemployment_gb.mean()[["Unemployment Rate", "weighed tempo"]]

# Create a Scatter Graph
x_values = rate_v_tempo["Unemployment Rate"]
y_values = rate_v_tempo["weighed tempo"]
linearplt(rate_v_tempo, x_values, y_values, "Tempo", (7,-1700))
plt.title("Unemployment Rate vs. Tempo")
plt.show()


Knowing that we might want to use this to predict what the next big hit by be, we decided to look at songs since 2010, because music has changed a lot from 1960 and on.

In [None]:
# Find the data for songs 2010 and after
music_unemployment_years = (music_unemployment.loc[(music_unemployment["Year"]) >= 2010])
music_unemployment_years

# Group by the song's date
music_unemployment_years_gb = music_unemployment_years.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
two_rate_v_tempo = music_unemployment_years_gb.mean()[["Unemployment Rate", "weighed tempo"]]

# Create a Scatter Graph
x_values = rate_v_tempo["Unemployment Rate"]
y_values = rate_v_tempo["weighed tempo"]
linearplt(rate_v_tempo, x_values, y_values, "Tempo", (7,-1700))
plt.title("Unemployment Rate vs. Tempo in Song 2010 and after")
plt.show()


## Unemployment Rate vs. Energy

We ran a regression for the unemployment rate versus energy and discovered there is a positive relationship between the energy in a song and the unemployment rate.

In [None]:
#Create a weighed energy
music_unemployment["weighed energy"] = music_unemployment["energy"] * (101 - music_unemployment["Placement"])

# Group by the song's date
music_unemployment_gb = music_unemployment.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
rate_v_energy = music_unemployment_gb.mean()[["Unemployment Rate", "weighed energy"]]

# Create a Scatter Graph
x_values = rate_v_energy["Unemployment Rate"]
y_values = rate_v_energy["weighed energy"]
linearplt(rate_v_energy, x_values, y_values, "Energy", (5,22))
plt.title("Unemployment Rate vs. Energy")
plt.show()


In [None]:
# Find the data for songs 2010 and after
music_unemployment_years = (music_unemployment.loc[(music_unemployment["Year"]) >= 2010])
music_unemployment_years
                                  
# Group by the song's date
music_unemployment_years_gb = music_unemployment_years.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
two_rate_v_energy = music_unemployment_years_gb.mean()[["Unemployment Rate", "weighed energy"]]

# Create a Scatter Graph
x_values = two_rate_v_energy["Unemployment Rate"]
y_values = two_rate_v_energy["weighed energy"]
linearplt(two_rate_v_energy, x_values, y_values, "Energy", (7,30))
plt.title("Unemployment Rate vs. Energy in Songs since 2010")
plt.show()


## Conclusion

Happiness in a song did not have a strong correlation with the U.S. Employment Rate. However, we did discover that energy does have a correlation. When there is a high unemployment rate in the U.S., the top billboard songs are more likely to have higher energy than when there is a low unemployment rate.

This is not great news for Taylor Swift's new album "folklore" that came out last week.