## Data Analysis

Aaron Wollman, Albin Joseph, Kelsey Richardson Blackwell, Will Huang

In this notebook, we analized if the measure of “musical positiveness”in the Top 100 Hits and the US’s unemployment data have a strong correlation? Is the correlation strong enough to predict next month? Are there other attributes besides happiness that have a stronger correlation - danceability, energy, tempo, speech?

In [None]:
%matplotlib inline

In [None]:
# Dependencies
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import numpy as np
from scipy.stats import linregress
import scipy.stats as st

In [None]:
# Constants


In [None]:
data = pd.read_csv('../data/music_and_unemployment.csv')
data.drop('Unnamed: 0',axis=1,inplace=True)
data.head()

In [None]:
# Aaron's Code

In [None]:
# End of Aaron's Code

In [None]:
# Albin's Code

In [None]:
# End of Albin's Code

In [None]:
# Kelsey's code

## Unemployment Rate vs. Happiness

We ran a regression for happiness ("valence") versus Unemployment Rate and discovered that the unemployment rate does not impact happiness in a Top 100 hit song. If you look at the plot below, you can visibily see the scattered data points. The r value = 0.1, which means there is almost no relationship between happiness in a song and the unemployment rate.

So we decided to dig a littler deeper and look at the other variables in music.

In [None]:
# Create a new data point "Weighted Valence"
data["weighed valence"] = data["valence"] * (101 - data["Placement"])
data.head()

In [None]:
# Group by the song's date
data_gb = data.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
rate_v_valence = data_gb.mean()[["Unemployment Rate", "weighed valence"]]

# Create a Scatter Graph
rate_v_valence.plot(kind="scatter", x = "Unemployment Rate", y = "weighed valence")

# Calculate the correlation coefficient and linear regression model 
x_values = rate_v_valence["Unemployment Rate"]
y_values = rate_v_valence["weighed valence"]

(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))

plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,(5,22),fontsize=15,color="red")

plt.title("Unemployment Rate vs. Valence (Happiness)")
plt.xlabel("Unemployment Rate")
plt.ylabel("Weighed Valence (Happiness)")
plt.show()

## Unemployment Rate vs. Energy

We ran a regression for the unemployment rate versus energy and discovered there is a positive relationship between the energy in a song and the unemployment rate.

In [None]:
data["weighed energy"] = data["energy"] * (101 - data["Placement"])
data["weighed tempo"] = data["tempo"] * (101 - data["tempo"])

In [None]:
# Group by the song's date
data_gb = data.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
rate_v_energy = data_gb.mean()[["Unemployment Rate", "weighed energy"]]

# Create a Scatter Graph
rate_v_energy.plot(kind="scatter", x = "Unemployment Rate", y = "weighed energy")

# Calculate the correlation coefficient and linear regression model 
x_values = rate_v_energy["Unemployment Rate"]
y_values = rate_v_energy["weighed energy"]

(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))

plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,(5,22),fontsize=15,color="red")

plt.title("Unemployment Rate vs. Energy")
plt.xlabel("Unemployment Rate")
plt.ylabel("Energy")
plt.show()

## Unemployment Rate vs. Tempo

We ran a regression for the unemployment rate versus tempo and discovered there is a slight negative relationship between tempo in a song and the unemployment rate

In [None]:
# Group by the song's date
data_gb = data.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
rate_v_tempo = data_gb.mean()[["Unemployment Rate", "weighed tempo"]]

# Create a Scatter Graph
rate_v_tempo.plot(kind="scatter", x = "Unemployment Rate", y = "weighed tempo")

# Calculate the correlation coefficient and linear regression model 
x_values = rate_v_tempo["Unemployment Rate"]
y_values = rate_v_tempo["weighed tempo"]

(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))

plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,(8,-1800),fontsize=15,color="red")

plt.title("Unemployment Rate vs. Tempo")
plt.xlabel("Unemployment Rate")
plt.ylabel("Tempo")
plt.show()

In [None]:
# End of Kelsey's Code

In [None]:
# Will's Code

In [None]:
unemployement_time=data[['Year','Month','Unemployment Rate']].drop_duplicates().reset_index(drop=True)
unemployement_time.head()

In [None]:
unemployement_time_pivot=unemployement_time.pivot('Year','Month','Unemployment Rate')
unemployement_time_pivot.head()

In [None]:
temp=unemployement_time_pivot.copy()
temp['STD']=[statistics.stdev(temp.loc[index,:])for index,row in temp.iterrows()]
temp.head()

In [None]:
plt.figure(figsize=(15,20))
sns.heatmap(unemployement_time_pivot,cmap=("Blues"))

In [None]:
vmax=unemployement_time_pivot.max().max()
vmin=unemployement_time_pivot.min().min()

In [None]:
fig,axes=plt.subplots(6,1,figsize=(10,20),sharex=True)
sns.heatmap(unemployement_time_pivot[:10],cmap=("Blues"),ax=axes[0],vmax=vmax,vmin=vmin)
sns.heatmap(unemployement_time_pivot[10:20],cmap=("Blues"),ax=axes[1],vmax=vmax,vmin=vmin)
sns.heatmap(unemployement_time_pivot[20:30],cmap=("Blues"),ax=axes[2],vmax=vmax,vmin=vmin)
sns.heatmap(unemployement_time_pivot[30:40],cmap=("Blues"),ax=axes[3],vmax=vmax,vmin=vmin)
sns.heatmap(unemployement_time_pivot[40:50],cmap=("Blues"),ax=axes[4],vmax=vmax,vmin=vmin)
sns.heatmap(unemployement_time_pivot[50:],cmap=("Blues"),ax=axes[5],vmax=vmax,vmin=vmin)

In [None]:
unemployment_rate_list=[]
for i in range(len(unemployement_time_pivot)):
    for j in unemployement_time_pivot.iloc[i,1:]:
        unemployment_rate_list.append(j)
unemployment_rate_list[:5]

In [None]:
plt.boxplot(unemployment_rate_list)

In [None]:
import numpy as np
high_unemployment_rate=np.quantile(unemployment_rate_list, .75) ###7.0
data["weighed valence"] = data["valence"] * (101 - data["Placement"])
data['High_Unemployment']=data['Unemployment Rate'].apply(lambda x: 1 if x>=high_unemployment_rate else 0)

data.head()

In [None]:
music_unemployment["weighed valence"] = music_unemployment["valence"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed danceability']=music_unemployment["danceability"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed energy']=music_unemployment["energy"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed key']=music_unemployment["key"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed loudness']=music_unemployment["loudness"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed speechiness']=music_unemployment["speechiness"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed acousticness']=music_unemployment["acousticness"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed liveness']=music_unemployment["liveness"] * (101 - music_unemployment["Placement"])
music_unemployment['weighed tempo']=music_unemployment["tempo"] * (101 - music_unemployment["Placement"])
music_unemployment['High_Unemployment']=music_unemployment['Unemployment Rate'].apply(lambda x: 1 if x>=high_unemployment_rate else 0)

In [None]:
music_unemployment.head()

In [None]:
music_unemployment_group=music_unemployment.groupby(['Year','Month','Day'])[['High_Unemployment','Unemployment Rate',
       'weighed valence', 'weighed danceability', 'weighed energy',
       'weighed key', 'weighed loudness', 'weighed speechiness',
       'weighed acousticness', 'weighed liveness', 'weighed tempo']].mean()

In [None]:
compare_list=['weighed valence', 'weighed danceability', 'weighed energy',
       'weighed key', 'weighed loudness', 'weighed speechiness',
       'weighed acousticness', 'weighed liveness', 'weighed tempo']
for i in compare_list:
    plt.figure(figsize=(10,5))
    sns.scatterplot(x='Unemployment Rate',y=i,hue='High_Unemployment',data=music_unemployment_group)

In [None]:
music_unemployment_group[i][music_unemployment_group['High_Unemployment']==1]

In [None]:
compare_list

In [None]:
row=0
col=0
fig,axes=plt.subplots(3,3,figsize=(15,15))
for i in compare_list:
    if col>2:
        row+=1
        col=0
        sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group[[i,'High_Unemployment']],ax=axes[row][col])
        col+=1
        
    else:
        sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group[[i,'High_Unemployment']],ax=axes[row][col])
        col+=1
        

In [None]:
music_unemployment_group.columns

In [None]:
statistic_list=[]
pvalue_list=[]
for i in compare_list:
    group1=music_unemployment_group[i][music_unemployment_group['High_Unemployment']==1]
    group2=music_unemployment_group[i][music_unemployment_group['High_Unemployment']==0]
    statistic=st.f_oneway(group1,group2)[0]
    pvalue=st.f_oneway(group1,group2)[1]
    statistic_list.append(statistic)
    pvalue_list.append(pvalue)
    print(f' ANOVA Result for {i} vs. High_Unemployment\n {st.f_oneway(group1,group2)}\n==================')

In [None]:
siginificant_list=[1 if i <=0.05 else 0 for i in pvalue_list]
anova=pd.DataFrame({'Feature':compare_list,'Statistic':statistic_list,'Pvalue':pvalue_list,'Siginificant':siginificant_list})
anova.sort_values('Pvalue')

In [None]:
anova.sort_values('Siginificant',ascending=False)

In [None]:
row=0
col=0
fig,axes=plt.subplots(3,3,figsize=(15,15))
for i in ['weighed danceability',
'weighed energy',
'weighed speechiness',
'weighed acousticness',
'weighed tempo']:
    if col>2:
        row+=1
        col=0
        sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group[[i,'High_Unemployment']],ax=axes[row][col])
        col+=1
        
    else:
        sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group[[i,'High_Unemployment']],ax=axes[row][col])
        col+=1
        

In [None]:
# End of Will's Code

## Conclusion

Happiness in a song did not have a strong correlation with the U.S. Employment Rate. However, we did discover that energy does have a correlation. When there is a high unemployment rate in the U.S., the top billboard songs are more likely to have higher energy than when there is a low unemployment rate.

This is not great news for Taylor Swift's new album "folklore" that came out last week.