## Data Analysis

<em>Aaron Wollman, Albin Joseph, Kelsey Richardson Blackwell, Will Huang</em>

In this notebook, the code will look at data from Spotify, Billboard, and the US Bureau of Labor Statistics to try to answer the following questions:
<ul>
    <li>Is there a correlation between unemployment and the Billboard Top 100 Songs Chart?  If so, can the data predict what the next top song might sound like?
    </li>
    <li>
        Are there other musical attributes besides happiness that have a stronger correlation such as danceability, energy, tempo, speech?
    </li>
</ul>

In [None]:
%matplotlib inline

In [None]:
# Dependencies
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import numpy as np
from scipy.stats import linregress
import scipy.stats as st

In [None]:
# Get constants
from columns import Music_Unemploy_Cols, Unemploy_Cols
from datafiles import music_unemployment, unemployment

In [None]:
# Import Music and Unemployment CSV
music_unemployment = pd.read_csv(music_unemployment, index_col = 0)
music_unemployment.head()

## Unemployment Rate

Before we jumped into running recressions and statistical tests, we wanted to understand the in the unemployment rate during the timeframe. We wanted to visually understand the changes, so we created a heat map.

In [None]:
# Import the Unemployment Data file
unemployment_time = pd.read_csv(unemployment, index_col = 0)
unemployment_time = unemployment_time.dropna()
unemployment_time = unemployment_time.loc[unemployment_time[Unemploy_Cols.year] < 2020]
unemployment_time.head()

In [None]:
unemployment_rate_list=unemployment_time[Unemploy_Cols.rate]
plt.boxplot(unemployment_rate_list)
plt.title("Box Plot of Unemployment Rate")
plt.show()

In [None]:
# Pivot the table to give a yearly view.
unemployment_time_pivot=unemployment_time.pivot(
    Unemploy_Cols.year, Unemploy_Cols.month, Unemploy_Cols.rate)
unemployment_time_pivot.dropna()
unemployment_time_pivot.head()

To show the how the unemployment rate has changed over time, the code will use a heatmap. The darker the shade of blue, the higher the unemployment rate.

In [None]:
# Show Unemployment Rate heatmap per decade.
vmax_un = unemployment_time_pivot.max().max()
vmin_un = unemployment_time_pivot.min().min()

fig,axes = plt.subplots(6,1,figsize=(10,20),sharex=True)
i = 0
for axis in axes:
    data = unemployment_time_pivot[i*10 : (i+1) * 10]
    axis.set_title(f"Unemployment in the {1960 + (i*10)}s")
    sns.heatmap(data,cmap = ("Blues"),ax = axis,vmax = vmax_un, vmin = vmin_un)
    i += 1

plt.show()

From the heatmap there are a few observable trends:
* Higher unemployment rates usually lasted for several years
* Unemployment rate usually went high (above 7) once a decade
* The highest unemployment timeframes occurred in 1982-1083 and 2009-2011

## Song Valence
[Spotify's API](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/) defines a song's valence as:
<blockquote>"A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)."</blockquote>

For this project, this can be considered our \"happiness\" metric.

This code will run a similar heat map for the valence score, to help us visually understand the changes in such a large timeframe.

Now it is time to do the work!

In [None]:
# Weighted Valence
music_unemployment["weighed valence"] = music_unemployment["valence"] * (101 - music_unemployment["Placement"])
valence_year_month=music_unemployment.groupby(['Year','Month'])['weighed valence'].mean()
valence_year_month_df=pd.DataFrame(valence_year_month)

In [None]:
month_list=list(range(1,13))
year_list=list(music_unemployment['Year'].unique())

In [None]:
valence_df=pd.DataFrame(columns = month_list,index=year_list)
for i in year_list:
    valence_df.loc[i]=valence_year_month[(i)]

valence_df.columns.name = 'Month'
valence_df.index.name = 'Year'
valence_df=valence_df.astype(float)
valence_df.sort_index(inplace=True)
valence_df.head()

In [None]:
# Valence heat map by decades
vmax=valence_df.max().max()
vmin=valence_df.min().min()

fig,axes=plt.subplots(6,1,figsize=(15,20))

for n,j in zip(range(0,6),range(0,60,10)):
    sns.heatmap(valence_df[j:j+10],cmap=("Blues"),ax=axes[n],vmax=vmax,vmin=vmin)


From the heatmap, it looks like people tend to listen to sad song more in the past 3 decades.

## Song Valence vs Unemployment
In the next cell, the code will display the heatmaps for the song valence and unemployment side-by-side for comparison.

In [None]:
# Display unemployment and valence heatmaps side-by-side.
fig,axes=plt.subplots(6,2,figsize=(15,20))
for n,j in zip(range(0,6),range(0,60,10)):
    sns.heatmap(unemployment_time_pivot[j:j+10],cmap=("Blues"),ax=axes[n][0],vmax=vmax_un,vmin=vmin_un)
for n,j in zip(range(0,6),range(0,60,10)):
    sns.heatmap(valence_df[j:j+10],cmap=("Blues"),ax=axes[n][1],vmax=vmax,vmin=vmin)

Looking at the heatmaps, there doesn't appear to have much correlation.  However, it would still be good to see mathematically if this is the case. Let's do a regression on them to see how the points line up.

In [None]:
# Define a function for plotting a regression
def regression_plot(dataframe, x_col, y_col):
    # Plot the scatter plot
    dataframe.plot(kind="scatter", x = x_col, y = y_col)
    
    # Calculate the correlation coefficient and linear regression model 
    x_values = dataframe[x_col]
    y_values = dataframe[y_col]
    (slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
    regress_values = x_values * slope + intercept
    equation = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
    eq_label = f"{equation} \nr-squared = {round(rvalue * rvalue, 3)}"
    regress_plot, = plt.plot(x_values, regress_values, "r-", label=eq_label)
    plt.legend(handles=[regress_plot], loc="best")

In [None]:
# Group by the song's date
date_cols = [Music_Unemploy_Cols.year, 
             Music_Unemploy_Cols.month, 
             Music_Unemploy_Cols.day]
music_unemployment_gb = music_unemployment.groupby(date_cols)

# Find the average of unemployment rate and weighed valence for each date
avg_music_unemploy = music_unemployment_gb.mean()
rate_v_valence = avg_music_unemploy[[Music_Unemploy_Cols.unemploy_rate, 
                                     Music_Unemploy_Cols.valence]]

# Create a Scatter Graph
regression_plot(rate_v_valence, 
                Music_Unemploy_Cols.unemploy_rate, 
                Music_Unemploy_Cols.valence)
plt.title("Unemployment Rate vs. Valence (Happiness)")
plt.xlabel("Unemployment Rate")
plt.ylabel("Valence (Happiness)")
plt.show()

From the above graph, there is <b>not</b> a good correlation between valence and the unemployment rate. With the above, the data doesn't take the song's placement in the Top 100 into account. Let's try again using a weighted average of the Top 100.

This weighted average will give the number 1 song 101 points, number 2 100 points, and will keep decreasing by 1 point until it assigns the number 100 song 1 point. By doing this weighted average, the placement of a song on the Top 100 will be more meaningful.

In [None]:
# Create a new data point "Weighted Valence"
Music_Unemploy_Cols.weighed_valence = "weighed valence"
weights = (101 - music_unemployment[Music_Unemploy_Cols.placement])
weighed_valence = music_unemployment[Music_Unemploy_Cols.valence] * weights
music_unemployment[Music_Unemploy_Cols.weighed_valence] = weighed_valence
music_unemployment.head()

In [None]:
# Group by the song's date
music_unemployment_gb = music_unemployment.groupby(date_cols)

# Find the average of unemployment rate and weighed valence for each date
avg_music_unemploy = music_unemployment_gb.mean()
rate_v_valence = avg_music_unemploy[[Music_Unemploy_Cols.unemploy_rate, 
                                     Music_Unemploy_Cols.weighed_valence]]

# Create a Scatter Graph
regression_plot(rate_v_valence, 
                Music_Unemploy_Cols.unemploy_rate, 
                Music_Unemploy_Cols.weighed_valence)
plt.title("Unemployment Rate vs. Valence (Happiness)")
plt.xlabel("Unemployment Rate")
plt.ylabel("Weighed Valence (Happiness)")
plt.show()

Even with a weighted average, there still isn't a good correlation between the average valence and the unemployment rate for all decades.

### Unemployment Rate vs. Valence in Songs 2010-2019

Even if the code didn't provide a good correlation between unemployment and song valence from 1960 and now, musical tastes do change. This code will look at the correlation between 2010 and 2019.

In [None]:
# Find the data for songs 2010 and after
music_unemployment_years = (music_unemployment.loc[(music_unemployment["Year"]) >= 2010])

# Group by the song's date
music_unemployment_years_gb = music_unemployment_years.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
two_rate_v_valence = music_unemployment_years_gb.mean()[["Unemployment Rate", "weighed valence"]]

# Create a Scatter Graph
regression_plot(two_rate_v_valence, "Unemployment Rate", "weighed valence")
plt.title("Unemployment Rate vs. Valence in Songs 2010-2019")
plt.show()

From the graph above, it is observable that looking at songs between 2010 and 2019 provides a better correlation than looking at all decades.  It's still not great though, with a r-squared value less than 0.5.

### Valence vs. Unemployment Conclusion: 

We discovered the unemployment rate does not impact happiness in a Top 100 hit song. As you can see in the regression graphs above, the r-squared value shows there was not a strong correlation.

## Finding an Alternative Music Attribute
Although there wasn't a great correlation between song valence and unemployment, that doesn't mean that there might not be a correlation between unemployment and another data attribute. This code will conduct an ANOVA test to see what other attributes might be worth looking into for a regression.

First, let's categorize songs into high and low unemployment.  From the boxplot in the Unemployment section, an unemployment rate above 7 is in the 3rd quartile, which can be considered high.

In [None]:
# Categorize song by its unemployment rate at the time
# if unemployment rate higher than 7.0, assign into High_Unemployment group
# 7.0 is descided by the 3rd quantile of all the unemployment rate data

high_unemployment_rate=np.quantile(unemployment_rate_list, .75) ###7.0
music_unemployment["weighed valence"] = music_unemployment["valence"] * (101 - music_unemployment["Placement"])
music_unemployment['High_Unemployment'] = music_unemployment['Unemployment Rate'].apply(lambda x: 1 if x>=high_unemployment_rate else 0)

music_unemployment.head()

Now, let's weigh the attributes according to their position on the Top 100 Charts.

In [None]:
#the calculation of weighted features could be done together 
music_weights = (101 - music_unemployment["Placement"])
music_unemployment["weighed valence"] = music_unemployment["valence"] * music_weights
music_unemployment['weighed danceability']=music_unemployment["danceability"] * music_weights
music_unemployment['weighed energy']=music_unemployment["energy"] * music_weights
music_unemployment['weighed key']=music_unemployment["key"] * music_weights
music_unemployment['weighed loudness']=music_unemployment["loudness"] * music_weights
music_unemployment['weighed speechiness']=music_unemployment["speechiness"] * music_weights
music_unemployment['weighed acousticness']=music_unemployment["acousticness"] * music_weights
music_unemployment['weighed liveness']=music_unemployment["liveness"] * music_weights
music_unemployment['weighed tempo']=music_unemployment["tempo"] * music_weights
music_unemployment.head()

In [None]:
# assign the weighted feature scores to the mean of the monthly feature score
music_unemployment_group=music_unemployment.groupby(['Year','Month','Day'])[
    ['High_Unemployment','Unemployment Rate',
       'weighed valence', 'weighed danceability', 'weighed energy',
       'weighed key', 'weighed loudness', 'weighed speechiness',
       'weighed acousticness', 'weighed liveness', 'weighed tempo']].mean()
music_unemployment_group.head()

In [None]:
feature_list=['weighed valence', 'weighed danceability', 'weighed energy',
       'weighed key', 'weighed loudness', 'weighed speechiness',
       'weighed acousticness', 'weighed liveness', 'weighed tempo']

The below graphs show the distribution of attributes for high and low unemployment rates.

In [None]:
# scatter plots for weigthed feature scroes by unemployment rate
row=0
col=0
fig,axes=plt.subplots(3,3,figsize=(15,15))

for i in feature_list:
    if col>2:
        row+=1
        col=0
        sns.scatterplot(x='Unemployment Rate',y=i,hue='High_Unemployment',data=music_unemployment_group,ax=axes[row][col])
        col+=1
        
    else:
        sns.scatterplot(x='Unemployment Rate',y=i,hue='High_Unemployment',data=music_unemployment_group,ax=axes[row][col])
        col+=1

And now, the code will conduct the ANOVA tests on the weighted features vs unemployment.

In [None]:
# Conduct Anova test for weighted features
statistic_list=[]
pvalue_list=[]
for i in feature_list:
    group1=music_unemployment_group[i][music_unemployment_group['High_Unemployment']==1]
    group2=music_unemployment_group[i][music_unemployment_group['High_Unemployment']==0]
    statistic=st.f_oneway(group1,group2)[0]
    pvalue=st.f_oneway(group1,group2)[1]
    statistic_list.append(statistic)
    pvalue_list.append(pvalue)


In [None]:
# Anova test results dataframe
significant_list=[1 if i <=0.05 else 0 for i in pvalue_list]
anova=pd.DataFrame({'Feature':feature_list,'Statistic':statistic_list,'Pvalue':pvalue_list,'Significant':significant_list})
anova.sort_values('Pvalue')

From the above table of ANOVA test results, energy and tempo give the smallest P-Values.  This means that they have a higher chance of correlation with unemployment than other attributes. Below is the box-plots of each of the attributes tested.

In [None]:
# Boxplots for the weighted fetures which has significant 
row=0
col=0
fig,axes=plt.subplots(2,3,figsize=(15,10))
for i in anova['Feature'][anova['Significant']==1]:
    if col>2:
        row+=1
        col=0
        sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group[[i,'High_Unemployment']],ax=axes[row][col])
        col+=1
        
    else:
        sns.boxplot(x='High_Unemployment',y=i,data=music_unemployment_group[[i,'High_Unemployment']],ax=axes[row][col])
        col+=1
        

## Unemployment Rate vs. Tempo

From the ANOVA test, we knew that energy and tempo may correlate with the unemployment rate.

We ran a regression for the unemployment rate versus tempo for the whole timeframe and since 2010. In both graphs, the r-squared values were not significant enough to show us a correlation between unemployment rate vs. tempo.

In [None]:
# Create a new data point "Weighed Tempo"
music_unemployment["weighed energy"] = music_unemployment["energy"] * (101 - music_unemployment["Placement"])

#Create a weighed tempo
music_unemployment["weighed tempo"] = music_unemployment["tempo"] * (101 - music_unemployment["tempo"])

# Group by the song's date
music_unemployment_gb = music_unemployment.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
rate_v_tempo = music_unemployment_gb.mean()[["Unemployment Rate", "weighed tempo"]]

# Create a Scatter Graph
regression_plot(rate_v_tempo, "Unemployment Rate", "weighed tempo")
plt.title("Unemployment Rate vs. Tempo")
plt.show()

Comparing the tempo vs unemployment for all decades didn't give a good correlation either, with the r-squared being far below 0.5.

### Unemployment Rate vs. Tempo in Songs 2010-2019
While all decades didn't give a good result, maybe songs between 2010 and 2019 will give a better result?

In [None]:
# Find the data for songs 2010 and after
music_unemployment_years = (music_unemployment.loc[(music_unemployment["Year"]) >= 2010])

# Group by the song's date
music_unemployment_years_gb = music_unemployment_years.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
two_rate_v_tempo = music_unemployment_years_gb.mean()[["Unemployment Rate", "weighed tempo"]]

# Create a Scatter Graph
regression_plot(two_rate_v_tempo, "Unemployment Rate", "weighed tempo")
plt.title("Unemployment Rate vs. Tempo in Song 2010-2019")
plt.show()

Looking at the above graph, narrowing the years didn't help much either.  The r-squared value is still below 0.5.

## Unemployment Rate vs. Energy

We ran a regression for the unemployment rate versus tempo for the whole timeframe and since 2010.

In [None]:
#Create a weighed energy
music_unemployment["weighed energy"] = music_unemployment["energy"] * (101 - music_unemployment["Placement"])

# Group by the song's date
music_unemployment_gb = music_unemployment.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
rate_v_energy = music_unemployment_gb.mean()[["Unemployment Rate", "weighed energy"]]

# Create a Scatter Graph
regression_plot(rate_v_energy, "Unemployment Rate", "weighed energy")
plt.title("Unemployment Rate vs. Energy")
plt.show()

Across all decades, energy and unemployment don't have much of a correlation either.  The r-squared value is below 0.5.

### Unemployment Rate vs. Energy in Songs 2010-2019

As with the other attributes observed, maybe looking at data between 2010 and 2019 will give a better result?

In [None]:
# Find the data for songs 2010 and after
music_unemployment_years = (music_unemployment.loc[(music_unemployment["Year"]) >= 2010])
                                  
# Group by the song's date
music_unemployment_years_gb = music_unemployment_years.groupby(["Year", "Month", "Day"])

# Find the average of unemployment rate and weighed valence for each date
two_rate_v_energy = music_unemployment_years_gb.mean()[["Unemployment Rate", "weighed energy"]]

# Create a Scatter Graph
regression_plot(two_rate_v_energy, "Unemployment Rate", "weighed energy")
plt.title("Unemployment Rate vs. Energy in Songs 2010-2019")
plt.show()

The graph for energy vs unemployment between 2010 and 2019 finally gives a great correlation.  The r-squared value is well above 0.5 and is above 0.85.  This means that predictions might be possible off of this model.

## Conclusion

Happiness in a song did not have a strong correlation with the U.S. Employment Rate. However, we did discover that energy does have a correlation if the data is limited to 2010 through 2019.

When there is a high unemployment rate in the U.S., the top billboard songs are more likely to have higher energy than when there is a low unemployment rate.

This is not great news for Taylor Swift's new album "folklore" that came out last week, which is more mellow.