## Exploratory Data Analysis (EDA)

Conducting EDA on US top charting songs from 1921- 2020 to examine relationship between variables and other patterns in the data.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

# shows plots inline
%matplotlib inline

In [None]:
# To suppress future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)



In [None]:
df=pd.read_csv('/Users/josephlim/Desktop/Data Science/Capstone Projects/Capstone project- Spotify/Data/Cleaned Data/US_1921-2020_final.csv')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.info()

### Spotify Audio Features

Spotify provides breaks down its track data using its audio features. The Spotify Web API developer guide defines them as follows:
- Duration: The duration of the track in milliseconds.
- Danceability: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
- Energy: Represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
- Key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
- Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.
- Mode: Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- Speechiness: This detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
- Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
- Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”.
- Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
- Valence: Describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
- Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece, and derives directly from the average beat duration.
- Time signature: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

Because the term "mode" can be confusing given the context of statistical analysis, it will be referred to as "musical mode"(m_mode)

In [None]:
df_m= df.rename(columns={'mode':'m_mode'})

### Distributions of Data
#### Categorical features
There are two categorical features in this datasets: key and mode. 

In [None]:
df_cat=df_m[['key','m_mode']].copy()

In [None]:
df_cat.head()

Keys are denoted as numbers.0 represents C, and each index represents a semitone increase in key. Musical mode is also denoted as integers, but it is a binary data that indicates whether the song is a major/ minor key. 0 represents minor keys, while 1 represents major.

Values will be replaced to reflect corresponding categories.

In [None]:
df_cat['key'].replace({0:'C',1:'C#', 2:'D', 3:'D#', 4:'E', 5:'F', 6:'F#',7:'G',8:'Ab', 9:'A', 10:'Bb',11:'B'}, inplace=True)
df_cat['m_mode'].replace({0:'minor',1:'major'}, inplace=True)

In [None]:
df_cat['key'].value_counts()

In [None]:
key_order= ['C','C#','D','D#', 'E', 'F','F#','G', 'Ab','A','Bb','B' ]

In [None]:
sns.countplot(data=df_cat, order= key_order,x='key')

The most common keys are C and G.

In [None]:
df_cat['m_mode'].value_counts()

In [None]:
sns.countplot(data= df_cat, x='m_mode')

There are more songs that are major keys.

#### Quantitative data

We will now conduct analysis on numerical features of the dataset.

In [None]:
# Filtering numerical data.
cat_columns= df_cat.columns
df_num= df_m.drop(cat_columns, axis=1)

In [None]:
df_num.columns

In [None]:
df_num.describe().T

In [None]:
df_num.shape

In [None]:
df_num.sort_values('popularity', ascending=False)

Popularity score should be normalized to better understand its trends and for later use

In [None]:
def score_normalization(score, max_score):
    popularity_normalized= 1- score/ max_score
    
    return popularity_normalized

The normalized popularity score ranges from 0 to 1, with 1 being the most popular. 

In [None]:
df_num['popularity']=df_num['popularity'].apply(lambda x:score_normalization(x,df_num['popularity'].max()))

Also, no one talks about the durations of songs in ms. Let's convert them into to seconds.

In [None]:
def convert_mstosec(ms_input):
    sec_output= ms_input/1000
    return sec_output

In [None]:
df_num['duration']= df_num['duration_ms'].apply(lambda x: convert_mstosec(x))

In [None]:
df_num['duration']

In [None]:
df_num.drop('duration_ms', axis=1, inplace=True)

In [None]:
hist= df_num.hist(bins=10, figsize=(16,17))
for i, ax in enumerate(hist.ravel()):
    ax.set_xlabel(f'{i}')
    ax.set_ylabel('Count')

- Energy and valence have noticeably similar distribution. Danceability, loudness, and energy have similar distribution. 

- Most top charting songs tend to be short in duration. They tend to have high energy and loudness, around 0.6 and -10dB, respectively. 
- Most top charting tracks also tend to contain musical vocals(such as singing or rapping),rather than spoken words (as heard in audio books).  
- They also scored low "liveness", which means they are often polished studio recordings than live performances.
- There is lower likelihood that top charting songs are acoustic.
- In terms of valence, the highest distribution was seen between 0.25 and 0.75. 
- Similarly, the highest distribution in danceability was in between 0.5-0.75. 
- The largest tempo distribution in top charting songswas in between 90 to 150 BPM.

#### More in-depth look at individual features:

In [None]:
boxplot_dur= df_num.boxplot(column='duration', grid=False, vert=False, fontsize=15)
boxplot_dur.set_xlabel('Seconds')

In [None]:
mean= df_num['duration'].mean()
q25, q75= np.percentile(df_num['duration'],[25,75])
iqr= np.subtract(q75,q25)
maximum= q75+ 1.5*iqr

print('mean:',mean)
print('maximum', maximum)

In [None]:
outliers= df_num[df_num['duration']>maximum]
len(outliers)

Top charting songs have mean length of 230.05 sec. Of 586672 songs, 4% of songs(25254 songs) were longer than 397.028 sec. 

In [None]:
boxplot_tempo= df_num.boxplot(column='tempo', grid=False, vert=False, fontsize=15)

In [None]:
mean= df_num['tempo'].mean()


q25,q75= np.percentile(df_num['tempo'],[25,75])
iqr= np.subtract(q75,q25)
maximum= q75+ 1.5*iqr

print('mean:',mean)
print('maximum', maximum)

In [None]:
outliers= df_num[df_num['tempo']>maximum]
len(outliers)

Top charting songs tend to be medium or faster tempo, with mean tempo of around 120 BPM.  Of 586672 songs, 0.9% of songs(5336 songs) were faster than 197 BPM. 

In [None]:
boxpolot_E= df_num.boxplot(column='energy', grid=False, vert=False, fontsize=15)

In [None]:
mean= df_num['energy'].mean()
mode= df_num['energy'].mode()
q25,q75= np.percentile(df_num['energy'],[25,75])
iqr= np.subtract(q75,q25)
minimum= q25- 1.5*iqr
maximum= q75+ 1.5*iqr

print('mean:',mean)
print('minimum:', minimum)
print('mode:', mode)

In [None]:
outliers_min= df_num[df_num['energy']<minimum]
outliers_max=df_num[df_num['energy']>maximum]
print('lower_outlier',len(outliers_min))
print('upper_outlier:', len(outliers_max))

Top charting songs had balanced energy levels across the chart, with mean energy level of 0.542. There was no outlier.

In [None]:
boxpolot_live= df_num.boxplot(column='liveness', grid=False, vert=False, fontsize=15)

In [None]:
mean= df_num['liveness'].mean()

q25,q75= np.percentile(df_num['liveness'],[25,75])
iqr= np.subtract(q75,q25)
maximum= q75+ 1.5*iqr

print('mean:',mean)
print('maximum', maximum)

In [None]:
outliers= df_num[df_num['liveness']>maximum]
len(outliers)

Top charting songs tend to be polished studio recording rather than live recording, with mean liveness of 0.2139.
Of 586672 songs, around 7% of songs (40987 songs) had liveness more than 0.54755

In [None]:
boxpolot_acoust= df_num.boxplot(column='acousticness', grid=False, vert=False, fontsize=15)

In [None]:
mean= df_num['acousticness'].mean()

q25,q75= np.percentile(df_num['acousticness'],[25,75])
iqr= np.subtract(q75,q25)
minimum= q25- 1.5*iqr
maximum= q75+ 1.5*iqr

print('mean:',mean)
print('mode:', mode)
print('maximum', maximum)
print('minimum:', minimum)

In [None]:
outliers_min= df_num[df_num['acousticness']<minimum]
outliers_max=df_num[df_num['acousticness']>maximum]
print('lower_outlier',len(outliers_min))
print('upper_outlier:', len(outliers_max))

Top charting songs had mean acousticness of 0.44986. As seen by the plot and the mean closer to the center, there is only slight preference towards non-acoustic song.

In [None]:
boxpolot_loud= df_num.boxplot(column='loudness', grid=False, vert=False, fontsize=15)

In [None]:
mean= df_num['loudness'].mean()
mode= df_num['loudness'].mode()
q25,q75= np.percentile(df_num['loudness'],[25,75])
iqr= np.subtract(q75,q25)
maximum= q75+ 1.5*iqr
minimum= q25- 1.5*iqr

print('mean:',mean)
print('mode:', mode)
print('maximum', maximum)
print('minimum:', minimum)


In [None]:
outliers_min= df_num[df_num['loudness']<minimum]
outliers_max=df_num[df_num['loudness']>maximum]
print('lower_outlier',len(outliers_min))
print('upper_outlier:', len(outliers_max))

Top charting songs tend to be loud, with mean loudness of -10.206 dB. Of 586672 songs, only around 2.57% of songs (15096 songs) had loudness quieter thaniveness more than -22.50 dB. This makes sense, given that the Spotify's loudness guideline is around -14 dB. There were also some minor outliers in the other end of extreme, with approximately 0.00006% of songs (35 songs) being louder than 3.13 dB.

### Visualizing Relationships Between Features

We have seen that there are observable patterns in individual features. We will explore if there are relationships between features. I'll categorize correlation between 0.4- 0.7 moderatcorrelation and those over 0.7 strong correlation. I will be rounding up correlations with aboslute values between 0.35- 0.4.

In [None]:
# Heatmap to visualize data relationships
plt.figure(figsize=(14,12))
sns.heatmap(df.corr(), linewidths=.1, cmap='YlGnBu', annot=True)
plt.yticks(rotation=0)

#### Strong positive correlations:
energy: loudness (0.76)

#### Moderate correlations:
valence: danceability (0.53)
<br> valence: energy(0.37)

#### Strong negative correaltions:
energy: acousticness (-0.72)

#### Moderate negative correlations:
loudness: acousticness (-0.52)
<br> acousticness: popularity (-0.37)



In [None]:
g= sns.pairplot(df_num, palette= 'Set1')
plt.show()

Not much correlation was seen in the data. This is to be expected, given how much music changes in a century. Let's try breaking down the data by year.

In [None]:
df_num['year']= pd.DatetimeIndex(df_num['release_date']).year

In [None]:
year_list= df_num['year'].unique().tolist()
year_list.sort()

data_year=[]
for i in range(len(year_list)):
    data_year.append(df_num[df_num['year']== year_list[i]])

In [None]:
len(data_year)

The earliest data we have is from 1900. While one may expect music from 1900 to be significantly different from 1922, let's examine it.

In [None]:
# Heatmap to visualize data relationships
# Heatmap to visualize data relationships
df_noyear=data_year[0].drop('year', axis=1)

plt.figure(figsize=(14,12))
sns.heatmap(df_noyear.corr(), linewidths=.1, cmap='YlGnBu', annot=True)
plt.title(label=f"Correlation Matrix for Year: {year_list[0]}")
plt.yticks(rotation=0)


It seems like there isn't much to dissect from 1900. Onto the next year!

In [None]:
# Heatmap to visualize data relationships
df_noyear=data_year[1].drop('year', axis=1)

plt.figure(figsize=(14,12))
sns.heatmap(df_noyear.corr(), linewidths=.1, cmap='YlGnBu', annot=True)
plt.title(label=f"Correlation Matrix for Year: {year_list[1]}")
plt.yticks(rotation=0)

#### Strong positive correlations:
energy: loudness (0.75)

#### Moderate positive correlations:
loudness: duration (0.55)
<br> acousticness: duration (0.51)
<br> valence: danceability (0.55)
<br> speechiness: danceability (0.53)
<br> valence: energy(0.4)

#### Strong negative correaltions:
energy: acousticness (-0.72)

#### Moderate negative correlations:
loudness: acousticness (-0.52)
<br> acousticness: popularity (-0.37)




While it is nice to be able to visualize correlations, it will be a bit much to do so for 101 years of data. I'll simply extract features, correlation values, and year. 

# I want to iterate through different years, create a dataframe with four columns: 1st feature, 2nd feature, correlation, and year of the data. Drop 'Negligible'.

In [2]:
df_corr= pd.DataFrame(columns=['feature_1', 'feature_2', 'corr_value', 'corr_strength','year']).set_index('year')

In [7]:
type(df_corr['corr_strength'])

pandas.core.series.Series

In [3]:
df_corr['corr_strength']= None

In [6]:
var.dtypes

NameError: name 'var' is not defined

In [None]:
corr_mat=data_year[1].corr()
corr_mat

In [None]:
corr_mat=data_year[1].corr()
pairs = corr_mat.stack()
high_pairs= pairs[(pairs>0.7)& (pairs!=1)].drop_duplicates()

high_pairs

corr_feat= df_corr[df_corr==high_pairs]
corr_feat.columns

In [None]:
variables= df_num.columns

In [None]:
columns=['feature_1', 'feature_2', 'corr_value', 'corr_strength','year']
feature_list=[]

for year in range(len(data_year)):
    
    corr_mat= data_year[year].corr()
    pairs = corr_mat.stack()
    
    for i in pairs:
        # Correlation value
        df_corr['corr_value']= i
        
             # Feature 1
        corr_feat= df_corr[df_corr==i]
        feature_1, feature_2 = corr_feat.columns
        df_corr['feature_1', 'feature_2'] = df_corr.update(feature_1, feature_2)

        
        # Correlation strength
        if (i>0.7)& (i !=1):
            df_corr['corr_strength']= 'Strong Positive'
        elif (i <0.7) & (i>0.4):
            df_corr['corr_strength']= 'Moderate Postive'
        elif (i> -0.7)& (i != -1):
            df_corr['corr_strength']= 'Strong Negative'
        elif (i< -0.7) & (i> -0.4):
            df_corr['corr_strength']= 'Moderate Negative'
        else:
            df_corr['corr_strength']= 'Negligible'

    

        # Feature 2
    #     df_corr['feature_2']= corr_mat.apply(lambda column: column[column== pairs[i]], axis=0)
    
        


    # Year
    df_corr['year'] =year_list[year]

Let's see if there is a trend amongst correlations as well. 

## Correlations:
The findings from EDA revealed that the strongest correlation lies between energy and loudness (r=0.76). This makes intuitive sense, because loud music (i.e. hip hop and EDM) are associated with more energy.

<br> There is also moderate positive correlations between valence and danceability (r= 0.53). This also makes intuitive sense, as people are more likely to dance to "happy" or "fun" songs. Another positive correlation was seeen in valence and energy (r= 0.37). This also makes sense as those songs that convey "fun" emotions also tend to be more energetic. 

<br>There was a strong negative correlation between energy and acousticness (r= -0.72). This also makes intuitive sense, as acoustic versions of songs tend to be more relaxed in arrangement and overall texture/quality. This is further supported by moderate negative correlation between acousticness and loudness. More acoustic a song is, quieter it is. 

<br>There isn't any particular feature that has strong positive correlation with popularity. Energy and loudness have weak positive correlation with popularity, with r= 0.3 andr= 0.33, respectively. This makes sense as loud music is perceived to be better.There is also a moderate negative correlation between popularity and acousticness (-0.37). 
