# Introduction to the project
As an active music listener, I became interested in analyzing dataset of Spotify's Most Popular songs for 2010-2019.

What is the most popular music genre? What artist had most hits in the last decade? What are correlations between song variables?



# First look at dataset

I ensured that data is loaded correctly below.

In [None]:
import numpy as np
import pandas as pd 

df=pd.read_csv('/kaggle/input/top-spotify-songs-from-20102019-by-year/top10s.csv',encoding='ISO-8859-1')
df=df.iloc[:,1:]
df.head()

Let's check for any missing values.

In [None]:
df.isnull().any()

No Null values in the dataset.

# How many songs, artists and genres in this dataset?

In [None]:

number_songs=df.title.nunique()
number_artists=df.artist.nunique()
number_genres=df['top genre'].nunique()

print('There are', number_songs,'songs,', number_artists,'artists and',number_genres,'genres in the dataset.')


# Popular artists

Let's see what artists had the most hits for past decade.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sb
from matplotlib import rcParams
rcParams['figure.figsize'] =12,12

sb.countplot(y=df['artist'],order=df.artist.value_counts().iloc[:10].index);


Artists with most hits are Katy Perry(17 songs) , Justin Bieber(16 songs) , Rihanna & Maroon 5(both 15 songs).

# Popular genres

Let's see most popular genres for past decade.

In [None]:

genres_piechart=plt.pie(df['top genre'].value_counts().iloc[:5],explode=[0.1,0,0,0,0] ,labels=df['top genre'].value_counts().iloc[:5].index,
autopct='%1.1f%%', shadow=True, startangle=50)

We see that 'Pop' genres dominated last decade in Spotify.

'Dance pop' is the most prevalent genre in the last decade. 'Pop' and 'Canadian pop' are second and third, respectively.

# Duration and BPM of songs

Let's see distribution of song duration and BPM (Beats per Minute) by plotting histograms.

In [None]:
bpm_=df['bpm']
dur=df['dur']

rcParams['figure.figsize'] =17,7

#BPM
fig, (ax1,ax2) =plt.subplots(1,2)
ax1.hist(bpm_,bins=np.arange(40, 210, step=20));
ax1.set_title('Frequency plot of Beats per Minute (BPM)')
plt.sca(ax1)
plt.xticks(np.arange(40, 210, step=20))
plt.yticks(np.arange(0, 250, step=20))
ax1.set_xlabel('Beats per Minute (BPM)');
ax1.set_ylabel('Counts');

#Duration
ax2.hist(dur,bins=np.arange(120, 450, step=30));
ax2.set_title('Frequency plot of Song Duration')
plt.sca(ax2)
plt.xticks(np.arange(120, 450, step=30),('2:00','2:30','3:00','3:30','4:00','4:30','5:00','5:30','6:00','6:30','7:00'))
plt.yticks(np.arange(0, 250, step=20))
ax2.set_xlabel('Song Duration');
ax2.set_ylabel('Counts');



Most songs had between 120 and 140 Beats per minute and the length from 3:30 to 4:00 minutes.

Now, let's see how BPM and duration of songs changed over the decade.

I will plot mean and median BPM & duration of songs for each year.

In [None]:
bpm_mean=df['bpm'].groupby(df['year']).mean()
length_mean=df['dur'].groupby(df['year']).mean()

bpm_med=df['bpm'].groupby(df['year']).median()
length_med=df['dur'].groupby(df['year']).median()


rcParams['figure.figsize'] =19,9


fig, axs =plt.subplots(2,2);
#Mean
axs[0,0].plot(bpm_mean);
axs[0,0].set_title('Mean BPM of songs for each year',fontsize=15);
axs[0,0].set_ylabel('Beats per Minute',fontsize=12)
plt.sca(axs[0,0])
plt.xticks(np.arange(2010, 2020, step=1));

axs[0,1].plot(length_mean);
axs[0,1].set_title('Mean Length of songs for each year',fontsize=15);
axs[0,1].set_ylabel('Minutes:econds',fontsize=12)
plt.sca(axs[0,1])
plt.xticks(np.arange(2010, 2020, step=1));
plt.yticks(np.arange(180, 260, step=10),('3:00','3:10','3:20','3:30','3:40','3:50','4:00','4:10','4:20'));

#Median
axs[1,0].plot(bpm_med);
axs[1,0].set_title('Median BPM of songs for each year',fontsize=15);
axs[1,0].set_ylabel('Beats per Minute,',fontsize=12)
plt.sca(axs[1,0])
plt.xticks(np.arange(2010, 2020, step=1));


axs[1,1].plot(length_med);
axs[1,1].set_title('Median Length of songs for each year',fontsize=15);
axs[1,1].set_ylabel('Minutes:Seconds',fontsize=12)
plt.sca(axs[1,1])
plt.xticks(np.arange(2010, 2020, step=1));
plt.yticks(np.arange(180, 260, step=10),('3:00','3:10','3:20','3:30','3:40','3:50','4:00','4:10','4:20'));


Key takeaways from above graphs:
1. For BPM, Mean graph shows that BPM soared to 123 in year 2014 and later decreased to 112 in year 2019. Median graph shows that there is a huge drop drom 125 BPM in 2014 to 105 BPM in 2017.
2. For song length, mean graph shows that songs became shorter from around 3:55 minutes at the beginning of decade to approximately 3:20 minutes at the end of decade. This is half a minute decrease in song length. Median graph shows the slow and gradual decrease in song durations with few upticks in 2011 and 2017.

# Correlation

In [None]:
corr_matrix=df.corr()
corr_matrix
sb.heatmap(corr_matrix, annot=True);


Highest positive correlations are between 'dB' & 'nrgy' (0.54), 'val' & 'dnce' (0.5) and 'val' & 'nrgy' (0.41). These are considered to be moderate type of correlation. Highest negative correlation is between 'dB' and 'acous' (-0.56).

# Linear Regressions

Below is the linear regression for '**Valence'** (*the higher the value, the more positive mood for the song*) and **'Danceability'** (*The higher the value, the easier it is to dance to this song*).

In [None]:
#val and dnce
sb.regplot(x=df.val,y=df.dnce).set_title('Valence (Positive mood) vs Danceability',fontsize=15)
plt.xlabel('Valence (Positive mood)',fontsize=12);
plt.ylabel('Danceability',fontsize=12);


Takeaway:
It seems that the more positive mood the song has, more danceable it is.

Below is the linear regression for 'Valence' (the higher the value, the more positive mood for the song) and 'Energy' (the higher the value, the more energetic song). 

In [None]:
#val and nrgy
sb.regplot(x=df.val,y=df.nrgy).set_title('Valence (Positive mood) vs Energy',fontsize=15);
plt.xlabel('Valence (Positive mood)',fontsize=12);
plt.ylabel('Energy',fontsize=12);

Takeaway:
It seems that the more positive mood the song has, more energetic the song is.