# Spotify Time Series Analysis
In this project, we have the Spotify dataset which contains audio features of 160k+ songs released in between 1921 and 2020. The dataset is collected from Spotify Web API and can be found on [Kaggle](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks).

The goal of this project is to analyze the trends of songs as well as the top artists over the course of a century.

## Setup
Before we start, let's first import the libraries that we are going to use for our analysis.

In [None]:
# Import the libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

%matplotlib inline

## Loading the Data
Next, we are going to import the `data.csv` and convert it into a Pandas DataFrame.

In [None]:
# Import the data
df = pd.read_csv("../input/spotify-dataset-19212020-160k-tracks/data.csv")

# View the shape and columns names
print(df.shape)
df.columns

The data contains nearly 170,000 songs with 19 different features columns. The features include the song name, artist, release date as well as some characteristics of the song such as acousticness, danceability, loudness, tempo and so on.

In [None]:
# Check for missing values
df.isnull().sum()

The `df.isna().sum()` method returns the number of missing values in each column. And as we can see, there is no missing value. We might also want to drop some columns as they are unnecessary for our analysis.

In [None]:
# Drop unneccessary columns
df.drop(["id", "key", "mode", "explicit", "release_date"], axis=1, inplace=True)
df.head()

## Audio Features Correlation Analysis
The dataset includes many different audio features of the songs. Some of these features may be correlated. At first glance, energy and loudness seem correlated. We can use Pandas' `corr` method to calculate the correlation and use a heatmap to visualize them.

In [None]:
corr = df[["acousticness","danceability","energy", "instrumentalness", 
           "liveness","tempo", "valence", "loudness", "speechiness"]].corr()

plt.figure(figsize=(10,10))
sns.heatmap(corr, annot=True)

There is a strong positive correlation between energy and loudness as we suspected. On the other hand, there seems to be a strong negative correlation between energy and acousticness.

## Song Trends
The dataset contains songs from as far back as 1921. We can get an overview how the characteristics of song change over a hundred-year-period.

In [None]:
year_avg = df[["acousticness","danceability","energy", "instrumentalness", 
               "liveness","tempo", "valence", "loudness", "speechiness", "year"]].\
groupby("year").mean().sort_values(by="year").reset_index()

year_avg.head()

For five different measures, we obtained the average yearly values. Let's create a line graph that shows the trends in these variables over time.

In [None]:
# Create a line plot
plt.figure(figsize=(14,8))
plt.title("Song Trends Over Time", fontdict={"fontsize": 15})

lines = ["acousticness","danceability","energy", 
         "instrumentalness", "liveness", "valence", "speechiness"]

for line in lines:
    ax = sns.lineplot(x='year', y=line, data=year_avg)
    
    
plt.ylabel("value")
plt.legend(lines)

## Artists with Most Songs
Now, let's analyze which artists have the most songs over this hundred-year-period. But first, I wonder how many unique artists we have in the dataset. 

In [None]:
# Check for the number of unique artists
df["artists"].nunique()

There are 33375 artists in the entire dataset.

In [None]:
# Top 10 artists with most songs
df["artists"].value_counts()[:10]

Эрнест Хемингуэй has 1215 songs and the runner up, Francisco Canaro, has 938. 

In [None]:
artist_list = df.artists.value_counts().index[:10]

df_artists = df[df.artists.isin(artist_list)][["artists","year"]].\
groupby(["artists","year"]).size().reset_index(name="song_count")

df_artists.head()

In [None]:
plt.figure(figsize=(14,8))
sns.lineplot(x="year", y="song_count", hue="artists", data=df_artists)

We cannot really separate the lines. Since it is such a long period (100 years) artists appear in only a part of the entire timeline. For instance, “Эрих Мария Ремарк” seems to be dominating 1930s.

In [None]:
top_artists = pd.DataFrame(np.zeros((100,10)), columns=artist_list)
top_artists['year'] = np.arange(1921,2021)
print(top_artists.shape)
top_artists.head()

The dataframe includes 100 rows for 100 years and 11 columns (10 artists and a year column). Then I will convert it to a long dataframe using melt function.

In [None]:
top_artists = top_artists.melt(id_vars='year',var_name='artists', value_name='song_count')
print(top_artists.shape)
top_artists.head()

Song count is zero in all years. Let's merge the song counts from df_artists dataframe using Pandas merge function. 

In [None]:
df_merged = pd.merge(top_artists, df_artists, on=['year','artists'], how='outer').\
sort_values(by='year').reset_index(drop=True)
df_merged.head()

If an artist does not have any songs in a particular year, that value is filled with NaN. Let's also replace NaN values with 0 and drop song_count_x column.

In [None]:
df_merged.fillna(0, inplace=True)
df_merged.drop('song_count_x', axis=1, inplace=True)
df_merged.rename(columns={'song_count_y':'song_count'}, inplace=True)
df_merged.head()

Let's also add a column that shows the cumulative sum of the songs that each artist produced over the years. One way to do that is to use groupby and cumsum functions.

In [None]:
df_merged['cumsum'] = df_merged[['song_count','artists']].groupby('artists').cumsum()
df_merged.head(10)

Finally, we can create an animated bar plot that spans through the entire timeline to see how each artist dominates in different years.. There will be a bar for each artists. The bars will go up as the cumulative number of songs for artists increase.

In [None]:
fig = px.bar(df_merged,
             x='artists', y='cumsum',
             color='artists',
             animation_frame='year', animation_group='year',
             range_y=[0,1300],
             title='Artists with Most Songs')
fig.show()