# **Exploratory Data Analysis**

We live in the era of big data. We can collect lots of data which allows to infer meaningful results and make informed business decisions. However, as the amount of data increases, it gets trickier to analyze and explore the data. There comes in the power of visualizations which are great tools in exploratory data analysis when used efficiently and appropriately. Visualizations also help to deliver a message to your audience or inform them about your findings. There is no one-fits-all kind of visualization method so certain tasks require different kinds of visualizations.

Let's try to approach the spotify dataset from a couple of perspectives. One is how general trends in songs change over time and the other one is to look at popular artists in different part of the entire timeline.

Dataset contains more than 160.000 songs collected from Spotify Web API. The features include song, artist, release date as well as some characteristics of song such as acousticness, danceability, loudness, tempo and so on. Date range is from 1921 to 2020.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv("../input/spotify-dataset-19212020-160k-tracks/data.csv")
print(df.shape)
df.columns

In [None]:
df.dtypes

In [None]:
df.isna().sum().sum()

There is no missing value. df.isna().sum() returns the number of missing values in each column. By adding another sum(), we get the total number of missing values in the dataset.

I will not use some of the features in my analysis so I will drop them.

In [None]:
df.drop(['Unnamed: 0', 'id','explicit','key','release_date','mode'], axis=1, inplace=True)

In [None]:
df.head()

## **Song Trends**

Dataset includes many different measures on songs. Some of the names give an idea of what they mean such as tempo, loudness, energy. There are also very specific measures that are hard to understand if you are not that into music. For instance, acousticness, liveness, and speechines are technical terms that we do not hear oftenly.

Some of these measures may be correlated. At first glance, danceability and valence seem correlated. We can use corr method of pandas to calculate the correlation and use a heatmap to visualize them.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')

%matplotlib inline

In [None]:
corr = df[['acousticness','danceability','energy','instrumentalness','liveness','tempo','valence']].corr()

plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True)

There is a positive correlation between valence and danceability as we suspected. There seems to be a strong negative correlation between energy and acousticness.

Let’s also check top 10 artists in terms of average energy per song and compare the results with their average acousticness values.

In [None]:
df[['artists','energy','acousticness']].groupby('artists').mean().sort_values(by='energy', ascending=False)[:10]

In [None]:
df.acousticness.mean()

With a few exceptions, artists with high energy songs produce low acousticness. The average acousticness in the entire dataset is 0.50.

## How trends change over time

The dataset contains song from far back in 1921. We can get overview how the characteristics of song change over a hundred-year-period.

In [None]:
year_avg = df[['danceability','energy','liveness',
               'acousticness','valence','year']].groupby('year').mean().sort_values(by='year').reset_index()
year_avg.head()

For five different measures, we obtained the average yearly values. The variety of different software packages and useful functions, there is almost always more than one way to do a task in the field of data science. I will show you two different ways to create a line graph that shows the trends in these variables over time.

For five different measures, we obtained the average yearly values. The variety of different software packages and useful functions, there is almost always more than one way to do a task in the field of data science. I will show you two different ways to create a line graph that shows the trends in these variables over time.

In [None]:
plt.figure(figsize=(14,8))

plt.title("Song Trends Over Time", fontsize=15)
lines = ['danceability','energy','liveness','acousticness','valence']

for line in lines:
    ax = sns.lineplot(x='year', y=line, data=year_avg)

plt.legend(lines)

Another way is to convert year_avg dataframe to a long dataframe using pandas melt function.

In [None]:
melted = year_avg.melt(id_vars='year')
melted.head()

Different measures are combined under a column names “variable”. 5 features are combined into one feature so the length of melted dataframe must be 5 times the length of year_avg dataframe:

In [None]:
print(len(melted))
print(len(year_avg))

We confirmed the shapes. Let’s now see how to create the same plot using the melted dataframe.

In [None]:
plt.figure(figsize=(14,6))
plt.title("Song Trends Over Time", fontsize=15)
sns.lineplot(x='year', y='value', hue='variable', data=melted)

## Artists with Most Songs

I wonder how many unique artists we have in the dataset.

In [None]:
df.artists.nunique()

There are 33268 artists in the entire dataset. Some of them produce a lot of songs whereas there are some artists with very few songs. Let’s see the top 7 artists who has the most songs in the dataset.

In [None]:
df.artists.value_counts()[:7]

Francisco Canaro has 956 songs and the runner up, Ignacio Corsini, has 635. We can create a new dataframe that shows yearly song production for these 7 artists.

In [None]:
artist_list = df.artists.value_counts().index[:7]
df_artists = df[df.artists.isin(artist_list)][['artists','year',
                                                          'energy']].groupby(['artists','year']).count().reset_index()
df_artists.rename(columns={'energy':'song_count'}, inplace=True)

In [None]:
df_artists.head()

In [None]:
plt.figure(figsize=(16,8))

sns.lineplot(x='year', y='song_count', hue='artists', data=df_artists)

We cannot really separate the lines. Since it is such a long period (100 years) artists appear in only a part of the entire timeline. For instance, “Francisco Canaro” seems to be dominating 1930s.

I will now try a different way to see which artists are dominating which era. First, I will create an empty dataframe that contains the entire timeline (1921–2020) and the names of top 7 artists.

In [None]:
df1 = pd.DataFrame(np.zeros((100,7)), columns=artist_list)
df1['year'] = np.arange(1921,2021)
print(df1.shape)
df1.head()

The dataframe includes 100 rows for 100 years and 8 columns (7 artists and a year column). Then I will convert it to a long dataframe using melt function.

In [None]:
df1 = df1.melt(id_vars='year',var_name='artists', value_name='song_count')
print(df1.shape)
df1.head()

Song count is zero in all years. I will merge song counts from df_artists dataframe using pandas merge function.

In [None]:
df_merge = pd.merge(df1, df_artists, on=['year','artists'], how='outer').sort_values(by='year').reset_index(drop=True)
df_merge.head()

If an artist does not have any songs in a particular year, that value is filled with NaN. Please note that it is important to set how parameter of merge function as “outer”. Otherwise, merged dataframe only includes year-artist combination in which there is at least one song of that artist.

I will replace NaN values with 0 and drop song_count_x column.

In [None]:
df_merge.fillna(0, inplace=True)
df_merge.head()

In [None]:
df_merge.drop('song_count_x', axis=1, inplace=True)
df_merge.rename(columns={'song_count_y':'song_count'}, inplace=True)
df_merge.head()

I also want to add a column that show the cumulative sum of the songs that each artist produced over the years. One way to do that is to use groupby and cumsum functions.

In [None]:
df_merge['cumsum'] = df_merge[['song_count','artists']].groupby('artists').cumsum()

df_merge.head()

If we only use cumsum and not groupby on artists, then cumsum column includes cumulative sum based on only years. It does not take artist column into consideration.

I’ve managed to reformat the dataframe that fits to what I want to plot. I will create an animated bar plot that spans through the entire timeline. There will be a bar for each artists. The bars will go up as the cumulative number of songs for artists increase. We will be able see how each artists dominate differant years.

I will use plotly python (plotly.py) which is a great library to create interactive visualizations. Plotly express is the high level API of plotly that also makes the syntax very simple and easy to understand.

In [None]:
import plotly.express as px

In [None]:
fig = px.bar(df_merge,
             x='artists', y='cumsum',
            color='artists',
            animation_frame='year', animation_group='year',
            range_y=[0,1000],
            title='Artists with Most Number of Songs')
fig.show()

Dynamic plots change based on what is passed to animation_frame and animation_group parameters. It is important to define a range to prevent datapoints from falling out of the figure.

We have covered some techniques to manipulate or change the format of a dataframe. We have also created some basic plots as well as an animated plot. There is much more we can do on this dataset. For instance, we can analyze the popularity of songs or artists. How popularity changes over time based on the music style can also be investigated. Thus, there is no limit to the exploratory data analysis process. We can approach the dataframe from a specific point of view depending on our needs. However, the techniques and operations are usually the same.

Thank you for reading. Please let me know if you have any feedback.