# EDA on Spotify global 2019 most-streamed tracks

## On Spotify and this dataset

#### Spotify is freemium audio streaming service launched on 2008. This dataset compiles the most streamed tracks on the year 2019 along with the characteristics of each track. You can look up more about the audio characteristics definitions given by Spotify on their website:  https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/

## The goal of this exploratory data analysis

#### Figure out the traits of the most listened songs on Spotify

## 1. Importing required libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import seaborn as sns                       
import matplotlib.pyplot as plt             
%matplotlib inline     
sns.set(color_codes=True)

## 2. Displaying a preview of the dataset

In [None]:
df = pd.read_csv("/kaggle/input/spotify-global-2019-moststreamed-tracks/spotify_global_2019_most_streamed_tracks_audio_features.csv")
df.head(5)

In [None]:
df.tail(5)

In [None]:
df.dtypes

## 3. Cleaning the dataset of irrelevant columns for this analysis

In [None]:
new_df = df.drop(['Country', 'Track_id', 'URL', 'Artist_id', 'Artist_img'], axis=1)
new_df.head(10)

## 4. Correlation between numeric variables

In [None]:
f,ax = plt.subplots(figsize=(15, 15))
sns.heatmap(new_df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

With the heatmap above we can see the characteristics that are paired the most in the list of most-streamed tracks.

Those variables are loudness and energy (0.8 correlation), energy and valence* (0.4 correlation) and valence and danceability (0.3 correlation).

*Valence is defined by Spotify as a measure of musical positiveness. For more information visit:
https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/
https://web.archive.org/web/20170422195736/http://blog.echonest.com/post/66097438564/plotting-musics-emotional-valence-1950-2013

## 5. Mean Values of the tracks and conclusion

In [None]:
print("Mean value for danceability:", new_df['danceability'].mean())
sns.distplot(new_df['danceability'])
plt.show()
print("Mean value for energy:", new_df['energy'].mean())
sns.distplot(new_df['energy'])
plt.show()
print("Mean value for mode:", new_df['mode'].mean())
sns.distplot(new_df['mode'])
plt.show()
print("Mean value for speechiness:", new_df['speechiness'].mean())
sns.distplot(new_df['speechiness'])
plt.show()
print("Mean value for acousticness:", new_df['acousticness'].mean())
sns.distplot(new_df['acousticness'])
plt.show()
print("Mean value for instrumentalness:", new_df['instrumentalness'].mean())
sns.distplot(new_df['instrumentalness'])
plt.show()
print("Mean value for liveness:", new_df['liveness'].mean())
sns.distplot(new_df['liveness'])
plt.show()
print("Mean value for valence:", new_df['valence'].mean())
sns.distplot(new_df['valence'])
plt.show()

In [None]:
numeric = new_df.drop(['Rank','Streams','Artist_popularity', 'Artist_follower'], axis=1)
small = numeric.drop(['tempo','duration_ms','key', 'loudness', 'time_signature'], axis=1)
sns.set_palette('pastel')
small.mean().plot.bar()
plt.title('Mean Values of Audio Features')
plt.show()

As a conclusion, we can see on the graphic that danceability, energy and mode are the most common characteristics of the 2019 most streamed tracks on Spotify. We can infer from that that people tend to listen more to upbeat music which is represented by the mean values we have calculated.