# Exploratory data analysis process

In this notebook, I will analyze the data structure my Top Spotify Tracks from 2016 to 2022.The dataset contains 100 of my top songs each year .

In [1]:
#Import libraries
import pandas as pd
import numpy as np
from tqdm import tqdm

# Data Visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import re
from wordcloud import WordCloud

# Import the library
import plotly.express as px

## Load data

In [2]:
songs = pd.read_csv('../data/my_top_songs.csv')
songs.head()

Unnamed: 0.1,Unnamed: 0,artist_name,artist_pop,artist_genres,track_name,track_id,track_uri,popularity,danceability,energy,...,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,year
0,0,Alan Walker,82,['electro house'],Faded,1brwdYwjltrJo7WHpIvbYt,1brwdYwjltrJo7WHpIvbYt,0,0.589,0.651,...,0.166,90.011,audio_features,1brwdYwjltrJo7WHpIvbYt,spotify:track:1brwdYwjltrJo7WHpIvbYt,https://api.spotify.com/v1/tracks/1brwdYwjltrJ...,https://api.spotify.com/v1/audio-analysis/1brw...,212627,4,2016
1,1,Twenty One Pilots,81,"['modern rock', 'rock']",Ride,2Z8WuEywRWYTKe1NybPQEW,2Z8WuEywRWYTKe1NybPQEW,83,0.645,0.713,...,0.566,74.989,audio_features,2Z8WuEywRWYTKe1NybPQEW,spotify:track:2Z8WuEywRWYTKe1NybPQEW,https://api.spotify.com/v1/tracks/2Z8WuEywRWYT...,https://api.spotify.com/v1/audio-analysis/2Z8W...,214507,4,2016
2,2,New Beat Fund,41,[],No Type,4fxtYgcIOqjCq9Ix1pvrzn,4fxtYgcIOqjCq9Ix1pvrzn,47,0.619,0.622,...,0.21,136.978,audio_features,4fxtYgcIOqjCq9Ix1pvrzn,spotify:track:4fxtYgcIOqjCq9Ix1pvrzn,https://api.spotify.com/v1/tracks/4fxtYgcIOqjC...,https://api.spotify.com/v1/audio-analysis/4fxt...,213797,4,2016
3,3,The Strumbellas,58,"['canadian indie', 'folk-pop', 'pop rock', 'st...",Spirits,1mqbTByfUxLPeqN1YEw08a,1mqbTByfUxLPeqN1YEw08a,0,0.553,0.724,...,0.775,80.517,audio_features,1mqbTByfUxLPeqN1YEw08a,spotify:track:1mqbTByfUxLPeqN1YEw08a,https://api.spotify.com/v1/tracks/1mqbTByfUxLP...,https://api.spotify.com/v1/audio-analysis/1mqb...,203653,4,2016
4,4,Kygo,81,"['edm', 'pop', 'pop dance', 'tropical house']",Raging (feat. Kodaline),6DsFZITJMPnh8z5XewfVmL,6DsFZITJMPnh8z5XewfVmL,61,0.55,0.689,...,0.408,99.904,audio_features,6DsFZITJMPnh8z5XewfVmL,spotify:track:6DsFZITJMPnh8z5XewfVmL,https://api.spotify.com/v1/tracks/6DsFZITJMPnh...,https://api.spotify.com/v1/audio-analysis/6DsF...,224487,4,2016


## Data Preparation
First we need to prepare our dataframe. We're going to drop columns `track_id`, `track_uri`, `type`, `id`, `uri`,`track_href` and `analysis_url` as we don't need them for this application. Then we convert the song duration from millieseconds to seconds

In [4]:
#Drop columns
columns_to_drop=['track_id', 'track_uri', 'type', 'id', 'uri','track_href','analysis_url']
songs=songs.drop(columns_to_drop,axis=1)


KeyError: "['track_id', 'track_uri', 'type', 'id', 'uri', 'track_href', 'analysis_url'] not found in axis"

## Data Visualisation
First we're going to visualise what are the most common genres the artist I listened more over the years are associated with, by creating a wordcloud based on music genre

In [None]:
top_gernes = ' '.join([word for word in songs['artist_genres']])
wordcloud_gernes = WordCloud(width=600,
                             height=400,
                             random_state=2,
                             max_font_size=100,
                             colormap='magma',
                             background_color="white").generate(top_gernes)


# Create a figure object and set its size and title
fig = plt.figure(figsize=(12, 6))
# showing image
plt.imshow(wordcloud_gernes)
plt.axis('off')
plt.title("Favorite artist gernes")

Now that we have the data, let’s take a look at the features for each song. For this subpart specifically, we will be looking at the mean values of the features for the top 100 most popular songs, as well as all of the dataset. To visualize this we can use a radar chart as below. The code below is:

Obtaining all of the labels of the data, in our case the audio features, and taking their means. There are two variables: “features”, with the average values of the audio features for the top 100 songs and “features_all”, with the means for all of the dataset.
Plotting the “features” and “features_all” on the same radar plot with different colors.

In [None]:
features= ['danceability','liveness','energy','valence','speechiness','acousticness','year']
song_feat=songs.loc[:,features]

summary_stats = song_feat.groupby(["year"]).mean().reset_index()

In [None]:
# Create the chart:
fig = px.parallel_coordinates(
    summary_stats,
    dimensions=['danceability','liveness','energy','valence','speechiness','acousticness'],
    color="year",
    color_continuous_scale=px.colors.diverging.Tealrose,
    color_continuous_midpoint=0.5,
    title = 'PCP High Quality Wines  Without Reordering')

# Hide the color scale that is useless in this case
fig.update_layout(coloraxis_showscale=False)

# Show the plot
fig.show()