# Exploratory Data Analysis with Spotify Data

Link for dataset: https://www.kaggle.com/datasets/leonardopena/top50spotify2019

Some questions that we will answer:

- Which songs and artists are more popular?

- Which artist has more songs on spotify?

- What can we know about the genre?

- What is the mean of minutes that a top music has?


In [1]:
# Install a pip package in the current Jupyter kernel
#import sys
#!{sys.executable} -m pip install ydata-profiling
#!{sys.executable} -m pip install autoviz --upgrade

In [3]:
# Importing libraries

import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import seaborn as sns
from ydata_profiling import ProfileReport
from autoviz.AutoViz_Class import AutoViz_Class
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

## Get and Inspect the Data

In [4]:
# We need to change the encoding to read the data
df = pd.read_csv("top50.csv", encoding="latin1")

In [5]:
# Seeing all the columns names:
for column_headers in df.columns:
    print(column_headers)

Unnamed: 0
Track.Name
Artist.Name
Genre
Beats.Per.Minute
Energy
Danceability
Loudness..dB..
Liveness
Valence.
Length.
Acousticness..
Speechiness.
Popularity


In [6]:
# Sample the data
df.sample(3)

Unnamed: 0.1,Unnamed: 0,Track.Name,Artist.Name,Genre,Beats.Per.Minute,Energy,Danceability,Loudness..dB..,Liveness,Valence.,Length.,Acousticness..,Speechiness.,Popularity
16,17,LA CANCIÓN,J Balvin,latin,176,65,75,-6,11,43,243,15,32,90
32,33,0.958333333333333,Maluma,reggaeton,96,71,78,-5,9,68,176,22,28,89
13,14,Otro Trago - Remix,Sech,panamanian pop,176,79,73,-2,6,76,288,7,20,87


As we can see, we can drop the first column ("Unnamed") because it is just an index for the data

In [7]:
# Droping the first column
df.drop(df.columns[0], axis="columns", inplace=True)
df.head(3)

Unnamed: 0,Track.Name,Artist.Name,Genre,Beats.Per.Minute,Energy,Danceability,Loudness..dB..,Liveness,Valence.,Length.,Acousticness..,Speechiness.,Popularity
0,Señorita,Shawn Mendes,canadian pop,117,55,76,-6,8,75,191,4,3,79
1,China,Anuel AA,reggaeton flow,105,81,79,-4,8,61,302,8,9,92
2,boyfriend (with Social House),Ariana Grande,dance pop,190,80,40,-4,16,70,186,12,46,85


Now we ca use the ydata-profiling to create a profiling report of the data.
This create an initial Exploratory data Analysis

In [8]:
# Creating profile
profile = ProfileReport(df, title="Profiling Report")

In [9]:
# Exporting
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

The profiling report gives us an insight aboout the data, but we will investigate by ourselves too

In [None]:
# Overview of the data
df.info()

As we can see, there aren't null values in our dataframe.

Also, we have 2 "object" datatypes: "Artist.Name" and "Genre".

The other features are numerical, "int64" datatype.

In [None]:
# Generate a descriptive statistics
df.describe().T

- <b>Beats per minute</b>: the mean are 120.06, that indicates that overall musics are fast
- <b>Danceability</b>: the mean is 71.38 so, in general, the musics are very danceable
- <b>Length</b>: the mean length is 200.96 minutes, this indicates that in general the musics of this dataset are long

Checking the data distribution by ploting an histogram

In [None]:
histo = df.hist(bins=10, figsize=(16,10))

We can see from the histogram that we don't have outliers in our dataset.

We also see that some plots are very skewed, like: "Beats.per.minute", "Danceability", "Popularity", "Liveness"

In [None]:
# Check correlation of the features by Pearson's correlation ploting a heatmap
fig,ax = plt.subplots(figsize=(10,10))
heatmap = sns.heatmap(df.corr(),
                     cmap="Wistia",
                     annot=True,
                     )

We have some high positive correlation between "Beats.per.minute" and "Speechiness", "Energy" and "Valence"

We can try another type of correlation. the Spearman's correlation

In [None]:
# Check correlation of the features by Spearman's correlation ploting a heatmap
fig,ax = plt.subplots(figsize=(10,10))
heatmap = sns.heatmap(df.corr("spearman"),
                     cmap="Wistia",
                     annot=True,
                     )

In [None]:
# Plot pairwise relationships
sns.pairplot(df)

## Now we can start to answer our questions:

- Which songs and artists are more popular?

In [None]:
# All the artists in the dataset
df["Artist.Name"].value_counts()

As we can see "Ed Sheeran" is the artist with more musics in the dataset

In [None]:
df.nlargest(5, columns=["Popularity"]).head()

Even with the greater number of musics in the dataset, Ed Sheeran's musics isn't in the top 5 popularity musics

In [None]:
df[(df["Artist.Name"] == "Ed Sheeran")].sort_values(by="Popularity", ascending=False)

So, Ed Sheeran's musics popularity it's between 82 and 87

In [None]:
# Top 5 artists dataframe
df_artists = df.filter(["Artist.Name"])
df_artists.head(3)

In [None]:
df_artists["Count"]=1
df_artist=df_artists.groupby("Artist.Name")["Count"].sum().reset_index().sort_values(by="Count", ascending=False)

In [None]:
df_artist.head()

In [None]:
fig = px.bar(
        df_artist.head(5),
        x="Artist.Name",
        y="Count",)
        #text_auto=True)
fig.update_layout(legend_orientation="h",
                  legend=dict(x=0, y=1, traceorder="normal"),
                  title="Top 5 Artists Names",
                  margin=dict(l=0, r=0, t=30, b=0)
                 )

fig.show()