**Introduction**

In this notebook we will analyze the Top 50 Spotify Songs data set, which can be found on Kaggle. The data set provides the 50 most listened to songs on Spotify in 2019. Finally spotify is in India so lets begin with some analysis.

![](https://miro.medium.com/max/2085/1*whTb1rhPwQcJkWFNLf1LgA.jpeg)

**Import packages**

In [None]:
# Data analysis packages
import pandas as pd
import numpy as np

# Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
from plotnine import *
%matplotlib inline
sns.set(style='dark')


**Import the Dataset**

In [None]:
df = pd.read_csv("../input/top50spotify2019/top50.csv", encoding='ISO-8859-1')

* **Exploratory Data Analysis (EDA)**

In [None]:
df.head()

In [None]:
df.shape

The dataset consists of 50 song listed against 14 different features.

In [None]:
df.info()

There are **NO Null** values in the dataset. Dataset consists of 3 categorical features and 10 numerical features in the dataset.

The data set contains the following fields:
1. Track.Name — Name of Track
2. Artist.Name — Name of the Artist
3. Genre — Genre of Track
4. Beats.Per.Minute — Tempo of the Song
5. Energy — The energy of Song — the higher the value the more energetic
6. Danceability — Thee higher the value, the easier it is to dance to the song
7. Loudness..dB.. — The higher the value, the louder the song.
8. Liveness — The higher the value, the more likely the song is a live recording.
9. Valence. — The higher the value, the more positive mood for the song.
10. Length. — The duration of the song.
11. Acousticness.. The higher the value the more acoustic the song
12. Speechiness. — The higher the value the more spoken word the song contains
13. Popularity — The higher the value the more popular the song is.

Let's begin with some fixes.

In [None]:
# Dropping 'Unnamed: 0' since it doesn't consist of any relevant information
df.drop(['Unnamed: 0'], axis=1, inplace=True)

Fixing the column names

In [None]:
df.rename(columns={'Track.Name':'Track_Name', 
                   'Artist.Name':'Artist_Name',
                   'Beats.Per.Minute':'Beats_Per_Minute', 
                   'Loudness..dB..':'Loudness',
                   'Valence.':'Valence', 
                   'Length.':'Length', 
                   'Acousticness..':'Acousticness',
                   'Speechiness.':'Speechiness'}, inplace=True)

Check descriptive statistics from the dataset

In [None]:
df.describe().T

**Distribution by genre**

In [None]:
df.Genre.nunique()

In [None]:
df.Genre.value_counts()

In [None]:
plt.style.use('fivethirtyeight')
plt.figure(figsize = (16,10));
sns.countplot(x="Genre", data=df, linewidth=2, edgecolor='black');
plt.ylabel('Number of occurances');
plt.xticks(rotation=45, ha='right');

Leading genre in top 50 category are dance pop, pop and latin respectively.

Distribution by Artist

In [None]:
df.Artist_Name.nunique()

In [None]:
df.Artist_Name.value_counts()

In [None]:
plt.figure(figsize=(20,8))
plt.style.use('fivethirtyeight')
sns.countplot(x=df['Artist_Name'],data=df, linewidth=2, edgecolor='black')
plt.title('Number of times an artist appears in the top 50 songs list')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
top_artists = df.groupby('Artist_Name')
filtered_data = top_artists.filter(lambda x: x['Artist_Name'].value_counts() > 1)

In [None]:
plt.figure(figsize=(20,8))
plt.style.use('fivethirtyeight')
sns.countplot(y=filtered_data['Artist_Name'],data=filtered_data, linewidth=2, edgecolor='black', order=filtered_data["Artist_Name"].value_counts().index)
plt.title('Top Artists of 2019')
plt.xticks(rotation=45, ha='right')
plt.show()

**Distribution by Liveness**

**Liveness:** *This value describes the probability that the song was recorded with a live audience. Higher liveness values represent an increased probability that the track was performed live.*

In [None]:
values = df.Liveness.value_counts()
indexes = values.index

fig = plt.figure(figsize=(15, 8))
sns.barplot(indexes, values,linewidth=2, edgecolor='black')

plt.ylabel('Number of occurances')
plt.xlabel('Liveness')

In [None]:
minimum_Liveness = df[df.Liveness == df.Liveness.min()]
minimum_Liveness[['Track_Name', 'Artist_Name', 'Genre', 'Liveness']]

In [None]:
maximum_Liveness = df[df.Liveness == df.Liveness.max()]
maximum_Liveness[['Track_Name', 'Artist_Name', 'Genre', 'Liveness']]

**Distribution by Valence**

**Valence**: *Describes the musical positiveness conveyed by a track.  Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).*

In [None]:
plt.figure(figsize=(8,4))
sns.distplot(df.Valence, kde=False, bins=15,color='blue', hist_kws=dict(edgecolor="black", linewidth=1))
plt.show()

In [None]:
minimum_Valence = df[df.Valence == df.Valence.min()]
minimum_Valence[['Track_Name', 'Artist_Name', 'Genre', 'Valence']]

In [None]:
maximum_Valence = df[df.Valence == df.Valence.max()]
maximum_Valence[['Track_Name', 'Artist_Name', 'Genre', 'Valence']]

**Distribution by Length**

**Length**: *Describes the duration of the song in seconds.*

In [None]:
plt.figure(figsize=(8,4))
sns.distplot(df['Length'], kde=False, bins=15,color='green', hist_kws=dict(edgecolor="black", linewidth=1))
plt.show()

**Distribution by Loudness**

*Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude).*

In [None]:
plt.figure(figsize=(8,4))
sns.distplot(df['Loudness'], kde=False, bins=15,color='red', hist_kws=dict(edgecolor="black", linewidth=1))
plt.show()

**Distribution by Danceability**

*Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.*

In [None]:
plt.figure(figsize=(8,4))
sns.distplot(df['Danceability'], kde=False, bins=15,color='violet', hist_kws=dict(edgecolor="black", linewidth=1))
plt.show()

**Distribution by Energy**

*Represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.*

In [None]:
plt.figure(figsize=(8,4))
sns.distplot(df['Energy'], kde=False, bins=15,color='#F06292', hist_kws=dict(edgecolor="k", linewidth=1))
plt.show()

**Distribution by Beates per Minute**

*Beats per minute is the unit of measurement for measuring tempo. A "beat" is the standard measurement for a length of a piece of music.*

In [None]:
plt.figure(figsize=(8,4))
sns.distplot(df['Beats_Per_Minute'], kde=False, bins=18,color='#E67E22', hist_kws=dict(edgecolor="black", linewidth=1))
plt.show()

**Correlation heatmap**

In [None]:
correlations = df.corr()

fig = plt.figure(figsize=(12, 8))
sns.heatmap(correlations, annot=True, linewidths=1, cmap='YlGnBu', center=1)
plt.show()

*Only two pairs of features have a correlation value of more than 0.5*
1. Beats per Minute & Speechiness
2. Energy & Loudness(dB) 

**Pairplot of all the features**

In [None]:
sns.set_style('whitegrid')
sns.pairplot(df)
plt.show()

Relationship between energy and loudness

In [None]:
fig = plt.figure(figsize=(8, 6))
sns.regplot(x='Energy', y='Loudness', data=df)
plt.show()

In [None]:
sns.catplot(x = "Loudness", y = "Energy", kind = "box", data = df)
plt.show()

Relationship between Beats Per Minute and speechiness

In [None]:
sns.jointplot(x="Beats_Per_Minute", y="Speechiness", data=df, kind="kde");

These are just a few visualizations of op 50 Spotify Songs information, there is much more that can be explored more deeply.

This was my first analysis and I’m still trying to make it better. That’s why, please feel free to give any feedback!