## Spotify Dataset Analysis

### This dataset was taken from Kaggle. It is a comprehensive collection of Spotify tracks across various genres. We are going to be perfomring an Exploratory Data Analysis on this dataset.

In [1]:
# import numpy
import numpy as np


# import pandas
import pandas as pd

# import matplotlib
import matplotlib.pyplot as plt

# import seaborn
import seaborn as sns

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")

# display all columns of the dataframe
pd.options.display.max_columns = None

# display all rows of the dataframe
pd.options.display.max_rows = None

In [36]:
df_spotify = pd.read_csv('spotify_tracks.csv')
df_spotify.head()

Unnamed: 0,id,name,genre,artists,album,popularity,duration_ms,explicit
0,7kr3xZk4yb3YSZ4VFtg2Qt,Acoustic,acoustic,Billy Raffoul,1975,58,172199,False
1,1kJygfS4eoVziBBI93MSYp,Acoustic,acoustic,Billy Raffoul,A Few More Hours at YYZ,57,172202,False
2,6lynns69p4zTCRxmmiSY1x,Here Comes the Sun - Acoustic,acoustic,"Molly Hocking, Bailey Rushlow",Here Comes the Sun (Acoustic),42,144786,False
3,1RC9slv335IfLce5vt9KTW,Acoustic #3,acoustic,The Goo Goo Dolls,Dizzy up the Girl,46,116573,False
4,5o9L8xBuILoVjLECSBi7Vo,My Love Mine All Mine - Acoustic Instrumental,acoustic,"Guus Dielissen, Casper Esmann",My Love Mine All Mine (Acoustic Instrumental),33,133922,False


### Data Overview:

In [37]:
df_spotify.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6300 entries, 0 to 6299
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           6300 non-null   object
 1   name         6300 non-null   object
 2   genre        6300 non-null   object
 3   artists      6300 non-null   object
 4   album        6300 non-null   object
 5   popularity   6300 non-null   int64 
 6   duration_ms  6300 non-null   int64 
 7   explicit     6300 non-null   bool  
dtypes: bool(1), int64(2), object(5)
memory usage: 350.8+ KB


Here, we can see that the 'duration_ms', despite being milliseconds is of int64 type. So let us convert to the hh:mm:ss format.

In [38]:
def ms_to_hhmmss(ms):
    seconds = ms // 1000
    minutes = seconds // 60
    hours = minutes // 60
    minutes = minutes % 60
    seconds = seconds % 60
    return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"

# Apply the function to the duration_ms column
df_spotify['duration_hhmmss'] = df_spotify['duration_ms'].apply(ms_to_hhmmss)

In [39]:
df_spotify.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6300 entries, 0 to 6299
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               6300 non-null   object
 1   name             6300 non-null   object
 2   genre            6300 non-null   object
 3   artists          6300 non-null   object
 4   album            6300 non-null   object
 5   popularity       6300 non-null   int64 
 6   duration_ms      6300 non-null   int64 
 7   explicit         6300 non-null   bool  
 8   duration_hhmmss  6300 non-null   object
dtypes: bool(1), int64(2), object(6)
memory usage: 400.0+ KB


Now, we see that the duration_ms has been converted to duration_hhmmss.

In [40]:
df_spotify['duration_hhmmss'] = pd.to_timedelta(df_spotify['duration_hhmmss'])

In [42]:
df_spotify['duration_hhmmss'].head()

0   0 days 00:02:52
1   0 days 00:02:52
2   0 days 00:02:24
3   0 days 00:01:56
4   0 days 00:02:13
Name: duration_hhmmss, dtype: timedelta64[ns]

In [43]:
df_spotify.columns

Index(['id', 'name', 'genre', 'artists', 'album', 'popularity', 'duration_ms',
       'explicit', 'duration_hhmmss'],
      dtype='object')

Now, we have our duration in terms of hh:mm:ss format.

1. Data Overview
Summary Statistics: Calculate summary statistics such as mean, median, mode, standard deviation, and range for numerical columns like popularity and duration_ms.
Data Types: Verify the data types of each column and ensure they are appropriate (e.g., popularity and duration_ms should be integers).
2. Data Cleaning
Missing Values: Check for any missing values in the dataset and decide how to handle them (e.g., remove rows, fill with a placeholder).
Duplicates: Identify and handle duplicate records if any.
3. Data Visualization
Distribution Plots: Plot histograms or box plots to visualize the distribution of popularity and duration_ms.
Bar Plots: Create bar plots to show the count of songs per genre or per artist.
Scatter Plots: Visualize the relationship between popularity and duration_ms.
4. Categorical Analysis
Unique Values: Count the number of unique values in categorical columns such as genre, artists, and album.
Frequency Distribution: Create frequency distributions for categorical variables to understand their occurrences.
5. Correlation Analysis
Correlation Matrix: Compute the correlation matrix for numerical columns to identify any potential relationships.
Heatmaps: Use heatmaps to visualize the correlation matrix.
6. Insights and Patterns
Popular Genres and Artists: Identify the most popular genres and artists based on popularity.
Duration Analysis: Analyze the average duration of songs in different genres or by different artists.
Explicit Content: Examine the proportion of explicit content in the dataset.

Univariate Analysis
Distribution of Popularity:
Question: What is the distribution of the popularity of the songs?
Answer: Create a histogram or a box plot to visualize the distribution of the popularity column.

Average Duration of Songs:
Question: What is the average duration of the songs in the dataset?
Answer: Calculate the mean of the duration_ms column.

Explicit Content Count:
Question: How many songs are explicit versus non-explicit?
Answer: Use a bar chart or count plot to show the counts of explicit values (True/False).

Bivariate Analysis
Popularity vs Duration:
Question: Is there a relationship between the popularity of a song and its duration?
Answer: Create a scatter plot with popularity on the y-axis and duration_ms on the x-axis. Calculate the correlation coefficient.

Genre vs Popularity:
Question: How does the popularity vary across different genres?
Answer: Use a box plot to show the distribution of popularity for each genre.

Artists vs Popularity:
Question: Which artists have the highest average popularity?
Answer: Group the data by artists and calculate the mean popularity. Then, create a bar chart to display the average popularity of the top artists.

## Univariate Analysis

### Now, we will analyse each variable one-by-one.

In [45]:
df_spotify.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6300 entries, 0 to 6299
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype          
---  ------           --------------  -----          
 0   id               6300 non-null   object         
 1   name             6300 non-null   object         
 2   genre            6300 non-null   object         
 3   artists          6300 non-null   object         
 4   album            6300 non-null   object         
 5   popularity       6300 non-null   int64          
 6   duration_ms      6300 non-null   int64          
 7   explicit         6300 non-null   bool           
 8   duration_hhmmss  6300 non-null   timedelta64[ns]
dtypes: bool(1), int64(2), object(5), timedelta64[ns](1)
memory usage: 400.0+ KB


Here, ID and name are not important for analysis so we will ignore these columns.

#### 1. Genre

In [46]:
df_spotify['genre'].value_counts()

genre
acoustic             50
new-age              50
punk                 50
psych-rock           50
progressive-house    50
power-pop            50
post-dubstep         50
pop-film             50
pop                  50
piano                50
philippines-opm      50
party                50
pagode               50
opera                50
new-release          50
mpb                  50
r-n-b                50
movies               50
minimal-techno       50
metalcore            50
metal-misc           50
metal                50
mandopop             50
malay                50
latino               50
latin                50
kids                 50
k-pop                50
jazz                 50
j-rock               50
punk-rock            50
rainy-day            50
afrobeat             50
songwriter           50
work-out             50
turkish              50
trip-hop             50
trance               50
techno               50
tango                50
synth-pop            50
swedish   