![](2024-spotify-brand-assets-media-kit.jpg)

In [138]:
import pandas as pd

To evade "UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 7250-7251: invalid continuation byte" error, pass `latin1` as encoding format for `encoding` parameter

In [139]:
df = pd.read_csv(
    filepath_or_buffer="spotify-2023.csv",
    encoding="latin1"
)

#### Information on Features

- `track_name`: Name of the song
- `artist(s)_name`: Name of the artist(s) of the song
- `artist_count`: Number of artists contributing to the song
- `released_year`: Year when the song was released
- `released_month`: Month when the song was released
- `released_day`: Day of the month when the song was released
in_spotify_playlists: Number of Spotify playlists the song is included in
in_spotify_charts: Presence and rank of the song on Spotify charts
streams: Total number of streams on Spotify
in_apple_playlists: Number of Apple Music playlists the song is included in
in_apple_charts: Presence and rank of the song on Apple Music charts
in_deezer_playlists: Number of Deezer playlists the song is included in
in_deezer_charts: Presence and rank of the song on Deezer charts
in_shazam_charts: Presence and rank of the song on Shazam charts
bpm: Beats per minute, a measure of song tempo
key: Key of the song
mode: Mode of the song (major or minor)
danceability_%: Percentage indicating how suitable the song is for dancing
valence_%: Positivity of the song's musical content
energy_%: Perceived energy level of the song
- `acousticness_%`: Amount of acoustic sound in the song
- `instrumentalness_%`: Amount of instrumental content in the song
- `liveness_%`: Presence of live performance elements
- `speechiness_%`: Amount of spoken words in the song



In [140]:
df.head()

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6


### ðŸ§¹ SECTION 1 â€” Data Cleaning & Type Fixes

#### Duplicate & Consistency Check

In [141]:
df = df.drop_duplicates(
    subset=["track_name"]
)

In [142]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 943 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            943 non-null    object
 1   artist(s)_name        943 non-null    object
 2   artist_count          943 non-null    int64 
 3   released_year         943 non-null    int64 
 4   released_month        943 non-null    int64 
 5   released_day          943 non-null    int64 
 6   in_spotify_playlists  943 non-null    int64 
 7   in_spotify_charts     943 non-null    int64 
 8   streams               943 non-null    object
 9   in_apple_playlists    943 non-null    int64 
 10  in_apple_charts       943 non-null    int64 
 11  in_deezer_playlists   943 non-null    object
 12  in_deezer_charts      943 non-null    int64 
 13  in_shazam_charts      893 non-null    object
 14  bpm                   943 non-null    int64 
 15  key                   851 non-null    object


#### Missing Values Strategy

In [143]:
for column in df.columns:
    print(
        f"Number of missing values for {column} is {df[column].isna().sum()}"
    )

Number of missing values for track_name is 0
Number of missing values for artist(s)_name is 0
Number of missing values for artist_count is 0
Number of missing values for released_year is 0
Number of missing values for released_month is 0
Number of missing values for released_day is 0
Number of missing values for in_spotify_playlists is 0
Number of missing values for in_spotify_charts is 0
Number of missing values for streams is 0
Number of missing values for in_apple_playlists is 0
Number of missing values for in_apple_charts is 0
Number of missing values for in_deezer_playlists is 0
Number of missing values for in_deezer_charts is 0
Number of missing values for in_shazam_charts is 50
Number of missing values for bpm is 0
Number of missing values for key is 92
Number of missing values for mode is 0
Number of missing values for danceability_% is 0
Number of missing values for valence_% is 0
Number of missing values for energy_% is 0
Number of missing values for acousticness_% is 0
Numbe

In [144]:
# NOTE: 0 Does not mean that it was streamed 0 times
#       It means that the actual number of streams is unknown.
df["in_shazam_charts"] = df["in_shazam_charts"].fillna(0)

# NOTE: I am not going to use it.
df = df.drop(labels="key", axis="columns")

#### Audit Data Types & Fixing Numeric Columns

In [145]:
for column in df.columns:
    print(f"{column}: {df.loc[0, column]} - {df[column].dtype}")

track_name: Seven (feat. Latto) (Explicit Ver.) - object
artist(s)_name: Latto, Jung Kook - object
artist_count: 2 - int64
released_year: 2023 - int64
released_month: 7 - int64
released_day: 14 - int64
in_spotify_playlists: 553 - int64
in_spotify_charts: 147 - int64
streams: 141381703 - object
in_apple_playlists: 43 - int64
in_apple_charts: 263 - int64
in_deezer_playlists: 45 - object
in_deezer_charts: 10 - int64
in_shazam_charts: 826 - object
bpm: 125 - int64
mode: Major - object
danceability_%: 80 - int64
valence_%: 89 - int64
energy_%: 83 - int64
acousticness_%: 31 - int64
instrumentalness_%: 0 - int64
liveness_%: 8 - int64
speechiness_%: 4 - int64


`streams`, `in_deezer_playlists` and `in_shazam_charts` obviously are not `object`s. I gotta fix their types

In [146]:
mask = (df["streams"] == "BPM110KeyAModeMajorDanceability53Valence75Energy69Acousticness7Instrumentalness0Liveness17Speechiness3")
df = df.loc[~mask]

In [147]:
columns_to_fix = ["streams", "in_deezer_playlists", "in_shazam_charts"]

for column in columns_to_fix:
    df[column] = (
        df[column]
        .str.replace(",", "", regex=False)
        .astype(dtype="int64", errors="ignore")
    )

In [148]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 942 entries, 0 to 952
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            942 non-null    object
 1   artist(s)_name        942 non-null    object
 2   artist_count          942 non-null    int64 
 3   released_year         942 non-null    int64 
 4   released_month        942 non-null    int64 
 5   released_day          942 non-null    int64 
 6   in_spotify_playlists  942 non-null    int64 
 7   in_spotify_charts     942 non-null    int64 
 8   streams               942 non-null    int64 
 9   in_apple_playlists    942 non-null    int64 
 10  in_apple_charts       942 non-null    int64 
 11  in_deezer_playlists   942 non-null    int64 
 12  in_deezer_charts      942 non-null    int64 
 13  in_shazam_charts      892 non-null    object
 14  bpm                   942 non-null    int64 
 15  mode                  942 non-null    object


### ðŸŽµ SECTION 2 â€” Popularity & Streaming Analysis