### We will clean the main dataset "data.csv" in this notebook

The dataset we selected is in general very tidy. There is no missing data in any column. Numerical columns have data types of either integer or float64, and text columns also have data types of object so minimal to none obstacle should be encountered while doing analysis.

The only that's worth while to mention in the cleaning process is that the artists column seems to contain values of lists. Upon a few spot checks, the artists in lists with multiple artists are collaborators on the same song. As there isn't a way to parse them out or select only one of them as the creator of one particular song, it makes our analysis tricky. But since they all worked on the song, they should all deserve credits. Thus, for the sake of our analysis, we will leave them as they are now throughout the project. Because of this, the same artist, for example, "Jay Z" would appear multiple times not only because he has more than one song, but also because he will appear in lists that he is one of the contributors.

In [1]:
# Dependencies
import pandas as pd
import numpy as np

In [2]:
# Load data
file = '../../uncleaned_data/data.csv'
df = pd.read_csv(file)
df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.95,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,1920,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,1920-01-01,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,2,1920-10-01,0.0768,122.076,0.299,1920


In [3]:
# Take a look at all the columns and check if there's any NaN in the df
df.isna().any()

acousticness        False
artists             False
danceability        False
duration_ms         False
energy              False
explicit            False
id                  False
instrumentalness    False
key                 False
liveness            False
loudness            False
mode                False
name                False
popularity          False
release_date        False
speechiness         False
tempo               False
valence             False
year                False
dtype: bool

In [4]:
# Doesn't look like there's any NaN
# Check data types of each column
df.dtypes

acousticness        float64
artists              object
danceability        float64
duration_ms           int64
energy              float64
explicit              int64
id                   object
instrumentalness    float64
key                   int64
liveness            float64
loudness            float64
mode                  int64
name                 object
popularity            int64
release_date         object
speechiness         float64
tempo               float64
valence             float64
year                  int64
dtype: object

##### Since the "artists" column consists of values of lists, an empty list could potentially be regarded as non-NaN values, thus, we will check if there's indeed any missing artist names in this dataset.

In [5]:
df.loc[df['artists'] == '[]', 'artists'].count()

0

Good, the answer is no

In [6]:
# We will not use "release_date" column as we won't focus on dates so dropping the column
df.drop('release_date', axis=1, inplace=True)

In [7]:
# Check cleaned dataset again
df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,0.0936,149.976,0.634,1920
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,0.0534,86.889,0.95,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,2,0.0768,122.076,0.299,1920


In [8]:
# Export file as a CSV, without index, but with header
df.to_csv("../../cleaned_data/cleaned_data.csv", index=False, header=True)