# Final Project: End-to-End Data Cleaning Workflow

### Billboard Chart Dataset

> Author: *Haneul Kim* (ttcielott)

- Repository: https://github.com/ttcielott/haneul_kim/tree/master/Data_Cleaning_Project_01_Billboard_Chart
- Overview & Assesment: https://github.com/ttcielott/haneul_kim/blob/master/Data_Cleaning_Project_01_Billboard_Chart/InitialAssessment.pdf

In [1]:
import pandas as pd

In [2]:
billboard=pd.read_csv('billboard.csv')

### Raw Data View

In [3]:
billboard.head()

Unnamed: 0,year,artist,track,time,date.entered,wk1,wk2,wk3,wk4,wk5,...,wk67,wk68,wk69,wk70,wk71,wk72,wk73,wk74,wk75,wk76
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,87,82.0,72.0,77.0,87.0,...,,,,,,,,,,
1,2000,2Ge+her,The Hardest Part Of ...,3:15,2000-09-02,91,87.0,92.0,,,...,,,,,,,,,,
2,2000,3 Doors Down,Kryptonite,3:53,2000-04-08,81,70.0,68.0,67.0,66.0,...,,,,,,,,,,
3,2000,3 Doors Down,Loser,4:24,2000-10-21,76,76.0,72.0,69.0,67.0,...,,,,,,,,,,
4,2000,504 Boyz,Wobble Wobble,3:35,2000-04-15,57,34.0,25.0,17.0,17.0,...,,,,,,,,,,


In [4]:
billboard.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317 entries, 0 to 316
Data columns (total 81 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   year          317 non-null    int64  
 1   artist        317 non-null    object 
 2   track         317 non-null    object 
 3   time          317 non-null    object 
 4   date.entered  317 non-null    object 
 5   wk1           317 non-null    int64  
 6   wk2           312 non-null    float64
 7   wk3           307 non-null    float64
 8   wk4           300 non-null    float64
 9   wk5           292 non-null    float64
 10  wk6           280 non-null    float64
 11  wk7           269 non-null    float64
 12  wk8           260 non-null    float64
 13  wk9           253 non-null    float64
 14  wk10          244 non-null    float64
 15  wk11          236 non-null    float64
 16  wk12          222 non-null    float64
 17  wk13          210 non-null    float64
 18  wk14          204 non-null    

## Data Cleaning


All of my data cleaning can be done with Pandas, since the dataset is small enough and is just one table. In order to clean the data and ensure it is on a usable format for the end user I've performed the following operations:

- Replace the data type, int64 of the column 'wk1' with float64.
- Turn numerous columns of week number into rows.
- Remove rows without rating to make the dataset more storage-friendly.
- Assign ID to each track(song) and sort the table by ID.

In [5]:
#replace the data type of values in the column 'wk1' with float64. 
billboard['wk1']=pd.to_numeric(billboard['wk1'], downcast='float')

In [6]:
billboard['wk1'].dtype

dtype('float32')

In [7]:
#turn the columns of weekly number into rows.
billboard_melt=billboard.melt(
    id_vars=['year','artist','track','time','date.entered'],
    var_name='week',
    value_name='rating'
)

In [8]:
billboard_melt.head()

Unnamed: 0,year,artist,track,time,date.entered,week,rating
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,wk1,87.0
1,2000,2Ge+her,The Hardest Part Of ...,3:15,2000-09-02,wk1,91.0
2,2000,3 Doors Down,Kryptonite,3:53,2000-04-08,wk1,81.0
3,2000,3 Doors Down,Loser,4:24,2000-10-21,wk1,76.0
4,2000,504 Boyz,Wobble Wobble,3:35,2000-04-15,wk1,57.0


In [9]:
billboard_melt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24092 entries, 0 to 24091
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   year          24092 non-null  int64  
 1   artist        24092 non-null  object 
 2   track         24092 non-null  object 
 3   time          24092 non-null  object 
 4   date.entered  24092 non-null  object 
 5   week          24092 non-null  object 
 6   rating        5307 non-null   float64
dtypes: float64(1), int64(1), object(5)
memory usage: 1.3+ MB


In [10]:
#drop rows of missing values(rating)
billboard_melt.dropna(inplace=True)

In [11]:
#create new dataframe containing only song information.
billboard_songs=billboard_melt[['year','artist','track','time','date.entered']]

In [12]:
#remove duplicates.
billboard_songs=billboard_songs.drop_duplicates()

In [13]:
billboard_songs.shape

(317, 5)

In [14]:
#generate the column 'id'
billboard_songs['id']=range(len(billboard_songs))

In [15]:
billboard_songs.head()

Unnamed: 0,year,artist,track,time,date.entered,id
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,0
1,2000,2Ge+her,The Hardest Part Of ...,3:15,2000-09-02,1
2,2000,3 Doors Down,Kryptonite,3:53,2000-04-08,2
3,2000,3 Doors Down,Loser,4:24,2000-10-21,3
4,2000,504 Boyz,Wobble Wobble,3:35,2000-04-15,4


The column 'id' was added. It is going to be each track(song)'s id.

In [16]:
#export 'billboard_songs.csv'.
billboard_songs.to_csv('billboard_song.csv',index=False)

In [17]:
#merge 'billboard.songs' with 'billboard_melt' and create new dataframe, called 'billboard ratings'.
billboard_ratings=billboard_melt.merge(
    billboard_songs, on=['year','artist','track','time','date.entered']
)

In [18]:
billboard_ratings.head()

Unnamed: 0,year,artist,track,time,date.entered,week,rating,id
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,wk1,87.0,0
1,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,wk2,82.0,0
2,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,wk3,72.0,0
3,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,wk4,77.0,0
4,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,wk5,87.0,0


In [19]:
#select 4 columns that are needed for analysis. 
billboard_ratings=billboard_ratings[['id','date.entered','week','rating']]

In [20]:
billboard_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5307 entries, 0 to 5306
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            5307 non-null   int64  
 1   date.entered  5307 non-null   object 
 2   week          5307 non-null   object 
 3   rating        5307 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 207.3+ KB


### Clean Data View

In [21]:
billboard_ratings.head()

Unnamed: 0,id,date.entered,week,rating
0,0,2000-02-26,wk1,87.0
1,0,2000-02-26,wk2,82.0
2,0,2000-02-26,wk3,72.0
3,0,2000-02-26,wk4,77.0
4,0,2000-02-26,wk5,87.0


In [22]:
billboard_ratings.to_csv('billboard_ratings.csv', index=False)