## Introduction
All the data are scraped from various playlists in Korean on Spotify. Please see the process [here]()

The playlists are:
* https://open.spotify.com/playlist/3y9YbrxtaOMTURvMUX6qCn
* https://open.spotify.com/playlist/6VPXCgMKWIlQR2hxPTMfiE
* https://open.spotify.com/playlist/3kwb1LyzCSsLLacppOJQc8
* https://open.spotify.com/playlist/37dtOyRL9xpjbVR9gE0dY3

There will be 5 csv files that need to be merged to one dataset, including:
* **names.csv:** The names of the songs
* **artists.csv:** The artist(s) of the songs
* **popularity.csv:** The popularity of the songs
* **release_date.csv:** The release dates of the songs
* **features.csv:** The features of the songs

In [1]:
# import essenstial libraries
import pandas as pd
import re
import numpy as np
import json
from pandas.io.json import json_normalize

### Names of the songs

In [2]:
# read in names.csv
song_names = pd.read_csv('../data/kpop/names.csv')
print(song_names.shape)
song_names.head()

(2660, 1)


Unnamed: 0,name
0,"How can I love the heartbreak, you`re the one ..."
1,All about you
2,Can You See My Heart
3,At the end
4,All with You


### Artists of the songs

In [3]:
# read in artists.csv
song_artists = pd.read_csv('../data/kpop/artists.csv') 
print(song_artists.shape)
song_artists.head()

(2660, 1)


Unnamed: 0,artists
0,AKMU
1,TAEYEON
2,HEIZE
3,CHUNG HA
4,TAEYEON


### Popularity of the songs

In [4]:
# read in popularity.csv
song_popularity = pd.read_csv('../data/kpop/popularity.csv')
print(song_popularity.shape)
song_popularity.head()

(2660, 1)


Unnamed: 0,popularity
0,12
1,67
2,65
3,55
4,56


### Release dates of the songs

In [5]:
# read in release_date.csv
song_dates = pd.read_csv('../data/kpop/release_date.csv') 
print(song_dates.shape)
song_dates.head()

(2660, 1)


Unnamed: 0,release_date
0,2019-09-25
1,2019-07-21
2,2019-07-28
3,2019-08-03
4,2016-09-13


### Features of the songs

In [6]:
# read "features.csv" 
song_features = pd.read_csv('../data/kpop/features.csv')
song_features = song_features.reset_index() 
song_features = song_features.drop(columns=['index', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'time_signature'])
print(song_features.shape)
song_features.head()

(2660, 12)


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,0.52,0.248,5.0,-8.675,1.0,0.0355,0.91,1e-06,0.118,0.228,129.205,290096.0
1,0.531,0.287,9.0,-7.091,1.0,0.0364,0.915,0.0,0.118,0.491,135.55,209482.0
2,0.398,0.165,9.0,-10.715,1.0,0.0354,0.885,0.0,0.102,0.125,134.808,225786.0
3,0.618,0.405,11.0,-5.808,1.0,0.0299,0.896,0.0,0.103,0.222,133.909,224284.0
4,0.247,0.41,6.0,-5.725,1.0,0.0331,0.756,3e-06,0.127,0.125,62.91,233940.0


## Merge Dataframes

In [7]:
# merge 5 dataframes
df = pd.merge(song_names, song_artists, how='inner', left_index=True, right_index=True)
df = df.join(song_popularity)
df = df.join(song_dates)
df = df.join(song_features)

# drop duplicates and missing values
df['name'] = df['name'].drop_duplicates()
df= df.dropna()
df = df.reset_index(drop=True)

# covert dates to datetime
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

# extract year
df['year'] = pd.to_datetime(df['release_date']).dt.to_period('Y')

# drop dates columns
df = df.drop(columns='release_date')

### Data Codebook
* **name:** The name of the song.
* **artists:** The artist(s) of the song.
* **popularity:** The popularity of the song.
* **danceability:** Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
* **energy:** Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
* **key:** Key is the major or minor scale around which a piece of music revolves.
* **loudness:** The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track. Values typical range between -60 and 0 db. 
* **mode:** Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.
* **speechiness:** Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
* **acousticness:** A measure from 0.0 to 1.0 of whether the track is acoustic.
* **instrumentalness:** Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
* **liveness:** Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
* **valence:** A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
* **tempo:** The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
* **duration_ms:** Duration of the song in millisecond.
* **year:** The year of the song.

In [8]:
# print df
print(df.shape)
df.head()

(2274, 16)


Unnamed: 0,name,artists,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,year
0,"How can I love the heartbreak, you`re the one ...",AKMU,12,0.52,0.248,5.0,-8.675,1.0,0.0355,0.91,1e-06,0.118,0.228,129.205,290096.0,2019
1,All about you,TAEYEON,67,0.531,0.287,9.0,-7.091,1.0,0.0364,0.915,0.0,0.118,0.491,135.55,209482.0,2019
2,Can You See My Heart,HEIZE,65,0.398,0.165,9.0,-10.715,1.0,0.0354,0.885,0.0,0.102,0.125,134.808,225786.0,2019
3,At the end,CHUNG HA,55,0.618,0.405,11.0,-5.808,1.0,0.0299,0.896,0.0,0.103,0.222,133.909,224284.0,2019
4,All with You,TAEYEON,56,0.247,0.41,6.0,-5.725,1.0,0.0331,0.756,3e-06,0.127,0.125,62.91,233940.0,2016


## Export to CSV

In [9]:
# write to csv file
df.to_csv('../data/kpop/kpop_songs.csv', index=False)