## Introduction
All the data are scraped from various playlists in Mandarin on Spotify. Please see this [notebook](https://github.com/tvo10/spotify-recommendation-system/blob/main/01_spotify_recommendation_system_scrape_data.ipynb) to see how we scraped the metadata. 

The playlists are:
* https://open.spotify.com/playlist/1cx0Gbqhb7rT3aHUQbQiTQ
* https://open.spotify.com/playlist/0iAm4XiG8zb6q3lWi4qtiF

There will be 5 csv files that need to be merged to one dataset, including:
* **names.csv:** The names of the songs
* **artists.csv:** The artist(s) of the songs
* **popularity.csv:** The popularity of the songs
* **release_date.csv:** The release dates of the songs
* **features.csv:** The features of the songs

In [1]:
# import libraries
import pandas as pd
import re
import numpy as np
import json
from pandas.io.json import json_normalize

### Names of the songs

In [2]:
# read in names.csv
song_names = pd.read_csv('../data/mandopop/names.csv')
print(song_names.shape)
song_names.head()

(2410, 1)


Unnamed: 0,name
0,晴天
1,零
2,寶貝 (In a Day)
3,雨愛
4,掉了


### Artists of the songs

In [3]:
# read in artists.csv
song_artists = pd.read_csv('../data/mandopop/artists.csv') 
print(song_artists.shape)
song_artists.head()

(2410, 1)


Unnamed: 0,artists
0,Jay Chou
1,Alan Kuo
2,Deserts Chang
3,Rainie Yang
4,A-Mei Chang


### Popularity of the songs

In [4]:
# read in popularity.csv
song_popularity = pd.read_csv('../data/mandopop/popularity.csv')
print(song_popularity.shape)
song_popularity.head()

(2410, 1)


Unnamed: 0,popularity
0,61
1,42
2,44
3,55
4,0


### Release dates of the songs

In [5]:
# read in release_date.csv
song_dates = pd.read_csv('../data/mandopop/release_date.csv') 
print(song_dates.shape)
song_dates.head()

(2410, 1)


Unnamed: 0,release_date
0,2003-07-31
1,2005-08-12
2,2006-06-06
3,2009-12-29
4,2009


### Features of the songs

In [6]:
# read "features.csv" 
song_features = pd.read_csv('../data/mandopop/features.csv')
song_features = song_features.reset_index() 
song_features = song_features.drop(columns=['index', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'time_signature'])
print(song_features.shape)
song_features.head()

(2410, 12)


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,0.547,0.567,7,-7.295,1,0.0242,0.276,0.000548,0.104,0.399,137.13,269747
1,0.494,0.565,3,-4.958,0,0.0291,0.061,0.0,0.121,0.0989,120.026,279893
2,0.827,0.16,0,-12.729,1,0.0483,0.887,0.0,0.105,0.388,119.891,145440
3,0.422,0.657,4,-5.274,1,0.0292,0.214,0.0,0.129,0.218,159.957,261560
4,0.547,0.475,1,-6.613,1,0.0278,0.811,0.0,0.0722,0.142,161.965,239560


## Merge Dataframes

In [7]:
# merge 5 dataframes
df = pd.merge(song_names, song_artists, how='inner', left_index=True, right_index=True)
df = df.join(song_popularity)
df = df.join(song_dates)
df = df.join(song_features)

# drop duplicates and missing values
df['name'] = df['name'].drop_duplicates()
df= df.dropna()
df = df.reset_index(drop=True)

# covert dates to datetime
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

# extract year
df['year'] = pd.to_datetime(df['release_date']).dt.to_period('Y')

# drop dates columns
df = df.drop(columns='release_date')

### Data Codebook
* **name:** The name of the song.
* **artists:** The artist(s) of the song.
* **popularity:** The popularity of the song.
* **danceability:** Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
* **energy:** Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
* **key:** Key is the major or minor scale around which a piece of music revolves.
* **loudness:** The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track. Values typical range between -60 and 0 db. 
* **mode:** Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.
* **speechiness:** Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
* **acousticness:** A measure from 0.0 to 1.0 of whether the track is acoustic.
* **instrumentalness:** Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
* **liveness:** Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
* **valence:** A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
* **tempo:** The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
* **duration_ms:** Duration of the song in millisecond.
* **year:** The year of the song.

In [8]:
# print df
print(df.shape)
df.head()

(2039, 16)


Unnamed: 0,name,artists,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,year
0,晴天,Jay Chou,61,0.547,0.567,7,-7.295,1,0.0242,0.276,0.000548,0.104,0.399,137.13,269747,2003
1,零,Alan Kuo,42,0.494,0.565,3,-4.958,0,0.0291,0.061,0.0,0.121,0.0989,120.026,279893,2005
2,寶貝 (In a Day),Deserts Chang,44,0.827,0.16,0,-12.729,1,0.0483,0.887,0.0,0.105,0.388,119.891,145440,2006
3,雨愛,Rainie Yang,55,0.422,0.657,4,-5.274,1,0.0292,0.214,0.0,0.129,0.218,159.957,261560,2009
4,掉了,A-Mei Chang,0,0.547,0.475,1,-6.613,1,0.0278,0.811,0.0,0.0722,0.142,161.965,239560,2009


### Export to CSV

In [9]:
# write to csv file
df.to_csv('../data/mandopop/mandopop_songs.csv', index=False)