## Introduction
All the data are scraped from various playlists in Vietnamese on Spotify. Please see this [notebook](https://github.com/tvo10/spotify-recommendation-system/blob/main/01_spotify_recommendation_system_scrape_data.ipynb) to see how we scraped the metadata. 

The playlists are:
* https://open.spotify.com/playlist/1aTfe08sYQNkfCNQHX91yn
* https://open.spotify.com/playlist/7ctIQ3akGLoBN82xYXejZT
* https://open.spotify.com/playlist/36aF4KcKVuo3gkp7t3K4xB
* https://open.spotify.com/playlist/2nQSQ7n5IagPDuWvU9CnKX
* https://open.spotify.com/playlist/6fwQQ7wXnJnjCAFmfk7MI1
* https://open.spotify.com/playlist/6luqC1Np20Z0Ps576JFNj8
* https://open.spotify.com/playlist/28LU0Rr3eQSAekGD8XgUYS
* https://open.spotify.com/playlist/0mrqaVOe6KYEEcaSYhVK4a
* https://open.spotify.com/playlist/0lxrQEsFzKaCWNlw2cjbNf
* https://open.spotify.com/playlist/3Rso58hNRq63GTIFTiHLW7
* https://open.spotify.com/playlist/2gkbbPRL9A3mxQU9dbPKqx
* https://open.spotify.com/playlist/37i9dQZF1DX34s4fg4Zx3Z
* https://open.spotify.com/playlist/2InfSOdVSdsEZYy9Kz9eHu
* https://open.spotify.com/playlist/6Cq01g1QSDzNyq97Ryh4l9
* https://open.spotify.com/playlist/4j9GfFCZaGh8xbzcFRhY0o
* https://open.spotify.com/playlist/4pkof2WT0Spca1pX3j2dh4
* https://open.spotify.com/playlist/3rIUD7NO8nAe91GE4AtqZN
* https://open.spotify.com/playlist/2dm4foxI22LpTJBUed05PO
* https://open.spotify.com/playlist/4to6jjsvMscVt8Ct3TkBHT

There will be 5 csv files that need to be merged to one dataset, including:
* **names.csv:** The names of the songs
* **artists.csv:** The artist(s) of the songs
* **popularity.csv:** The popularity of the songs
* **release_date.csv:** The release dates of the songs
* **features.csv:** The features of the songs

In [1]:
# import essenstial libraries
import pandas as pd
import re
import numpy as np
import json
from pandas.io.json import json_normalize

### Names of the songs

In [2]:
# read in names.csv
song_names = pd.read_csv('../data/vpop/names.csv')
print(song_names.shape)
song_names.head()

(3622, 1)


Unnamed: 0,name
0,Em Dạo Này
1,Đâu Cần Một Bài Ca Tình Yêu
2,Chờ Anh Nhé (feat. Hoang Rob)
3,trời giấu trời mang đi
4,SAY


### Artists of the songs

In [3]:
# read in artists.csv
song_artists = pd.read_csv('../data/vpop/artists.csv') 
print(song_artists.shape)
song_artists.head()

(3622, 1)


Unnamed: 0,artists
0,Ngọt
1,Tien Tien
2,Hoang Dung
3,AMEE
4,Lena


### Popularity of the songs

In [4]:
# read in popularity.csv
song_popularity = pd.read_csv('../data/vpop/popularity.csv')
print(song_popularity.shape)
song_popularity.head()

(3622, 1)


Unnamed: 0,popularity
0,45
1,49
2,48
3,50
4,41


### Release dates of the songs

In [5]:
# read in release_date.csv
song_dates = pd.read_csv('../data/vpop/release_date.csv') 
print(song_dates.shape)
song_dates.head()

(3622, 1)


Unnamed: 0,release_date
0,2017-09-23
1,2019-09-19
2,2018-11-09
3,2020-06-28
4,2020-05-08


### Features of the songs

In [6]:
# read "features.csv" 
song_features = pd.read_csv('../data/vpop/features.csv')
song_features = song_features.reset_index() 
song_features = song_features.drop(columns=['index', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'time_signature'])
print(song_features.shape)
song_features.head()

(3622, 12)


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,0.417,0.356,9,-8.747,1,0.0351,0.752,0.0,0.194,0.341,158.112,201828
1,0.524,0.663,11,-5.473,1,0.0729,0.692,8.5e-05,0.0824,0.568,93.381,195514
2,0.409,0.353,0,-7.286,1,0.0263,0.763,0.0,0.113,0.157,82.032,322499
3,0.438,0.408,3,-9.857,1,0.0705,0.688,0.0,0.121,0.373,92.053,253500
4,0.709,0.388,8,-6.058,1,0.343,0.662,0.0,0.0974,0.712,76.787,177039


## Merge Dataframes

In [7]:
# merge 5 dataframes
df = pd.merge(song_names, song_artists, how='inner', left_index=True, right_index=True)
df = df.join(song_popularity)
df = df.join(song_dates)
df = df.join(song_features)

# drop duplicates and missing values
df['name'] = df['name'].drop_duplicates()
df= df.dropna()
df = df.reset_index(drop=True)

# covert dates to datetime
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

# extract year
df['year'] = pd.to_datetime(df['release_date']).dt.to_period('Y')

# drop dates columns
df = df.drop(columns='release_date')

### Data Codebook
* **name:** The name of the song.
* **artists:** The artist(s) of the song.
* **popularity:** The popularity of the song.
* **danceability:** Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
* **energy:** Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
* **key:** Key is the major or minor scale around which a piece of music revolves.
* **loudness:** The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track. Values typical range between -60 and 0 db. 
* **mode:** Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.
* **speechiness:** Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
* **acousticness:** A measure from 0.0 to 1.0 of whether the track is acoustic.
* **instrumentalness:** Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
* **liveness:** Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
* **valence:** A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
* **tempo:** The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
* **duration_ms:** Duration of the song in millisecond.
* **year:** The year of the song.

In [8]:
# print df
print(df.shape)
df.head()

(2037, 16)


Unnamed: 0,name,artists,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,year
0,Em Dạo Này,Ngọt,45,0.417,0.356,9,-8.747,1,0.0351,0.752,0.0,0.194,0.341,158.112,201828,2017
1,Đâu Cần Một Bài Ca Tình Yêu,Tien Tien,49,0.524,0.663,11,-5.473,1,0.0729,0.692,8.5e-05,0.0824,0.568,93.381,195514,2019
2,Chờ Anh Nhé (feat. Hoang Rob),Hoang Dung,48,0.409,0.353,0,-7.286,1,0.0263,0.763,0.0,0.113,0.157,82.032,322499,2018
3,trời giấu trời mang đi,AMEE,50,0.438,0.408,3,-9.857,1,0.0705,0.688,0.0,0.121,0.373,92.053,253500,2020
4,SAY,Lena,41,0.709,0.388,8,-6.058,1,0.343,0.662,0.0,0.0974,0.712,76.787,177039,2020


### Export to CSV

In [9]:
# write to csv file
df.to_csv('../data/vpop/vpop_songs.csv', index=False)