In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import dask.dataframe as dd
import seaborn as sns
from seaborn import distplot
import seaborn as sns
import sys

# Stage 2: Data Cleaning and Processing

In this stage, I used the raw data I retrieved in the last stage, and I cleaned and manipulate the data in order to prepare for my analysis and data modelling stages afterwards.

In [6]:
bts = pd.read_csv('data/bts_songs.csv')
exo = pd.read_csv("data/exo_songs.csv")
twice = pd.read_csv("data/twice_songs.csv")
bp = pd.read_csv("data/blackpink_songs.csv")
shinee = pd.read_csv("data/shinee_songs.csv")
rv = pd.read_csv("data/rv_songs.csv")

I skimmed through the data, and I realized that for each artist, there are some songs that should not be used for this project.

In [7]:
bts = bts.drop([78, 156, 162, 167, 218, 232, 241, 245]).reset_index().drop('index',1)

BTS has several skits that were retrieved but they are not really songs. Thus I removed them from the dataset. I individually found the index of these skits in the dataset and manually removed them.

In [8]:
# cleaning data for exo songs
# removes last non-exo albums
exo = exo.iloc[:316]
to_remove = []
for label, row in exo.iterrows():
    if 'Live Album' in exo.loc[label, 'album']:
        to_remove.append(label)

In [9]:
exo = exo.drop(to_remove).reset_index().drop('index',1)

While I retrieved the data of EXO songs, I realized there were many songs that did not belong to EXO, but were unintentionally retrieved. Obviously, these songs should not be included in the dataset. EXO also had "Live Albums", which essentially contain duplicates of previously-released songs, but with some live sounds from live concerts. Those songs were streamed much less often, which meant my dataset contained resulting in the dataset containing similar songs with drastically different popularity scores. This would affect my analysis. Thus, I also removed those kinds of albums.

In [10]:
to_remove = []
for label, row in shinee.iterrows():
    if ('concert' in shinee.loc[label, 'album']) or ('CONCERT' in shinee.loc[label, 'album']):
        to_remove.append(label)

In [11]:
shinee = shinee.drop(to_remove).reset_index().drop('index',1)

While looking at the music data of Shinee, I realized Shinee also had a few albums from their live concerts. Thus, I also removed them from the dataset.

In [12]:
to_remove = []
for label, row in twice.iterrows():
    if (twice.loc[label, 'album'] == 'Twicetagram') or ('Repackage' in twice.loc[label, 'album']):
        to_remove.append(label)

In [13]:
twice = twice.drop(to_remove).reset_index().drop('index',1)

While looking at Twice's dataset, I realized that there were albums that contained duplicate songs. I therefore found those albums and removed them.

In [14]:
# cleaning data for rv songs
rv = rv.iloc[:111]

to_remove = []
for label, row in rv.iterrows():
    if ('Hotel Del Luna' in rv.loc[label, 'album']):
        to_remove.append(label)

In [15]:
rv = rv.drop(to_remove).reset_index().drop('index',1)

For the dataset of Red Velvet, I removed this album called "Hotel Del Luna", because this album was part of a drama series, and Red Velvet sang the official songs of this series. This meant that the popularity score may be affected by factors other than the musical features, and the results would be inaccurate. Thus, I decided to remove this album. 

In [16]:
# cleaning data for blackpink songs
to_remove = []
for label, row in bp.iterrows():
    if ('TOUR' in bp.loc[label, 'album']) or ('BLACKPINK IN YOUR AREA' in bp.loc[label, 'album']) or ('Viral 2020' in bp.loc[label, 'album']):
        to_remove.append(label)

In [17]:
bp = bp.drop(to_remove).reset_index().drop('index',1)

Blackpink also had a few albums from their live concerts, which were removed.

In [18]:
names = [bts, exo, twice, bp, shinee, rv]
for i in range(len(names)):
    # names[i] = names[i].drop(['Unnamed: 0', 'id', 'uri'], axis=1) 
    names[i] = names[i].drop_duplicates(subset='name', keep="first") # removes duplicates

After removing unwanted rows/songs, I removed a few unwanted columns, including: "Unnamed: 0" (which is a column that is automatically included after data retrieval), "id" (not necessary) and "uri" (which is the url of the song of Spotify which is also not needed).

In [19]:
bts.to_csv('bts_data_cleaned.csv')
exo.to_csv('exo_data_cleaned.csv')
shinee.to_csv('shinee_data_cleaned.csv')
twice.to_csv('twice_data_cleaned.csv')
rv.to_csv('rv_data_cleaned.csv')
bp.to_csv('bp_data_cleaned.csv')

After I have cleaned the data, I saved each dataset as a new csv file.

## Summary of Stage 2:

- In this stage, I removed songs that I did not wish to include in my dataset. 
- These songs included: songs from repackaged albums (duplicatess), songs that were unintentionally scraped (errors), and songs that were potentially confounded by other variables (theme songs from drama series), and songs from live concerts. 
- I also removed a few columns that were not necessary.
- Overall, this stage was important for preparing the dataset appropriately for further analysis.