# Data Cleanup

Now that we have all of the albums and their features, we can use these features to clean up our dataset even more. The reason for this is because, there are some albums that do not need to be considered as they don't offer any new info.

In [1]:
import pandas as pd
import sqlite3
import album_lists

db_con = sqlite3.connect('../albums.db')
df = pd.read_sql_query('SELECT * FROM albums', db_con)
df.drop(columns='index', inplace=True)

## Live Albums

Live albums are simply performance recordings of the artist's songs and don't offer any unique music. To clean up live albums, we can use the `liveness` feature to eliminate albums that are live recordings of songs. After going through the dataset, I noticed that live albums with a `liveness` value between 0.4 and 0.5 that would need to be removed. The rest of the albums with a `liveness` from 0.4 to 0.5 are not live and should still be kept. The live and normal albums tend to be mixed because of how instrumentals in the albums were mixed.

In [3]:
df = df.loc[~(df['spotify_id'].isin(album_lists.albums_to_remove))]

Similar to above, for a `liveness` value greater than 0.5, the majority of albums are live albums, but there are some that are not and these must be kept.

In [5]:
df = df.loc[(df['liveness'] < 0.5) | ((df['liveness'] >= 0.5) & (df['spotify_id'].isin(album_lists.albums_to_keep)))]

## Deluxe Albums

Some artists usually release deluxe versions of their albums which have a few more tracks or remixes on top of the tracks in the original album. These types of albums also don't give us many new songs and we can disregard them.

In [2]:
df = df.loc[~(df['spotify_id'].isin(album_lists.deluxe_albums))]

## Instrumental Albums

Some albums in the dataset are also collections of instrumentals used in the artist's tracks. These can also be removed, as they are not helpful to our recommendations.

In [4]:
df = df.loc[~(df['spotify_id'].isin(album_lists.instrumental_albums))]

Overwrite table in the database and close the database connection.

In [6]:
df.to_sql('albums', db_con, if_exists='replace')
db_con.close()