# Spotify Song Data - Data Cleaning

This is the data cleaning notebook for a classification project that uses a dataset of Spotify song data to determine what features make a song popular on the platform, aka a hit song.

## Introduction

### Imports

In [1]:
# Regulars
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


Bad key "text.kerning_factor" on line 4 in
/Users/spags/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


In [None]:
# tqdm
# %%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

### Functions

In [None]:
def get_info(df):
    '''Simple function that takes in a full dataframe and returns
    some basic information about the countents of the dataframe.'''
    
    print('Shape of DataFrame:\n', df.shape)
    print('\nDataFrame Info:')
    print(df.info())
    print('\n Null Values Present:\n', df.isna().sum())

## Obtain

### Import Dataset

In [None]:
df = pd.read_csv('spotify_song_data.csv')
print(df.shape)
df.head()

## Data Cleaning

In [None]:
# Checking out some info
get_info(df)

### Name

Let's set the song name as the index for the dataframe

In [None]:
df.set_index('name', inplace = True)
df.head()

### ID

I suppose we can drop the ID column as it's not really necessary for this project at the moment.

In [None]:
df.drop('id', axis = 1, inplace = True)
df.head()

### Null Values

In [None]:
# Checking nulls 
df.isna().sum()

No nulls here, so we're in the clear!

### Release Date

Let's change the release date to a datetime object.

In [None]:
df['release_date'] = pd.to_datetime(df['release_date'])
df.head()

### Year

Let's try a different approach here. At first, we changed the values of the year column to strings so that they wouldn't be affected by scaling. The drawback here is that we wound up with a massive amount of columns after one hot encoding it as a categorical column. Let's try binning year into decades to cut down on these columns.

In [None]:
df['year'].unique()

In [None]:
for i in tqdm(df['year']):
    if i >= 1920 and i < 1930:
        df['year'] = df['year'].replace(i, '1920s')
    elif i >= 1930 and i < 1940:
        df['year'] = df['year'].replace(i, '1930s')
    elif i >= 1940 and i < 1950:
        df['year'] = df['year'].replace(i, '1940s')
    elif i >= 1950 and i < 1960:
        df['year'] = df['year'].replace(i, '1950s')
    elif i >= 1960 and i < 1970:
        df['year'] = df['year'].replace(i, '1960s')
    elif i >= 1970 and i < 1980:
        df['year'] = df['year'].replace(i, '1970s')
    elif i >= 1980 and i < 1990:
        df['year'] = df['year'].replace(i, '1980s')
    elif i >= 1990 and i < 2000:
        df['year'] = df['year'].replace(i, '1990s')
    elif i >= 2000 and i < 2010:
        df['year'] = df['year'].replace(i, '2000s')
    elif i >= 2010 and i < 2020:
        df['year'] = df['year'].replace(i, '2010s')
    elif i >= 2020 and i < 2029:
        df['year'] = df['year'].replace(i, '2020s')

In [None]:
df['year'].value_counts()

### Artists

In [None]:
df['artists'].value_counts()

Keeping all the artists in the artist column creates an issue when one hot encoding...namely that it expands to over 31k columns and creates an issue for running our models. Let's do some exploration here and see if there's any way to bin artists in a sensible way.

In [None]:
df['artists'].value_counts().unique().sum()

Wow...31k+ unique artists represented in this dataset. This presents an interesting dilemma. It doesn't seem too likely that we can figure out a way to bin this column in a sensible way. We could separate the artists into bins of multiple artists and single artists for the tracks but that doesn't really take the individual artist popularity into account.

For now, we're going to have to dump the artist column and then revisit this down the line after being able to get a few successful models running.

In [None]:
df.drop('artists', axis = 1, inplace = True)

### Popularity Column

This one is going to take some thought and research. We know that the popularity metric is a number from 1 to 100 that (with 100 being the most popular) that is assigned to a song to denote it's popularity. Spotify calculates this metric based on total streams, trends, and several other factors. First, we need to see what we're working with in terms of the value counts. Next, we need to make some sort of determination of what level of popularity constitutes as a hit song and the level that constitutes a dud.

<b> Note:</b> In the other notebook, we work on a multiclass version that creates 3 targets: hit, solid single, and dud.

In [None]:
# Placeholder Plot

ax = plt.figure(figsize = (24, 6))
ax = sns.countplot(df['popularity'])
ax.set_title('Song Popularity Countplot')
ax.set_xlabel('Popularity')
ax.set_ylabel('Count')
plt.show();

> We can see from the figure above that an overhwelmingly large percentage of the songs have a popularity of 0 and the top of the scale is an extremely small percentage.  This is going to wreak havok on class weights.

<b>From Spotify:</b><br>
The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.

#### Binning Popularity

We'll make the following labels by binning hte popularity metric:<br>
0 - Not a hit<br>
1 - Hit

In [None]:
# Using the following code with various values to check what our thresholds should be

df[df["popularity"] == 65]

> First we'll create strings and then label encode from there.  I'm sure there's an easier way, but this is what's worked for me to this point.

In [None]:
for i in tqdm(df['popularity']):
    if i >= 65:
        df['popularity'] = df['popularity'].replace(i, 'Hit')
    else:
        df['popularity'] = df['popularity'].replace(i, 'Dud')

In [None]:
df['popularity'].value_counts()

In [None]:
# Now for encoding it for 1 and 0

for i in tqdm(df['popularity']):
    if i == 'Hit':
        df['popularity'] = df['popularity'].replace(i, 1)
    else: 
        df['popularity'] = df['popularity'].replace(i, 0)

In [None]:
df['popularity'].value_counts()

## Save Clean DataFrame

In [None]:
df.head()

Now that we've successfully cleaned the data, we can save our clean dataframe to use in other notebooks.  

In [None]:
df.to_csv('clean_spotify_data.csv')