# Data Exploration, Cleaning, Analysis

This notebook is used for the initial data exploration and analysis. 

The data consists of two CSVs:
- Track IDs and track information
- Session IDs and user behavior information

After the analysis, a cleaned, merged DataFrame containing all of the above information was converted to a CSV to use for future modeling. 

#### Import necessary libraries

In [1]:
import pandas as pd
import seaborn
import numpy as np
import matplotlib.pyplot as plt

#### Read in the data

In [3]:
data = pd.read_csv('data/training_set/log_mini.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'data/training_set/log_mini.csv'

In [None]:
data.head()

In [None]:
data.info()

In [None]:
tracks = pd.read_csv('data/track_features/tf_mini.csv')

In [None]:
tracks.head()

In [None]:
tracks.info()

#### Join the data sets on `track_id`. The resulting dataframe will have the session information as well as the track information.

In [None]:
df = pd.merge(data, tracks, how='left', left_on='track_id_clean', right_on='track_id')

In [None]:
# No longer need both track columns
df.drop('track_id', axis=1, inplace=True)

#### Adjust target to  `skipped`. `skip_2` is ground truth. Drop other skips and rename columns for efficiency and readability.

In [None]:
df.drop(labels=['skip_1', 'skip_3', 'not_skipped'], axis=1, inplace=True)

In [None]:
df.rename(columns={'skip_2': 'skipped', 
                    'track_id_clean' : 'track_id',
                    'hist_user_behavior_n_seekfwd' : 'hist_seekfwd',
                    'hist_user_behavior_n_seekback' : 'hist_seekback',
                    'hist_user_behavior_is_shuffle' : 'hist_shuffle',
                    'us_popularity_estimate' : 'popularity'
                    }, inplace=True)

#### Converted different kinds of pauses to boolean `pause_before_play`. Dropped all other types of pauses.

In [None]:
df['no_pause_before_play'].replace([0, 1], [1, 0])
df['pause_before_play'] = df['no_pause_before_play']
df.drop(labels=['no_pause_before_play', 'short_pause_before_play', 'long_pause_before_play'], axis=1, inplace=True)

#### Convert boolean columns to numeric values.

In [None]:
bool_col = ['skipped', 'hist_shuffle', 'premium']
df[bool_col] = df[bool_col].astype(int)

#### The hours of day were originally 0-23, I made these into sections of the day.

In [None]:
"""
Encoding for time of day

1-   5-9 Early AM
2-   10-15 Late AM/Early PM
3-   16-19 Evening
4-   20-23 LatePM
5-   0-4 Night
"""

hour_of_day_dict = {
    
    range(5, 10): 'EarlyAM',
    range(10, 16): 'LateAMEarlyPM',
    range(16, 20): 'Evening',
    range(20, 24): 'LatePM',
    range(0, 5): 'Night'
    
}

df['hour_of_day'] = df['hour_of_day'].replace(hour_of_day_dict)

## Data Analysis

In [None]:
df['skipped'].value_counts(normalize=True)

Target is pretty balanced!

#### Viewing the values of the `date` column, it appears that most of these datapoints are from Sunday, July 15th 2018 (70%). This is considered a limitation as there isnt much data for different days of the week, and one can assume that would change skipping habits.

In [None]:
# Date column to datetime in case needed

df['date'] = pd.to_datetime(df['date'])

In [None]:
df['date'].value_counts(normalize=True)

In [None]:
# Get info per session instead of per value
df.pivot_table(index=['session_id'], values=['premium']).value_counts()

There are about 80% premium users, and 20% free users. 

#### Heat map to view correlation of track features:

In [None]:
seaborn.heatmap(tracks.drop(['track_id', 'mode'], axis=1).corr())

Some correlation to with beat strength, danceability and energy. Acoustic vector 0 is also negatively correalted to these as well. 

In [None]:
df.describe()

Some things that stick out after looking at the stats:
- the maximums for seek forward and back are oddly high
- One of the songs is close to 20 min, shorted song is 30 sec
- Earliest release date is 1950s

Addressing the outliers for Seek Forward and Backwards. Decided to encode these as a boolean below. This means that they fastforwarded or rewinded the current track at least once. 

In [None]:
df['hist_seekfwd'].replace(range(1, 200), 1, inplace=True)

In [None]:
df['hist_seekback'].replace(range(1, 200), 1, inplace=True)

#### The following section views the current value counts and their percentages

In [None]:
for column in df.columns:
    print(f"\n{column.title()}:")
    print(df[column].value_counts())
    print(df[column].value_counts(normalize=True))

#### Checking out the energy of a song and the time of day. Followed by categorical encoding using pandas get_dummies(). Hour of day and Context type will be encoded.

In [None]:
df.pivot_table(index=['hour_of_day'], values=['energy', 'valence', 'danceability'])

In [None]:
df.pivot_table(index=['hour_of_day'], values=['skipped'])

There is a drop in skipping in the morning.

In [None]:
df.pivot_table(index=['session_position'], values=['skipped']).T

People are skipping slightly more later into their session.

In [None]:
df.pivot_table(index=['premium'], values=['skipped'])

#### Encoding `hour_of_day` and `context type` categorical variables.

In [None]:
hour_of_day_dummies = pd.get_dummies(df['hour_of_day'])

In [None]:
df = pd.concat([df, hour_of_day_dummies], axis=1)

In [None]:
df.drop('hour_of_day', axis=1, inplace=True)

In [None]:
context_dummies = pd.get_dummies(df['context_type'])

In [None]:
df = pd.concat([df, context_dummies], axis=1)

#### Dropping `context_switch` and `hist_user_behavior_reason_end` as they dont add much value to this analysis.

In [None]:
df.drop(labels=['hist_user_behavior_reason_end', 'context_switch'], axis=1, inplace=True)

#### Convert release year to `years old`.

In [None]:
df['years_old'] = [(2018 - x) for x in df['release_year']]

In [None]:
to_drop = ['context_type', 'bounciness', 'liveness', 'mechanism', 'mode', 'release_year']

In [None]:
df.drop(labels=to_drop, axis=1, inplace=True)

#### Reorder column names so all track information is on the tail end.

In [None]:
df.columns

In [None]:
df = df[['session_id', 'session_position', 'session_length', 'date', 'EarlyAM', 'Evening',
       'LateAMEarlyPM', 'LatePM', 'Night', 'track_id',
       'skipped', 'hist_seekfwd', 'hist_seekback', 'hist_shuffle', 
       'premium', 'hist_user_behavior_reason_start', 'pause_before_play', 'catalog', 'charts',
       'editorial_playlist', 'personalized_playlist', 'radio',
       'user_collection', 'duration',
       'years_old', 'popularity', 'acousticness', 'beat_strength',
       'danceability', 'dyn_range_mean', 'energy', 'flatness',
       'instrumentalness', 'key', 'loudness', 'organism', 'speechiness',
       'tempo', 'time_signature', 'valence', 'acoustic_vector_0',
       'acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3',
       'acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6',
       'acoustic_vector_7']]

### Convert final dataframe to CSV to use for modeling.

In [None]:
#df.to_csv('tracks_session_clean.csv')