# Exploratory Data Analysis: Spotify Listening

As both an avid Spotify user and a lover of all-things-data, I have always wanted to deep-dive into my own listening habits via data science. Were there any patterns that I had never noticed? Does my streaming data match how I personally feel about my music taste? What kind of listener am I?

In order to explore who I was as a music listener, I requested the available data for my personal Spotify account. The following notebook follows through my process of exploratory data analysis, with particular focus on my streaming history.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import json
import time
import matplotlib.pyplot as plt

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

## Loading the dataset

In [None]:
def load_streaming_data(src):
    """
    Returns a Pandas DataFrame object with the current listening history 
    from Spotify data pull. The src argument is the folder in which the data is stored.
    """
    with open(f'../data/personal/{src}/StreamingHistory0.json') as file:
        data = json.load(file)
    with open(f'../data/personal/{src}/StreamingHistory0.json') as file:
        data1 = json.load(file)
        
    df0 = pd.DataFrame(data)
    df1 = pd.DataFrame(data1)
    df = df0.append(df1, ignore_index=True)
    df['secPlayed'] = round(df['msPlayed'] / 1000, 1)
    df = df.drop(columns=['msPlayed'])

    STRTIME_FORMAT = '%Y-%m-%d %H:%M'
    df['endTime'] = pd.to_datetime(df['endTime'], format=STRTIME_FORMAT)
    
    return df

In [None]:
df = load_streaming_data("summer20")

In [None]:
print('INFORMATION ABOUT DATA: \n') 
print(df.info())
df.head()

## General Exploratory Data Analysis

### Daily Listening Time

In [None]:
daily = df.groupby(pd.Grouper(key='endTime', freq='D')).sum()
daily['minPlayed'] = daily['secPlayed'] / 60

In [None]:
# Showing elementary statistics
daily.describe()

In [None]:
plt.figure(figsize=(20,10))
sns.lineplot(x=daily.index, y='minPlayed', data=daily)
plt.xticks(rotation=45);
plt.xlabel('Date', fontsize='x-large')
plt.ylabel('Time Played (Minutes)', fontsize='x-large')
plt.title('Daily Listening Times (Spotify)', pad=20);
#plt.savefig('daily_listening', bbox_inches='tight', dpi=300);

### Monthly Listening Time

In [None]:
weekly = df.groupby(pd.Grouper(key='endTime', freq='W-MON')).sum()
weekly['hrPlayed'] = weekly['secPlayed'] / 3600

In [None]:
plt.figure(figsize=(12,10))
sns.lineplot(x=weekly.index, y='hrPlayed', data=weekly)
plt.xticks(rotation=45);
plt.xlabel('Date', fontsize='x-large')
plt.ylabel('Time Played (Hours)', fontsize='x-large')
plt.title('Weekly Listening Times (Spotify)', pad=20);
#plt.savefig('weekly_listening', bbox_inches='tight', dpi=300)

When was the highest peak in the chart above? Somewhere between March 2020 and May 2020. Let's find out the exact week!

In [None]:
weekly['hrPlayed'].sort_values(ascending=False).head(1)

It looks like during the week of April 6th `2020-04-06` I played a total of `35.436806` hours of music and podcasts!

# Podcast Analysis

Since (at the time of downloading this data) the music and podcast listening history is not separate, it is necessary to make this distinction in order to perform EDA on each separately.

In [None]:
### Below is a list of all the artists who made podcasts

PODCAST_ARTISTS = ['VIEWS with David Dobrik and Jason Nash', 'The California Golden Bearcast', 
                 'Whiskey Ginger w/ Andrew Santino', 'The Tiny Meat Gang Podcast',
                 'Stuff You Should Know','Patriots Unfiltered','Cal Rivals Excellent Podcast Experience',
                 'Curious with Josh Peck','Locked On Patriots - Daily Podcast On The New England Patriots',
                 'Skotcast with Jeff Wittek & Scotty Sire','Anything Goes with Emma Chamberlain',
                 'Call Her Daddy', 'Office Ladies', 'That Made All the Difference','Pardon My Take', 
                  'My Favorite Theorem', 'The James Altucher Show', 'Zane and Heath: Unfiltered',
                   'With Authority','The Numberphile Podcast', 'Billionaires Getting Interviewed',
                  'Elon Musk Interviews','Cover 3 College Football Podcast']

In [None]:
podcasts = df[df['artistName'].isin(PODCAST_ARTISTS)].reset_index(drop=True)
print('INFORMATION ABOUT PODCASTS: \n') 
print(podcasts.info())
podcasts.head()

In [None]:
podcasts_top_10 = podcasts.groupby('artistName').sum().sort_values('secPlayed', ascending=False)[:10]

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x='secPlayed', y=podcasts_top_10.index, data=podcasts_top_10, orient='h', palette='plasma')
plt.xlabel('Time Played (seconds)', fontsize='x-large')
plt.ylabel('Podcast Title', fontsize='x-large')
plt.title('Top 10 Podcasts (Listening Time)', pad=20, fontsize='xx-large')
#plt.savefig('top_10_podcasts', bbox_inches='tight', dpi=300)

### Views with David Dobrik and Jason Nash

The most popular podcast that I listen to. In order to perform EDA on the music side of things, I will have to separate all podcasts (including this one).

In [None]:
views_podcast = df[df['artistName'] == 'VIEWS with David Dobrik and Jason Nash']
views_podcast.head()

In [None]:
views_top_15 = views_podcast.groupby('trackName').sum().sort_values('secPlayed', ascending=False)[:15]

In [None]:
### Horizontal Bar Chart Showing distributed listening time of the Views Podcast 
### broken down into the top fifteen episodes

plt.figure(figsize=(15,10))
plt.xlim(4000,8000)
sns.barplot(x='secPlayed', y=views_top_15.index, data=views_top_15, orient='h', palette='bone')
plt.xlabel('Time Played (seconds)')
plt.ylabel('Podcast Episode')
plt.title('Views Podcast Listening Times', pad=20);
#plt.savefig('views_listening', bbox_inches='tight', dpi=500)

### The Numberphile Podcast

Another podcast that I listen to quite a bit...

In [None]:
numberphile_podcast = df[df['artistName'] == 'The Numberphile Podcast']
numberphile_podcast.head()

In [None]:
numberphile_top_15 = numberphile_podcast.groupby('trackName').sum().sort_values('secPlayed', ascending=False)[:15]
numberphile_top_15

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x='secPlayed', y=numberphile_top_15.index, data=numberphile_top_15, orient='h', palette='bone')
plt.xlabel('Time Played (seconds)')
plt.ylabel('Podcast Episode')
plt.title('The Numberphile Podcast Listening Times', pad=20);
#plt.savefig('numberphile_listening', bbox_inches='tight', dpi=500)

In order to isolate the podcasts, we must group by the average time played for each artist in order to find the longer forms of media (i.e. podcasts, talk shows). From there, we must manually investigate to find all the artists within the `artistName` column that correspond to Podcasts.

In [None]:
df.groupby('artistName').mean().sort_values('secPlayed', ascending=False).head(25)

# Music Analysis

### Loading the DataFrame

In [None]:
WHITE_NOISE = ['Nature Sounds', 'Sounds Of Nature : Thunderstorm, Rain','Calmsound']

In [None]:
music = df[~df['artistName'].isin(PODCAST_ARTISTS + WHITE_NOISE)]
print('INFORMATION ABOUT MUSIC DATAFRAME: \n')
print(music.info())
music.head()

In [None]:
music[music['trackName'].str.contains('rain', case=False)]

### Most Listened To Artists

In [None]:
music_top_10 = music.groupby('artistName').sum().sort_values('secPlayed', ascending=False)[:10]

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x=music_top_10.index, y='secPlayed', data=music_top_10, palette='plasma')
plt.xticks(rotation=45);
plt.title('Top 10 Musical Artists',pad=20);

### Drake vs. Billie Eilish: A/B Testing

Is the difference in average listening time for either artist statistically significant? By performing a hypothesis test, we can find out!

In [None]:
drake_billie = music[music['artistName'].isin(['Drake','Billie Eilish'])]

#Individual tables for plotting purposes
drake, billie = music[music['artistName'] == 'Drake'], music[music['artistName'] == 'Billie Eilish']
drake_billie.head()

In [None]:
### Plotting distributions of play time of each play of Drake, Billie Eilish

plt.figure(figsize=(15,10))
sns.distplot(drake['secPlayed'], bins=10, label='Drake')
sns.distplot(billie['secPlayed'], bins=10, label='Billie Eilish')
plt.xlim(0,500)
plt.title('Drake vs. Billie Eilish Listen Time', pad=20)
plt.legend();

### Hypothesis Definitions

**Null Hypothesis:** The average listening times of both Drake and Billie Eilish come from the same underlying distribution; any difference is due to random chance.

**Alternative Hypothesis:** The average listening times of hoth Drake and Billie Eilish come from *different* underlying distributions.

**Test Statistic:** Difference of means between average listen time of Drake and Billie Eilish.

In [None]:
def calculate_ts(df, group_label, col_label):
    """Calculates the desired test statistic given a grouping label and a column label"""
    grouped_df = df.groupby(group_label).mean()
    test_stat = round(grouped_df[col_label][0]-grouped_df[col_label][1], 2)
    return test_stat

def shuffle_table(df,col_label):
    """Shuffles a given DataFrame in a random order based on the given column label"""
    new_df = df.copy()
    shuffled_df = new_df.sample(frac=1, replace=False).reset_index(drop=True)
    shuffled_labels = shuffled_df[col_label]
    new_df[col_label] = list(shuffled_labels)
    return new_df

def do_ab_test(df, group_label, col_label):
    """Performs a permutation test and returns one single test statistic"""
    shuffled = shuffle_table(df, col_label)
    test_stat = calculate_ts(shuffled, group_label, col_label)
    return test_stat

def do_all_analysis():
    """Performs all above functions in order to conclude statistical significance"""
    list_ts = []
    for _ in range(1000):
        list_ts.append(do_ab_test(drake_billie, 'artistName', 'secPlayed'))

    p_val = sum(list_ts >= ORIGINAL_TEST_STAT) / len(list_ts)
    print('P-Value: \n', p_val, '\n')

    if p_val <= 0.05:
        print('Conclusion: \n Statistically significant')
        print(' The data favor the alternative hypothesis')
    else:
        print('Conclusion: \n Not statistically significant')
        print('  The data favor the null hypothesis')
    
    plt.figure(figsize=(15,10))
    sns.distplot(list_ts)
    plt.vlines(ORIGINAL_TEST_STAT,0, 0.07,color='blue',linestyles='dashed',label='Observed Test Statistic')
    plt.title('Distribution of Test Statistic',pad=20)
    plt.legend(loc='upper left');

In [None]:
ORIGINAL_TEST_STAT = calculate_ts(drake_billie, 'artistName', 'secPlayed')
print('The Original Test Statistic is: {}'.format(ORIGINAL_TEST_STAT))

In [None]:
do_all_analysis()

### Top Songs of 2020 (so far)

In [None]:
### Reduced the table to only include streams from 2020

twenty_twenty = music.iloc[10085:].reset_index(drop=True)
twenty_twenty['minPlayed'] = twenty_twenty['secPlayed'] / 60
print('INFORMATION ABOUT 2020 DATAFRAME: \n') 
print(twenty_twenty.info())
twenty_twenty.head()

In [None]:
top_20_of_2020 = twenty_twenty.groupby('trackName').sum().sort_values('minPlayed', ascending=False).head(20)

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x='minPlayed', y=top_20_of_2020.index, data=top_20_of_2020, palette='autumn');

In [None]:
longest_songs = music.groupby(['trackName','artistName']).max().sort_values('secPlayed', ascending=False)
longest_songs.head()

In [None]:
top_artists = music.groupby('artistName').sum().sort_values('secPlayed', ascending=False)[:10]
top_artists.head()