## Background

**TODO:** Write the background

## Questions to Answer

General analysis:
- Listening patterns over time
    - Long term analysis
    - Time of day, day of week, season, and other short term analysis
- What types of artists and genres did I listen to?
- Can I obtain the lyrics of each song and analyze the emotion of a song?
    - What emotions of songs did I listen to over time?
- How often did I listen to extremely popular artists versus smaller/indie artists?
- How often do I listen to music in any one sitting?
- How does the frequency of how long I listen to music vary over different time frames?
    - How long do I listen to specific artists and genres?
- Which countries did I listen to and what type of music did I listen to in those countries?
- How is my listening behavior affected by the device I use to listen?


Some specific questions:
- What type of songs did I either skip or only play for five seconds or less?
- What type of songs did I listen to on incognito mode and why?

In [150]:
# import libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime
import calendar
import pytz

In [191]:
# read in datasets
df0 = pd.read_json("data/Streaming_History_Audio_2018-2020_0.json")
df1 = pd.read_json("data/Streaming_History_Audio_2020-2021_1.json")
df2 = pd.read_json("data/Streaming_History_Audio_2021-2022_2.json")
df3 = pd.read_json("data/Streaming_History_Audio_2022_3.json")
df4 = pd.read_json("data/Streaming_History_Audio_2022-2023_4.json")
df5 = pd.read_json("data/Streaming_History_Audio_2023-2024_5.json")
df6 = pd.read_json("data/Streaming_History_Audio_2024_6.json")

In [192]:
df = pd.concat([df0, df1, df2, df3, df4, df5, df6], axis=0)

In [193]:
df.head()

Unnamed: 0,ts,username,platform,ms_played,conn_country,ip_addr_decrypted,user_agent_decrypted,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,...,episode_name,episode_show_name,spotify_episode_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode
0,2018-03-16T20:41:49Z,mmphieag,Windows 10 (10.0.16299; x64; AppX),45010,US,,,No Role Modelz,J. Cole,2014 Forest Hills Drive,...,,,,clickrow,,False,,False,1521232854659,False
1,2018-03-16T20:42:29Z,mmphieag,Windows 10 (10.0.16299; x64; AppX),36270,US,,,No Role Modelz,J. Cole,2014 Forest Hills Drive,...,,,,appload,,False,,False,1521232907487,False
2,2018-03-16T20:44:28Z,mmphieag,Windows 10 (10.0.16299; x64; AppX),27800,US,,,No Role Modelz,J. Cole,2014 Forest Hills Drive,...,,,,clickrow,,False,,False,1521232947154,False
3,2018-03-16T20:45:00Z,mmphieag,Windows 10 (10.0.16299; x64; AppX),31060,US,,,Chandler Road (Instrumental Remix),Sbvce,Sbvce Pvck: Bedroomtrap Vol. 1,...,,,,clickrow,,False,,False,1521233066829,False
4,2018-03-16T20:47:55Z,mmphieag,Windows 10 (10.0.16299; x64; AppX),3870,US,,,I. The Worst Guys,Childish Gambino,Because The Internet,...,,,,clickrow,,False,,False,1521233098379,False


In [194]:
df.columns

Index(['ts', 'username', 'platform', 'ms_played', 'conn_country',
       'ip_addr_decrypted', 'user_agent_decrypted',
       'master_metadata_track_name', 'master_metadata_album_artist_name',
       'master_metadata_album_album_name', 'spotify_track_uri', 'episode_name',
       'episode_show_name', 'spotify_episode_uri', 'reason_start',
       'reason_end', 'shuffle', 'skipped', 'offline', 'offline_timestamp',
       'incognito_mode'],
      dtype='object')

## Data Cleaning

### Connected Country
conn_country: This field is the country code of the country where the stream was played (e.g. SE - Sweden).

In [195]:
df.rename(columns={'conn_country': 'country'}, inplace=True)

In [196]:
df['country'].head()

0    US
1    US
2    US
3    US
4    US
Name: country, dtype: object

In [197]:
df['country'].unique()

array(['US', 'IS', 'ZZ', 'CA', 'DE', 'IN', 'TZ', 'KE', 'MX', 'DO'],
      dtype=object)

In [198]:
def convert_to_country(code):
    code_dict = {
        'US': 'United States',
        'IS': 'Iceland',
        'ZZ': 'Unknown',
        'CA': 'Canada',
        'DE': 'Germany',
        'IN': 'India',
        'TZ': 'Tanzania',
        'KE': 'Kenya',
        'MX': 'Mexico',
        'DO': 'Dominican Republic',
    }
    
    return code_dict[code]

df['country'] = df['country'].apply(lambda x: convert_to_country(x))

In [199]:
df['country'].unique()

array(['United States', 'Iceland', 'Unknown', 'Canada', 'Germany',
       'India', 'Tanzania', 'Kenya', 'Mexico', 'Dominican Republic'],
      dtype=object)

### Timestamp
ts: This field is a timestamp indicating when the track stopped playing in UTC (Coordinated Universal Time). The order is year, month and day followed by a timestamp in military time

In [200]:
df.rename(columns={'ts': 'timestamp'}, inplace=True)

In [201]:
df['timestamp'].head()

0    2018-03-16T20:41:49Z
1    2018-03-16T20:42:29Z
2    2018-03-16T20:44:28Z
3    2018-03-16T20:45:00Z
4    2018-03-16T20:47:55Z
Name: timestamp, dtype: object

First, change the format to datetime.

In [202]:
df['timestamp'] = df['timestamp'].apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ'))

Next, I need to convert it to the relevant timezone for accurate time calculation.

In [203]:
def convert_time_zone(stamp, country):
    country_to_time = {
        'United States': 'US/Eastern',
        'Iceland': 'Atlantic/Reykjavik',
        'Canada': 'US/Eastern',
        'Germany': 'Europe/Berlin',
        'India': 'Asia/Kolkata',
        'Tanzania': 'Africa/Dar_es_Salaam',
        'Kenya': 'Africa/Nairobi',
        'Mexico': 'US/Eastern',
        'Dominican Republic': 'US/Eastern'
    }
    
    return stamp.astimezone(pytz.timezone('US/Eastern'))

df['timestamp'] = df.apply(lambda x: convert_time_zone(x['timestamp'].replace(tzinfo=pytz.utc), x['country']), axis=1)

Now, I need to extract certain information from the timestamp:

- Date
- Year
- Month
- Day of week
- Season (determined based on me living in the Northern Hemisphere)
- Time
- Time of day (Morning, Afternoon, Night)

In [205]:
df['date'] = df['timestamp'].dt.date

In [206]:
df['year'] = df['timestamp'].dt.year

In [207]:
df['month'] = df['timestamp'].dt.month.apply(lambda x: calendar.month_name[x])

In [208]:
df['day_of_week'] = df['timestamp'].dt.date.apply(lambda x: x.strftime('%A'))

In [209]:
def determine_season(day, month, year):
    seasons = {
        'Spring': ((3,20), (6,20)),
        'Summer': ((6,21), (9,21)),
        'Fall': ((9,22), (12, 20)),
        'Winter': ((12,21), (3,19))
    }
    
    # check for winter
    if (month == 12 and day >= 21) or (month in (1,2)) or (month == 3 and day <= 19):
        return "Winter"
    
    for season, (s, e) in seasons.items():
        start = datetime(year, s[0], s[1])
        end = datetime(year, e[0], e[1])
        date = datetime(year, month, day)
        
        if start <= date <= end:
            return season

df['season'] = df['timestamp'].dt.date.apply(lambda x: determine_season(x.day, x.month, x.year))

In [210]:
df['time'] = df['timestamp'].dt.time

In [211]:
def determine_time_of_day(h, m, s):
    times_of_day = {
        'Early Morning': (5, 8),
        'Late Morning': (9, 12),
        'Early Afternoon': (13, 15),
        'Late Afternoon': (16, 17),
        'Evening': (18, 20),
        'Night': (21, 23),
        'Late Night': (0, 4)
    }
    
    for tod, (start, end) in times_of_day.items():
        if start <= h <= end:
            return tod

df['time_of_day'] = df['timestamp'].dt.time.apply(lambda x: determine_time_of_day(x.hour, x.minute, x.second))

In [212]:
df['time_of_day'].value_counts()

Early Afternoon    24182
Late Morning       23920
Late Afternoon     18400
Evening            18207
Night              13805
Late Night          5938
Early Morning       5519
Name: time_of_day, dtype: int64

In [213]:
df.head()

Unnamed: 0,timestamp,username,platform,ms_played,country,ip_addr_decrypted,user_agent_decrypted,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,...,offline,offline_timestamp,incognito_mode,date,year,month,day_of_week,season,time,time_of_day
0,2018-03-16 16:41:49-04:00,mmphieag,Windows 10 (10.0.16299; x64; AppX),45010,United States,,,No Role Modelz,J. Cole,2014 Forest Hills Drive,...,False,1521232854659,False,2018-03-16,2018,March,Friday,Winter,16:41:49,Late Afternoon
1,2018-03-16 16:42:29-04:00,mmphieag,Windows 10 (10.0.16299; x64; AppX),36270,United States,,,No Role Modelz,J. Cole,2014 Forest Hills Drive,...,False,1521232907487,False,2018-03-16,2018,March,Friday,Winter,16:42:29,Late Afternoon
2,2018-03-16 16:44:28-04:00,mmphieag,Windows 10 (10.0.16299; x64; AppX),27800,United States,,,No Role Modelz,J. Cole,2014 Forest Hills Drive,...,False,1521232947154,False,2018-03-16,2018,March,Friday,Winter,16:44:28,Late Afternoon
3,2018-03-16 16:45:00-04:00,mmphieag,Windows 10 (10.0.16299; x64; AppX),31060,United States,,,Chandler Road (Instrumental Remix),Sbvce,Sbvce Pvck: Bedroomtrap Vol. 1,...,False,1521233066829,False,2018-03-16,2018,March,Friday,Winter,16:45:00,Late Afternoon
4,2018-03-16 16:47:55-04:00,mmphieag,Windows 10 (10.0.16299; x64; AppX),3870,United States,,,I. The Worst Guys,Childish Gambino,Because The Internet,...,False,1521233098379,False,2018-03-16,2018,March,Friday,Winter,16:47:55,Late Afternoon


### Username
username: This field is your Spotify username.

### Platform
platform: This field is the platform used when streaming the track (e.g. Android OS, Google Chromecast).

### Milliseconds Played
ms_played: This field is the number of milliseconds the stream was played.

### IP Address
Ip_addr_decrypted: This field contains the IP address logged when streaming the track.

### User Agent
user_agent_decrypted: This field contains the user agent used when streaming the track (e.g. a browser, like Mozilla Firefox, or Safari)

### Track Name
master_metadata_track_name: This field is the name of the track.

### Artist Name
master_metadata_album_artist_name: This field is the name of the artist, band or podcast.

### Album Name
master_metadata_album_album_name: This field is the name of the album of the track.

### Spotify URI
spotify_track_uri: A Spotify URI, uniquely identifying the track in the form of “spotify:track:<base-62 string>”
A Spotify URI is a resource identifier that you can enter, for example, in the Spotify Desktop client’s search box to locate an artist, album, or track.

### Episode Name
episode_name: This field contains the name of the episode of the podcast.

### Episode Show Name
episode_show_name: This field contains the name of the show of the podcast.

### Spotify Episode URI
spotify_episode_uri: A Spotify Episode URI, uniquely identifying the podcast episode in the form of “spotify:episode:<base-62 string>”

### Reason Start
reason_start: This field is a value telling why the track started (e.g. “trackdone”)

### Reason End
reason_end: This field is a value telling why the track ended (e.g. “endplay”).

### Shuffle?
shuffle: This field has the value True or False depending on if shuffle mode was used when playing the track.

### Skipped?
skipped: This field indicates if the user skipped to the next song

### Offline?
offline: This field indicates whether the track was played in offline mode (“True”) or not (“False”).

### Offline Timestamp
offline_timestamp: This field is a timestamp of when offline mode was used, if used.

### Incognito Mode?
incognito_mode: This field indicates whether the track was played during a private session (“True”) or not (“False”).