# Researching User Spotify Data
The purpose of this notebook is to explore the data and create the algorithms that will be used in the django views.

### To-Do
- ~Liked tracks should be considered in the algo.~ When a top track matches with a saved track, it should carry more weight--create a metric for this -- idk how, just think about it in the future.
    - 1+ for each time an a track/artist appears, divided by 100
        - Also consider how many times a track shows up in their playlists; it should influence the weight metric. Consider this for artists as well to add more weight to the genres.
        - Also consider their followed artists and playlists.

- Consider getting the user's profile picture and artist picture for front-end purposes

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd
import plotly.express as px
from time import sleep
import json

In [2]:
with open('config.json', 'r') as f:
    config = json.load(f)

In [3]:
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=config['OAuth']['client_id'],
                                               client_secret=config['OAuth']['client_secret'],
                                               redirect_uri='http://localhost/',
                                               scope='user-library-read user-read-playback-state user-read-currently-playing user-top-read'))

## Spotify Functions

In [4]:
# find the user's top artists and return a dictionary that contains a custom artist dictionary and the raw response from the api.
def user_top_artists(sp_token: str, time_range: str):
    artists = []
    response = sp_token.current_user_top_artists(time_range=time_range, limit=50)
    
    for item in response['items']:
        artist = {
            'artist_name': item['name'],
            'artist_id': item['id'],
            'artist_uri': item['uri'],
            'genres': item['genres'],
            'popularity': item['popularity']
            
        }
        artists.append({
            'artist': artist,
            'response': item
        })
    return artists

In [24]:
# get the top tracks from a user
def user_top_tracks(sp_token: str, time_range: str):
    tracks = []
    response = sp_token.current_user_top_tracks(time_range=time_range, limit=50)
    
    for item in response['items']:
        track = {
            'track_name': item['name'],
            'track_id': item['id'],
            'artist_names': [x['name'] for x in item['artists']],
            'artist_ids': [x['id'] for x in item['artists']],
            'artist_uris': [x['uri'] for x in item['artists']],
            'duration_ms': item['duration_ms'],
            'explicit': item['explicit'],
            'popularity': item['popularity']
        }
        tracks.append({
            'track': track,
            'response': item
        })
        
    return tracks

In [6]:
# get a user's saved tracks -- limited to 550 tracks
# the limit is due to the amount of time it takes to get the saved tracks without reaching the API rate limit
# the rate limit could be extended if the app was in production and approved by spotify

def get_saved_tracks(sp_token: str):
    offset = 0
    limit = 50
    response = sp_token.current_user_saved_tracks(limit=limit, offset=offset)
    tracks = response['items']

    for i in range(0, 10):
        sleep(1)
        offset += limit
        
        if response['next'] != None or i < 10:
            response = sp_token.current_user_saved_tracks(limit=limit, offset=offset)
            tracks.extend(response['items'])
            i += 1
        else:
            return tracks
        
    return tracks

In [8]:
# def get_artists_from_tracks(sp_token: str, tracks: list):
#     artist_ids = []
#     artists = []
    
#     for i in range(0, len(tracks), )
    

In [9]:
# count the number of times an artist or track occurs across the time-ranges. it returns a dictionary of the track or artist name and the counts.
def count_items(items: list, item_type: str):
    ids = [x[item_type] for x in items]
    for item in items:
#         print(item[item_type])
#         print(ids.count(item[item_type]))
        item['counts'] = ids.count(item[item_type])
    return items

### Saved tracks
This is distinct from the top tracks and makes the tracks and artists lists more robust.

In [10]:
# define the saved track genre list

saved_tracks = get_saved_tracks(sp)

In [11]:
saved_track_dicts = []
for track in saved_tracks:
    saved_track_dicts.append({
        'track_name': track['track']['name'],
        'track_id': track['track']['id'],
        'artist_names': [x['name'] for x in track['track']['artists']],
        'artist_ids': [x['id'] for x in track['track']['artists']],
        'artist_uris': [x['uri'] for x in track['track']['artists']],
        'duration_ms': track['track']['duration_ms'],
        'explicit': track['track']['explicit'],
        'popularity': track['track']['popularity'],
        'added_at': pd.to_datetime(track['added_at']),
    })
    
pd.DataFrame(saved_track_dicts).head()

Unnamed: 0,track_name,track_id,artist_names,artist_ids,artist_uris,duration_ms,explicit,popularity,added_at
0,Reinventing Your Exit,591vJuuep0gfPhx9p8WPD5,[Underoath],[3GzWhE2xadJiW8MqRKIVSK],[spotify:artist:3GzWhE2xadJiW8MqRKIVSK],262573,False,59,2024-02-14 21:44:29+00:00
1,Autobiography Of A Nation,0ALZo8QRVF93qCVyseIjNF,[Thursday],[61awhbNK16ku1uQyXRsQj5],[spotify:artist:61awhbNK16ku1uQyXRsQj5],235626,False,39,2024-02-14 21:33:18+00:00
2,A Hole In The World,5fNCiYldS7oqIuOuTuKRGM,[Thursday],[61awhbNK16ku1uQyXRsQj5],[spotify:artist:61awhbNK16ku1uQyXRsQj5],207853,False,38,2024-02-14 21:30:22+00:00
3,Whacko Jacko Steals The Elephant Man's Bones,69xFp8tGO0yEYbhzaXs3Nh,[The Fall of Troy],[5fuQrhMRYMtoO9uOlFad4P],[spotify:artist:5fuQrhMRYMtoO9uOlFad4P],291586,False,22,2024-02-13 00:58:24+00:00
4,The Circus That Has brought Us Back To These N...,0fbHJ5ed7WDfc7buiDpy33,[The Fall of Troy],[5fuQrhMRYMtoO9uOlFad4P],[spotify:artist:5fuQrhMRYMtoO9uOlFad4P],188986,False,26,2024-02-13 00:47:30+00:00


### Define artists lists

In [12]:
# define top artists lists by time-range

long_term_artists = user_top_artists(sp, 'long_term')
medium_term_artists = user_top_artists(sp, 'medium_term')
short_term_artists = user_top_artists(sp, 'short_term')

# combine all artists
all_top_artists = [x['artist'] for x in long_term_artists] + [x['artist'] for x in medium_term_artists] + [x['artist'] for x in short_term_artists]

# long_artists_df = pd.DataFrame([x['artist'] for x in long_term_artists])
# medium_artists_df = pd.DataFrame([x['artist'] for x in medium_term_artists])
# short_artists_df = pd.DataFrame([x['artist'] for x in short_term_artists])

In [13]:
# count of all artists across time-ranges
top_artists_counts = count_items(all_top_artists, 'artist_id')

In [14]:
# create a distinct list of the artists
existing_artists = []
all_artists_distinct = []
for artist in top_artists_counts:
    if artist['artist_id'] not in existing_artists:
        existing_artists.append(artist['artist_id'])
        all_artists_distinct.append(artist)
    else:
        continue

In [None]:
all_artists_distinct[0]

In [15]:
### get all the artists from the saved tracks and count them
saved_artists = []
for track in saved_track_dicts:
    artists = [x for x in track['artist_ids']]
    for artist in artists:
        if artist not in existing_artists:
            saved_artists.append({
                'artist_id': artist,
                'counts': 1
            })
        else:
            for d in all_artists_distinct:
                if d['artist_id'] == artist:
                    d['counts'] += 1
                    break

In [16]:
# iterate through the saved artists that aren't in the top artists and get their complete data

saved_artists_ids = [x['artist_id'] for x in saved_artists]
compete_saved_artists = []
for i in range(0, len(saved_artists_ids), 50): # making the api calls in batches
    print(i)
    batch_artists = saved_artists_ids[i:i+50]
    sleep(1)
    response = sp.artists(batch_artists)
    for artist in response['artists']:        # iterating through the artists in the response
        # matching the IDs in the saved artists dictionaries and completing the dictionary to add to the distinct artist list
        for d in saved_artists:
            if d['artist_id'] == artist['id']:
                d['artist_name'] = artist['name']
                d['artist_uri'] = artist['uri']
                d['genres'] = artist['genres']
                d['popularity'] = artist['popularity']
                d['counts'] += 1
                complete_saved_artists.append(d)
                all_artists_distinct.append(d)
                break
            else:
                continue

0
Artist ID: 3GzWhE2xadJiW8MqRKIVSK
Saved ID: 3GzWhE2xadJiW8MqRKIVSK
ID MATCH
Artist ID: 27w31c5ZkBHHXMlqRGYkJ1
Saved ID: 27w31c5ZkBHHXMlqRGYkJ1
ID MATCH
Artist ID: 34KMxwDAHIvM7Kwt1PcClb
Saved ID: 34KMxwDAHIvM7Kwt1PcClb
ID MATCH
Artist ID: 2CgysNw5B7rFNRtRjQbPZ9
Saved ID: 2CgysNw5B7rFNRtRjQbPZ9
ID MATCH
Artist ID: 4kubsO16bEfCADaVUyoYb6
Saved ID: 4kubsO16bEfCADaVUyoYb6
ID MATCH
Artist ID: 56WzaWPSOfPePuPdfHBYr5
Saved ID: 56WzaWPSOfPePuPdfHBYr5
ID MATCH
Artist ID: 0Qur5LqqAVVunjY2PnjhUp
Saved ID: 0Qur5LqqAVVunjY2PnjhUp
ID MATCH
Artist ID: 26Z0ZxMY2uzimneFbrNuSY
Saved ID: 26Z0ZxMY2uzimneFbrNuSY
ID MATCH
Artist ID: 56huNdCA3s7tthaMNhIXLU
Saved ID: 56huNdCA3s7tthaMNhIXLU
ID MATCH
Artist ID: 6JwRFnmMxmWcYribpIJbcS
Saved ID: 6JwRFnmMxmWcYribpIJbcS
ID MATCH
Artist ID: 2PxnKk0fTNgMzm5pY6tINL
Saved ID: 2PxnKk0fTNgMzm5pY6tINL
ID MATCH
Artist ID: 6gwlllidzu0wkRNkXKGDfG
Saved ID: 6gwlllidzu0wkRNkXKGDfG
ID MATCH
Artist ID: 6dFTaFlK5rluDgdw1AtXVb
Saved ID: 6dFTaFlK5rluDgdw1AtXVb
ID MATCH
Artist ID:

Artist ID: 5eBCPtU2iPbzuMRre9BePt
Saved ID: 5eBCPtU2iPbzuMRre9BePt
ID MATCH
Artist ID: 1Lqdsv7Ff4GNq9PM3Yd0vi
Saved ID: 1Lqdsv7Ff4GNq9PM3Yd0vi
ID MATCH
Artist ID: 3SR0KKxXYspg7mJwSUsGBb
Saved ID: 3SR0KKxXYspg7mJwSUsGBb
ID MATCH
Artist ID: 467M2s2YxXdlL2ZpDUNL3A
Saved ID: 467M2s2YxXdlL2ZpDUNL3A
ID MATCH
Artist ID: 7lmvHeAJ7CIhXCdLKjm7VL
Saved ID: 7lmvHeAJ7CIhXCdLKjm7VL
ID MATCH
Artist ID: 09RKiiT2NAJ7qYoAGOYu9w
Saved ID: 09RKiiT2NAJ7qYoAGOYu9w
ID MATCH
Artist ID: 4Ase9pfG4FCMoiuyRduc8k
Saved ID: 4Ase9pfG4FCMoiuyRduc8k
ID MATCH
Artist ID: 6bAM7jeIX4pI5lZ0QoSZjt
Saved ID: 6bAM7jeIX4pI5lZ0QoSZjt
ID MATCH
Artist ID: 2E701AAAlg7LthbISEZv0N
Saved ID: 2E701AAAlg7LthbISEZv0N
ID MATCH
Artist ID: 4nygSGQAUjpauYKxzqm8MS
Saved ID: 4nygSGQAUjpauYKxzqm8MS
ID MATCH
Artist ID: 4q3ewBCX7sLwd24euuV69X
Saved ID: 4q3ewBCX7sLwd24euuV69X
ID MATCH
Artist ID: 26dSoYclwsYLMAKD3tpOr4
Saved ID: 26dSoYclwsYLMAKD3tpOr4
ID MATCH
Artist ID: 67OokTsDsLUvJI6oIxCigq
Saved ID: 67OokTsDsLUvJI6oIxCigq
ID MATCH
Artist ID: 5

Artist ID: 7dGU76bSIWTcybFmovtjcz
Saved ID: 7dGU76bSIWTcybFmovtjcz
ID MATCH
Artist ID: 7dGU76bSIWTcybFmovtjcz
Saved ID: 7dGU76bSIWTcybFmovtjcz
ID MATCH
Artist ID: 3YrAMtgGeFZCc4JCQI9mfr
Saved ID: 3YrAMtgGeFZCc4JCQI9mfr
ID MATCH
Artist ID: 67kJ7VKxNECnr33Y1BJEoO
Saved ID: 67kJ7VKxNECnr33Y1BJEoO
ID MATCH
Artist ID: 56huNdCA3s7tthaMNhIXLU
Saved ID: 56huNdCA3s7tthaMNhIXLU
ID MATCH
Artist ID: 3BQpGPX8wEhlKrICJuCmBd
Saved ID: 3BQpGPX8wEhlKrICJuCmBd
ID MATCH
Artist ID: 6KgydpcCuiFaWZQvCA50YZ
Saved ID: 6KgydpcCuiFaWZQvCA50YZ
ID MATCH
Artist ID: 59e7YxjDTqYuyxi0kTt4fL
Saved ID: 59e7YxjDTqYuyxi0kTt4fL
ID MATCH
Artist ID: 2iyLphVMYwRzaMn9hwHqDf
Saved ID: 2iyLphVMYwRzaMn9hwHqDf
ID MATCH
Artist ID: 1TtJ8j22Roc24e2Jx3OcU4
Saved ID: 1TtJ8j22Roc24e2Jx3OcU4
ID MATCH
Artist ID: 7FFwJQ58hAy7PMo4lUBW96
Saved ID: 7FFwJQ58hAy7PMo4lUBW96
ID MATCH
Artist ID: 5dMa4Q6mkH9767IgmuGuAu
Saved ID: 5dMa4Q6mkH9767IgmuGuAu
ID MATCH
Artist ID: 5kNGExA70Z5twMjo7mFYrW
Saved ID: 5kNGExA70Z5twMjo7mFYrW
ID MATCH
Artist ID: 2

Artist ID: 5MJREYwNVcTl1ohELWlciR
Saved ID: 5MJREYwNVcTl1ohELWlciR
ID MATCH
Artist ID: 5OqtUcLnQsI4uPnyWACQBg
Saved ID: 5OqtUcLnQsI4uPnyWACQBg
ID MATCH
Artist ID: 2zdyZ3Dk59W2RKwAvGioLp
Saved ID: 2zdyZ3Dk59W2RKwAvGioLp
ID MATCH
Artist ID: 5ZPYeVqoWNuukwfarvkyJX
Saved ID: 5ZPYeVqoWNuukwfarvkyJX
ID MATCH
Artist ID: 7lU8Gtn7moZmPqqu4oPkEh
Saved ID: 7lU8Gtn7moZmPqqu4oPkEh
ID MATCH
Artist ID: 2Mpw0nGvpyFqhdqdhNPFsg
Saved ID: 2Mpw0nGvpyFqhdqdhNPFsg
ID MATCH
Artist ID: 0YZFY3TMKjMskbxnYzwT1g
Saved ID: 0YZFY3TMKjMskbxnYzwT1g
ID MATCH
Artist ID: 1o8nCox6ggFAOMN0vZBj6b
Saved ID: 1o8nCox6ggFAOMN0vZBj6b
ID MATCH
Artist ID: 4ClVQWxbbuA2I7Yuyeu5of
Saved ID: 4ClVQWxbbuA2I7Yuyeu5of
ID MATCH
Artist ID: 4ioovQ90tEAnJiQryAtkXl
Saved ID: 4ioovQ90tEAnJiQryAtkXl
ID MATCH
Artist ID: 6SGVSqPxYUYa885J3dtXQu
Saved ID: 6SGVSqPxYUYa885J3dtXQu
ID MATCH
Artist ID: 4J4bsLS3aMvdGzv0tJTB8t
Saved ID: 4J4bsLS3aMvdGzv0tJTB8t
ID MATCH
Artist ID: 1s7yEBXZWlCUwSQQ4Jljnc
Saved ID: 1s7yEBXZWlCUwSQQ4Jljnc
ID MATCH
Artist ID: 2

In [19]:
artist_df = pd.DataFrame(all_artists_distinct)

In [20]:
artist_df.describe()

Unnamed: 0,popularity,counts
count,622.0,622.0
mean,38.061093,3.043408
std,20.320527,1.804373
min,0.0,1.0
25%,25.0,2.0
50%,36.0,2.0
75%,49.0,4.0
max,95.0,15.0


In [21]:
artist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 622 entries, 0 to 621
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   artist_name  622 non-null    object 
 1   artist_id    622 non-null    object 
 2   artist_uri   622 non-null    object 
 3   genres       622 non-null    object 
 4   popularity   622 non-null    float64
 5   counts       622 non-null    int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 29.3+ KB


### Define tracks lists

In [None]:
## don't define a list of distinct tracks yet; its better to add more weight to the genres count

In [35]:
# define top tracks lists

long_term_tracks = user_top_tracks(sp, 'long_term')
medium_term_tracks = user_top_tracks(sp, 'medium_term')
short_term_tracks = user_top_tracks(sp, 'short_term')

all_top_tracks = saved_track_dicts + [item['track'] for item in long_term_tracks] + [item['track'] for item in medium_term_tracks] + [item['track'] for item in short_term_tracks]

In [36]:
# counts of all tracks across time-ranges
all_tracks_counts = count_items(all_top_tracks, 'track_id')
all_tracks_counts

[{'track_name': 'Reinventing Your Exit',
  'track_id': '591vJuuep0gfPhx9p8WPD5',
  'artist_names': ['Underoath'],
  'artist_ids': ['3GzWhE2xadJiW8MqRKIVSK'],
  'artist_uris': ['spotify:artist:3GzWhE2xadJiW8MqRKIVSK'],
  'duration_ms': 262573,
  'explicit': False,
  'popularity': 59,
  'added_at': Timestamp('2024-02-14 21:44:29+0000', tz='UTC'),
  'counts': 1},
 {'track_name': 'Autobiography Of A Nation',
  'track_id': '0ALZo8QRVF93qCVyseIjNF',
  'artist_names': ['Thursday'],
  'artist_ids': ['61awhbNK16ku1uQyXRsQj5'],
  'artist_uris': ['spotify:artist:61awhbNK16ku1uQyXRsQj5'],
  'duration_ms': 235626,
  'explicit': False,
  'popularity': 39,
  'added_at': Timestamp('2024-02-14 21:33:18+0000', tz='UTC'),
  'counts': 1},
 {'track_name': 'A Hole In The World',
  'track_id': '5fNCiYldS7oqIuOuTuKRGM',
  'artist_names': ['Thursday'],
  'artist_ids': ['61awhbNK16ku1uQyXRsQj5'],
  'artist_uris': ['spotify:artist:61awhbNK16ku1uQyXRsQj5'],
  'duration_ms': 207853,
  'explicit': False,
  'popular

In [38]:
pd.DataFrame(all_top_tracks).describe()

Unnamed: 0,duration_ms,popularity,counts
count,700.0,700.0,700.0
mean,212685.6,27.49,1.2
std,114678.5,19.162095,0.55331
min,23610.0,0.0,1.0
25%,139534.0,12.0,1.0
50%,191185.5,25.5,1.0
75%,254512.8,40.0,1.0
max,1262400.0,93.0,4.0


### Genre Counts

In [70]:
# get the genres from a list of tracks
# its ok for duplicate genres because we want those duplicates to hold more weight when analyzing them

# there needs to be a way to handle artists that don't have genres listed

def get_genres_from_tracks(sp_token: str, tracks: list, artists: list):
    
#     response_list = [item['response'] for item in tracks]
    genres = []
    
    for track in tracks:
        common_ids = set(track['artist_ids']) & {artist['artist_id'] for artist in artists}
#         print(common_ids)
        track['genres'] = set()
        for artist in artists:
            if artist['artist_id'] in common_ids:
                track['genres'].update(artist['genres'])
                genres.extend(artist['genres'])
                
    all_track_artist_ids = {id for track in tracks for id in track['artist_ids']}
    unmatched_artist_ids = all_track_artist_ids - {artist['artist_id'] for artist in artists}
    
    print(len(unmatched_artist_ids))
    print(unmatched_artist_ids)
    
    if unmatched_artist_ids:
        for i in range(0, len(set(artist_ids)), 50):
            batch_artists = un[i:i+50]
            sleep(1)
            artists = sp_token.artists(batch_artists)
            for artist in artists['artists']:
                genres.extend((artist['id'], artist['genres']))

    return (tracks, genres)

In [71]:
t, g = get_genres_from_tracks(sp, [x['track'] for x in long_term_tracks], all_artists_distinct)

4
{'3Ayl7mCk0nScecqOzvNp6s', '3kSqc2brwAF1kWRFWYe2fW', '1ZvF4Sgnre3Rk2CpiNy077', '1NUOfvAhA9AvsF1ISMkgHX'}


NameError: name 'artist_ids' is not defined

In [65]:
t[0]

{'track_name': 'ANGEL TEARS',
 'track_id': '7kU3XPZ8u4XPGgnKHZJIOT',
 'artist_names': ['Ftlframe', 'dissectedRen'],
 'artist_ids': ['6ueZc2xAm12Ib0e90Bx7P0', '4J4bsLS3aMvdGzv0tJTB8t'],
 'artist_uris': ['spotify:artist:6ueZc2xAm12Ib0e90Bx7P0',
  'spotify:artist:4J4bsLS3aMvdGzv0tJTB8t'],
 'duration_ms': 157354,
 'explicit': False,
 'popularity': 42,
 'counts': 4,
 'genres': {'chill breakcore'}}

In [39]:
# define the genre list from time ranges

long_genres = get_genres_from_tracks(sp, [x['track'] for x in long_term_tracks], all_artist_distinct)
medium_genres = get_genres_from_tracks(sp, [x['track'] for x in medium_term_tracks], all_artist_distinct)
short_genres = get_genres_from_tracks(sp, [x['track'] for x in short_term_tracks], all_artist_distinct)

KeyError: 'artists'

In [42]:
long_term_tracks[0]['track']

{'track_name': 'ANGEL TEARS',
 'track_id': '7kU3XPZ8u4XPGgnKHZJIOT',
 'artist_names': ['Ftlframe', 'dissectedRen'],
 'artist_ids': ['6ueZc2xAm12Ib0e90Bx7P0', '4J4bsLS3aMvdGzv0tJTB8t'],
 'artist_uris': ['spotify:artist:6ueZc2xAm12Ib0e90Bx7P0',
  'spotify:artist:4J4bsLS3aMvdGzv0tJTB8t'],
 'duration_ms': 157354,
 'explicit': False,
 'popularity': 42,
 'counts': 4}

In [None]:
# define the top all-time genres
long_series = pd.Series(data=long_genres)
long_counts = long_series.value_counts()[:10]

# define the top ten genres from the past 6 months
medium_series = pd.Series(data=medium_genres)
medium_counts = medium_series.value_counts()[:10]

# define the top ten genres from the past 4 weeks
short_series = pd.Series(data=short_genres)
short_counts = short_series.value_counts()[:10]

In [None]:
# define the time-range dataframe for genres
genre_count_df = pd.DataFrame({'all_time_genres': long_counts, '6month_genres': medium_counts, '4week_genres': short_counts})
genre_count_df.fillna(0, inplace=True)
genre_count_df = genre_count_df.astype(int, copy=True)

In [None]:
# create a series and count
all_genres_series = pd.Series(data=all_genres)
all_genres_counts = all_genres_series.value_counts()[:25]
all_genres_counts

In [None]:
# define the genre list from the saved tracks

# saved_track_genres = get_genres_from_tracks(sp, saved_tracks)



In [None]:
# combine all the genres
all_genres = saved_track_genres + long_genres + medium_genres + short_genres
len(all_genres)

## Track Features
Mapping out the track features of all the songs to create visualization of the average/median features for each time-frame.
Including saved tracks.

Break down the features based on the time frames and a combined features.

In [None]:
# long_track_ids = [x['id'] for x in top_tracks['long']]
# medium_track_ids = [x['id'] for x in top_tracks['medium']]
# short_track_ids = [x['id'] for x in top_tracks['short']]

def get_features(token, tracks):
    key_removal = ['type', 'uri', 'track_href', 'analysis_url']
    track_dir = list()
    
    for track in tracks:
        track_dir.append({
            'track_id': track['id'],
            'name': track['name'],
            'artist': track['name']
        })
        
    tracks_df = pd.DataFrame(track_dir)
    features = token.audio_features([x['track_id'] for x in track_dir])
    features = [{key: value for key, value in d.items() if key not in key_removal} for d in features]
    
    audio_feature_mapping = {item['id']: item for item in features}
    feature_cols = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'instrumentalness','liveness',
                   'valence', 'tempo', 'duration_ms', 'time_signature']
    
    for feature in feature_cols:
        tracks_df[feature] = tracks_df['track_id'].map(lambda x: audio_feature_mapping[x][feature])
        
    return tracks_df

In [None]:
tracks_df = get_features(sp, top_tracks['long'])
tracks_df.head()

In [None]:
tracks_df['danceability'].describe()

In [None]:
dance_mean = tracks_df['danceability'].mean()
dance_std = tracks_df['danceability'].std()
dance_zscore = (tracks_df['danceability'] - dance_mean) / dance_std

print(dance_mean)
print(dance_std)
print(tracks_df['danceability'].min())
print(tracks_df['danceability'].max())

In [None]:
tracks_df.describe()

In [None]:
dist_df = tracks_df.drop(columns = ['track_id','name', 'artist']).describe()

# User EDA
This is where all of the user data will be combined into one DataFrame and be analyzed. I'll look at:
- Distribution of genres and track features
    - Grouped by time ranges
- Counts of artists and tracks across all time ranges and saved tracks
    - Top tracks and artists from All-time, 6 months, and 4 weeks
- Top tracks from specific albums
- Distribution of artist and track popularity
    - Grouped by time range
- Distribution of explicit tracks
- Distribution of track duration
- Distribution of the custom weights (counts)
    - This could be used to cluster the features of tracks within the median/mean weight

### Visualizations

In [None]:
fig = px.bar(
    df,
    labels={
        'index':'Genres', 
        'value':'Number of Tracks',
        'variable':'Time Range',
    },
    title='Most Listened Genres',
)
# fig.update_layout(xaxis={'categoryorder': 'total descending'})
fig.update_traces(name='All-time', selector={'name':'long_tracks'})
fig.update_traces(name='Past 6 Months', selector={'name':'medium_tracks'})
fig.update_traces(name='Past 4 Weeks', selector={'name':'short_tracks'})

fig.show()

## Genre Features
- Group song features based on genre and apply a heat map. Perform various analysis to find any correlations.
- Compare the features of the individual track compared to the genre features.
- Create the following functions:
    - Collect features and return a list of dictionaries for those features
    - Input the list of features and analyze them (cluster? median?) and return list of the new values

# Create the model

- Try clustering the track features and genre features.
- Maybe make a simple neural net using weights from the custom TBD metric.
    - Give weights to tracks, artists, and genres