# Researching User Spotify Data
The purpose of this notebook is to explore the data and create the algorithms that will be used in the django views.

### To-Do
- ~Along with the top tracks, a user's saved/liked tracks should be analyzed.~

- ~Liked tracks should be considered in the algo.~ When a top track matches with a saved track, it should carry more weight--create a metric for this -- idk how, just think about it in the future.
    - 1+ for each time an a track/artist appears, divided by 100
        - Also consider how many times a track shows up in their playlists; it should influence the weight metric. Consider this for artists as well to add more weight to the genres.
        - Also consider their followed artists and playlists.

- ~Modify the amount of top tracks. There needs to be a while loop that increases the offset by 50 (limit)--offset += limit. there needs to be a condition at the end of the loop that checks if 'next' is in the resopnse key. If the key is None, then that means there are no more artists/tracks, and the loop should break.~

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd
import plotly.express as px
from time import sleep
import json

In [2]:
with open('config.json', 'r') as f:
    config = json.load(f)

In [3]:
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=config['OAuth']['client_id'],
                                               client_secret=config['OAuth']['client_secret'],
                                               redirect_uri='http://localhost/',
                                               scope='user-library-read user-read-playback-state user-read-currently-playing user-top-read'))

# Spotify Functions

In [38]:
# find the user's top artists and return a dictionary that contains a custom artist dictionary and the raw response from the api.
def user_top_artists(sp_token: str, time_range: str):
    artists = []
    response = sp_token.current_user_top_artists(time_range=time_range, limit=50)
    
    for item in response['items']:
        artist = {
            'artist_name': item['name'],
            'artist_id': item['id'],
            'artist_uri': item['uri'],
            'generes': item['genres'],
            'popularity': item['popularity']
            
        }
        artists.append({
            'artist': artist,
            'response': item
        })
    return artists

In [5]:
# get the top tracks from a user
def user_top_tracks(sp_token: str, time_range: str):
    tracks = []
    response = sp_token.current_user_top_tracks(time_range=time_range, limit=50)
    
    for item in response['items']:
        track = {
            'track_name': item['name'],
            'track_id': item['id'],
            'artist_names': [x['name'] for x in item['artists']],
            'artist_ids': [x['id'] for x in item['artists']],
            'artist_uris': [x['id'] for x in item['artists']]
            'duration_ms': item['duration_ms'],
            'explicit': item['explicit'],
            'popularity': item['popularity']
        }
        tracks.append({
            'track': track,
            'response': item
        })
        
    return tracks

In [6]:
# get a user's saved tracks -- limited to 550 tracks
# the limit is due to the amount of time it takes to get the saved tracks without reaching the API rate limit
# the rate limit could be extended if the app was in production and approved by spotify

def get_saved_items(sp_token: str):
    offset = 0
    limit = 50
    response = sp_token.current_user_saved_tracks(limit=limit, offset=offset)
    tracks = response['items']

    for i in range(0, 10):
        sleep(1)
        offset += limit
        
        if response['next'] != None or i < 10:
            response = sp_token.current_user_saved_tracks(limit=limit, offset=offset)
            tracks.extend(response['items'])
            i += 1
        else:
            return tracks
        
    return tracks

In [7]:
# get the genres from a list of tracks
# its ok for duplicate genres because we want those duplicates to hold more weight when analyzing them

# there needs to be a way to handle artists that don't have genres listed

def get_genres_from_tracks(sp_token: str, tracks: list):
    
#     response_list = [item['response'] for item in tracks]
    response_list = tracks
    genres_complete = list()
    artists_genres = list()
    artist_ids = list()

    for track in [x['track'] for x in response_list]:
        for artist in track['artists']:
            artist_ids.append(artist['id'])

    try:
        for i in range(0, len(set(artist_ids)), 50):
            batch_artists = artist_ids[i:i+50]
#             batch_artists = set(batch_artists)
            sleep(1)
            artists = sp_token.artists(batch_artists)
            for artist in artists['artists']:
                genres.append((artist['id'], artist['genres']))
#                 print(f'{artist["name"]}: {artist["genres"]}')
                genres_complete.extend(artist['genres'])
            print(i)

    except Exception as e:
        print(f'ERROR FETCHING GENRES {e}') 

    return (genres, genres_complete)

In [26]:
# count the number of times an artist or track occurs across the time-ranges. it returns a dictionary of the track or artist name and the counts.
def count_items(items: list, item_type: str):
    ids = [x[item_type] for x in items]
    for item in items:
#         print(item[item_type])
#         print(ids.count(item[item_type]))
        item['counts'] = ids.count(item[item_type])
    return items

## Saved tracks

In [19]:
# define the saved track genre list

saved_tracks = get_saved_tracks(sp)

In [None]:
saved_dicts = []
for track in saved_tracks:
    saved_dicts.append({
        'track_name': track['track']['name'],
        'track_id': track['track']['id'],
        'artists_names': [x['name'] for x in track['track']['artists']],
        'artists_ids': [x['id'] for x in track['track']['artists']],
        'artists_uris': [x['uri'] for x in track['track']['artists']],
        'duration_ms': track['track']['duration_ms'],
        'explicit': track['track']['explicit'],
        'popularity': track['track']['popularity'],
        'added_at': pd.to_datetime(track['added_at']),
    })
    
pd.DataFrame(saved_dicts)

## Define artists lists

In [40]:
# define top artists lists

long_term_artists = user_top_artists(sp, 'long_term')
medium_term_artists = user_top_artists(sp, 'medium_term')
short_term_artists = user_top_artists(sp, 'short_term')
saved_artists = 

all_top_artists = [x['artist'] for x in long_term_artists] + [x['artist'] for x in medium_term_artists] + [x['artist'] for x in short_term_artists]

# long_artists_df = pd.DataFrame([x['artist'] for x in long_term_artists])
# medium_artists_df = pd.DataFrame([x['artist'] for x in medium_term_artists])
# short_artists_df = pd.DataFrame([x['artist'] for x in short_term_artists])

In [41]:
# count of all artists across time-ranges
all_artists_counts = count_items(all_top_artists, 'artist_uri')

all_artists_counts

['spotify:artist:0NwRAG9DawUqqgur9925fA', 'spotify:artist:6EjKG35nRHw9t8ypwDunOB', 'spotify:artist:6Ghvu1VvMGScGpOUJBAHNH', 'spotify:artist:0BqcFG3uqS8l59OsCtIiH0', 'spotify:artist:19I4tYiChJoxEO5EuviXpz', 'spotify:artist:0IVapwlnM3dEOiMsHXsghT', 'spotify:artist:7m63GptZSke3jGqCxR4rom', 'spotify:artist:1aEYCT7t18aM3VvM6y8oVR', 'spotify:artist:1hiIe6hmDchjc246cpoAOM', 'spotify:artist:39kuwM2oBNmrM3kEYVmk2X', 'spotify:artist:0AZ3VR0YbFcS0Kgei7L2QF', 'spotify:artist:1NV2n4DkUNfCCuaaxsWJnl', 'spotify:artist:15VmPRQCJEZWaZWgHEroj0', 'spotify:artist:7FBcuc1gsnv6Y1nwFtNRCb', 'spotify:artist:49O77SKrEk1b9sNjhI0kM4', 'spotify:artist:6FBDaR13swtiWwGhX1WQsP', 'spotify:artist:3EgMK920cIH5aLxFnJ6zSi', 'spotify:artist:0MkAzpDHUZpuDnWGUII4RN', 'spotify:artist:1LhK7wn59Hq6GNN4sUS3ih', 'spotify:artist:2bgBTY9LvajPhwkPoyLGH7', 'spotify:artist:2iT2Fmot4VzWgdOTgp3j9M', 'spotify:artist:0Y4inQK6OespitzD6ijMwb', 'spotify:artist:3NChzMpu9exTlNPiqUQ2DE', 'spotify:artist:6P5ccCJCe8A4s9tDSTNFzF', 'spotify:artist

[{'artist_name': 'Slikback',
  'artist_id': '7ab5IU6f9rBvhgS4kuQjSh',
  'artist_uri': 'spotify:artist:0NwRAG9DawUqqgur9925fA',
  'generes': ['african electronic',
   'deconstructed club',
   'experimental techno',
   'grimewave',
   'mandible',
   'singeli'],
  'popularity': 25,
  'counts': 2},
 {'artist_name': 'Drumcorps',
  'artist_id': '7ab5IU6f9rBvhgS4kuQjSh',
  'artist_uri': 'spotify:artist:6EjKG35nRHw9t8ypwDunOB',
  'generes': ['breakcore', 'cybergrind'],
  'popularity': 27,
  'counts': 2},
 {'artist_name': 'Deftones',
  'artist_id': '7ab5IU6f9rBvhgS4kuQjSh',
  'artist_uri': 'spotify:artist:6Ghvu1VvMGScGpOUJBAHNH',
  'generes': ['alternative metal',
   'nu metal',
   'rap metal',
   'rock',
   'sacramento indie'],
  'popularity': 78,
  'counts': 3},
 {'artist_name': 'Filth is Eternal',
  'artist_id': '7ab5IU6f9rBvhgS4kuQjSh',
  'artist_uri': 'spotify:artist:0BqcFG3uqS8l59OsCtIiH0',
  'generes': ['american grindcore', 'modern hardcore'],
  'popularity': 18,
  'counts': 2},
 {'arti

In [43]:
all_artists_counts[0]

{'artist_name': 'Slikback',
 'artist_id': '7ab5IU6f9rBvhgS4kuQjSh',
 'artist_uri': 'spotify:artist:0NwRAG9DawUqqgur9925fA',
 'generes': ['african electronic',
  'deconstructed club',
  'experimental techno',
  'grimewave',
  'mandible',
  'singeli'],
 'popularity': 25,
 'counts': 2}

## Define tracks lists

In [16]:
# define top tracks lists

long_term_tracks = user_top_tracks(sp, 'long_term')
medium_term_tracks = user_top_tracks(sp, 'medium_term')
short_term_tracks = user_top_tracks(sp, 'short_term')

all_top_tracks = [item['track'] for item in long_term_tracks] + [item['track'] for item in medium_term_tracks] + [item['track'] for item in short_term_tracks]

In [17]:
all_top_tracks[0]

{'track_name': 'ANGEL TEARS',
 'track_id': '7kU3XPZ8u4XPGgnKHZJIOT',
 'artist_names': ['Ftlframe', 'dissectedRen'],
 'artist_ids': ['6ueZc2xAm12Ib0e90Bx7P0', '4J4bsLS3aMvdGzv0tJTB8t'],
 'duration_ms': 157354,
 'explicit': False,
 'popularity': 42}

In [17]:
# counts of all tracks across time-ranges
all_tracks_counts = count_items(all_top_tracks, 'track_name')
all_tracks_counts

{'track_name': 'ANGEL TEARS', 'track_id': '7kU3XPZ8u4XPGgnKHZJIOT', 'artist_names': ['Ftlframe', 'dissectedRen'], 'artist_ids': ['6ueZc2xAm12Ib0e90Bx7P0', '4J4bsLS3aMvdGzv0tJTB8t'], 'duration_ms': 157354, 'explicit': False, 'popularity': 42}
{'track_name': 'One Day', 'track_id': '6OaRKtksghRcNU5bN1qvZR', 'artist_names': ['Drumcorps'], 'artist_ids': ['6EjKG35nRHw9t8ypwDunOB'], 'duration_ms': 42906, 'explicit': False, 'popularity': 15}
{'track_name': 'Compromised', 'track_id': '3Tb2kcEZ6lEPSMWhNANXOx', 'artist_names': ['Drumcorps'], 'artist_ids': ['6EjKG35nRHw9t8ypwDunOB'], 'duration_ms': 220636, 'explicit': False, 'popularity': 15}
{'track_name': 'On a Mission', 'track_id': '3PdkMOyoya1ppA3n2Vqtn8', 'artist_names': ['Drumcorps'], 'artist_ids': ['6EjKG35nRHw9t8ypwDunOB'], 'duration_ms': 68333, 'explicit': False, 'popularity': 36}
{'track_name': 'The Importance of Stealth', 'track_id': '2SxsAmAQkxIKPfHEF8IxKZ', 'artist_names': ['Drumcorps'], 'artist_ids': ['6EjKG35nRHw9t8ypwDunOB'], 'dura

{'track_name': 100}

In [18]:
pd.DataFrame(all_top_tracks)['artist_names'][0][0]

'Ftlframe'

# Genre Exploration

In [None]:
# define the genre list from time ranges

long_genres = get_genres_from_tracks(sp, long_term_tracks)
medium_genres = get_genres_from_tracks(sp, medium_term_tracks)
short_genres = get_genres_from_tracks(sp, short_term_tracks)

In [None]:
# define the top all-time genres
long_series = pd.Series(data=long_genres)
long_counts = long_series.value_counts()[:10]

# define the top ten genres from the past 6 months
medium_series = pd.Series(data=medium_genres)
medium_counts = medium_series.value_counts()[:10]

# define the top ten genres from the past 4 weeks
short_series = pd.Series(data=short_genres)
short_counts = short_series.value_counts()[:10]

In [None]:
# define the time-range dataframe for genres
genre_count_df = pd.DataFrame({'all_time_genres': long_counts, '6month_genres': medium_counts, '4week_genres': short_counts})
genre_count_df.fillna(0, inplace=True)
genre_count_df = genre_count_df.astype(int, copy=True)

In [None]:
# define the genre list from the saved tracks

saved_track_genres = get_genres_from_tracks(sp, saved_tracks)

In [None]:
# combine all the genres
all_genres = saved_track_genres + long_genres + medium_genres + short_genres
len(all_genres)

In [None]:
# create a series and count
all_genres_series = pd.Series(data=all_genres)
all_genres_counts = all_genres_series.value_counts()[:25]
all_genres_counts

## Track Features
Mapping out the track features of all the songs to create visualization of the average/median features for each time-frame.
Including saved tracks.

Break down the features based on the time frames and a combined features.

In [None]:
# long_track_ids = [x['id'] for x in top_tracks['long']]
# medium_track_ids = [x['id'] for x in top_tracks['medium']]
# short_track_ids = [x['id'] for x in top_tracks['short']]

def get_features(token, tracks):
    key_removal = ['type', 'uri', 'track_href', 'analysis_url']
    track_dir = list()
    
    for track in tracks:
        track_dir.append({
            'track_id': track['id'],
            'name': track['name'],
            'artist': track['name']
        })
        
    tracks_df = pd.DataFrame(track_dir)
    features = token.audio_features([x['track_id'] for x in track_dir])
    features = [{key: value for key, value in d.items() if key not in key_removal} for d in features]
    
    audio_feature_mapping = {item['id']: item for item in features}
    feature_cols = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'instrumentalness','liveness',
                   'valence', 'tempo', 'duration_ms', 'time_signature']
    
    for feature in feature_cols:
        tracks_df[feature] = tracks_df['track_id'].map(lambda x: audio_feature_mapping[x][feature])
        
    return tracks_df

In [None]:
tracks_df = get_features(sp, top_tracks['long'])
tracks_df.head()

In [None]:
tracks_df['danceability'].describe()

In [None]:
dance_mean = tracks_df['danceability'].mean()
dance_std = tracks_df['danceability'].std()
dance_zscore = (tracks_df['danceability'] - dance_mean) / dance_std

print(dance_mean)
print(dance_std)
print(tracks_df['danceability'].min())
print(tracks_df['danceability'].max())

In [None]:
tracks_df.describe()

In [None]:
dist_df = tracks_df.drop(columns = ['track_id','name', 'artist']).describe()

## Genre Features
- Group song features based on genre and apply a heat map. Perform various analysis to find any correlations.
- Compare the features of the individual track compared to the genre features.
- Create the following functions:
    - Collect features and return a list of dictionaries for those features
    - Input the list of features and analyze them (cluster? median?) and return list of the new values

# Plotly

In [None]:
fig = px.bar(
    df,
    labels={
        'index':'Genres', 
        'value':'Number of Tracks',
        'variable':'Time Range',
    },
    title='Most Listened Genres',
)
# fig.update_layout(xaxis={'categoryorder': 'total descending'})
fig.update_traces(name='All-time', selector={'name':'long_tracks'})
fig.update_traces(name='Past 6 Months', selector={'name':'medium_tracks'})
fig.update_traces(name='Past 4 Weeks', selector={'name':'short_tracks'})

fig.show()

# Create the model

- Try clustering the track features and genre features.
- Maybe make a simple neural net using weights from the custom TBD metric.
    - Give weights to tracks, artists, and genres