# Spotify Dataset Exploration:

**Comment:** 

1. Objective: 
    1. To only focus ON ARTISTS that WERE NOT LISTENED TO or SUCCESSFUL in 2014. 
    2. To limit to artists who released songs between 2015 to 2017.
    3. The focus can be shifted to SONGS that were released between 2015 and 2017, BUT WERE NOT SUCCESSFUL in 2014. 
    4. We should add more playlists to the "success" criteria - https://www.complex.com/music/best-spotify-playlists/new-music-friday
    5. Perhaps we can make the top 20 playlists the success criteria
    6. We should take genre specifics into considerations as well (RapCaviar, Massive Pop Remixes, etc.). Major genres should be explored
    
2. Limitations:
    1. We can't observe the direct effect of certain playlists on music on a timeline, as the day is always 10.
    2. The weeks are always at fixed intervals. 
    3. The success criteria has been determined by WMG - we might need to broaden our definition of 'success'. 

In [None]:
# Set up environment:

# !pip install xgboost
# !pip install tensorflow
# !pip install "tensorflow-text==2.8.*"
# !pip install bokeh
# !pip install simpleneighbors[annoy]
# !pip install tqdm
# !pip install lyricsgenius
# !pip install spotipy

In [7]:
import numpy as np

In [3]:
 # Relevant Packages:
    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import Isomap, LocallyLinearEmbedding, TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, RepeatedStratifiedKFold, GridSearchCV
from sklearn.metrics import confusion_matrix, recall_score, precision_score, accuracy_score, precision_recall_curve
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier, StackingClassifier
#from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.linear_model import LinearRegression

import pickle 
import time
from IPython.display import HTML

In [None]:
# Packages for Lyrics Embeddings:

import bokeh
import bokeh.models
import bokeh.plotting
import os
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer
import sklearn.metrics.pairwise
from simpleneighbors import SimpleNeighbors
from tqdm import tqdm
from tqdm import trange

## Some Sources:

https://output.com/blog/playlists-good-or-bad-for-musicians#:~:text=Playlists%20match%20the%20right%20song%20to%20the%20right%20listener&text=Often%20without%20promotion%2C%20these%20customized,likely%20want%20to%20hear%20it.

https://github.com/maxgmarin/AC209a_FinalProject_EEM

https://github.com/maxgmarin/AC209a_FinalProject_EEM/blob/master/notebook_Markdown/AC209a_Final_ER_Spotify_EDA.md

https://www.theinformationlab.co.uk/2019/08/08/getting-audio-features-from-the-spotify-api/

https://towardsdatascience.com/using-sentence-embeddings-to-explore-the-similarities-and-differences-in-song-lyrics-1820ac713f00

https://medium.com/swlh/how-to-leverage-spotify-api-genius-lyrics-for-data-science-tasks-in-python-c36cdfb55cf3

### Ideas to Walk though:

1. Some artists may have JUST dropped their FIRST song. Or they may have been in the industry for far less years. 
2. We may need more characteristics regarding particular playlists. 
3. Correlation Matrix on SoundTrack features can give us an idea of whether to do a PCA on all of them, or each of them
4. Playlist features can be done based on average of songs (IN THE DATASET RIGHT NOW) on particular playlists. 
5. We can investigate genre on PCA of the audio features later - https://maxgmarin.github.io/AC209a_FinalProject_EEM/
6. Maybe get twitter posts or reddit posts regarding songs for 2017 and before

In [8]:
df = pd.read_csv('cleaned_data.csv', low_memory=False)

In [9]:
df.head(10)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,day,log_time,mobile,track_id,isrc,upc,artist_name,...,hour,minute,week,month,year,date,weekday,weekday_name,playlist_id,playlist_name
0,0,9,"('small_artists_2016.csv', 9)",10,20160510T12:15:00,True,8f1924eab3804f308427c31d925c1b3f,USAT21600547,75679910000.0,Sturgill Simpson,...,12,15,19,5,2016,2016-05-10,1,Tuesday,,
1,1,19,"('small_artists_2016.csv', 19)",10,20160510T12:15:00,True,8f1924eab3804f308427c31d925c1b3f,USAT21600547,75679910000.0,Sturgill Simpson,...,12,15,19,5,2016,2016-05-10,1,Tuesday,,
2,2,29,"('small_artists_2016.csv', 29)",10,20160510T14:00:00,True,8f1924eab3804f308427c31d925c1b3f,USAT21600547,75679910000.0,Sturgill Simpson,...,14,0,19,5,2016,2016-05-10,1,Tuesday,,
3,3,39,"('small_artists_2016.csv', 39)",10,20160510T10:45:00,True,8f1924eab3804f308427c31d925c1b3f,USAT21600547,75679910000.0,Sturgill Simpson,...,10,45,19,5,2016,2016-05-10,1,Tuesday,,
4,4,49,"('small_artists_2016.csv', 49)",10,20160510T10:15:00,True,8f1924eab3804f308427c31d925c1b3f,USAT21600547,75679910000.0,Sturgill Simpson,...,10,15,19,5,2016,2016-05-10,1,Tuesday,,
5,5,59,"('small_artists_2016.csv', 59)",10,20160510T02:30:00,False,8f1924eab3804f308427c31d925c1b3f,USAT21600547,75679910000.0,Sturgill Simpson,...,2,30,19,5,2016,2016-05-10,1,Tuesday,,
6,6,69,"('small_artists_2016.csv', 69)",10,20160510T09:45:00,False,8f1924eab3804f308427c31d925c1b3f,USAT21600547,75679910000.0,Sturgill Simpson,...,9,45,19,5,2016,2016-05-10,1,Tuesday,,
7,7,79,"('small_artists_2016.csv', 79)",10,20160510T14:00:00,True,8f1924eab3804f308427c31d925c1b3f,USAT21600547,75679910000.0,Sturgill Simpson,...,14,0,19,5,2016,2016-05-10,1,Tuesday,,
8,8,89,"('small_artists_2016.csv', 89)",10,20160510T19:15:00,True,8f1924eab3804f308427c31d925c1b3f,USAT21600547,75679910000.0,Sturgill Simpson,...,19,15,19,5,2016,2016-05-10,1,Tuesday,,
9,9,99,"('small_artists_2016.csv', 99)",10,20160510T15:00:00,True,8f1924eab3804f308427c31d925c1b3f,USAT21600547,75679910000.0,Sturgill Simpson,...,15,0,19,5,2016,2016-05-10,1,Tuesday,,


In [None]:
df2 = pd.read_csv('newartists2015onwards.csv', low_memory=False)

In [None]:
# Remove further useless columns:

df.drop(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1'], axis=1, inplace=True)

In [None]:
# Remove ones with high % nulls

df.drop(['offline_timestamp', 'stream_cached', 'source', 'referral_code'], axis=1, inplace=True)

In [None]:
df.columns

In [None]:
HTML(df.head().to_html())

**Comment:** 

1. We will need all type-based features
2. For each artist, we will get the features of the 10 most played songs that they have. 
3. There will be a table for each feature, and the rows will be organised by artist
4. A PCA will be done on all tables

In [None]:
# Number of songs
df.track_name.nunique()

In [None]:
# Most played song
df.track_name.value_counts().loc[lambda x: x == x.max()]

In [None]:
# Number of songs played less than 10 times
len(df.track_name.value_counts().loc[lambda x: x < 10])

In [None]:
# Number of artists
df.track_artists.nunique()

In [None]:
# Artists with at least 5 songs
no_of_tracks = df[['track_artists', 'track_name']].drop_duplicates().groupby(by='track_artists').count().sort_values(by='track_name', ascending=False)
no_of_tracks.reset_index(inplace=True)
artists_with_5plus_tracks = no_of_tracks[no_of_tracks['track_name'] >= 5]
artists_with_10plus_tracks = no_of_tracks[no_of_tracks['track_name'] >= 10]
print('Number of artists with 5 or more songs: ', len(artists_with_5plus_tracks))
print('Number of artists with 10 or more songs: ', len(artists_with_10plus_tracks))

In [None]:
# Playlists available
df.playlist_name.nunique()

In [None]:
# Top 10 playlists by number of songs and number of artists
constraint = df['playlist_name'].notna()
df[constraint][['playlist_name', 
                'track_name']].drop_duplicates().groupby(by='playlist_name').count().sort_values(by='track_name', ascending=False)

In [None]:
# Artists organised by how many playlists they are on
df[constraint][['artist_name', 
                'playlist_name']].drop_duplicates().groupby(by='artist_name').count().sort_values(by='playlist_name', ascending=False)

In [None]:
# Objective playlists:
# success_playlists = ['Hot Hits UK', 'Massive Dance Hits', 'The Indie List', 'New Music Friday']

In [None]:
# Checking the distribution of artists by when their songs were played:

df.year.value_counts()

**Comment:**

1. We will have to first gather which artists have made it onto the successful playlists
2. After that, we drop columns with such playlists, in order to avoid feature leakage

In [None]:
# all artists in ascending order
all_artists_ordered = df['artist_name'].drop_duplicates().sort_values()
len(all_artists_ordered)

In [None]:
# all artists that have been successful
temp_df = df[df.playlist_name.isin(success_playlists)][['artist_name', 'playlist_name']]

# artists by degree of success
artists_dos = temp_df.copy().drop_duplicates().groupby('artist_name').count().reset_index()
artists_dos.columns = ['artist_name', 'degree_of_success']
artists_dos.sort_values(by='artist_name', inplace=True)

# artists that reached success
artists_rs = temp_df.drop_duplicates(subset='artist_name').copy()
artists_rs.columns = ['artist_name', 'success']
artists_rs.loc[:, 'success'] = 1
artists_rs.sort_values(by='artist_name', inplace=True)

# display the dataframes
display(artists_dos.head(5))
display(artists_rs.head(5))

In [None]:
# All artists, whether they have reached success or not
artists_status = [1 if x in artists_rs.artist_name.to_list() else 0 for x in all_artists_ordered]

In [None]:
# artist linked to dependent variable - reached success or not

# artist linked to dependent variable - degree of success

In [None]:
df.partner_name.unique()

In [None]:
# Checking how many songs are remixes
list_of_tracks = df.track_name.drop_duplicates().tolist()
        
len([song for song in list_of_tracks if ('Remix' in song) or ('remix' in song)])

In [None]:
len(list_of_tracks)

In [None]:
len([song for song in list_of_tracks if ('Remix' in song) and ('-' not in song)])

In [None]:
len(df[df.track_name.str.contains('Remix')]['track_name'].drop_duplicates())

In [None]:
df[df.track_name.str.contains('Remix')]['track_name'].drop_duplicates()

# Checking Which Songs were Listened To By Year

In [None]:
df.info()

In [None]:
# Making Song/Year Pivot Table

df_song_year = pd.pivot_table(df, values='track_id', index='track_name', columns = 'year', aggfunc=len)

In [None]:
# Filling and showing

df_song_year.fillna(0, inplace=True)

df_song_year.head()

# Defining Success Criteria

In [None]:
pd.pivot_table(temp, values='track_id', index='date', columns='track_name', aggfunc=len)

In [None]:
temp = df[(df.track_name == '7 Years') & (df.year == 2017)]
sns.countplot(x='day', data=temp)
plt.show()

### Checking Which Playlist IDs are Counted as SUCCESS

In [None]:
playlist_name_id_pairs = df[['playlist_id', 'playlist_name']].copy().drop_duplicates()
top_100_playlists = df.playlist_id.value_counts().head(100).to_frame().reset_index()
top_100_playlists.columns = ['playlist_id', 'stream_count']
top_100_playlists = top_100_playlists.merge(playlist_name_id_pairs, on='playlist_id', how='left', copy=True)
top_100_playlists.playlist_name = top_100_playlists.playlist_name.astype(str)

In [None]:
top_100_playlists[top_100_playlists['playlist_name'].isin(['Hot Hits UK', 'Massive Dance Hits', 'The Indie List', 'New Music Friday UK'])]

In [None]:
# Order is Hot Hits UK, Massive Dance Hits, New Music Friday UK, The Indie List

success_playlist_ids = ['6FfOZSAN3N6u7v81uS7mxZ', '37i9dQZF1DX5uokaTN4FTR', '37i9dQZF1DX4W3aJJYCDfV', '37i9dQZF1DWVTKDs2aOkxu']

In [None]:
# DataFrame for Success Criteria

success_criteria_df = top_100_playlists[top_100_playlists['playlist_id'].isin(success_playlist_ids)]
success_criteria_df

### Checking Which Playlist IDs should be Ignored

**Rule:** We will remove playlists with a stream count of above 10k, and the playlists that are considered as a hallmark of success as well. This is to mitigate feature leakage, and to further inhibit using variables that are correlated with the outcome variable as a predictor variables. 

In [None]:
playlists_to_be_ignored = top_100_playlists[top_100_playlists.stream_count > 10000].copy()

In [None]:
# Appending success playlist on top, and removing duplicates

playlists_to_be_ignored = pd.concat([playlists_to_be_ignored.copy(), success_criteria_df.copy()]).drop_duplicates()

In [None]:
# To CSV

# success_criteria_df.to_csv('playlists_success_criteria.csv')
# playlists_to_be_ignored.to_csv('playlists_to_ignore_PCA.csv')

# Checking Which Artists were Successful Across the Years (And to Degree)

In [None]:
# Get the necessary values of main df

logged_success = df[df.playlist_name.isin(success_playlists)].copy()

In [None]:
# Get pivot ready with filled values

df_successes_year = pd.pivot_table(logged_success, values='playlist_id', index='artist_name', columns = 'year', aggfunc=pd.Series.nunique)
df_successes_year.fillna(0, inplace=True)

In [None]:
# See dataframe

df_successes_year.head()

In [None]:
logged_success.playlist_id.unique()

In [None]:
logged_success.playlist_name.unique()

# SpotiPy and Genius (With geniuslyrics Python Wrapper) API Data Extraction

In [None]:
# Additional Spotify

import requests
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

In [None]:
# Spotipy Credentials

CLIENT_ID = '1c30003afc8142c3bf686eca75a3af8c'
CLIENT_SECRET = '338740b73889449199d83a1cf1424d06'

In [None]:
### Spotify non-user initialisation

client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [None]:
# Number of Songs

df.track_name.nunique()

In [None]:
# Making list of unique track URIs to reference

track_uri_list = df.track_uri.copy().drop_duplicates().tolist()
len(track_uri_list)

### Audio Features

In [None]:
def audio_features_extract(track_uri_list):
    
    '''
    
    Creates a dataframe of track URIs, matched with their respective audio features. 
    This will need to be merged with song names and artist for a more complete dataframe
    
    '''
    
    reference = track_uri_list
    danceability = []
    energy = []
    key = []
    loudness = []
    mode = []
    speechiness = []
    acousticness = []
    instrumentalness = []
    liveness = []
    valence = []
    tempo = []
    duration_ms = []
    time_signature = []
    
    for uri in reference:
        temp_uri_info = sp.audio_features(uri)[0]
        danceability.append(temp_uri_info['danceability'])
        energy.append(temp_uri_info['energy'])
        key.append(temp_uri_info['key'])
        loudness.append(temp_uri_info['loudness'])
        mode.append(temp_uri_info['mode'])
        speechiness.append(temp_uri_info['speechiness'])
        acousticness.append(temp_uri_info['acousticness'])
        instrumentalness.append(temp_uri_info['instrumentalness'])
        liveness.append(temp_uri_info['liveness'])
        valence.append(temp_uri_info['valence'])
        tempo.append(temp_uri_info['tempo'])
        duration_ms.append(temp_uri_info['duration_ms'])
        time_signature.append(temp_uri_info['time_signature'])
        time.sleep(0.1)
    
    df = pd.DataFrame({'track_uri': reference, 'danceability': danceability, 
                       'energy': energy, 'key': key, 'loudness': loudness, 
                       'mode': mode, 'speechiness': speechiness, 'acousticness': acousticness, 
                       'instrumentalness': instrumentalness, 'liveness': liveness, 
                       'valence': valence, 'tempo': tempo, 'duration_ms': duration_ms, 
                       'time_signature': time_signature})
    
    return df

In [None]:
# Get dataframe of audio features

audio_features = audio_features_extract(track_uri_list)

In [None]:
# Checking for bugs and nulls

display(audio_features.head())
display(audio_features.tail())
print('Number of nulls: ', df.isnull().sum().sum())

In [None]:
# Save to dataframe audio_features.csv

# audio_features.to_csv('audio_features.csv')

### Lyrics from Songs

In [None]:
# Importing essential files 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import lyricsgenius

In [None]:
# Get song names and artist names set up as pairs and zips to draw on

track_artist_pairs = df[['track_name', 'artist_name']].copy().drop_duplicates()

#Getting all remixes: 
pairs_remix = track_artist_pairs[(track_artist_pairs['track_name'].str.contains('Remix')) | 
                                 (track_artist_pairs['track_name'].str.contains('remix'))]
track_names_list_remix = pairs_remix.track_name.tolist()
artist_names_list_remix = pairs_remix.artist_name.tolist()
zipped_tracks_artists_remix = zip(track_names_list_remix, artist_names_list_remix)

#Getting all nonremixes: 
pairs_nonremix = track_artist_pairs[~((track_artist_pairs['track_name'].str.contains('Remix')) | 
                                (track_artist_pairs['track_name'].str.contains('remix')))]
track_names_list_nonremix = pairs_nonremix.track_name.tolist()
artist_names_list_nonremix = pairs_nonremix.artist_name.tolist()
zipped_tracks_artists_nonremix = zip(track_names_list_nonremix, artist_names_list_nonremix)

# Get results
print('Number of pairs: ', len(track_artist_pairs))
print('Number of pairs that are remixes: ', len(pairs_remix))
print('Number of pairs that are original: ', len(pairs_nonremix))

**comment**: The nonremixes, or 'originals', should be available on the Genius API database. The remixes might not, and the API might draw on irrelevant songs. The original songs can be close to easily automated. The remixes might not. We will look into the remixes. 

In [None]:
pairs_remix.head(10)

In [None]:
pairs_nonremix.head(10)

**comment:** There were some errors in regards to using the Genius API on remixes and even nonremixes. After experimentation, it seems 'feat', parentheses and hyphens interrupt the API. We will remove these from both remixes and non-remixes. 

In [None]:
# Intialising Genius API with token

genius = lyricsgenius.Genius("cudPmM1MC6Mt5TX8vuZj6ZFgV1Zv75PL_mOy6Re6JkDEgM23lXWK1KWvTX9lqhf1")

In [None]:
# Testing

# song = genius.search_song("Save Me ", 'The Parakit')
# print(song.lyrics)

In [None]:
# Defining function for extracting songs

def extracting_lyrics(zipped_pairs):
    
    '''
    
    Extracts lyrics with errors in mind, should the song not be in Genius's database. 
    
    '''
    
    artist_names = []
    track_names = []
    lyrics = []
    
    for track, artist in zipped_pairs:
        time.sleep(0.01)
        
        # Setting up important variables
        artist_temp = artist 
        if "(" in track:
            track_temp = track.split('(')[0]
        elif "-" in track:
            track_temp = track.split('-')[0]
        else:
            track_temp = track
        
        # Getting the lyrics
        try: 
            song = genius.search_song(track_temp, artist_temp)
            lyrics.append(song.lyrics)
            track_names.append(track)
            artist_names.append(artist)
        except Exception:
            lyrics.append('Fail')
            track_names.append(track)
            artist_names.append(artist)
            continue
        
    print('Number of lyrics we have', len(lyrics))
    print('Number of artists we have', len(artist_names))
    print('Number of tracks we have', len(track_names))
    
    # We return the lists, which we will manually turn into a dataframe
    # This is because the function can be prone to errors, and the lists might have different lengths
    return artist_names, track_names, lyrics

### Original / Non-remix Songs

In [None]:
# Running function for original songs

# artist_names_func1, track_names_func1, lyrics_func1 = extracting_lyrics(zipped_tracks_artists_nonremix) # Takes about one hour to run

In [None]:
# print(artist_names_func1[-1])
# print(track_names_func1[-1])
# print(lyrics_func1[-1])

In [None]:
# Checking the number of original songs from first run that has SUCCESSFULLY retrieved lyrics:

print('Number of original songs whose lyrics were successfully extracted: ', sum([1 for x in lyrics_func1 if x != 'Fail']))

In [None]:
# Checking the number of original songs from first run that have FAILED retrieved lyrics:

print('Number of original songs whose lyrics were failed to be extracted: ', sum([1 for x in lyrics_func1 if x == 'Fail']))

**Comment:** Used the last row to see if the details of the last song aligns. It does! We can safely turn it into a dataframe. We will save it as a csv file, in order to retain our findings, as it was a time-intensive process. We will save it as 'song_lyrics_func1.csv'. 

In [None]:
# Saving the lists as a dataframe, and then saving that to a csv

# song_lyrics_func1 = pd.DataFrame({'track_name': track_names_func1, 'artist_name': artist_names_func1, 'lyrics': lyrics_func1})
# song_lyrics_func1.to_csv('song_lyrics_func1.csv')

### Remix Songs

In [None]:
# Running function for original songs (Best not to run - Load dataset in next cell)

# Run function
artist_names_func2, track_names_func2, lyrics_func2 = extracting_lyrics(zipped_tracks_artists_remix) # Takes about 30 minutes to run

# Save to Dataframe
song_lyrics_func2 = pd.DataFrame({'track_name': track_names_func2, 'artist_name': artist_names_func2, 'lyrics': lyrics_func2})

# Save to csv
# song_lyrics_func2.to_csv('song_lyrics_func2.csv')

In [None]:
# Checking the number of remix songs from first run that has SUCCESSFULLY retrieved lyrics:

print('Number of remix songs whose lyrics were successfully extracted: ', sum([1 for x in lyrics_func2 if x != 'Fail']))

In [None]:
# Checking the number of remix songs from first run that have FAILED retrieved lyrics:

print('Number of remix songs whose lyrics were failed to be extracted: ', sum([1 for x in lyrics_func2 if x == 'Fail']))

### Analysing the Failed Lyric Extractions (func1 & func2)

In [None]:
# Load the necessary data:
song_lyrics_func1 = pd.read_csv('song_lyrics_func1.csv')
song_lyrics_func2 = pd.read_csv('song_lyrics_func2.csv')
song_lyrics_func1.drop('Unnamed: 0', axis=1, inplace=True)
song_lyrics_func2.drop('Unnamed: 0', axis=1, inplace=True)

In [None]:
# Check Func1 Table Fails

song_lyrics_func1[song_lyrics_func1['lyrics'] == 'Fail']

In [None]:
song_lyrics_func1[song_lyrics_func1['track_name'].str.contains('Beethoven') |
                   song_lyrics_func1['track_name'].str.contains('Instrumental') |
                   song_lyrics_func1['track_name'].str.contains('Violin') |
                   song_lyrics_func1['track_name'].str.contains('Piano')]

**Comment:** These lyrics will need to be removed

In [None]:
song_lyrics_func1[song_lyrics_func1['lyrics'] == 'Fail'].artist_name.value_counts().tail(100)

In [None]:
# song_lyrics_func1[song_lyrics_func1['lyrics'] == 'Fail'].to_csv('help_with_lyrics.csv', index=False)

In [None]:
# Check Func2 Table Fails

song_lyrics_func2[song_lyrics_func2['lyrics'] == 'Fail']

**Comment:** For the remixes, there are only two artists that there are issues with. Perhaps another website might have the lyrics for these two artists, or another API. 

# Processing the Audio Features of All Songs and Playlists

In [None]:
# Loading and processing data

audio_features = pd.read_csv('audio_features.csv')
audio_features.drop('Unnamed: 0', axis=1, inplace=True)

# Perhaps we don't need some columns (probably unveiled in EDA section)
audio_features.drop(['duration_ms', 'key'], axis=1, inplace=True)

In [None]:
audio_features.head()

In [None]:
audio_features.shape

In [None]:
# Getting joined table ready

audio_joined = df[['track_uri', 'playlist_id', 'artist_name', 'playlist_name']].copy().merge(audio_features, how='left', on='track_uri')

In [None]:
# Get pivot tables ready

danceability_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='danceability', aggfunc=np.mean).fillna(0)

energy_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='energy', aggfunc=np.mean).fillna(0)

loudness_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='loudness', aggfunc=np.mean).fillna(0)

mode_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='mode', aggfunc=np.mean).fillna(0)

speechiness_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='speechiness', aggfunc=np.mean).fillna(0)

acousticness_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='acousticness', aggfunc=np.mean).fillna(0)

instrumentalness_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='instrumentalness', aggfunc=np.mean).fillna(0)

liveness_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='liveness', aggfunc=np.mean).fillna(0)

valence_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='valence', aggfunc=np.mean).fillna(0)

tempo_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='tempo', aggfunc=np.mean).fillna(0)

time_signature_pivot = pd.pivot_table(audio_joined, index='artist_name', 
                                    columns='playlist_id', values='time_signature', aggfunc=np.median).fillna(0)


In [None]:
# Full join on all pivot tables

final_audio_pivot = pd.concat([danceability_pivot, energy_pivot, loudness_pivot, mode_pivot, 
                               speechiness_pivot, acousticness_pivot, instrumentalness_pivot, 
                               liveness_pivot, valence_pivot, tempo_pivot, time_signature_pivot], axis=1)

In [None]:
# Drop columns that lead to feature leakage

final_audio_pivot.drop(playlists_to_be_ignored.playlist_id.tolist(), axis=1, inplace=True)

In [None]:
# Doing PCA:

pca = Pipeline([('scaler', StandardScaler()), ('pca', PCA(n_components=50))])
components = pca.fit(final_audio_pivot)

In [None]:
# Assigning variables

no_of_components = list(range(1,51))
exp_variance_ratio = pca[1].explained_variance_ratio_.tolist()
cum_exp_variance = np.cumsum(exp_variance_ratio)

In [None]:
# Plotting explained variance and cumulative variance over components

sns.set_style("darkgrid")
plt.figure(figsize=(10,7))
sns.lineplot(x=no_of_components, y=exp_variance_ratio, color='green')
sns.lineplot(x=no_of_components, y=cum_exp_variance, color = 'orange')
plt.legend(labels=["Explained Variance","Cumulative Explained Variance"])
plt.show();

In [None]:
# Zooming into elbow curve

sns.set_style("darkgrid")
plt.figure(figsize=(10,7))
sns.barplot(x=no_of_components, y=exp_variance_ratio, color='green')
plt.xlim(2.5, 18.5)
plt.ylim(0,0.05)
plt.show();

In [None]:
# Calculating slope of cumulative explained variance:

differences = [i-exp_variance_ratio[1+exp_variance_ratio.index(i)] for i in exp_variance_ratio[3:49]]

In [None]:
# Better visualising 'elbows'. Need to pick poison.

x = list(range(4,50))
y = differences
plt.figure(figsize=(18,5))
sns.barplot(x=x, y=y, color='green')
plt.title('Difference in Explained Variance between a Given Component and the Next One')
plt.show()

**Comment:** The larger the value in the above diagram, the more *useless* the next component. We can choose 9, 11, or 15 components. We can "choose our poison". 

# Processing the Lyrics Features of All Songs and Playlists

Link to try: https://towardsdatascience.com/using-sentence-embeddings-to-explore-the-similarities-and-differences-in-song-lyrics-1820ac713f00

**Comment:** File formatted properly. Text file for 'fails' will need to be redone.

**Update:** Redone

In [None]:
# Loading Song Files

# Loading csv
originals_lyrics = pd.read_csv('song_lyrics_func1.csv')
remixes_lyrics = pd.read_csv('song_lyrics_func2.csv')
lyrics_fixed = pd.read_csv('lyrics_sharaf_fix2.txt')

# Making necessary changes
originals_lyrics.drop('Unnamed: 0', axis=1, inplace=True) # Forgot to remove index before saving as csv
remixes_lyrics.drop('Unnamed: 0', axis=1, inplace=True) # Forgot to remove index before saving as csv

**Note:** We will need to append originals and remixes on top of lyrics_fixed, as removing duplicates will keep the first occurance of a row. 

In [None]:
# Getting the final lyrics

# Concatenate
final_lyrics = pd.concat([lyrics_fixed.copy(), 
                          originals_lyrics.copy(), 
                          remixes_lyrics.copy()]).drop_duplicates(subset=['track_name', 'artist_name'])

# Drop the ones with fail
final_lyrics = final_lyrics[final_lyrics.lyrics != 'Fail']

# Making necessary changes to manipulate data
final_lyrics.lyrics = final_lyrics.lyrics.astype(str)

In [None]:
# Defining function to clean lyrics

# Specifying what to remove
remove_from_lyrics = ['\n', 'Lyrics', '[Verse 1]', '[Verse 2]', '[Verse 2]', '[Intro]', 
                      '[Chorus]', '[Post-Chorus]', '[Bridge]', '[Outro]']

# Create function 
def clean_lyrics(lyrics_df):
    for element in remove_from_lyrics:
        lyrics_df.lyrics = lyrics_df.lyrics.apply(lambda x: x.replace(element, " ") if element in x else x)
    return lyrics_df

In [None]:
# Calling function to clean df

final_lyrics_cleaned = clean_lyrics(final_lyrics)
final_lyrics_cleaned.head()

In [None]:
# Adding an index for each pair for reference

final_lyrics_cleaned['primary_key'] = range(1, final_lyrics_cleaned.shape[0]+1)

### Attempting to do Embeddings

In [None]:
# Setting up model to do embeddings:

module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/3'
embedder = hub.load(module_url)

In [None]:
# Defining model to embed

def embed_text(input):
    return np.array(embedder(input)[0])

In [None]:
# Creating list of lists

def embedded_lyrics_df(df):
    embedded_lyrics = []
    for lyrics in df.lyrics.tolist():
        temp = embed_text(lyrics)
        embedded_lyrics.append(temp)
    temp_df = pd.DataFrame(embedded_lyrics)
    return temp_df

In [None]:
# Call function

# embedded_lyrics_df1 = embedded_lyrics_df(final_lyrics_cleaned)

# save
# embedded_lyrics_df1.to_csv('3639_lyrics_embeddings.csv')

# load data
embedded_lyrics_df1 = pd.read_csv('3639_lyrics_embeddings.csv')

In [None]:
# Checking Shape

embedded_lyrics_df1.shape

In [None]:
# Merge with embeddings

# Getting df ready
merge_with_embeddings_df = df[['playlist_id', 'playlist_name', 'track_id', 'track_name', 'artist_name']].copy()

# Concatenating with columns to join upon
embedded_lyrics_df2 = pd.concat([final_lyrics_cleaned[['track_name', 'artist_name']].copy().reset_index(drop=True), 
                                 embedded_lyrics_df1.copy().reset_index(drop=True)], axis=1)

# Merging
merged_embeddings = merge_with_embeddings_df.merge(embedded_lyrics_df2, how='left', on=['track_name', 'artist_name'])

In [None]:
merged_embeddings

In [None]:
# Aggregate by each vector by their playlists

pivots_embeddings = []

for column in merged_embeddings.columns[6:]:
    embeddings_playlist_pivot = pd.pivot_table(merged_embeddings, 
                                           index='artist_name', 
                                           columns='playlist_id', 
                                           aggfunc = np.mean,
                                           values=column).reset_index()
    pivots_embeddings.append(embeddings_playlist_pivot)

pca_ready_embeddings = pd.concat(pivots_embeddings, axis=1)

In [None]:
# Changing format for better storage

pca_ready_embeddings = pca_ready_embeddings.T

In [None]:
pca_ready_embeddings.to_csv('pca_ready_embeddings.csv')

In [None]:
# Store artist name record in given order

# Store names
store_artists_pca_embeddings = pca_ready_embeddings['artist_name'].copy()

# Drop the name to get ready for PCA:
pca_ready_embeddings.drop('artist_name', axis=1, inplace=True)

# fill nulls
pca_ready_embeddings.fillna(0, inplace=True)

In [None]:
# Run pca

lyrics_pca = Pipeline([('scaler', StandardScaler()), ('pca', PCA(n_components=50))])
lyrics_pca.fit(pca_ready_embeddings)

In [None]:
# Running analysis

no_of_components = list(range(1,51))
exp_variance_ratio = lyrics_pca[1].explained_variance_ratio_.tolist()
cum_exp_variance = np.cumsum(exp_variance_ratio)

In [None]:
# Plotting explained variance and cumulative variance over components

sns.set_style("darkgrid")
plt.figure(figsize=(10,7))
sns.lineplot(x=no_of_components, y=exp_variance_ratio, color='green')
sns.lineplot(x=no_of_components, y=cum_exp_variance, color = 'orange')
plt.legend(labels=["Explained Variance","Cumulative Explained Variance"])
plt.show();

In [None]:
# Zooming into elbow curve

sns.set_style("darkgrid")
plt.figure(figsize=(10,7))
sns.barplot(x=no_of_components, y=exp_variance_ratio, color='green')
plt.xlim(2.5, 18.5)
plt.ylim(0,0.05)
plt.show();

In [None]:
# Better visualising 'elbows'. Need to pick poison.

differences = [i-exp_variance_ratio[1+exp_variance_ratio.index(i)] for i in exp_variance_ratio[0:49]]

x = list(range(2,50))
y = differences[1:]
plt.figure(figsize=(18,5))
sns.barplot(x=x, y=y, color='green')
plt.title('Difference in Explained Variance between a Given Component and the Next One')
plt.show()