# Spotify Dataset 1922 - 2021

In this notebook, we first analyse the data and do some basic visualisations before we perform a time series analysis of the artists popularity over the years.
Following that, we start working on the recommendation model where we used content based filtering.

> Before we start, I would like to ackloedge that I am learning myself and I took inspiration from sources at Stack Overflow, Kaggle and articles on Towards Data Science. Moreover, I referred to code from other contributors on this kaggle dataset like Darkstar Dream and Florian Heiny. 

The imports I am using for the data analysis and visualisations:

In [None]:
import pandas as pd
import numpy as np
import json
from datetime import date
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

artist_df = pd.read_csv('/kaggle/input/spotify-dataset-19212020-160k-tracks/artists.csv')
tracks_df = pd.read_csv('/kaggle/input/spotify-dataset-19212020-160k-tracks//tracks.csv')
with open("/kaggle/input/spotify-dataset-19212020-160k-tracks//dict_artists.json", encoding='utf-8', errors='ignore') as json_data:
     data = json.load(json_data, strict=False)

In [None]:
tracks_df

Let's convert all the dates to datetime objects for easier comprehension in the future.

In [None]:
df = tracks_df.copy()

df['release_date'] = pd.to_datetime(df['release_date'])

Correlation matrix of all the variables in the dataset tracks_df

In [None]:
corr = df.corr()
plt.figure(figsize=(20,8))
sns.heatmap(corr, vmax=1, vmin=-1, center=0,linewidth=.5,square=True, annot = True, annot_kws = {'size':8},fmt='.1f', cmap='BrBG_r')
plt.title('Correlation')
plt.show()

Correlation matrix of the important variables in the dataset tracks_df

In [None]:
corr = df[["acousticness","danceability","energy", "instrumentalness", 
           "liveness","tempo", "valence", "loudness", "speechiness"]]

plt.figure(figsize=(10,10))
sns.heatmap(corr.corr(), annot=True)

Here, we create a new column for year for ease in visualisations.

In [None]:
df['year'] = df.apply(lambda row: row.release_date.year, axis = 1)

The Time Series Analysis of the artists

In [None]:
year_avg = df[["acousticness","danceability","energy", "instrumentalness", 
               "liveness","tempo", "valence", "loudness", "speechiness", "year"]].\
groupby("year").mean().sort_values(by="year").reset_index()

# year_avg.head()
plt.figure(figsize=(14,8))
plt.title("Song Trends Over Time", fontdict={"fontsize": 15})

lines = ["acousticness","danceability","energy", 
         "instrumentalness", "liveness", "valence", "speechiness"]

for line in lines:
    ax = sns.lineplot(x='year', y=line, data=year_avg)
    
    
plt.ylabel("value")
plt.legend(lines)

# Recommendation Model

We now use content based filtering to build the recommendation model. We will use the package from sklearn.
Also, we remove the id column since we don't need that for the recommendation

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

artist_df.drop(['id'], axis=1, inplace=True)

First, we remove all the rows with empty genres since they will not help with the recommendation.
Secondly, we remove data from the dataset due to less computational power. If you do not want to do this, you only need to use the first line of code from the segment

In [None]:
artist_df = artist_df[artist_df['genres'] != '[]']
# Remove data due to less computational power
artist_df = artist_df.sort_values(by=['popularity'], ascending=False)
l = len(artist_df)/15
artist_df = artist_df[:round(l)]
artist_df = artist_df.reset_index(drop=True)

In [None]:
artist_df

In [None]:
model = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = model.fit_transform(artist_df['genres'])


In [None]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) 

In [None]:
indices = pd.Series(artist_df.index, index=artist_df['name'])

In [None]:
results = {}
for idx, row in artist_df.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] 
    similar_items = [(cosine_similarities[idx][i], artist_df['name'][i]) for i in similar_indices] 
    results[row['name']] = similar_items[1:]

In [None]:
def _recommend(item_id, num):
    recs = results[item_id][:num]   
    preds = {}
    for pair in recs:
        preds[pair[1]] = pair[0]
    return preds

In [None]:
_recommend('Drake', 5)

In [None]:
def _recommend_multiple(artists, num=10):
    dict_similar = {}
    for artist, weight in artists.items():
        dict_similar[artist] = _recommend(artist, num)
    artists_all = []
    for artist, similar_artists in dict_similar.items():
        artists_all.append(list(similar_artists.keys()))
    artists_unique = np.unique(artists_all).tolist()
    artists_dict = {artist: 0 for artist in artists_unique}
    for artist, similar_artists in dict_similar.items():
        for similar_artist, score in similar_artists.items():
            artists_dict[similar_artist] += artists[artist] * score
    return list({k: v for k, v in sorted(artists_dict.items(), key=lambda item: item[1], reverse=True) if k not in artists}.keys())[0:num]

In [None]:
_recommend_multiple({"Drake": 10, "Queen": 8}, 5)