# Projet: Système de recommendation de musique Spotify

## Présentation

Dans ce projet on cherche via un <a href="https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db/">dataSet kaggle </a> à créer un système de recommendation prenant une musique en input et proposant un certain nombre de titre similaire.

On se base sur un algorithme des K Nearest Neighbors et le projet est en scala. 

On pourra retrouver ce notebook sur <a href="https://github.com/vincrichard/Spotify_Recommation_System">github </a>

Étapes:

* [Importation des données](#import)

* [Préprocessing des données](#preproces)<br>
Dans cette partie on supprime des abbérations du à des titres de musique trop longue qui amènenent le décalage des informations de notre DataFrame. On remarque aussi qu'il y a des duplicats de musique qu'il faudra traiter plus tard. Ces duplicats sont du à la possibilité pour une musique d'appartenir à plusieur genre différent. Ces musiques auront alors plusieurs lignes associées et même si leurs ID reste les mêmes elles peuvent avoir des mesures différentes.

* [Etude des différentes métriques](#metric)<br>
Cette partie permet d'expliquer le sens de chaque métrique disponible dans le dataset. Pour chaque métrique un graphique associé permet de se rendre compte de la répartition des données dans le dataset.
Les différentes métriques sont: genre, artist_name, track_name, track_id, popularity, acousticness, danceability, duration_ms, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, time_signature, valence.

* [Typage des données](#typage)<br>
Typage du dataset en fonction de leur valeurs.

* [Etude de la correlation de nos features](#corr)<br>
Etude de la matrice de correlation de nos features pour s'assurer que nos features ne sont pas trop corrélées.

* [Selection de nos features](#selection)<br>
On décide de ne pas se servir de *duration_ms* et *popularity* dans la suite de notre modèle car ces valeurs ne semblent pas pertinente pour un système de recommendation.

* [Preparation de nos données](#prep)<br>
Dans cette partie on va modifier les champs de nos données pour pouvoir les utiliser dans notre modèle.
  + [Encodage de nos variable de texte avec StringIndexer](#stringindexer)   <br>
  On modifie quatre de nos features contenant du texte, genre, key, mode et time_signature afin d'avoir un input numérique.
  
  + [Utilisation d'un One Hot Encoder](#onehot)<br>
  Sur les mêmes features que le StringIndexer pour éviter à notre modèle de penser à une hiéarchie entre nos variables nous allons les passer dans un `OneHotEncoder` qui les transformera en vecteur de 0 et de 1.
  
  + [Aggregation de nos données](#agg)<br>
  Il faut régler le problème de la duplication des musiques dans notre dataset. De plus le genre de chaque musique nous intéresse il faut donc pouvoir supprimer les duplicats tout en gardant ces informations. Pour cela on aggrège chaque ligne ensemble en calculant la moyenne de chaque valeur numérique pour les mesures étant des entiers. Pour nos vecteurs créé avec le OneHotEncoder, on utilise la fonction max pour être sûr de ne pas perdre l'information des différents genres. 
  
  + [Vector Assembler](#vectorAssembler)<br>
  Afin d'avoir une seule colonne regroupant toutes nos features comme input de notre modèle.
  
  + [Normalizer](#normalizer)<br>
  Afin que chaque feature ait le même poids on normalize nos données.
 
* [Algorithme des plus proches voisins](#knn)<br>
On utilise la librairie de spark MLlib plus spécifiquement la classe `BuckereRandomProjectionLSH` qui est une approximation de la méthode des plus proches voisins dans le cas du traitement de grosse donnée.

* [Resultats](#result)<br>

* [Utilisation du model une fois entrainé](#udf)<br>
Définition d'une fonction permettant de réutiliser le modèle entrainé en prenant un *track_id* en input.


Défaut de l'étude:

Il n'y a pas de moyen de vérifier la précision du modèle, chacun doit essayer et écouter les titres proposés pour se rendre compte si le modèle marche ou non. Personnellement je trouve les résultats pertinents.

Amélioration:

Création d'une fonction pour la partie d'aggregation des musiques possédant de multiple lignes afin de pouvoir compacter la préparation des données en une seule pipeline.


### Import

<div class="alert alert-warning" role="alert">
  Attention l'utilisation de <a href="https://github.com/vegas-viz/Vegas">Vegas</a> (pour les graphes) est faite ici pour un notebook sous Kernel ApacheToree
</div>



In [1]:
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions._

In [2]:
%AddDeps org.vegas-viz vegas_2.11 0.3.10 --transitive

Marking org.vegas-viz:vegas_2.11:0.3.10 for download
Obtained 42 files


In [3]:
%AddDeps org.vegas-viz vegas-spark_2.11 0.3.10 --transitive

Marking org.vegas-viz:vegas-spark_2.11:0.3.10 for download
Obtained 44 files


In [4]:
implicit val render = vegas.render.ShowHTML(kernel.display.content("text/html", _))

render = <function1>


<function1>

In [5]:
import vegas._
import vegas.render.WindowRenderer._
import vegas.sparkExt._

<a id='import'></a>
## Data Import

In [11]:
{val df: DataFrame = spark
      .read
      .option("header", true) // utilise la première ligne du (des) fichier(s) comme header
      .option("inferSchema", "true") // pour inférer le type de chaque colonne (Int, String, etc.)
      .csv("./data/SpotifyFeatures.csv")}
// df.printSchema()
println(s"Nombre de lignes : ${df.count}")
println(s"Nombre de colonnes : ${df.columns.length}")

Nombre de lignes : 232725
Nombre de colonnes : 18


In [12]:
%%dataFrame
df

genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39
Movie,Henri Salvador,Le petit souper aux chandelles,0Mf1jKa8eNAf1a4PwTbizj,0,0.749,0.578,160627,0.0948,0.0,C#,0.107,-14.97,Major,0.143,87.479,4/4,0.358
Movie,Martin & les fées,"Premières recherches (par Paul Ventimila, Lorie Pester, Véronique Jannot, Michèle Laroque & Gérard Lenorman)",0NUiKYRd6jt1LKMYGkUdnZ,2,0.344,0.703,212293,0.27,0.0,C#,0.105,-12.675,Major,0.953,82.873,4/4,0.533
Movie,Laura Mayne,Let Me Let Go,0PbIF9YVD505GutwotpB5C,15,0.939,0.416,240067,0.269,0.0,F#,0.113,-8.949,Major,0.0286,96.827,4/4,0.274
Movie,Chorus,Helka,0ST6uPfvaPpJLtQwhE6KfC,0,0.00104,0.734,226200,0.481,0.00086,C,0.0765,-7.725,Major,0.046,125.08,4/4,0.765
Movie,Le Club des Juniors,Les bisous des bisounours,0VSqZ3KStsjcfERGdcWpFO,10,0.319,0.598,152694,0.705,0.00125,G,0.349,-7.79,Major,0.0281,137.496,4/4,0.718


<a id='preprocess'></a>
## Préprocess des données

### Suppression des abérrations

Le data set n'est pas parfait il semble que des données soient décalées si le titre de la musique est trop longue

In [13]:
%%dataframe --limit 5
df.filter(! col("key").isInCollection(Array("C", "G", "D", "C#", "A", "F", "B", "E", "A#", "F#", "G#", "D#")))

genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
Anime,Nanahira,"""""""Oshietekudasai","goshujinsama""""""",4xbsWTKyvlrOztwLqMreEc,13,0.115,0.561,229295,0.901,0.0,C#,0.0952,-2.016,Minor,0.0613,183.982,4/4
Blues,Queens of the Stone Age,"""""""You Got A Killer Scene There","Man...""""""",6ZZiYOTFuZC1XLJjMiEnvS,32,0.00688,0.55,296627,0.617,0.112,B,0.238,-5.662,Minor,0.0304,101.214,4/4
Movie,Sacha Tran,"""Quinze ans à peine - Extrait de """"Robin des Bois","Le Spectacle"""" [Live 2014]""",1wsJmPdJTAOFoFnVphXT19,10,0.19,0.46,228160,0.252,0.0,F#,0.712,-9.375,Major,0.0256,79.011,4/4
Movie,Bruce Broughton,"""3 Incongruities, """"Triptych"""": No. 3. Rhythmically","with a bounce""",6TlSPscwN6ZoNnsjMvceQm,0,0.972,0.349,689600,0.248,0.0965,D,0.0668,-12.303,Major,0.0565,168.103,4/4
Movie,Bruce Broughton,"""3 Incongruities, """"Triptych"""": No. 2. Slow","in a singing style""",4Lg8aGdZCjB8T7P66IHmPe,0,0.948,0.341,602493,0.197,0.0334,F#,0.136,-13.894,Minor,0.0403,119.071,4/4


In [14]:
val data : DataFrame = df
    .filter($"key".isin("C", "G", "D", "C#", "A", "F", "B", "E", "A#", "F#", "G#", "D#"))

data = [genre: string, artist_name: string ... 16 more fields]


[genre: string, artist_name: string ... 16 more fields]

### Détéction des duplicats

In [15]:
%%dataframe --limit 5
data.groupBy("track_id", "artist_name", "track_name").count()

track_id,artist_name,track_name,count
3AnPOKKZV1NRhED24p9YeX,Chorus,Sai Parameshwar Sai Karuneshwar,1
64Jyg9AzWl3AHdnkKPmY4T,Adrian Marcel,2AM.,4
4cIPBRZVBcsk7yiNYgAnqR,Madison Beer,Fools,3
1u2ht8xGYJb5Buizx4SanY,Chorus,Chal Chal Chal,1
6Xz4Pk66OuieSrpau2OdVX,Chorus,Jalwath Karala,1


In [16]:
%%dataframe
data.filter($"track_name" === "EX - Remix")

genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
Alternative,Kiana Ledé,EX - Remix,4LfoYkTuIPgJ2RlNkN5P5C,63,0.326,0.775,196133,0.537,0,C,0.0792,-6.978,Major,0.141,73.484,4/4,0.403
Dance,Kiana Ledé,EX - Remix,7uToQLjICwg3iPxFiCLBHB,0,0.271,0.768,196133,0.525,0,C,0.0791,-7.318,Major,0.173,73.462,4/4,0.399
R&B,Kiana Ledé,EX - Remix,4LfoYkTuIPgJ2RlNkN5P5C,59,0.324,0.77,196133,0.533,0,C,0.0871,-7.285,Major,0.159,73.477,4/4,0.388
Indie,Kiana Ledé,EX - Remix,4LfoYkTuIPgJ2RlNkN5P5C,59,0.324,0.77,196133,0.533,0,C,0.0871,-7.285,Major,0.159,73.477,4/4,0.388
Children’s Music,Kiana Ledé,EX - Remix,4LfoYkTuIPgJ2RlNkN5P5C,0,0.324,0.77,196133,0.533,0,C,0.0871,-7.285,Major,0.159,73.477,4/4,0.388
Pop,Kiana Ledé,EX - Remix,4LfoYkTuIPgJ2RlNkN5P5C,59,0.324,0.77,196133,0.533,0,C,0.0871,-7.285,Major,0.159,73.477,4/4,0.388


On peut voir que certaines chansons apparaisent plusieurs fois dans nos datas elles ont des métriques quelque peu différentes mais l'information la plus importante est le genre du morceaux que l'on voudra regrouper pour chaque chanson. On s'en chargera par la suite.

<a id='metric'></a>
## Étude des différentes métriques

Les descriptions des variables se basent sur la <a href=https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/> documentation de Spotify </a>

### Key

>The estimated overall key of the track. Integers map to pitches using standard <a href="https://en.wikipedia.org/wiki/Pitch_class">Pitch Class notation </a>. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

In [17]:
%%dataframe
data.groupBy("key").count().sort($"count".desc)

key,count
C,27472
G,26293
D,23970
C#,23096
A,22593
F,20140
B,17627
E,17324
A#,15442
F#,15181


In [18]:
Vegas("Key")
    .withDataFrame(data.groupBy("key").count())
    .encodeX("key", Nom,  scale=Scale(bandSize=50))
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

La plupart des clés ont leurs valeurs basées sur les lettres, il faudra supprimer les exceptions.

### Mode Value

> Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

In [19]:
%%dataframe
data.groupBy("mode").count().sort($"count".desc)

mode,count
Major,150956
Minor,80748


Les valeurs principales sont "Major" et "Minor"

In [20]:
Vegas("Mode")
    .withDataFrame(data.groupBy("mode").count())
    .encodeX("mode", Nominal, scale=Scale(bandSize=50))
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

### Time Signature

> An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

In [21]:
%%dataframe
data.groupBy("time_signature").count().sort($"count".desc)

time_signature,count
4/4,200143
3/4,23806
5/4,5177
1/4,2570
0/4,8


In [22]:
Vegas("Time Signature")
    .withDataFrame(data.groupBy("time_signature").count())
    .encodeX("time_signature", Nominal,  scale=Scale(bandSize=50))
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

On voit que la majorité des valeur ont une valeur de *time_signature* de 4/4 ont peut donc supposer que cette variable ne sera pas très explicative.

### Acousticness

>A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. The distribution of values for this feature look like this: Acousticness distribution

In [23]:
Vegas("Accousticness")
    .withDataFrame(data.groupBy("acousticness").count())
    .encodeX("acousticness", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

### Danceability

>Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

In [24]:
Vegas("Danceability")
    .withDataFrame(data.groupBy("danceability").count())
    .encodeX("danceability", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

### Energy

>Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. 

In [25]:
Vegas("Energy")
    .withDataFrame(data.groupBy("energy").count())
    .encodeX("energy", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

### Instrumentalness

>Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. 

In [26]:
Vegas("Instrumentalness")
    .withDataFrame(data.groupBy("instrumentalness").count())
    .encodeX("instrumentalness", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

La grande majorité des musique sont donc avec voix

### Liveness

>Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

In [27]:
Vegas("Liveness")
    .withDataFrame(data.groupBy("liveness").count())
    .encodeX("liveness", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

### Loudness

>The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

In [28]:
Vegas("Loudness")
    .withDataFrame(data.groupBy("loudness").count())
    .encodeX("loudness", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

Il peut exister des valeurs de loudness supérieur à 0 mais cela ne ressemble pas une erreur dans les données.

### Speechiness

>Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

In [29]:
Vegas("Speechiness")
    .withDataFrame(data.groupBy("speechiness").count())
    .encodeX("speechiness", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

### Valence

>A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

In [30]:
Vegas("Valence")
    .withDataFrame(data.groupBy("valence").count())
    .encodeX("valence", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

### Tempo

>The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

In [31]:
Vegas("Tempo")
    .withDataFrame(data.groupBy("tempo").count())
    .encodeX("tempo", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

<a id='typage'></a>
## Typage des données

Pour choisir le type de chaque variable on se base sur la <a href=https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/> documentation de Spotify </a>

In [32]:
val dfCasted: DataFrame = data
    .withColumn("duration_ms", $"duration_ms".cast("Int"))
    .withColumn("acousticness", $"acousticness".cast("Float"))
    .withColumn("danceability", $"danceability".cast("Float"))
    .withColumn("energy", $"energy".cast("Float"))
    .withColumn("instrumentalness", $"instrumentalness".cast("Float"))
    .withColumn("liveness", $"liveness".cast("Float"))
    .withColumn("loudness", $"loudness".cast("Float"))
    .withColumn("speechiness", $"speechiness".cast("Float"))
    .withColumn("valence", $"valence".cast("Float"))
    .withColumn("tempo", $"tempo".cast("Float"))

dfCasted = [genre: string, artist_name: string ... 16 more fields]


[genre: string, artist_name: string ... 16 more fields]

<a id='corr'></a>
## Etude de la correlation des données

In [33]:
import org.apache.spark.ml.feature.VectorAssembler
val colCorr= Array("acousticness", "danceability", "duration_ms", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence")
val assembler = new VectorAssembler()
    .setInputCols(colCorr)
    .setOutputCol("features")

colCorr = Array(acousticness, danceability, duration_ms, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence)
assembler = vecAssembler_7299a10e8ef4


vecAssembler_7299a10e8ef4

In [34]:
import org.apache.spark.ml.linalg.Matrix
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val Row(coeff1: Matrix) = Correlation.corr(assembler.transform(dfCasted), "features").head
val colNamePairs = colCorr.flatMap(name_from => colCorr.map(name_to => (name_from, name_to)))
val triplesList = colNamePairs.zip(coeff1.toArray)
  .filterNot{case((name_from, name_to), corr) => name_from >= name_to}
  .map{case((name_from, name_to), corr) => (name_from, name_to, corr)}
val corrDf = sc.parallelize(triplesList).toDF("name_from", "name_to", "corr")

corrDf.sort($"corr".desc).show(5)
corrDf.sort($"corr").show(5)

+------------+-----------+-------------------+
|   name_from|    name_to|               corr|
+------------+-----------+-------------------+
|      energy|   loudness| 0.8146067202887777|
|danceability|    valence|   0.54415832574648|
|    liveness|speechiness| 0.5112653589253985|
|danceability|   loudness|0.43403079178908577|
|      energy|    valence|0.43300746664482853|
+------------+-----------+-------------------+
only showing top 5 rows

+----------------+----------------+--------------------+
|       name_from|         name_to|                corr|
+----------------+----------------+--------------------+
|    acousticness|          energy| -0.7229341642500959|
|    acousticness|        loudness| -0.6879526714061998|
|instrumentalness|        loudness|  -0.510346072773586|
|          energy|instrumentalness| -0.3813198112973624|
|    danceability|instrumentalness|-0.36677073817119393|
+----------------+----------------+--------------------+
only showing top 5 rows



coeff1 = 
colNamePairs = Array((acousticness,acousticness), (acousticness,danceability), (acousticness,duration_ms), (acousticness...


<console>:6: error: Symbol 'type scala.AnyRef' is missing from the classpath.
This symbol is required by 'class org.apache.spark.ml.linalg.SparseVector'.
Make sure that type AnyRef is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'SparseVector.class' was compiled against an incompatible version of scala.
  lazy val $print: String =  {
           ^
1.0                   -0.3587350256039598   ... (10 total)
-0.3587350256039598   1.0                   ...
0.00891152428395838   -0.12412821897443163  ...
-0.7229341642500959   0.3197446219673942    ...
0.31821445468419374   -0.36677073817119393  ...
0.06901616607205185   -0.04189692407309058  ...
-0.6879526714061998   0.43403079178908577   ...
0.15413690123693932   0.13313415465148323   ...
-0.23692404267228767  0.018890904611838646  ...
-0.32119127814732445  0.54415832574648      ...


Array((acousticness,acousticness), (acousticness,danceability), (acousticness,duration_ms), (acousticness...

<a id='selection'></a>
## Selection de nos features

In [35]:
dfCasted.printSchema

root
 |-- genre: string (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- track_id: string (nullable = true)
 |-- popularity: string (nullable = true)
 |-- acousticness: float (nullable = true)
 |-- danceability: float (nullable = true)
 |-- duration_ms: integer (nullable = true)
 |-- energy: float (nullable = true)
 |-- instrumentalness: float (nullable = true)
 |-- key: string (nullable = true)
 |-- liveness: float (nullable = true)
 |-- loudness: float (nullable = true)
 |-- mode: string (nullable = true)
 |-- speechiness: float (nullable = true)
 |-- tempo: float (nullable = true)
 |-- time_signature: string (nullable = true)
 |-- valence: float (nullable = true)



Comme aucune de nos variables n'est trop corrélée ( < 0.95) nous pouvons créer un modèle avec toutes les variables précédentes. 

Néanmoins l'utilisation de *duration_ms* et *popularity* ne semble pas utile pour un système de recommendation basé sur la similitude d'une musique.

<a id='prep'></a>
## Préparation des données

Pour pouvoir utiliser toutes les informations à notre disposition il nous faut modifier nos données pour quelle puisse être utilisé par un algorithme de KNN.

Notamment les variables *genre*, *mode*, *key*, *time_signature* qui sont sous format string.

<a id='stringindexer'></a>
### Encodage de nos variables de texte avec StringIndexer

Le `StringIndexer` encode nos labels en indice.
On a nos quatres features *genre*, *key*, *mode* et *time_signature* à modifier.

In [36]:
import org.apache.spark.ml.feature.StringIndexer
val genreIndexer = new StringIndexer()
    .setInputCol("genre")
    .setOutputCol("genreIndex")

val modeIndexer = new StringIndexer()
    .setInputCol("mode")
    .setOutputCol("modeIndex")

val keyIndexer = new StringIndexer()
    .setInputCol("key")
    .setOutputCol("keyIndex")

val tsIndexer = new StringIndexer()
    .setInputCol("time_signature")
    .setOutputCol("tsIndex")

genreIndexer = strIdx_570b80f3117e
modeIndexer = strIdx_835aecdd9ca1
keyIndexer = strIdx_fead7eea0095
tsIndexer = strIdx_a8c908af05d1


strIdx_a8c908af05d1

<a id='onehot'></a>
### Utilisation d'un One Hot Encoder

Pour éviter à notre modèle de penser à une hiéarchie entre nos variables nous allons les passer dans un `OneHotEncoder` qui les transformera en vecteur de 0 et de 1.

In [37]:
import org.apache.spark.ml.feature.OneHotEncoder
val genreEncoder = new OneHotEncoder()
  .setInputCol("genreIndex")
  .setOutputCol("genreVec")

val modeEncoder = new OneHotEncoder()
  .setInputCol("modeIndex")
  .setOutputCol("modeVec")

val keyEncoder = new OneHotEncoder()
  .setInputCol("keyIndex")
  .setOutputCol("keyVec")

val tsEncoder = new OneHotEncoder()
  .setInputCol("tsIndex")
  .setOutputCol("tsVec")

genreEncoder = oneHot_572b19e0a259
modeEncoder = oneHot_c85f479e5a71
keyEncoder = oneHot_6a21b4c75cb2
tsEncoder = oneHot_f3daec6645ec




oneHot_f3daec6645ec

### Création de notre Pipeline

In [38]:
import org.apache.spark.ml.{Pipeline, PipelineModel}

val pipeline = new Pipeline()
    .setStages(Array(genreIndexer, modeIndexer, keyIndexer, tsIndexer,
                     genreEncoder, modeEncoder, keyEncoder, tsEncoder))

val dfEncode = pipeline.fit(dfCasted).transform(dfCasted)

dfEncode.select("genre","genreIndex", "genreVec").show(1)

+-----+----------+---------------+
|genre|genreIndex|       genreVec|
+-----+----------+---------------+
|Movie|      23.0|(26,[23],[1.0])|
+-----+----------+---------------+
only showing top 1 row



pipeline = pipeline_7317baebf3c8
dfEncode = [genre: string, artist_name: string ... 24 more fields]


[genre: string, artist_name: string ... 24 more fields]

<a id='agg'></a>
### Aggrégation de nos données

Il reste le problème des doublons à régler. 

Pour nos valeurs obtenus grâce au One Hot Encoder on utilise la classe `Summarizer` et la méthode `max` comme nous somme en présence de vecteur composé de 1 et 0 on obtiendra les différents *genre* et les variable *mode*, *key*, *time_signature* ne seront pas modifié. (Elles ne variaient pas en fonction du genre.)

Pour les autres métriques ont fait le choix de prendre la moyenne des valeurs sur les différentes itérations comme celle-ci pouvait varier faiblement.

In [39]:
import org.apache.spark.ml.stat.Summarizer

val dfDistinct = dfEncode
    .groupBy("track_id", "artist_name", "track_name")
    .agg(mean("acousticness").alias("acousticness"), 
         mean("danceability").alias("danceability"), 
         mean("energy").alias("energy"), 
         mean("instrumentalness").alias("instrumentalness"), 
         mean("liveness").alias("liveness"), 
         mean("loudness").alias("loudness"), 
         mean("speechiness").alias("speechiness"), 
         mean("tempo").alias("tempo"), 
         mean("valence").alias("valence"),
         Summarizer.max($"modeVec").alias("modeVec"),
         Summarizer.max($"keyVec").alias("keyVec"), 
         Summarizer.max($"tsVec").alias("tsVec"),
         Summarizer.max($"genreVec").alias("genreVec"))

dfDistinct = [track_id: string, artist_name: string ... 14 more fields]


[track_id: string, artist_name: string ... 14 more fields]

<a id='vectorAssembler'></a>
### VectorAssembler

Afin d'avoir une seule colonne regroupant toutes nos features on utilise un `VectorAssembler`. 

In [40]:
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
    .setInputCols(Array("acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "modeVec", "keyVec", "tsVec", "genreVec"))
    .setOutputCol("features")

assembler = vecAssembler_1d98a2e834fc


vecAssembler_1d98a2e834fc

<a id='normalizer'></a>
### Normalizer

Comme nous effectuons un K Nearest Neighbors algorithme nous avons besoin de normaliser nos données sinon il n'y aurait pas de sens dans les distances que nous calculons.

In [41]:
import org.apache.spark.ml.feature.Normalizer
val normalizer = new Normalizer()
  .setInputCol("features")
  .setOutputCol("normFeatures")

normalizer = normalizer_04ba2c3df6c8


normalizer_04ba2c3df6c8

In [42]:
val assemblerPipeline = new Pipeline()
    .setStages(Array(assembler, normalizer))

assemblerPipeline = pipeline_1d81a2114aaa


pipeline_1d81a2114aaa

In [43]:
val dfClean = assemblerPipeline.fit(dfDistinct).transform(dfDistinct)

dfClean = [track_id: string, artist_name: string ... 16 more fields]


[track_id: string, artist_name: string ... 16 more fields]

<a id='knn'></a>
## Nearest Neighbors algorithm 

J'utilise un LSH algorithms plus précisément le `BuckereRandomProjectionLSH` pour calculer mes buckets de nearest neighbors

In [44]:
import org.apache.spark.ml.feature.BucketedRandomProjectionLSH

val brp = new BucketedRandomProjectionLSH()
    .setBucketLength(5.0)
    .setNumHashTables(3)
    .setInputCol("features")
    .setOutputCol("hashes")

brp = brp-lsh_61e7b1b4d610


brp-lsh_61e7b1b4d610

## Training du modèle

Notre pipeline est terminé on va maintenant séparer nos données pour entrainer puis challenger notre model. <br>
Il nous suffit de prendre les artistes et musique qu l'on veut tester. <br>
Je fait le choix personnel de prendre l'artiste Tyler the Creator. <br>
On fait aussi le choix que nos recommendations ne renvoient pas des musiques du même artiste.


In [45]:
val tylerData = dfClean.filter($"artist_name" === "Tyler, The Creator")
val training = dfClean.filter($"artist_name" =!= "Tyler, The Creator")

tylerData = [track_id: string, artist_name: string ... 16 more fields]
training = [track_id: string, artist_name: string ... 16 more fields]


[track_id: string, artist_name: string ... 16 more fields]

In [46]:
val model = brp.fit(training)

model = brp-lsh_61e7b1b4d610


brp-lsh_61e7b1b4d610

<a id=result></a>
## Résultats

In [47]:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
val key = tylerData.filter($"track_name" === "See You Again").select("features").rdd.map { case Row(v: Vector) => v}.first

key = (51,[0,1,2,3,4,5,6,7,8,9,19,21,29,33,36],[0.3709999918937683,0.5580000281333923,0.5590000152587891,7.48999991628807E-6,0.10899999737739563,-9.222000122070312,0.09589999914169312,78.55799865722656,0.6200000047683716,1.0,1.0,1.0,1.0,1.0,1.0])


(51,[0,1,2,3,4,5,6,7,8,9,19,21,29,33,36],[0.3709999918937683,0.5580000281333923,0.5590000152587891,7.48999991628807E-6,0.10899999737739563,-9.222000122070312,0.09589999914169312,78.55799865722656,0.6200000047683716,1.0,1.0,1.0,1.0,1.0,1.0])

In [48]:
val resultTyler = model.approxNearestNeighbors(training, key, 4)

resultTyler = [track_id: string, artist_name: string ... 18 more fields]


[track_id: string, artist_name: string ... 18 more fields]

In [49]:
%%dataframe
resultTyler.select("artist_name", "track_name")

artist_name,track_name
Chance the Rapper,How Great (feat. Jay Electronica & My cousin Nicole)
Drake,6 Man
Flatbush Zombies,Facts (feat. Jadakiss)
J. Cole,everybody dies


## Etude des résultats

In [50]:
%%dataframe
tylerData.filter($"track_name" === "See You Again").select("track_id","artist_name", "track_name", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "valence")

track_id,artist_name,track_name,acousticness,danceability,energy,instrumentalness,liveness,loudness,valence
7KA4W4McWYRpgf0fWsJZWB,"Tyler, The Creator",See You Again,0.3709999918937683,0.5580000281333923,0.5590000152587891,7.48999991628807e-06,0.1089999973773956,-9.222000122070312,0.6200000047683716


In [51]:
resultTyler
    .select("track_id", "artist_name", "track_name", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "valence").show(false)

+----------------------+-----------------+----------------------------------------------------+--------------------+------------------+------------------+--------------------+-------------------+------------------+-------------------+
|track_id              |artist_name      |track_name                                          |acousticness        |danceability      |energy            |instrumentalness    |liveness           |loudness          |valence            |
+----------------------+-----------------+----------------------------------------------------+--------------------+------------------+------------------+--------------------+-------------------+------------------+-------------------+
|0OT0cCKbSmSMRvyWeqEFBq|Chance the Rapper|How Great (feat. Jay Electronica & My cousin Nicole)|0.47699999809265137 |0.4390000104904175|0.4830000102519989|0.0                 |0.5529999732971191 |-9.449000358581543|0.2919999957084656 |
|4kdfjhj9xNkYU0R8xlDy8k|Drake            |6 Man             

In [52]:
df.filter($"track_name" === "See You Again" && $"artist_name" === "Tyler, The Creator").select("genre").show()

+-------+
|  genre|
+-------+
|Hip-Hop|
|    Rap|
|    Pop|
+-------+



In [53]:
df.filter($"track_id" === "0OT0cCKbSmSMRvyWeqEFBq" || $"track_id" === "4kdfjhj9xNkYU0R8xlDy8k" || $"track_id" === "1PNfhBdmFikFn4vkrwiq05" || $"track_id" === "1wIQtB3UQ1TfjNMZZqO6eh")
          .select("genre", "track_name", "artist_name").show()

+-------+--------------------+-----------------+
|  genre|          track_name|      artist_name|
+-------+--------------------+-----------------+
|Hip-Hop|      everybody dies|          J. Cole|
|Hip-Hop|               6 Man|            Drake|
|Hip-Hop|How Great (feat. ...|Chance the Rapper|
|Hip-Hop|Facts (feat. Jada...| Flatbush Zombies|
|    Pop|               6 Man|            Drake|
|    Pop|      everybody dies|          J. Cole|
|    Rap|      everybody dies|          J. Cole|
|    Rap|               6 Man|            Drake|
|    Rap|How Great (feat. ...|Chance the Rapper|
|    Rap|Facts (feat. Jada...| Flatbush Zombies|
|    Pop|How Great (feat. ...|Chance the Rapper|
+-------+--------------------+-----------------+



L'algorithme semble avoir fonctionné étant donné les musiques proposées par notre modèle.

<a id=udf></a>
## Utilisation du model une fois entrainé

Je n'ai pas de métrique me permettant de juger de la qualité de mon analyse. Je propose donc une fonction permettant à chaque utilisateur du notebook de facilement rechercher des recommendations pour un titre de chanson.

In [54]:
import  org.apache.spark.sql.Dataset
def recommend(track_id: String, brp: BucketedRandomProjectionLSH, dfClean: DataFrame): Dataset[_] = {
    var training = dfClean.filter($"track_id" =!= track_id)
    var test = dfClean.filter($"track_id" === track_id)
    var model = brp.fit(training)
    var key = test.filter($"track_id" === track_id).select("features").rdd.map { case Row(v: Vector) => v}.first
    var result = model.approxNearestNeighbors(training, key, 4)
    return result
}


recommend: (track_id: String, brp: org.apache.spark.ml.feature.BucketedRandomProjectionLSH, dfClean: org.apache.spark.sql.DataFrame)org.apache.spark.sql.Dataset[_]


In [55]:
val result =  recommend("7KA4W4McWYRpgf0fWsJZWB", brp, dfClean)

result = [track_id: string, artist_name: string ... 18 more fields]


[track_id: string, artist_name: string ... 18 more fields]

In [56]:
%%dataframe
result.select("artist_name", "track_name")

artist_name,track_name
Chance the Rapper,How Great (feat. Jay Electronica & My cousin Nicole)
Drake,6 Man
Flatbush Zombies,Facts (feat. Jadakiss)
J. Cole,everybody dies
