# Introduction

Have you ever asked yourself how we can recommend songs music based on your taste? **Similarity** is the answer.
Similarity measures how much two objects have similar shapes, values, or distances.
Thus, we can use similarity to measure similar songs and create a fine recommendation for the users based on previously listened songs.

Dataset: [Spotify Song Attributes](https://www.kaggle.com/geomack/spotifyclassification) - An attempt to build a classifier that can predict whether or not I like a song.

**Disclaimer**: This is a simple study case of similarity. There are many state-of-art algorithms for song recommendation. Anyway, this notebook can be used as a first step for this study, and also a base test algorithm for your experiments.

## Pre Definitions

Import packages, and create useful functions (code hidden).

In [None]:
# Installing youtube tool
!pip install youtube-search-python=='1.3.1'

In [None]:
# Import the needs
import numpy as np # linear algebra
import pandas as pd # data processing

import json
from youtubesearchpython import SearchVideos # YouTube search tool

In [None]:
# Get a Song str search
def getMusicName(elem):
    return '{} - {}'.format(elem['artist'], elem['song_title'])


# Function to search a YouTube Video
def youtubeSearchVideo(music, results=1):
    searchJson = SearchVideos(music, offset=1, mode="json", max_results=results).result()
    searchParsed = json.loads(searchJson)
    searchParsed = searchParsed['search_result'][0]
    return {'title': searchParsed['title'], \
            'duration': searchParsed['duration'], \
            'views': searchParsed['views'], \
            'url': searchParsed['link'] }

## Loading data

How many songs do we have?

In [None]:
# Load dataset
dfSongs = pd.read_csv('/kaggle/input/spotifyclassification/data.csv', index_col=0)

# Number of rows and columns
rows, cols = dfSongs.shape
print('Number of songs: {}'.format(rows))
print('Number of attributes per song: {}'.format(cols))

What are the song attributes?

In [None]:
# Print the columns
display(dfSongs.columns)

In [None]:
# Print the attributes type
dfSongs.info()

Printing the first rows.

In [None]:
dfSongs[['song_title', 'artist']].head(5)

Searching some video, for example.

In [None]:
# Select a song
anySong = dfSongs.loc[0]
# Get the song name
anySongName = getMusicName(anySong)
print('name:', anySongName)

# Search in YouTube
youtubeSearchVideo(anySongName)

# Similarity Queries

We created queries to retrive the elements more similar based on [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance).
"In mathematics, the Euclidean distance between two points is a number, the length of a line segment between the two points."
In this sense, the distance the closer to 0 the more similar the songs are.

## k-nearest neighbors algorithm (k-NN)

The [k-NN algoritm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) searches for the $k$ similar elements based on a query point at the center; or a threshold distance limit based on a query point, which is in a pre defined radius. Thus, we have two kinds of k-NN:

* $k$ query: return $k$ closest songs.
* Range query: return all songs with 'distance' $\leq$ 'threshold'.

In [None]:
# K-query
def knnQuery(queryPoint, arrCharactPoints, k):
    tmp = arrCharactPoints.copy(deep=True)
    tmp['dist'] = tmp.apply(lambda x: np.linalg.norm(x-queryPoint), axis=1)
    tmp = tmp.sort_values('dist')
    return tmp.head(k).index

# Range query
def rangeQuery(queryPoint, arrCharactPoints, radius):
    tmp = arrCharactPoints.copy(deep=True)
    tmp['dist'] = tmp.apply(lambda x: np.linalg.norm(x-queryPoint), axis=1)
    tmp['radius'] = tmp.apply(lambda x: 1 if x['dist'] <= radius else 0, axis=1)
    return tmp.query('radius == 1').index

In [None]:
# Execute k-NN removing the 'query point'
def querySimilars(df, columns, idx, func, param):
    arr = df[columns].copy(deep=True)
    queryPoint = arr.loc[idx]
    arr = arr.drop([idx])
    response = func(queryPoint, arr, param)
    return response

### $k$ query

Trying a query using `knnQuery`.

For example, let's search for $k=3$ similar songs to a query point `songIndex=5` (music: `"Drake - Sneakin"`).

In [None]:
# Selecting song and attributes
songIndex = 1936 # query point, selected song
columns = ['acousticness','danceability','energy','instrumentalness','liveness','speechiness','valence']

# Selecting query parameters
func, param = knnQuery, 3 # k=3

# Querying
response = querySimilars(dfSongs, columns, songIndex, func, param)

In [None]:
# Select a song
anySong = dfSongs.loc[songIndex]
# Get the song name
anySongName = getMusicName(anySong)
# Retrive a YouTube link
youtube = youtubeSearchVideo(anySongName)

# Print
print('# Query Point')
print(songIndex, anySongName)
print(youtube['url'])

In [None]:
print('# Similar songs')
for idx in response:
    anySong = dfSongs.loc[idx]
    anySongName = getMusicName(anySong)
    youtube = youtubeSearchVideo(anySongName)
    
    print(idx, anySongName)
    print(youtube['url'])

### Range query

Trying a query using `rangeQuery`.

For example, let's search similar songs using $dist \leq 0.15$, and query point `songIndex=10` (music: `"The Avalanches - Subways - In Flagranti Extended Edit"`).

In [None]:
# Selecting song and attributes
songIndex = 5 # query point, selected song
columns = ['acousticness','danceability','energy','instrumentalness','liveness','speechiness','valence']

# Selecting query parameters
func, param = rangeQuery, 0.15 # threshold distance

# Querying
response = querySimilars(dfSongs, columns, songIndex, func, param)

In [None]:
# Select a song
anySong = dfSongs.loc[songIndex]
# Get the song name
anySongName = getMusicName(anySong)
# Retrive a YouTube link
youtube = youtubeSearchVideo(anySongName)

# Print
print('# Query Point')
print(songIndex, anySongName)
print(youtube['url'])

In [None]:
print('# Similar songs')
for idx in response:
    anySong = dfSongs.loc[idx]
    anySongName = getMusicName(anySong)
    youtube = youtubeSearchVideo(anySongName)
    
    print(idx, anySongName)
    print(youtube['url'])

# Making questions

So far, we have been able to make queries searching for similar songs based on distance to a query point, using knnQuery and rangeQuery. In this way, it is possible to find similar songs based on a user's tastes.

## What are the most active, cheerful songs?

Anyway, we can also create our own personalized query points and modify the columns to explore other options. For example, query the most cheerful songs, selecting a specific set of song attributes `columns = ['danceability','energy','valence']`; and searching for the $k$ most high values of `'danceability'=1,'energy'=1,'valence'=1`. Thus, **question**: _What are the top 5 active, cheerful songs on our list?_

In [None]:
# Defining the query point and the attributes
k = 3
queryPoint = [1, 1, 1] # query point
columns = ['danceability','energy','valence']

# Searching for the songs
arr = dfSongs[columns].copy(deep=True)
response = knnQuery(queryPoint, arr, k)

# Printing
print('# Active, cheerful songs')
for idx in response:
    anySong = dfSongs.loc[idx]
    anySongName = getMusicName(anySong)
    youtube = youtubeSearchVideo(anySongName)
    
    print(idx, anySongName)
    print(youtube['url'])

## What are the less active, or not energized songs?

We can also change of perspective. In this way, **question**: _What are the top 5 less active or not animated songs on our list?_ We just need to change our query point to values of `'danceability'=0,'energy'=0,'valence'=0`.

In [None]:
# Defining the query point and the attributes
k = 3
queryPoint = [0, 0, 0] # query point
columns = ['danceability','energy','valence']

# Searching for the songs
arr = dfSongs[columns].copy(deep=True)
response = knnQuery(queryPoint, arr, k)

# Printing
print('# Active, cheerful songs')
for idx in response:
    anySong = dfSongs.loc[idx]
    anySongName = getMusicName(anySong)
    youtube = youtubeSearchVideo(anySongName)
    
    print(idx, anySongName)
    print(youtube['url'])