## Introduction

Working with the Echonest dataset which contains audio features on 13k+ tracks. This is in the interest of working towards my own MIR using deep learning to extract audio features from audio signals (probably using GTZAN).
The data comes from the [FMA dataset](https://github.com/mdeff/fma). Let's start by training a model to do feature extraction given an MFCC (derived from an audio signal). We can use the Echonest dataset which comes with audio features already extracted and use the FMA utils to generate the MFCC.

## What Model?

Is a CNN really the best to use for a regression model? [Community](https://stats.stackexchange.com/questions/335836/cnn-architectures-for-regression) seems to say no, so what's the alternative? We could just change the scope of the problem to a classification one, where instead of extracting audio features like acousticness (on a scale of 0-1) we try to classify by genre. A CNN for genre classification is an interesting idea and maybe a good starting point, but for the time being I'll read more.

#### Brainstorming
- A two network implementation: the first being a genre classifier, the second being a regression on an individual or set of features

## Setup

We first need to import the necessary packages and load in the dataset. We'll use pandas--cause when have we not--and sklearn for the similarity metrics. 

In [6]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

filepath = "data/echonest.csv"

In this scenario we'll use both Euclidean distance and cosine similarity. I have my hesitations with using just cosine similarity here because we aren't dealing with documents. I need to understand the data a little better before I decide.

In the Pitchfork review recommender, we were using keywords from the album reviews to compare similarity. Longer reviews would be mapped to vectors with large word frequency values, due to those words appearing more often. But a shorter review with relatively the same differences in frequencies has lower frequency values leading to a greater Euclidean distance between the two despite being similar. Therefore, measuring the angle between the vectors is a better measure of similarity.

But in this case we have the features already provided for us, so I'll have to understand how the data is scaled and all before deciding on a metric for comparisons.

In [16]:
df = pd.read_csv(filepath, index_col=0, header=[0, 1, 2])
print(df['echonest', 'audio_features'].head())

          acousticness  danceability    energy  instrumentalness  liveness  \
track_id                                                                     
2             0.416675      0.675894  0.634476          0.010628  0.177647   
3             0.374408      0.528643  0.817461          0.001851  0.105880   
5             0.043567      0.745566  0.701470          0.000697  0.373143   
10            0.951670      0.658179  0.924525          0.965427  0.115474   
134           0.452217      0.513238  0.560410          0.019443  0.096567   

          speechiness    tempo   valence  
track_id                                  
2            0.159310  165.922  0.576661  
3            0.461818  126.957  0.269240  
5            0.124595  100.260  0.621661  
10           0.032985  111.562  0.963590  
134          0.525519  114.290  0.894072  


## Understanding the Dataset

My initial impression is that this dataset is [multi-tiered](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html), or a dataset made up of clearly defined subsets. There is the umbrella/master set (echonest), sub-categories of which are the different types of data features offered (i.e. audio_features, metadata, social_features, etc), these sub-categories in turn have sub-categories which are the columns of the dataframe. The columns are each a different feature within that type of data feature (i.e. acousticness and danceability under audio_features).

So there already exists a nice hierarchical structure to the data. We can leverage this to quickly test and compare the results of measuring similarity between tracks based on different features and types of features. One objective is to have a good understanding of the differences in measuring similarity using different sets of features.

Let's explore the data a little, gotta figure out what all the types of data are and the features under those types. I'm mainly interested in the audio_features as a jumping off point to train the model.