# Music genre classification

## Dataset description

This dataset wad download from kaggle.com. According to the description, it was created as a part of MachineHack Hackathon. [Link to data](https://www.kaggle.com/datasets/purumalgi/music-genre-classification?select=test.csv)

### Features

The dataset has 17 features altogether.

Following table contains dataset feature names, their types and short description:

| Feature name | Type | Description |
| ---- | ---- | ---- |
| Artist Name | string | Artist name |
| Track Name | string | Name of the track |
| Popularity | integer | Measures how the song is popular |
| danceability | float | Measures how the song is danceable  |
| energy | float | Measure how energetic the song is |
| key | categorical | Determines the key of the song |
| loudness | float | Measures how loud the song is |
| mode | categorical | Determines the song's mode |
| speechiness | float | Measures how lyrical the song is |
| acousticness | float | Measure how acoustic the song is |
| instrumentalness | float | Measure how instrumental the song is |
| liveness | float | Measures how lively the song is |
| valence | float | Measures how positive the song is|
| tempo | float | Measures song tempo |
| duration_in min/ms | float | Song duration |
| time_signature | categorical | Determines time signature |
| Class | categorical | Determines which genre the song belongs to |


### Target

The aim of this dataset (and of this notebook) is to train a model to predict the genre a song belongs to. Following table provides overview of genres present in the dataset:

| Class in dataset | Genre |
| ---- | ---- |
| 0 | Acoustic |
| 1 | Alt |
| 2 | Blues |
| 3 | Bollywood |
| 4 | Country |
| 5 | HipHop |
| 6 | Indie |
| 7 | Instrumental |
| 8 | Metal |
| 9 | Pop |
| 10 | Rock |


In [31]:
# import libraries and functions for the whole notebook
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# set seed for reproducible results - sklearn follows this seed as well
np.random.seed(565)

## Creating training and testing dataset.

The data provided on kaggle does not have the target variable present in the testing dataset, so the training dataset will be used for the purposes of training and testing. The dataset will be read and split into training and testing. The split will be done in a stratified fashion to preserve the genre distribution between test and train data.

The exploratory analysis and preprocessing will be done only with the training part in order to avoid data leakage from the testing part.

In [63]:
data = pd.read_csv("data/train.csv", dtype={"key": "category", "mode": "category", "time_signature": "category", "Class": "category"})

train, test = train_test_split(data, test_size=0.3, stratify=data["Class"])

# show automatically detected dtypes (will be fixed later)
data.dtypes

Artist Name             object
Track Name              object
Popularity             float64
danceability           float64
energy                 float64
key                   category
loudness               float64
mode                  category
speechiness            float64
acousticness           float64
instrumentalness       float64
liveness               float64
valence                float64
tempo                  float64
duration_in min/ms     float64
time_signature        category
Class                 category
dtype: object

## Exploratory data analysis

In this section, we will look at how values are distributed in the dataset and how they're individually correlated to the output.

### Missing values

However, first a quick look on missing values in the dataset. It will give overview if certain features can or should be discarded. 

#### Missing values in columns

As it can be seen, only three features have at least one missing value. Result shows percentage of rows in each column that has a missing value. Feature `instrumentalness` is missing a quarter of its values and `key` has 11% missing. Those two features will be discarded due to high amount of missing values. `Popularity` has only 2% values missing, this value is acceptable.

In [64]:
train.isna().sum(axis=0)/train.shape[0] * 100

Artist Name            0.000000
Track Name             0.000000
Popularity             2.310074
danceability           0.000000
energy                 0.000000
key                   11.209018
loudness               0.000000
mode                   0.000000
speechiness            0.000000
acousticness           0.000000
instrumentalness      24.458204
liveness               0.000000
valence                0.000000
tempo                  0.000000
duration_in min/ms     0.000000
time_signature         0.000000
Class                  0.000000
dtype: float64

#### Missing values in rows

34% of songs have at least 1 feature missing. With the two features being dropped however, the percentage would drop to 2.2%.

In [44]:
(train.isna().sum(axis=1)>0).sum()/train.shape[0]*100

34.57172342621259

### Artist name

Artist names are provided as strings. Processing these could be difficult as there are 7286 unique artists in the dataset. There are also some artists represented in multiple unique strings because they can be part of a collaboration between multiple artists. 

Top 10 artists cover 2.56% of all songs. It could be worth to try to include one-hot encoded top N artists as an experiment. Including high number of artists will lead to high cardinality, and the model will probably not be able to capture the relationship for those artists that have just a few songs represented in this dataset. 

In [90]:
print(f"Amount of unique artist strings: {len(train['Artist Name'].unique())}")

top_artists = 10

print(f"Top {top_artists} artists cover {train['Artist Name'].value_counts()[:top_artists].sum()/train.shape[0] * 100:.2f}% of all training data.")

Amount of unique artist strings: 7286
Top 10 artists cover 2.56% of all training data


### Track name

As can be seen in the output below, there are some track names being represented multiple times. However, these numbers are very low to be significant enough to be used as a variable.

In [91]:
train['Track Name'].value_counts()

Track Name
Dreams                                    6
Fire                                      6
Hurricane                                 6
Ghost                                     6
Ride                                      6
                                         ..
Never Going Back Again - 2004 Remaster    1
Eternal Wheel Of Time And Space           1
Bet My Blood                              1
So What!                                  1
Tu Bolero Es Mi Zamba                     1
Name: count, Length: 11091, dtype: int64

### Popularity

## Preprocessing

TODO

> Requirements on preprocessing
> 
> Any two of the following operations are mandatory:
> 
> -     remove rows based on subsetting
> -     derive new columns
> -     use aggregation operators
> -     treat missing values


TODO remove instrumentalness


## Modeling

TODO

> Requirements
> 
> Use any classifier. Choose one of the following two options:
> 
> -     perform train/test split
> -     use crossvalidation
> 
> Also, evaluate and compare at least two algorithms of different types (e.g. logistic regression and random forest).
> Any classifier from `sklearn`

### Hyperparameter tuning

TODO

> Requirements on metaparameter tuning
> 
> If the chosen classifier has any metaparameters that can be tuned, use one of the following methods:
> 
>  -   try several configurations and describe the best result in the final report
>  -   perform grid search or other similar automatic method
>  -   once you have tuned metaparameters on a dedicated development (training) set, e.g. with GridSearchCV, you can retrain the model on the complete training data, as e.g. described here for Python: https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html and https://stackoverflow.com/questions/26962050/confused-with-repect-to-working-of-gridsearchcv
> 
> Python recommendation: sklearn.model_selection.GridSearchCV

## Results and evaluation

TODO

> Requirements on model evaluation
> 
> -    report the accuracy on test set/crossvalidation
> -    if you are performing binary classification task, involve also the ROC curve
> -    make sure to use dedicated dataset for evaluation
> 
> Python: use model_selection.cross_val_score, plot the roc curve using sklearn.metrics.roc_curve R: print model learned using the caret package, the roc curve can be plotted using the plotROC package.