# Group information

Names:


RAs:

# **Machine Learning MC886/MO444 - Task \#1**: Regression and Classification


### Objective:

To explore **Linear Regression** and **K-Nearest Neighbors** alternatives and come up with the best possible model for the problems. In this work, we will train three models, one for regression, other for binary classification, and the last one for multiclass classification.

## **Linear Regression**

In this section you must load and explore the dataset, and build a linear regressor by hand. No machine learning libraries are allowed. After building your own regressor, you must compare it with the sklearn `SGDRegressor`.


#### **Dataset: Spotify Song Attributes**

The Spotify Song Attributes Dataset is a comprehensive collection of music tracks, encompassing various genres and artist names. This dataset provides valuable insights into the world of music, allowing enthusiasts, researchers, and data scientists to delve into the characteristics and nuances of each track.

The dataset can be found here: https://www.kaggle.com/datasets/byomokeshsenapati/spotify-song-attributes


Some features and the corresponding descriptions:

| Variable Name | Data Type | Description | Example Value | Range/Possible Values |
|---|---|---|---|---|
| `track_name` | String | Name of the song | `'Blinding Lights'` | - |
| `track_artist` | String | Name of the artist(s) | `'The Weeknd'` | - |
| `msPlayed` | Integer | Miniseconds Played | `191772` | - |
| `danceability` | Float | How suitable a track is for dancing (0.0 - 1.0) | `0.70` | 0.0 - 1.0 |
| `energy` | Float | Perceptual measure of intensity and activity (0.0 - 1.0) | `0.73` | 0.0 - 1.0 |
| `key` | Integer | Estimated overall key of the track (0 = C, 1 = C♯/D♭, ... , 11 = B) | `1` | 0 - 11 |
| `loudness` | Float | Overall loudness of the track in decibels (dB) | `-5.934` | Typically -60 to 0 dB |
| `mode` | Integer | Modality of the track (0 = Minor, 1 = Major) | `1` | 0 or 1 |
| `speechiness` | Float | Presence of spoken words in the track (0.0 - 1.0) | `0.0572` | 0.0 - 1.0 |
| `acousticness` | Float | Confidence measure of whether the track is acoustic (0.0 - 1.0) | `0.00146` | 0.0 - 1.0 |
| `instrumentalness` | Float | Predicts whether a track contains no vocals (0.0 - 1.0) | `0.000095` | 0.0 - 1.0 |
| `liveness` | Float | Presence of an audience in the recording (0.0 - 1.0) | `0.0897` | 0.0 - 1.0 |
| `valence` | Float | Musical positiveness conveyed by a track (0.0 - 1.0) | `0.644` | 0.0 - 1.0 |
| `tempo` | Float | Overall estimated tempo of a track in beats per minute (BPM) | `171.005` | Typically 50 - 200 BPM |
| `duration_ms` | Integer | Duration of the track in milliseconds | `191948` | - |
| `time_signature` | Integer | Estimated overall time signature of a track | `4` | Typically 3, 4, or 5 |
| `genre` | String | Genre of the track (if available) | `'Pop'` | Varies depending on dataset |


In [None]:
# link with google drive
from google.colab import drive
import pandas as pd
drive.mount('/content/gdrive', force_remount=True)

# load dataset from google drive
path = # Change this to be the correct path if you added the dataset in a different location
df = pd.read_csv(path)

### **Data analysis and preprocessing** (1.5 point)

In this section, you should explore the dataset. Remember to avoid using data that you should not have in training.

You can plot graphs with features that you think are important to visualize the relation with the target(`msPlayed`). You can also use boxplot graphs to understand feature distributions. There are no minimal/maximum requirements in what graphs you should use, explore just what you think it can help in understanding the dataset.

Check for the dependencies of the features and the target to understand which has bigger impact in it (more details on the `mutual_information` section below!).

The dataset has categorical features that cannot be used in the models. Fix this (Pandas has a built-in function for that!).

Remember that machine learning models are highly affected by the scale of the input features.


In [None]:
## Visualize the data

**Mutual information**

The mutual information measure is a way to estimate the mutual dependency of two different variables. Therefore, it might be used as an alternative to t- or F-statistic to assess association between a predictor variable $X_i$ and the response variable $Y$.

In that way, we can try to select features in an early stage of the machine learning pipeline, by removing features with low mutual information with the target.

To do this task, use the `mutual_info_regression` function from the Sklearn library. You should pay attention to the *discrete_features* parameter, that should be correctly constructed (all continuous features should be `False` in the array, while others are True). <br/>
The features that are not numbers also need to be converter in order to `mutual_info_regression` to work. To do this you can use Sklearn's [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder). This way of encoding is not always appropriate for learning experiments, as ordinal encoding represents a specific order between the categorical values. For the mutual information this is not a problem, but you should not use this encoding for the rest of experiments.

Note: *It is important to notice that this preprocess step has to be done carefully, and is not mandatory for all problems/datasets. The mutual information measure does not take into account the interaction between different features. The basic Linear Regressor also does not use this, so this will most likelly enhance (or at least not worsen) the performance of the model. When using more complex models such as Neural Networks (that we will study in the near future), removing the features that have low direct mutual dependency with the target may worsen our model, as the complex model can find those hidden interactions.*


In [None]:
## check Mutal information
from sklearn.feature_selection import mutual_info_regression


In [None]:
## Visualize the mutual information of each variable (Just run!)
## mt_info is the output of Sklearn function
mt_info_df = pd.Series(mt_info)
mt_info_df.index = X.columns
mt_info_df.plot.bar(figsize = (16,6));
plt.ylabel('Mutual Information (MI)')
plt.title('Features - Vertical')
plt.grid(linewidth=0.25)
plt.show()

In [None]:
## Adapt the categorical features

#### Discussion of key points

- How the visualization helped in understanding the data?
- Looking at the mutual information plot, can you find features that seem to be uninfluential? (If so, remember to remove them before next steps!)


*YOUR ANSWER HERE*

### **Implement and train a Linear Regressor** (2.5 points)

You should complete the implementation of the `MyLinearRegressor` class and of the `MSE` metric started below. No machine learning libraries are allowed for this.

The common regression metrics used to understand regression model's perfomance are the Mean Squared Error (MSE), Mean Absolute Error (MAE) and Coefficient of determination(R²). You can implement your own version of the last two metrics (MAE and R²) or use it from sklearn. Compare the three different metrics.

In [None]:
# TODO: MSE. You cannot use machine learning libraries for this!!
def MSE():
  return None

In [None]:
# TODO: Finish the implementation of MyLinearRegressor. You cannot use machine learning libraries for this!!
class MyLinearRegressor():
  def __init__(self, learning_rate=-1, max_iter=-1):
    self.max_iter         = max_iter
    self.learning_rate    = learning_rate
    self.weights          = None
    self.bias             = None

  def predict(self, X):
    return None

  def fit(self, X, y):
    return None

#### Discussion of key points

- Looking at the different metrics proposed, what is the best one for this problem?
- Your Linear Regressor was able to closely estimate the amount of miliseconds played? Justify using a machine learning metric.
- What do you think is the biggest error type in your model: variance or bias?

*YOUR ANSWER HERE*

### **Compare with SGDRegressor** (0.5 point)

After training your regressor, train a `SGDRegressor` from sklearn and compare both.

In [None]:
# TODO: Traing the SGDRegressor. You should use sklearn libraries.
from sklearn.linear_model import SGDRegressor

### **EXTRA: Find interaction terms** (0.5 point)

In the "An Introduction to Statistical Learning" book, in chapter 3, the authors explain how different terms can interact with each other, and this interaction can have a bigger correlation with the target.

In this extra task, search for relation of columns that can enhance the results of the model.<br/>
The interaction could be to sum, subtract, multiply or divide two columns by each other. Choose some relations to test between some of the columns. <br/>
To do this, you can use the `mutual_information` technique to test wheter the new columns have bigger Mutual Information.

You should train your own model **not** the Sklearn one.



In [None]:
# Reload the dataset

In [None]:
# Transform features

In [None]:
# Check mutual information

In [None]:
# Re-train and test the model

## **K-Nearest Neighbors (KNN) Classifier**

In this section you must load and explore the dataset, and train a K-Nearest Neighbors (KNN) classifier. You can (and should) use the Sklearn library to it.

Remember to avoid using data that you should not have in training when performing the data analysis.

#### **Dataset: Kickstarter Projects**

I'm a crowdfunding enthusiast and i'm watching kickstarter since its early days. Right now I just collect data and the only app i've made is this twitter bot which tweet any project reaching some milestone: @bloomwatcher . I have a lot of other ideas, but sadly not enough time to develop them… But I hope you can!


https://www.kaggle.com/datasets/kemical/kickstarter-projects


| Variable Name | Data Type | Description | Example Value |
|---|---|---|---|
| `ID` | Integer | Unique identifier for the project | `1000002330` |
| `name` | String | Name of the project | `"The Songs of Adelaide & Abullah"` |
| `category` | String | Category of the project (e.g., "Publishing", "Film & Video", "Music") | `"Publishing"` |
| `main_category` | String | Main category of the project (e.g., "Art", "Technology", "Games") | `"Art"` |
| `currency` | String | Currency used for the funding goal | `"GBP"` |
| `deadline` | Date | Date the project funding period ended | `"2015-10-09"` |
| `goal` | Float | Funding goal in the project's currency | `1000.0` |
| `launched` | Date | Date the project was launched | `"2015-08-11 12:12:28"` |
| `pledged` | Float | Total amount pledged in the project's currency | `0.0` |
| `state` | String | State of the project (e.g., "failed", "successful", "canceled") | `"failed"` |
| `backers` | Integer | Number of backers who pledged to the project | `0` |
| `country` | String | Country where the project is based | `"GB"` |
| `usd pledged` | Float | Total amount pledged in US dollars (converted) | `0.0` |
| `usd_pledged_real` | Float | Total amount pledged in US dollars (converted using a different API) | `0.0` |
| `usd_goal_real` | Float | Funding goal in US dollars (converted using a different API) | `1533.95` |

**How to load the dataset**

As you already copied the folder in the first part of this task, you can just directly load the dataset.

*If you want to run the notebook locally, change the path below to the location of the folder in your local environment.*

In [None]:
# link with google drive, remove comments if need to restart from this step.
import pandas as pd
# from google.colab import drive
# drive.mount('/content/gdrive', force_remount=True)

# load dataset from google drive
path = # Change this to be the correct path if you added the dataset in a different location

df = pd.read_csv(path)

### **Data analysis and preprocessing** (1.5 point)

In this section, you should explore the dataset.
This should be done as in the Linear Regressor section, paying attention to mutual information (now using [`mutual_info_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html)) and categorical features.

Check for missing values before changing the dataset and explore how to deal with them (removing? filling with mean/median/random? Etc).

Remember that machine learning models are highly affected by the scale of the input features.


#### Discussion of key points

- There were missing values in the dataset? How did you dealt with each one?
-~Changing the missing values impacted in the mutual information of features with the target?~ => DO NOT ANSWER, there is no way to calculate Mutual Informations with missing values.

*YOUR ANSWER HERE*

### **Train a K-Nearest Neighbors Classifier** (2.5 points)


You should use the Sklearn `KNeighborsClassifier` function to fit the data.

You can use different metrics such as accuracy and f1-score from Sklearn (or create your own implementation) to understand the model's performance.

Also, plot a confusion matrix to analyze the results.
A Confusion matrix is a matrix were the columns represent the true label, and rows represent the predicted label. As this is a binary classification task, the matrix should have 2x2. You can study more about it [here](https://en.wikipedia.org/wiki/Confusion_matrix). You can use sklearn functions that help in building and displaying it.

#### Discussion of key points

- Is accuracy a good metric for this problem? Justify.
- What conclusions can you have when looking your results in the confusion matrix?
- What was the best K for this problem? How the selection o K impacts the Bias-variance tradeoff?

*YOUR ANSWER HERE*

## **Multiclass classification** (1 point)

In this last section you should adapt the "**Spotify Song Attributes**" dataset target, creating an arbitrary number N of classes, where 2 < N < 11.

Classes should represent equally spaced intervals in the continuous target of rate of played time w.r.t. the time of the song.<br/>
For example, if N == 3, we should have 3 classes. Given that $\hat{x}$ is the maximum rate in train dataset, samples with y <= $\hat{x}/3$ should be of class 0, samples y > $\hat{x}/3$ but y <= $2*\hat{x}/3$ should be of class 1, and samples where y > $2*\hat{x}/3$ should be of class 2.

You can use the Sklearn KNN classifier for this task, as well as any sklearn helper functions. Remember to carefully perform the needed preprocess steps discussed in other sections (if necessary).

Plot a confusion matrix with the results.

#### Discussion of key points

- Is accuracy a good metric for this problem? Justify.
- What conclusions can you have when looking your results in the confusion matrix?
- There is value in solve a regression problem as a multiclass classification problem?

*YOUR ANSWER HERE*