# Assignment 2

<br> 
    <p style="text-align:center;"><img src="https://upload.wikimedia.org/wikipedia/commons/0/0b/AbaloneInside.jpg" width="400"></p>
</br>

In the second assignment, you can revise **regression techniques** and your **Pandas skills**. 
For this, we consider a dataset of [ear shells](https://en.wikipedia.org/wiki/Abalone) as depicted above. You have encountered the first 23 entries of this dataset in the first assignment. With this notebook you can work on

* making proficient use of **Pandas** functions,
* **ridge regression**,
* regression with **decision trees**,
* and **random forest** regression.

For this exercise, you can switch back to the environment ```APML``` used during class.

Please provide solutions to all exercises below and send me all notebooks by **31st of May 2024**.

***

## Part I: Dataset overview

Let us first inspect the dataset. Load the data with the following cells.

In [None]:
import pandas as pd
import numpy as np

shell_data = pd.read_csv('data/shells.csv', sep=',')

In [None]:
shell_data.head(5)

In [None]:
shell_data.info()

### Exercise I.1

As we want to employ regression techniques, encode the entries in 
```sex``` column with numerical values like
* ```0.0```, ```0.5```, and ```1.0``` instead of
* ```'M'```, ```'I'```, and ```'F'```.

Fill in the cell below:

If you were successful, the following cell should execute without an error:

In [None]:
shell_data['sex'] = shell_data['sex'].astype(float)
print(shell_data['sex'])

### Exercise I.2

Now, obtain the following statistics for the columns containing numerical values:

* count of non-NaN entries
* mean
* standard deviation
* minimum value
* 25% quantile
* 50% quantile (i.e. the median)
* 75% qunatile
* maximum value

### Exercise I.3

From the information above, you can already deduce that there are some **missing 
values**. 

1. Obtain the sub-table containing missing values.
2. From this, store the row indices with missing values in ```row_indices_nans```. We will make use of this later on.

Hints: Your solution could include the following parts

*  ```np.any()```
*  ```.isna()```
*  ```.index```

Note that your own solution might not require all of these.

In [None]:
print(row_indices_nans)

### Exercise I.4

Missing entries can pose a problem in your analysis when fitting, e.g., a regression model 
that expects a value for each property / column you provide. Thus, the first step in any analysis 
should be to treat these missing values. There are different ways to perform **imputation** of missing values.

In this exercise, set a column's **mean value** as the value for the missing entries in that column.

Hints: Your solution could include the following parts

* ```.isna()```
*  ```df.loc[row_index, column_name]```
  
Note that your own solution might not require all of these.

If you were successful, the following should not include missing values anymore:

In [None]:
shell_data.loc[row_indices_nans]

### Exercise I.5

In the following, we will try to predict the price of ear shells based on the different properties 
(which we will refer to as **features** in the following).

A first, simple metric is the (linear) **correlation** between the different features / columns of our dataset.

Obtain the **correlation matrix** for the numeric columns of ```shell_data```. Which features of ear shells 
correlate more strongly with the price?

Note the top-3 features according to correlation matrix:
1. .
2. .
3. .

***

## Part II: Ridge Regression

Before we start fitting different regression methods, let us first separate the 
```shell_data``` into four parts:

* ```shell_features_train```: The features we want to fit our models with. This table should not include the target ```'price'```.
* ```shell_targets_train```: The target ```'price'``` values we want to use to fit our models.
* ```shell_features_test```: The features we want to **test** our models on. This table should not include the target ```'price'```.
* ```shell_targets_test```: The **test prediction** target ```'price'``` values. 

In [None]:
shell_features_train = shell_data[shell_data['subset']=='train'].copy()
shell_features_test = shell_data[shell_data['subset']=='test'].copy()

shell_targets_train = shell_features_train['price'].copy()
shell_targets_test = shell_features_test['price'].copy()

shell_features_train = shell_features_train.drop(['subset','price'], axis=1)
shell_features_test = shell_features_test.drop(['subset','price'], axis=1)

shell_targets_test = shell_targets_test.to_frame()

print(f"Shapes of\n"
      f"shell_features_train:\t {shell_features_train.shape}\n"
      f"shell_features_test:\t {shell_features_test.shape}\n"
      f"shell_targets_train:\t {shell_targets_train.shape}\n"
      f"shell_targets_test:\t {shell_targets_test.shape}\n"
)

### Exercise II.1 

Implement a **ridge regression** approach to predict the ```'price'``` for the test set ```shell_features_test```.
Append the predicted prices to the ```shell_targets_test``` of true prices as an additional column.

In [None]:
shell_targets_test

### Exercise II.2

Can you think of a way to identify which features were more relevant in the ridge regression?

Fill in the cell below with your idea:

Note the top-3 features according to the ridge regression:
1. .
2. .
3. .

***

## Part III: Decision Tree

### Exercise III.1

Implement a **decision tree** approach to predict the ```'price'``` for the test set ```shell_features_test```.
Append the predicted prices to the ```shell_targets_test``` of true prices as an additional column.

Use

In [None]:
tree_height = 5
random_state = 123

as the ```max_depth``` and ```random_state``` in the definition of your decision tree.

Fill in the cell below:

Visualise the fitted tree below:

### Exercise III.2

Can you think of a way to identify which features were more relevant for the decision tree decision?

Fill in the cell below with your idea:

Note the top-3 features according to the decision tree:
1. .
2. .
3. .

***

## Part IV: Random Forest

### Exercise IV.1

Implement a **random forest** approach to predict the ```'price'``` for the test set ```shell_features_test```.
Append the predicted prices to the ```shell_targets_test``` of true prices as an additional column.

Similar to before, use 

In [None]:
tree_height = 5
random_state = 123
num_trees = 100

as the ```max_depth```, ```random_state```, and additional ```n_estimators``` in the definition of your random forest.

Fill in the cell below:

Visualise a randomly selected fitted tree below:

### Exercise IV.2

Can you think of a way to identify which features were more relevant for the random forest decision?

Fill in the cell below with your idea:

Note the top-3 features according to the random forest:
1. .
2. .
3. .

***

## Part V: Metric to compare methods

Which method performed best? Choose a metric which gives you a single number for the performance of each method on 
the test set ```shell_features_test```, i.e. compare ```shell_targets_test``` to the predictions.

By now, you should have collected all prediction results in the following table:

In [None]:
shell_targets_test

Fill in the cell below with your metric:

***

## Part VI: Missing values revisited

In the previous parts, we have devised different methods to predict the price given the features of ear shells.
With this, we have another option for imputation of missing values: We can predict the missing prices based on the features.

We reuse ```row_indices_nans``` from Exercise I.3 to define the rows for which we want to predict a better
value than just the mean price.

In [None]:
imputation_subset =  shell_data.loc[row_indices_nans].copy()
imputation_features = imputation_subset.drop(['subset','price'], axis=1)

In [None]:
imputation_features 

Now, drop these indices in the training set and test set, i.e. recreate ```shell_features_train```, 
```shell_targets_train```, ```shell_features_test```, ```shell_targets_test``` similar to the way
we have done above, but this time without the rows in ```imputation_features```.

Fill in the cell below:

Rerun all cells above from Exercise II.1 on to retrain all methods, but this time we leave out all 
rows of ```imputation_features```. 

Finally, predict the results for ```price``` based on ```imputation_features``` with the different methods
and see how they differ from the mean value we imputed earlier.

Fill in the cell below:

***

## Bonus Part VII: Larger trees

#### This last part is not required to complete assignment 2!

Above we used ```tree_height = 5```, i.e. up to five splits, to construct the trees. 
In Part V, you have defined a comparison metric, which should be three numbers for the
three regression techniques. Save your previous result by copying the outputs in the cell below:

Now, rerun the cells from Part III on, but this time with ```tree_height = 20```.

Does increasing the number of splits, in other words making more decisions, always help predict unseen test examples accurately? And if not, why is that not the case?