# Follow these instructions:

Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.

# Assignment 6: Model Selection and Cross-validation [ __ /100 marks]


In this assignment we will examine ["Forest Fires"](https://archive.ics.uci.edu/ml/datasets/Forest+Fires) dataset to predict the burned area of forest fires giving some features. We will apply model selection and cross-validation method we learned.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import make_scorer
from sklearn.base import BaseEstimator, TransformerMixin
np.set_printoptions(precision=3)
seed=0

## Question 1.0 [ _ /6 marks]

Read the file `forestfires.csv` into a dataframe. Display the first 5 rows of this dataframe. 

In [2]:
# Read forestfires.csv into a dataframe [ /1 marks] 
df = pd.read_csv('forestfires.csv')

# Display the first 5 rows of the dataframe [ /1 marks]
print(df.head())

   X  Y month  day  FFMC   DMC     DC  ISI  temp  RH  wind  rain  area
0  7  5   mar  fri  86.2  26.2   94.3  5.1   8.2  51   6.7   0.0   0.0
1  7  4   oct  tue  90.6  35.4  669.1  6.7  18.0  33   0.9   0.0   0.0
2  7  4   oct  sat  90.6  43.7  686.9  6.7  14.6  33   1.3   0.0   0.0
3  8  6   mar  fri  91.7  33.3   77.5  9.0   8.3  97   4.0   0.2   0.0
4  8  6   mar  sun  89.3  51.3  102.2  9.6  11.4  99   1.8   0.0   0.0


In [6]:
# Inspect the data types of the attributes in the dataframe and answer the question in the next cell

# Number of rows in the dataframe [ /1 marks]
print("Number of rows: " + str(df.shape[0]))

# Number of null entries in the dataframe [ /1 marks]
print("Number of null entries: " + str(df.isnull().sum().sum()))

# Types of all the columns (variables) in a dataframe [ /2 marks ]
print(df.dtypes)

Number of rows: 517
Number of null entries: 0
X          int64
Y          int64
month     object
day       object
FFMC     float64
DMC      float64
DC       float64
ISI      float64
temp     float64
RH         int64
wind     float64
rain     float64
area     float64
dtype: object


 **Questions**:
 1. How many rows are there?  [ /1 marks]
 2. Does the data consist of any null entries? [ /1 marks]
 3. What categorical attributes do you see? [ /2 marks]
 
**Your answer**:
1. 517
2. No, there are zero null entries.
3. month and day are both categorical attributes.

## Question 1.1 [ _ /15 marks]

Using a threshold of statistical significance of 5%, check statistical significance for the labels of each categorical attribute. Group insignificant labels into two new statistically significant classes.

In [9]:
# Check statistical significance of the labels in month. [ /3 marks]

In [None]:
# Check statistical significance of labels in categorical attribute 2. [ /3 marks]
# ****** your code here ******


In [None]:
# Group insignificant labels into two new statistically significant labels. [ /8 marks]
# ****** your code here ******


In [None]:
# Recheck statistical significance of the attribute with adjusted labels [ /1 marks]
# ****** your code here ******


## Question 1.2 [ _ /4 marks]

Let's convert all categorical data into numerical data using `get_dummies`. Display the first 5 rows of your new dataframe.

In [None]:
# Use "get_dummies" to perform one hot encoding to the categorical attributes [ /3 marks]
# ****** your code here ****** 


# Display first 5 rows of the dataframe [ /1 mark]
# ****** your code here ****** 


In [None]:
# The .head() in the previous cell might give a truncated view. You can see all columns names using: 
df.columns

## Question 1.3 [ _ /8 marks]

Let's examine the distribution of the target variable "area".

In [None]:
# Plot the distribution of target variable "area" with bins = 20 using an appropriate seaborn function [ /1 mark]
# ****** your code here ****** 


 **Question**:
 
 Describe the distribution of the target variable. We will use log transform on it, explain why would it help. [ /3 mark]

**Your answer**:

Here

In [None]:
# Use np.log to transform "area". [ /3 mark]
# Note that log is not defined every where and you might need to do something about it.
# ****** your code here ****** 


# Plot the distribution of target variable "area" with bins = 20 using an appropriate seaborn function [ /1 mark]
# ****** your code here ****** 


## Question 1.4 [ _ /6 marks]

Let's use **mean squared error** as our score metric. We can use `sklearn.metrics.mean_squared_error`, but here let's write our own function called `mse` with arguments `y` and `ypr`(predicted y) which returns the mean squared error. Recall the formula for MSE below:

$$ MSE = \frac{1}{n} \sum_{i=1}^{n}  \left( \hat{y_{i}}-y_{i}\right)^{2} $$

In [None]:
# Define a function that takes in y's and returns MSE [ /6 marks]
# ****** your code here ****** 
def mse():
    

## Question 1.5 [ _ /4 marks]

We will use all available features as predictors, and use the log transformed "area" as target variable. Then let's split our data into training and test. As usual, let's use test_size=0.2 and random_state=seed.

In [None]:
# Create X and y [ /2 marks]
# ****** your code here ****** 
X = 
y = 

# Use train_test_split on X, y [ /2 marks]
# ****** your code here ****** 


## Question 1.6 [ _ / 6 marks]

For our first model, create a pipeline called "M1" that performs only a linear regression. 

In [None]:
# Create a pipeline for model 1 (M1) [ /6 marks]
# ****** your code here ****** 


## Question 1.7 [ _ / 8 marks]

For our second model let's add quadratic terms for all features (use `PolynomialFeatures`). Create a model pipeline for our second model (M2).

In [None]:
# Create a pipeline for model 2 (M2) [ / 8 marks]
# ****** your code here ****** 


## Question 1.8 [ _ / 18 marks]

`Temperature (temp)` and `Rain (rain)` may be important features, so let's extend model 1 by adding a *cubed* term for temp and a *squared* term for rain. Before creating a pipeline for this model, we need a custom transformer: we can specify a column for squared rain and one for cubed temp. The transformer has been initialized below, but you'll need to complete it with adding 1 or 2 lines of code. After this, create your corresponding pipeline (M3).

In [None]:
# Modify the transform method of the KeyFeatures class [ /10 marks]
class KeyFeatures(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return

    def transform(self, X, y=None):
        # ****** your code here ****** 
        
        return

# Create a pipeline for model 3 (M3) [ /8 marks]
# ****** your code here ****** 


## Question 1.9 [ _ /8 marks]

For models 1-3, use 4-fold Cross-validation and report the mean and std of the loss (i.e., the `mse` function you created for Q1.4). For the cross-validation part, use `from sklearn.metrics import make_scorer` to make a scorer out of your `mse` function.

In [None]:
# Use 4-fold CV on all models to get mean and std of score [ /8 marks]
# ****** your code here ****** 


print(f"M1 loss: %.4f +/- %.4f" % (cvsc1.mean(), cvsc1.std()))
print(f"M2 loss: %.4f +/- %.4f" % (cvsc2.mean(), cvsc2.std()))
print(f"M3 loss: %.4f +/- %.4f" % (cvsc3.mean(), cvsc3.std()))

## Question 2.0 [ _ / 3 marks]

**Question**: 

Which model would you choose and why? [ /3 marks]

**Your answer**:

Here...

## Question 2.1 [ _ /  6 marks]

Estimate the performance of your chosen model on the test data (which has been held out) using `mse`. 

In [None]:
# Compute the test loss on the unseen (test) dataset [ /6 marks]
# ****** your code here ******


print('MSE Loss on test data:',)

## Question 2.2 [ _ /8 marks]

Recap: The central limit theorem (CLT) states that if you have a population with mean $\mu$ and standard deviation $\sigma$ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually greater than 30).

Compute (and print) a 95% confidence interval for the average test error using the Central Limit Theorem. You can use the following formula to compute it: 

$$ \bar{L_n} \pm 1.96 * \frac{\sigma_{l}}{\sqrt{n}}$$

Here $\bar{L_n}$ is the average test loss (i.e. for our test set), $\sigma_l$ is the standard deviation (of our test losses), and $n$ is the total number of test losses we compute.  

In [None]:
# Test loss here is a point estimate (statistic) for the generalization error
# Having >30 samples, we can use the formula above safely
# Here we compute confidence interval for generalization error (i.e.expected [average] test loss for this particular dataset)

# Calculate the 95% Confidence Interval for average test loss [ /8 marks]
# ****** your code here ****** 


print('Confidence Interval is:', ci)

# Follow these instructions:

Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.