# Midterm

## Student ID: XXXXXXXXX (XX / 100)

## General comments

This Midterm integrates knowledge and skills acquired in the first half of the semester. You are allowed to use any document and source on your computer and look up documents on the internet. **You are NOT allowed to share documents, or communicate in any other way with people inside or outside the class during the midterm.** To finish the midterm in the alloted 2 hrs, you will have to work efficiently. **Read the entirety of each question carefully.**

You need to submit the midterm by the due date (18:30) on OWL in the Test and Quizzes section where you downloaded the data set and notebook. Late submission will be scored with 0 pts, unless you have received special accommodations. To avoid technical difficulties, start your submission, at the latest, five to ten minutes before the deadline.  

Most question demand a **written answer** - answer these in a full English sentence. 

For your Figures, ensure that all axes are labeled in an informative way. 

Ensure that your code runs correctly by choosing "Kernel -> Restart and Run All" before submitting. 

### Additional Guidance

If at any point you are asking yourself "are we supposed to...", *write your assumptions clearly in your exam and proceed according to those assumptions.*

Good luck!

In [None]:
## Preliminaries
# Sets up the environment by importing 
# pandas, numpy, matplotlib, searborn, sklearn, scipy.

### YOU MAY ADD ADDITIONAL IMPORTS IF YOU WISH

import matplotlib
import matplotlib.pyplot as plt 
%matplotlib inline
import pandas as pd 
import numpy as np
import seaborn as sns
import sklearn as sk 
import scipy 

# Get individual functions from 
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression, LogisticRegressionCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler, FunctionTransformer, PolynomialFeatures
from sklearn.metrics import make_scorer, mean_absolute_error, r2_score, mean_squared_error
from sklearn.metrics import confusion_matrix, roc_curve, precision_recall_curve, auc, plot_roc_curve, mean_absolute_percentage_error, r2_score, roc_auc_score
from scipy.stats import norm

# Plot here directly
%matplotlib inline

In [None]:
# Download the data (if not uploading directly)

#!gdown https://drive.google.com/uc?id=1PUcM6kytvlnOLTSEEpxDOicQzlsecKD_

## Data - Wisconsin Breast Cancer

In this midterm, we will be studying how regression models can support medical practice, in particular the diagnosis of breast cancer.

Each record represents follow-up data for one breast cancer case.  These are consecutive patients seen by Dr. William H. Wolberg, member of the General Surgery Dept. at the University of Wisconsin since 1984, and include only those cases exhibiting invasive breast cancer. 

The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  They describe characteristics of the cell nuclei present in the image.

The variables in the dataset are the following:

1. ID number (Non predictive).
2. Outcome (R = recur, N = nonrecur, TARGET)
3. Time (recurrence time if field 2 = R, disease-free time if field 2 = N, TARGET)

All other variables are **predictive**:

4. Tumor size - diameter of the excised tumor in centimeters

Variables 5-34 represent ten real-valued features that are computed the cell nucleus, representing:  
  a. radius (mean of distances from center to points on the perimeter)  
  b. texture (standard deviation of gray-scale values)  
  c. perimeter  
  d. area  
  e. smoothness (local variation in radius lengths)  
  f. compactness (perimeter^2 / area - 1.0)  
  g. concavity (severity of concave portions of the contour)  
  h. concave points (number of concave portions of the contour)  
  i. symmetry  
  j. fractal dimension ("coastline approximation" - 1)  

For each variable, the mean, SD and average of the worst three values are calculated. Variables 5-14 have the mean values, 15 - 24 the SD and 25 - 34 the worst three point average.

We will study how the image-related features (4-34) relate to both the recurrence time and the chances of recurrence for this dataset.

## Task 1 - Regression Model and Bootstrapping (45 pts)

### Question 1.1 Data Loading (5 pts)

  a. Read the data into the dataset ```cancer_data```.  
  b. Show the first few rows of the dataset.  
  c. Print the shape (rows and columns) of the dataset.  
  d. Print the descriptive statistics of the dataset.  
  e. Print how many cases are in class R and in class N, for the ```Outcome``` target variable.

In [None]:
# Read the data. (1 point)


In [None]:
# Print the first few rows (1 point)


In [None]:
# Shape of the dataset (1 point)


In [None]:
# Descriptive statistics of the dataset (1 point)


In [None]:
# Outcome descriptives (1 point)


### Question 1.2 Plotting (10 points)

a. Plot the histograms and kernel densities of the ```Time``` variable (distribution), so that both the histogram and kernel density appear in the same plot. Make two plots, one for recurring cases and non-recurring cases Title your plots accordingly. (5 pts)

b. For the mean variables (5-14), create a [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) showing the joint scatterplots of those variables, setting the ```hue``` parameter to the ```Outcome``` variable.  (2 pts)

c. Discuss what do you see in the plot for questions a and b. What shape are the distributions of the ```Time``` variable? For the predictive variables, what variables are strongly correlated? Why do you think this is? (3 pts)

In [None]:
# Plot histogram and KDE - Recurring (2 points)


In [None]:
# Plot histogram and KDE - Non-Recurring (2 points)


In [None]:
# Pairplot (2 points)


**Written answer (4 pts):**

### Question 1.3 Elastic Net Regression (15 points)

Now we will study the Time distribution for non-recurring cases.

1. Create a train / test split (70/30) for the dataset, selecting **only the Non-Recurring cases** (```Outcome == 'N'```), and use ```random_state=0```. (2 pts)
2. Create a pipeline that standardizes the predictive variables (tumor-related features) and that runs a regression with ElasticNet regularization using the ```Time``` variable as the target. Tune the parameter using three folds, 100 alpha values, and an adequate selection for the ```l1_ratio``` parameter. Train until convergence and report the best ```l1_ratio``` and alpha parameters. (10 pts)
3. What coefficients are significant? Which ones are not? Print a table with the coefficient names and its values for the optimal regression on step 2. (3 pts)

In [None]:
# Train test split (2 pts)


In [None]:
# Pipeline (8 pts)


In [None]:
# Train and show best parameters (2 pts)


In [None]:
# Coefficient table (2 pts)


**Written answer (1 pt):** 

### Question 1.4 Bootstrap of the test set (15 points)

Now we will study the fit of this model.

a. Apply the results to the test set and calculate the Mean Absolute Percentage Error measure and the R-squared measure over the full test set.  
b. Obtain a 95% confidence interval for the R2 and MAPE measures using bootstrap, **without refitting the model**. Run 1000 bootstrap samples. Plot the distribution of values you obtained.  
c. Written answer: What do you think of the fit? Interpret these measures. Could you have used, e. g., a t-test instead of a bootstrap to obtain these confidence intervals? Why or why not?  

In [None]:
# Calculate measures. (2 points)


In [None]:
# Bootstrap (5 points)

# Write a Bootstrap function that records the fitted models 


# Run the bootstrap


In [None]:
# Calculate CIs (3 pts)


In [None]:
# Distribution plots (3 points)


**Written answer (2 pts):**

## Task 2 - Classification Model and Crossvalidation (55 points)

Now we will work on the classification problem. For this, we will train a Logistic Regression with Ridge Penalization, fitting the alpha parameter manually using a crossvalidation scorer. We will also explore whether quadratic polynomials can help with this process.

### Question 2.1 - Base Model (10 points)

a. Create a **stratified** train / test split so that the target variable is now the binary variable Outcome. Use all predictive variables available (without Time or ID).  
b. Train an unregularized logistic regression over the unstandardized data.  
c. Calculate the AUC score and plot the ROC curve for the test set. Use the Recurrence ('R') label as the positive value.  
d. Written question: Why do you see that AUC shape?  


In [None]:
# Train test split (1 pt)


In [None]:
# Unregularized logistic regression (3pts)


In [None]:
# Calculate AUC and ROC curve (3pts)


**Written answer (3pts):**

### Question 2.2 Regularizing the Model (15 points)

Now explore how regularization impacts the model. 

a. Find the best C parameter by fitting Ridge Logistic Regressions using the training data (do you need to use a pipeline?), creating a **stratified** Crossvalidation with three folds that uses the AUC score as the deciding measure. Explore 50 values of C between $10^{-6}$ and $10^{-2}$ (change the exponent, not the base) and plot the average AUC for each value of $\log_{10}(C)$.  

b. Select a C value and train a model over the full training dataset. Calculate the AUC score and plot the ROC curve of your regularized model over the test set.  

c. Written question: How does your model perform compared to the model in the previous question? Compare performance vs model complexity.  

In [None]:
# Train the regularized logistic regression. (8 pts)


In [None]:
# Plot Cs (3 pts)


In [None]:
# Refit best estimator (1 pt)

In [None]:
# Calculate AUC and ROC curve (1 pt)


**Written answer (2 pts):**

### Question 2.3 - Testing Polynomial Features (20 points)

The final model we will test will focus on studying the impact of polynomial features. As there are so many variables, we will only create these for the 'mean' variables. For this:

a. Create a pipeline that first creates quadratic polynomial variables **only for the mean variables**, then standardizes all variables and finally applies a regularized logistic regression with **Elastic Net** penalty.  *Hint: Remember the example to [standardize only some variables](https://stackoverflow.com/questions/37685412/avoid-scaling-binary-columns-in-sci-kit-learn-standsardscaler) and adapt it to apply polinomial transformation to only some variables.*

b. Explore the `l1_ratio` parameter in the set $\{0.01, 0.1, 0.2, 0.7, 0.75, 0.8, 0.9, 0.95\}$  and the C values from the previous question. Obtain the best `l1_ratio` and C parameters for your pipeline using AUC as your scorer. Print the best `l1_ratio` and the best C parameter and remember to use the model fitted over the full training set.  

c. Calculate the AUC over the test set and plot the ROC curve.

d. Print the coefficients for the polynomial features. **Written answer: Are the polynomial features used?** *Hint: The FeatureUnion operator matches variables in the same order as they were given.*

In [None]:
# Create pipeline (10 pts)


In [None]:
# Create the CV and the scorer. (2 pts)


In [None]:
# Print best outputs (1 pt)


In [None]:
# Plot ROC curve (2 pts)


In [None]:
# Get coefficients for the polynomial features. (2 pts)


**Written Answer (3 pts):**

### Question 2.4 - A better comparison (10 points)

So far all three models have given similar AUC scores. Now we will study whether these differences are significant. 

a. **Without refitting the models,** obtain a confidence interval for the test AUC measure of the three models at a 95% significance using 1000 bootstrap runs.    
b. Written question: What model would you suggest using?

In [None]:
# Create and train bootstrap (7 points)


In [None]:
# Calculate CIs (2 pts)


**Written answer (1 pt):**