The purpose of this lab is to practice training and evaluating different types of regression and classification models.

In part 1, you will practice doing regression on the auto mpg dataset from the UC Irvine Machine Learning Repository https://archive.ics.uci.edu/dataset/9/auto+mpg. The goal is to learn a regression model to predict the miles per gallon of a car given its displacement, cylinders, horsepower, weight, and acceleration.

In part 2, you will practice doing binary classification on a breast cancer diagnostic dataset. The measurements in the dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the charactersitics of the cell nuclei present in the image. Each image is diagnosed as either benign or malignant, and your goal is to learn a binary classifier to predict benign or malignant.

**Important Note**: Normally I would strongly discourage doing machine learning without first doing exploratory analysis to thoughtfully select the appropriate features to include in the model, as well as to do any necessary feature engineering. However, the goal of this lab is to practice with the different regression and classification models that we have learned about in class and to practice with the different evaluation metrics, so you will not be doing any EDA or feature selection or feature engineering in this lab. You will just use the data and features that are provided. But, be warned that on your final and in your future careers as expert data scientists, my expectation is that you are thoughtful about what features you include in your machine learning models. Don't let me down!

In [1]:
#you need the ucimlrepo package to get the miles per gallon dataset, which you can install in google colab using the following command
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [2]:
#imports
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn import datasets
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, root_mean_squared_error, mean_squared_error, mean_absolute_error, classification_report, confusion_matrix, accuracy_score, log_loss, roc_curve, roc_auc_score, RocCurveDisplay, precision_recall_curve, PrecisionRecallDisplay
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibrationDisplay, calibration_curve

## **Part 1: Regression on the MPG dataset**

In [3]:
#loading the dataset
auto_mpg = fetch_ucirepo(id=9)
X = auto_mpg.data.features
Y = auto_mpg.data.targets

df = X[['displacement', 'cylinders', 'horsepower', 'weight', 'acceleration']]
df['mpg'] = Y.values

df = df.dropna()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mpg'] = Y.values


The mpg data with all the columns you need is now loaded into the pandas dataframe `df`. Remember that it is very important in machine learning to test your models on a holdout test dataset, so your first task is to perform an 80/20 train test split on the dataset.

In [4]:
#your train/test split code here


### **Part 1a: Linear Regression**

In this part of the lab, use your train dataset to learn a linear regression model to predict the mpg column using the displacement, cylinders, horsepower, weight, and acceleration columns. Then evaluate your trained linear regression model by doing the following:

1. Create a scatter plot of your predicted miles per gallon on the x axis and the actual miles per gallon on the y axis. Do this for both the train set and the test set, so create two scatter plots. Add a line to your plots with slope = 1 and intercept = 0. That line represents your linear model predictions, so it will help you see how close your predictions are to the actual values. Label your x axis "Predicted MPG", and label your y axis "Actual MPG". Add a title to your plot to signify if it is for the test set or the train set.

2. Calculate the mean absolute error (MAE) of your predictions on both the train set and test set

3. Calculate the root mean squared error (RMSE) of your predictions on both the train set and test set

4. Calculate the $r^2$ score of your predictions on both the train set and test set

After calculating the $r^2$ scores, interpret them. For example, if you find that the $r^2$ in your test set is 0.55, what does that mean? Type it out.

In [5]:
#train your linear regression model here


In [6]:
#create your train and test set scatter plots


In [7]:
#calculate the train set MAE and test set MAE. Print both


In [8]:
#calculate the train set RMSE and test set RMSE. Print both


In [9]:
#calculate the train set r squared and test set r squared. Print both


**Interpret your $r^2$ scores here:**



### **Part 1b: Random Forest**

Repeat everything you did in part 1a, only this time fit a random forest regression model.

In [10]:
#train your random forest regression model here


In [11]:
#create your train and test set scatter plots


In [12]:
#calculate the train set MAE and test set MAE. Print both


In [13]:
#calculate the train set RMSE and test set RMSE. Print both


In [14]:
#calculate the train set r squared and test set r squared. Print both


**Interpret your $r^2$ scores here:**


### **Part 1c: Boosting**

Redo everything you did in part 1a, but this time use some choice of boosting model (Gradient boost, XGBoost, LightGBM, etc).

In [15]:
#train your boosting regression model here


In [16]:
#create your train and test set scatter plots


In [18]:
#calculate the train set MAE and test set MAE. Print both


In [19]:
#calculate the train set RMSE and test set RMSE. Print both


In [20]:
#calculate the train set r squared and test set r squared. Print both


**Interpret your $r^2$ scores here:**


### **Part 1d**

Now that you have fit and evaluated three different models, answer these questions.

1. Which regression model would you choose to predict automobile miles per gallon. Justify your answer.

2. Is there evidence of overfitting in any of these models? If so, which ones and what is the evidence?

**Answer question 1 here:**

**Answer question 2 here**


## **Part 2: Classification on the breast cancer dataset**

Pay a little bit more attention to the instructions in each subpart of part 2. Classifiers have lots of different evaluation metrics, so I don't make you calculate all of them in each subpart, so you can't just blindly copy and paste your code from prior subparts this time.

In [21]:
#loading the dataset
bc = datasets.load_breast_cancer(as_frame = True)
X = bc['data']
Y = bc['target']
df = X[['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension']]
df.columns = ['radius', 'texture', 'perimeter', 'area', 'smoothness', 'compactness', 'concavity', 'conc_points', 'symmetry', 'fractal_dim']
df['is_malignant'] = Y

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['is_malignant'] = Y


The breast cancer data with all the columns you need is now loaded into the pandas dataframe `df`. Remember that it is very important in machine learning to test your models on a holdout test dataset, so your first task is to perform an 80/20 train test split on the dataset.

In [22]:
#perform the 80/20 train test split here


### **Part 2a: Logistic Regression**

In this part of the lab, use your train dataset to learn a logistic regression model to predict the is_malignant using the other ten columns as predictor variables. Then evaluate your trained logistic regression model by doing the following:

1. Calculate the precision, recall, and f1 score on both the train and test set (**Hint**: print the results of sklearn.metrics.classification_report)

2. Plot a calibration curve for both the train and test set. Use sklearn's functions calibration_curve and CalibrationDisplay to do this, and use strategy = 'quantile' in the calibration_curve function.

3. Calculate the log loss (cross entropy loss) on both the train and test set

4. Calculate the accuracy on both the train and test set

5. Create a confusion matrix on both the train and test set. Use the confusion matrix to calculate specificity (true negatives / (true negatives + false positives)) on both the train and test set.

In [23]:
# train your logistic regression model here


In [24]:
# calculate the precision, recall, and f1 on both the train and test set


In [25]:
# plot a calibration curve for both the train and test set


In [26]:
#calculate the log loss on the train and test set


In [27]:
#calculate the accuracy on the train and test set


In [28]:
#create a confusion matrix for both the train and test set.


**What is the specificity on the train and test set? Answer here using your confusion matrices:**


### **Part 2b: Random Forest**

Train a random forest classifier on the breast cancer data to predict is_malignant. Then, repeat steps 1-4 from part 2a. Finally, Plot a ROC curve on the train and test set and report the areas under those curves. To make the plots, you may want to use the roc_curve and RocCurveDisplay functions from sklearn.metrics.

In [29]:
# train your random forest model here


In [30]:
# calculate the precision, recall, and f1 on both the train and test set


In [31]:
# plot a calibration curve for both the train and test set


In [32]:
#calculate the log loss on the train and test set


In [33]:
#calculate the accuracy on the train and test set


In [34]:
#plot ROC curves for both the train and test set


In [35]:
#report the area under the train and test roc curves


### **Part 2c: Boosting**

Train your choice of boosting model (gradient boosting, xgboost, ligthgbm, etc.) on the breast cancer dataset to predict is_malignant. Repeat steps 1-4 from part 2a. Finally, plot a precision-recall curve for both the train and test set. You may want to use the precision_recall_curve and PrecisionRecallDisplay functions from sklearn.metrics to do this.

In [36]:
# train your boosting model here


In [37]:
# calculate the precision, recall, and f1 on both the train and test set


In [38]:
# plot a calibration curve for both the train and test set


In [39]:
#calculate the log loss on the train and test set


In [40]:
#calculate the accuracy on the train and test set


In [41]:
#plot precision-recall curves for the train and test sets


### **Part 2d**

Now that you have fit and evaluated three different models, answer these questions.

1. Which binary classification model would you choose to predict breast cancer. Justify your answer.

2. Is there evidence of overfitting in any of these models? If so, which ones and what is the evidence?

**Answer question 1 here:**

**Answer question 2 here:**

**Great Work!** I just wanted to leave you all this final note on classifiers. In job interviews you may get asked about confusion matrices and roc curves and precision and recall and all that stuff, so it's worth knowing about. However, in practice (once you pass the job interview and get the job) you rarely need to use all of them. I (Will Melville) usually just look at log loss and calibration curves and that is it. Of course, in baseball we primarily care about the predicted probabilities rather than the predicted classes. For example, batting averages and on base percentages and lots of other baseball statistics are just probabilities. Baseball people like to think in terms of probabilities. Thus, I gravitate towards the metrics (log loss and calibration curves) that test if my model is outputting good probabilities. There may be other industries where they care more about the predicted classes rather than the predicted probabilities, in which case you may gravitate more towards things like precision, recall, and F1 and away from log loss and calibration curves. Have data science feel and choose the right evaluation metrics for the job!