# SI 618: Introduction to Machine Learning

Version 2023.02.21.1.CT

We suggest you use extra markdown blocks or code comments to record your notes.

In [127]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import statsmodels.formula.api as smf

Seaborn (and other packages) come bundled with datasets.  Let's load the infamous Fisher's Iris Dataset:

In [128]:
iris = sns.load_dataset('iris')

In [129]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Exercise 1:
Create a 2-d scatterplot of petal_width (on the y-axis) vs. petal_length (on the x-axis) that includes a regression line.

In [130]:
# insert your code here

### Exercise 2:
Create a regression model of petal_width as the outcome variable and petal_length as the explanatory variable.  You might find the notebook on correlation and regression to be helpful here.

In [64]:
# insert your code here

## Introduction to scikit-learn

Recall the general process for using a scikit-learn estimator:
1. choose appropriate class that implements what you want to do and import it
1. choose model hyperparameters (or accept default ones, but be careful) and instantiate class
1. arrange data into features and labels
1. .fit() your model to the data
1. apply model to new data with .predict() for supervised learning

Let's do that with the regression model we implemented using statsmodels above:



1. choose appropriate class that implements what you want to do and import it

This takes a bit of experience to figure out, but we'll cover the common ones over the next few classes.  For now, I'll tell you that we want to use sklearn.linear_model.LinearRegression.  Import only that class into your default namespace:

### Exercise 3: write the correct line to import LinearRegression from the sklearn.linear_model module:

In [3]:
# insert your code here

### Exercise 4: choose model hyperparameters (or accept default ones, but be careful) and instantiate class
It's ok to accept the defaults this time. Let's assign the model to a variable called `lm`.

In [4]:
# insert your code here

### Exercise 5: arrange data into features and labels
Create one dataframe for the 'y' values (and call it 'y') and another dataframe for the 'x' values (and call it 'X').

In [5]:
# insert your code here

### Exercise 6: .fit() your model to the data

In [6]:
# insert your code here

### Exercise: apply model to new data with .predict() 
What's the estimated value for petal_width if the petal_length is 10?

In [8]:
# insert your code here

Great!  But what does our model actually look like?

We can always access a measure of how good our model is by calling .score(X,y):

In [87]:
lm.score(X.values, y)

0.9271098389904927

In the case of LinearRegression, we can access the coefficients for the equation:

In [88]:
lm.coef_

array([[0.41575542]])

and the value of the intercept:

In [89]:
lm.intercept_

array([-0.36307552])

Which, if we've done everything right, should match the results we got from statsmodels!

## Cross-validation

In [90]:
from sklearn.model_selection import cross_validate

In [91]:
result = cross_validate(lm, X, y, scoring='neg_mean_squared_error') # see docstring for more details

In [92]:
result['test_score']

array([-0.0109505 , -0.01435888, -0.02917584, -0.06226445, -0.10967123])

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

Note: unlike most other scores, R^2 score may be negative (it need not actually be the square of a quantity R).

See also https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative


What other scorers are available?

In [93]:
import sklearn
sklearn.metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'matthews_corrcoef', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'positive_likelihood_ratio', 'neg_negative_likelihood_ratio', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weig

# BREAK

# Part II - Machine Learning Pipelines for Regression


## Goal: to predict the flipper length of penguins given a number of features about them.

In [9]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [10]:
# to make this notebook's output identical at every run
np.random.seed(42)

In [11]:
penguins = sns.load_dataset('penguins')

In [12]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### Task 1
Are there any missing values?  Deal with the missing values.

In [14]:
# insert your code here

### Task 2
Use .value_counts() to get a sense of the distribution of categorical variables.

In [15]:
# insert your code here

### Task 3
Create scatterplots for all combinations of numeric variables. (Hint: sns.pairplot() might be useful, and if you color by species you'll see some pretty interesting things.)

In [16]:
# insert your code here

### Task 4
Split the data into training and testing sets, ensuring that the same distribution of species exists in the split data sets as the distribution of species in the original dataframe.

In [17]:
# insert your code here

### Task 5
Create a design matrix (`penguins_X`) and a label matrix (`penguins_y`) from the stratified training set.

In [18]:
# insert your code here

### Task 6
Create a pipeline to apply a `StandardScaler()` to all numeric values and a `OneHotEncoder()` to the categorical variables in `penguins_X`. Assign the resulting matrix to a variable called `penguins_prepared`.

In [19]:
# insert your code here

### Task 7
Fit a linear regression to penguins_prepared and penguins_y.

In [20]:
# insert your code here

### Task 8
Use the fitted model to show the predicted values for the first 5 rows of data.

In [21]:
# insert your code here

### Task 9
Show the mean and standard deviation of the root mean squared error for your model.

In [22]:
# insert your code here

### Task 10
Apply your model to the test data (from your train-test split) and report the final root mean squared error (RMSE).

In [23]:
# insert your code here