# Scikit-learn Supervised Learning API and Algorithm Roadmap

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint

from sklearn import set_config
set_config(transform_output='pandas')

## 1. Introducing Scikit-learn

[Scikit-learn](https://scikit-learn.org/) (aka sklearn) is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

In this notebook we will take a look at the supervised learning models which sklearn has available.

In [2]:
# Listing all regression models
from sklearn.utils import all_estimators

# Get all estimators from sklearn and filter only regressors
regressors = [(name) for name, _ in all_estimators(type_filter='regressor')]
regressors

['ARDRegression',
 'AdaBoostRegressor',
 'BaggingRegressor',
 'BayesianRidge',
 'CCA',
 'DecisionTreeRegressor',
 'DummyRegressor',
 'ElasticNet',
 'ElasticNetCV',
 'ExtraTreeRegressor',
 'ExtraTreesRegressor',
 'GammaRegressor',
 'GaussianProcessRegressor',
 'GradientBoostingRegressor',
 'HistGradientBoostingRegressor',
 'HuberRegressor',
 'IsotonicRegression',
 'KNeighborsRegressor',
 'KernelRidge',
 'Lars',
 'LarsCV',
 'Lasso',
 'LassoCV',
 'LassoLars',
 'LassoLarsCV',
 'LassoLarsIC',
 'LinearRegression',
 'LinearSVR',
 'MLPRegressor',
 'MultiOutputRegressor',
 'MultiTaskElasticNet',
 'MultiTaskElasticNetCV',
 'MultiTaskLasso',
 'MultiTaskLassoCV',
 'NuSVR',
 'OrthogonalMatchingPursuit',
 'OrthogonalMatchingPursuitCV',
 'PLSCanonical',
 'PLSRegression',
 'PassiveAggressiveRegressor',
 'PoissonRegressor',
 'QuantileRegressor',
 'RANSACRegressor',
 'RadiusNeighborsRegressor',
 'RandomForestRegressor',
 'RegressorChain',
 'Ridge',
 'RidgeCV',
 'SGDRegressor',
 'SVR',
 'StackingRegresso

In [3]:
# Get all estimators from sklearn and filter only classifiers
classifiers = [(name) for name, _ in all_estimators(type_filter='classifier')]
classifiers

['AdaBoostClassifier',
 'BaggingClassifier',
 'BernoulliNB',
 'CalibratedClassifierCV',
 'CategoricalNB',
 'ClassifierChain',
 'ComplementNB',
 'DecisionTreeClassifier',
 'DummyClassifier',
 'ExtraTreeClassifier',
 'ExtraTreesClassifier',
 'FixedThresholdClassifier',
 'GaussianNB',
 'GaussianProcessClassifier',
 'GradientBoostingClassifier',
 'HistGradientBoostingClassifier',
 'KNeighborsClassifier',
 'LabelPropagation',
 'LabelSpreading',
 'LinearDiscriminantAnalysis',
 'LinearSVC',
 'LogisticRegression',
 'LogisticRegressionCV',
 'MLPClassifier',
 'MultiOutputClassifier',
 'MultinomialNB',
 'NearestCentroid',
 'NuSVC',
 'OneVsOneClassifier',
 'OneVsRestClassifier',
 'OutputCodeClassifier',
 'PassiveAggressiveClassifier',
 'Perceptron',
 'QuadraticDiscriminantAnalysis',
 'RadiusNeighborsClassifier',
 'RandomForestClassifier',
 'RidgeClassifier',
 'RidgeClassifierCV',
 'SGDClassifier',
 'SVC',
 'SelfTrainingClassifier',
 'StackingClassifier',
 'TunedThresholdClassifierCV',
 'VotingClas

## 2. The estimator interface

Every model in scikit-learn follows a simple protocol.

- `model.fit(X_train, y_train)` learns parameters from data. `X_train` is 2D, `y_train` is 1D (single dimensional output space) or 2D (multiple dimensional output space).
    - `X_train` is commonly called a "data matrix", "feature matrix" or "design matrix".
    - `X_train` has $n$ rows and $p$ columns where $n$ is the number of observations and $p$ is the number of features.
    - `y_train` is called the "target vector" (1D) or "target matrix" (2D).  I like to call it the "vector/matrix of observed targets". 
- `model.predict(X_new)` returns point predictions.
    - `X_new` needs to have the same number of columns as `X_train`.
    - The output will have the same number of rows as `X_new`and the same number of columns as `y_train`.
- `model.predict_proba(X_new)` returns class probabilities for classifiers that support probabilistic output.

Linear regression and logistic regression are the most basic regression and classification algorithms respectively.  Note that despite the name "logistic *regression*" it is actually a classification algorithm!

We will use these models to demonstrate the sklearn supervised learning API.

##### 2a. Linear Regression Example

We will use [the UC Irvine abalone dataset](https://archive.ics.uci.edu/dataset/1/abalone).  Abalone are sea snails in the genus Haliotis.

>Predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task.  Other measurements, which are easier to obtain, are used to predict the age.
>
>From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).

For the purposes of this notebook we will focus on using the length as a predictor of the age.

<p align = center>
<img src = 'lecture_assets/abalone.jpg' width = 200></img>
</p>

In [4]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
abalone = fetch_ucirepo(id=1)

# data (as pandas dataframes)
X = abalone.data.features
y = abalone.data.targets

# metadata
print(abalone.metadata)

# variable information
print(abalone.variables)


{'uci_id': 1, 'name': 'Abalone', 'repository_url': 'https://archive.ics.uci.edu/dataset/1/abalone', 'data_url': 'https://archive.ics.uci.edu/static/public/1/data.csv', 'abstract': 'Predict the age of abalone from physical measurements', 'area': 'Biology', 'tasks': ['Classification', 'Regression'], 'characteristics': ['Tabular'], 'num_instances': 4177, 'num_features': 8, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': [], 'target_col': ['Rings'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1994, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C55C7W', 'creators': ['Warwick Nash', 'Tracy Sellers', 'Simon Talbot', 'Andrew Cawthorn', 'Wes Ford'], 'intro_paper': None, 'additional_info': {'summary': 'Predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- 

From the dataset description, the continuous features have been scaled down by a factor of $200$.  Let's undo that to make our results more interpretable.  Let's also add $1.5$ to our target so that it represents age instead of the number of rings counted.

In [5]:
X.loc[
    :,
    [
        "Length",
        "Diameter",
        "Height",
        "Whole_weight",
        "Shucked_weight",
        "Viscera_weight",
        "Shell_weight",
    ],
] *= 200

# I don't want to have to deal with the categorical feature `Sex` in this notebook, so I am eliminating it.
# We will learn how to deal with categorical features at a later time.

X = X[[
        "Length",
        "Diameter",
        "Height",
        "Whole_weight",
        "Shucked_weight",
        "Viscera_weight",
        "Shell_weight",
    ]]

y = y.Rings  # y is a dataframe, I want a series
y += 1.5  # rings plus 1.5 is age

y.rename("age", inplace=True)

0       16.5
1        8.5
2       10.5
3       11.5
4        8.5
        ... 
4172    12.5
4173    11.5
4174    10.5
4175    11.5
4176    13.5
Name: age, Length: 4177, dtype: float64

In [6]:
# Importing the linear regression model
from sklearn.linear_model import LinearRegression

# Instantiating the model
reg = LinearRegression()

# Fitting the model
reg.fit(X, y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [7]:
from sklearn.metrics import root_mean_squared_error as rmse

# It is generally a bad idea to train and evaluate a model on the same dataset.  
# We will usually use cross-validation to evaluate the generalization capabilities of a model.  
# The purpose of this notebook is only to teach you the .fit / .predict API.

rmse(y, reg.predict(X))

2.215679763823951

Assuming the assumption of iid normal errors is valid, this is an approximation of the standard deviation of this error term.

##### 2a. Classification Example

We will demonstrate the logistic regression classification model on the iris dataset.  This is a very popular data set for testing classification algorithms.

Each observation represents an iris (a type of flower) and gives it's measurements including:
- `sepal_length`: the length of the iris's sepal in cm.
- `sepal_width`: the width of the iris's sepal in cm.
- `petal_length`: the length of the iris's petal in cm.
- `petal_width`: the width of the iris's petal in cm.
- `iris_class`: the class of the iris:
    - `0` = iris setosa 
    - `1` = iris versicolor 
    - `2` = iris virginica 

In [8]:
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True, as_frame=True)

In [9]:
X

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [10]:
y

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: target, Length: 150, dtype: int64

In [11]:
from sklearn.linear_model import LogisticRegression

# Instantiating the classifier.  Default for multiclass is a multinomial model.
clf = LogisticRegression(max_iter=1000)

# Fitting the classifier
clf.fit(X,y)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [12]:
# Demonstrating .predict
clf.predict(X.iloc[[20,30,70,100],:])

array([0, 0, 2, 2])

In [13]:
y.iloc[[20,30,70,100]]

20     0
30     0
70     1
100    2
Name: target, dtype: int64

In [14]:
# Demonstrating .predict_proba
np.round(clf.predict_proba(X.iloc[[20,30,70,100],:]),2)

array([[0.95, 0.05, 0.  ],
       [0.96, 0.04, 0.  ],
       [0.  , 0.44, 0.56],
       [0.  , 0.  , 1.  ]])

WARNING:  These are powerful tools which are easy to abuse.  We need to be extremely careful in how we approach building and validating our models or we can easily fool ourselves into thinking that our models perform well when they are actually terrible.  In particular it is a terrible mistake to train and evaluate a model on the same dataset, as we have done here.  We should generally use both a train/test split and cross-validation.  We will discuss model evaluation more thoroughly next week.