# Scikit-learn

This notebook only presents the general capabilities of the scikit-learn library. The goal is to understand when and for what the library can be used. Specific algorithms, metrics, and topics are discussed in separate notebooks, where we explain the theory behind them and practice their usage.


Scikit-learn is a machine learning library in python built on NumPy, SciPy, and matplotlib. It provides a ready-to-use implementation of different ML algorithms for regression, classification, clustering, and building neural networks. It contains different validators and measurements to evaluate models, which helps in model selection and optimization. It also provides functionality for preprocessing (feature extraction and normalization), dimension reduction, and data visualization.

Install and import scikit learn

In [None]:
from matplotlib import pyplot as plt
!pip install scikit-learn

As scikit-learn offers many features, we usually import specific algorithms/features or groups of them. Scikit-learn is divided by different functionalities like sklearn.linear_model, sklearn.svm, sklearn.semi_supervised, sklearn.neural_network, etc. This helps to find us what we need.

# Example datasets

Scikit-learn comes with ready-to-use example data set. We can get them with sklearn.datasets. It contains functions to load specific datasets. We have a few simple data sets and real data sets. Here is a list of things usefule for testing different models:
- load_iris - good for classification, collection of iris classes with attributes
- load_digits - good for classification, collection of handwritten digits
- load_wine - good for classification, wine identification set
- load_diabetes - good for regression, information about sickness progression
- load_breast_cancer - good for classification

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
print(iris.data)
print(iris.feature_names)
print(iris.target)
print(iris.target_names)
x, y = load_iris(return_X_y=True)
print(x)
print(y)

# Data Preprocessing

SciKit Learn provides a few features to process data before using them for model training. There are a few main transformations that we will use often: 
- standardization - for example, scaling to a specific range or removing the mean and scaling to unit variance,  
- Normalization - scaling individual samples to have unit norm
- Binarization - to transform numeric values to boolean with defined thresholds
encoding categorical features - change text/categorical features of data to numeric representation
- Imputing missing values - to fill missing values using selected function to calculate missing values

There are more data processing options.
Different processing models and functions are available in ``sklearn.preprocessing``

In [None]:
from sklearn import datasets, preprocessing

iris = datasets.load_iris()
X  = iris.data[:-10, :2]
X_test = iris.data[-10:, :2]
scaler = preprocessing.StandardScaler().fit(X)
standardized_X = scaler.transform(X)
standardized_X_test = scaler.transform(X_test)

print(X_test)
print(standardized_X_test)

binarizer = preprocessing.Binarizer(threshold=6.2).fit(X)
binary_X = binarizer.transform(X_test)
print(binary_X)

Often, we also want to split our data into training data sets and test data sets. This can be achieved with  ``sklearn.model_selection.train_test_split``. The function accepts features and labels that are sets of data with the same shape. Additionally, it accepts ``random_state`` to initiate randomization of the splitting, and  ``train_test_split`` defines how big should be the test set (percent/fraction of the whole data set)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

wines = load_wine()
X, y = wines.data[:, :0], wines.target
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25, random_state=0)
print(X.shape)
print(X_train.shape)
print(X_test.shape)
print(y.shape)
print(y_train.shape)
print(y_test.shape)


# Building and training model
When data are prepared, we can train our models.

To build the model, we just need to call a function that will create it. For example, we can use ``sklearn.linear_model.LinearRegression`` for linear regression, ``sklearn.svm.SVC`` for Support Vector Machines, ``sklearn.neighbors.KNeighborsClassifier`` for K nearest neighbors, ``sklearn.tree.DecisionTreeClassifier`` for decision tree, ``sklearn.linear_model.Lasso`` for lasso regression model, ``sklearn.cluster.KMeans`` for KMeans clustering and many more. Please check the documentation to review the attributes required by specific models. For example, KMeans expects ``n_clusters`` and ``random_state``.

After we build the model, we need to train it with the ``.fit()`` function. For classification models, we need to pass features and labels; for clustering, only a set of features.

After we train the model, we can use it for the prediction of values with ``.predict()`` function

Try to train K nearest neighbors model for wine dataset  with 80% of original dataset and test it using 20% of original dataset.

In [None]:
# Import train_test_split, KNeighborsClassifier and load_wine function for dataset
from sklearn.___ import ___
from sklearn.___ import ___
from sklearn.___ import ___

# Import wine data with load_wine
wines = ___
# set labels to use classes names in results and features to train model
labels = wines.target_names[___]
features = wines.data

# Split data to training and tests sets, where test size is 20%, use random_state=0
X_train, X_test, y_train, y_test = ___.(___, ___, ___, random_state=0)

# init KNeighborsClassifier with n_neighbors=6
knn =___(___)
# train model
knn.__(___, ___)
# predict results for test set
prediction = knn.predict(___)
# print prediction and y_test
print(___)
print(___)


Try to train Linear Regression model for california housing dataset with 85% of original dataset and test it using 15% of original dataset. Print out first six target test data and prediction to compare them.

In [None]:
# Import function to split train/test data, LinearRegression from linear_model and fetch_california_housing
from ___ import ___
from ___ import ___

# Import california housing data
cf_housing = ___()
# Split data to training and tests sets, where test size is 15%
___ = ___(___, ___, ___, random_state=10)

# create linear regression model and train it
lr = ___
___
# predict values for test dataset
prediction = ___
# print results
print(prediction[0:6])
print(y_test[0:6])


# Evaluation

Scikit Learn provides different evaluation metrics for models. With them, we can evaluate the quality of our model.

Usually, this is why we split our datasets into training data and test data. After we train our model, we use test data to evaluate it. If we are not satisfied with the results, we can adjust different model parameters or build a better dataset. We can find metrics in ``sklearn.metrics``

For classification models, we usually use accuracy_score, classification_report, and confusion_matrix. For regression, we usually use: mean_absolute_error, mean_squared_error, r2_score. For clustering we can use: adjusted_rand_score, homogeneity_score, v_measure_score. Functions to calculate metrics accept target data and predicted data.

In [None]:
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
    diabetes.data, diabetes.target, test_size=0.1, random_state=10)

regr = LinearRegression()
regr.fit(X_train, y_train)

prediction = regr.predict(X_test)
print("R2-score: %.2f" % r2_score(y_test , prediction) )

Using Mean Absolute Error, Mean Squared Error and R2 Score, try to select the best regression model for diabetes data set (load_diabetes).

Split origianl data to have 85% of train data, and 15% of test data.

Check LinearRegression, Ridge, Lasso. For Ridge and Lasso use alpha=0.01

In [None]:
# Import all what is needed
from sklearn.metrics import ___
from sklearn.linear_model import ___
from sklearn.datasets import ___
from sklearn.model_selection import train_test_split

# load dataset and split for test/train
___
___

# Train linear regression and predict values
___

# Train ridge and predict values
___

# Train lasso and predict values
___

# print Mean Absolute Error, Mean Squared Error and R2 Score for models
___

Sklearn also provides cross-validation. The ideas is to run our modeling process on different subsets of the data to get multiple measures of model quality. It is available as sklearn.model_selection.cross_val_score. We need to pass the model to evaluate, features to train, and target values (in that order). We often also pass cross-validation generator and scoring function. 

Try to compare KNeighborsClassifier and DecisionTreeClassifier using cross-validation. For cv pass KFold object with n_splits=8 and random_state=1. Test them with data wine dataset (load_wine function). The result should be presented in the box plot.

In [None]:
# Import all what is needed
from sklearn.___ import ___
from sklearn.___ import ___
from sklearn.___ import ___
from sklearn.model_selection import train_test_split, KFold, cross_val_score
import matplotlib.pyplot as plt

models = {"KNN": ___, "Decision Tree Classifier": ___}
results = []

wines = ___

# Loop through the models' values
for model in models.___:  
  kf = KFold(n_splits=8, random_state=1, shuffle=True)
  
  # Perform cross-validation
  cv_results = cross_val_score(___, ___, ___, cv=___)
  results.append(cv_results)
  
plt.boxplot(results, labels=models.keys())
plt.show()

# Fine tuning models and pipelines

Each model we use has hyperparameters that allow us to adjust/fine-tune the model. We define hyperparameters when we create a model before we train it. Scikit Learn provides functionality that helps us identify the best hyperparameters by testing selected models on provided data. We can use ``sklearn.model_selection.GridSearchCV``

In [None]:
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.datasets import load_diabetes

x, y = load_diabetes(return_X_y=True)
# Set up the parameter grid
param_grid = {"alpha": np.linspace(0.0001, 1, 50)}
lasso = Lasso()
kf = KFold(n_splits=8, random_state=12, shuffle=True)
# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf)
# Fit to the training data
lasso_cv.fit(x, y)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))

Another often use feature is pipeline, available as ``sklearn.pipeline.Pipeline``. Pipeline accepts a list of steps - a sequence of data transformers with an optional final predictor.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes

# Create pipeline steps
steps = [("scaler", StandardScaler()),
         ("lasso", Lasso(alpha=0.5))]

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=21)
# Instantiate the pipeline
pipeline = Pipeline(steps)
pipeline.fit(X_train, y_train)
# Calculate and print R-squared
print(pipeline.score(X_test, y_test))
print(pipeline.predict(X_train)[:10])
print(y_train[:10])

You can mix pipelines and grid search together. In such a case, you need to pass the pipeline to theGridSearchCV instead of the model. Additionally, the name of the params for the grid should containe prefix with the name of the step with model, for example:
``
steps = [("lassostep", Lasso())]
pipeline = Pipeline(steps)
param_grid = {"lassostep__alpha": np.linspace(0.0001, 1, 50)}
``

Build pipeline with SimpleImputer, StandardScaler and LogisticRegression. Use the iris dataset (load_iris) and split it into test and train data (20% test data); for data split, use random_state=21. Using GridSearchCV find the best solver and C hyperparameters. Test following solvers: "newton-cg", "saga", "lbfgs". To search C parameter use np.linspace(0.001, 1.0, 10). Display best params and score for fine tuned model.

In [None]:
# import all what is needed (numpy, StandardScaler, SimpleImputer, LogisticRegression, Pipeline, etc.)
___

X, y = ___
X_train, X_test, y_train, y_test = train_test_split(___)
# Create steps
steps = [___]

# Set up pipeline
pipeline = ___
# define grid params
params = {___}

# Create the GridSearchCV object, train it and predict result for test set
tuning = ___
tuning.___
y_pred = ___

# Compute and print performance
print("Tuned Logistic Regression Parameters: {}, Accuracy: {}".format(tuning.best_params_, tuning.score(X_test, y_test)))

# Summary
Scikit-learn is the perfect library to start testing different types of models and evaluation metrics. It offers much more than what was presented in this notebook. With other notebooks, we will explore specific models, metrics, and data transformations, learning, and the theory behind them to better understand when and what should be used.


The general flow for all models is to:
1. Import data set
1. Preprocess it to fit model requirements, for example, standardized, impune missing data
1. Split data to test and train data ``train_test_split()``
1. Train data ``.fit()``
1. Evaluate model
1. Use model
For multistep processes, use pipelines. When you are not sure what hyperparameters to set - play with the model and fine-tune it (for example, with grid search).

If you want to persist your models to not build it each time or import in selected projects/systems, you need to dump it to the file and load it again in the target location. For example, you can export your model with skl2onnx to onxx format, then use onnxruntime to run predictions.