<a href="https://colab.research.google.com/github/victorviro/Machine-Learning-Python/blob/master/Introduction_to_Scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Scikit-learn

[Scikit-learn](https://github.com/scikit-learn/scikit-learn) is a Python library that provides a standard interface for implementing state-of-the-art machine learning algorithms, as well as comprehensive [documentation](https://scikit-learn.org/stable/index.html) about each algorithm. It also includes other auxiliary functions that are integral to the machine learning pipeline such as data preprocessing steps, data resampling techniques, evaluation parameters, and search interfaces for tuning/optimizing an algorithm’s performance.

This section will go through the functions for implementing a typical machine learning pipeline with Scikit-learn. Since Scikit-learn has a variety of packages and modules that are called depending on the use case, we’ll import a module directly from a package if and when needed using the `from` keyword. Again the goal of this notebook is to provide the foundation to be able to comb through the exhaustive Scikit-learn library and be able to use the right tool or function to get the job done.

Scikit-learn comes with a set of small standard datasets that are ideal for learning purposes. However, these datasets are small and well-curated, they do not represent real-world scenarios.

The [Iris plants dataset](https://scikit-learn.org/stable/datasets/index.html#iris-dataset) consists of 3 different types of plants (Setosa, Versicolour, and Virginica) and four features.

In [31]:
from sklearn import datasets
import numpy as np 
import pandas as pd

In [22]:
# Load iris dataset
iris = datasets.load_iris()

print(f'Keys of iris: {iris.keys()}')
print(f'Iris dataset shape: {iris.data.shape}')
print(f'Predictive features: {iris.feature_names}') 

Keys of iris: dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
Iris dataset shape: (150, 4)
Predictive features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


The data itself is contained in the `target` and `data` fields. `data` contains the numeric measurements of sepal length, sepal width, petal length, and petal width in a NumPy array. The `target` array contains the species of each of the flowers that were measured, also as a NumPy array.

## Splitting the dataset into training and test sets

We want to build a machine learning model from this data that can predict the species of iris for a new set of measurements. But before we can apply our model to new measurements, we need to know whether it actually works )whether we should trust its predictions). We cannot use the data we used to build the model to evaluate it because our model can always simply remember the whole training set, and will therefore always predict the correct label for any point in the training set. This *remembering* does not indicate to us whether our model will generalize well (whether it will also perform well on new data).

To assess the model’s performance, we show it new data (data that it hasn’t seen before) for which we have labels. This is usually done by splitting the labeled data we have collected (here, our 150 flower measurements) into two parts. One part of the
data, the training data, is used to build our machine learning model, and the rest of the data, the test data or hold-out set,  will be used to assess how well the model works.

Scikit-learn has a convenient method to assist in that process called [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)`(X, y, test_size=0.25)`, where `X` is the design matrix or dataset of predictors and `y` is the target variable. The split size is controlled using the attribute `test_size`. By default, `test_size` is set to 25% of the dataset size. It is standard practice to shuffle the dataset before splitting by setting the attribute `shuffle=True`.

In [None]:
from sklearn.model_selection import train_test_split

# Split in train and test sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, shuffle=True)
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')



X_train shape: (112, 4)
X_test shape: (38, 4)
y_train shape: (112,)
y_test shape: (38,)


Before making the split, the `train_test_split` function shuffles the dataset using a pseudorandom number generator. If we just took the last 25% of the data as a test set, all the data points would have the label 2 , as the data points are sorted by the label (see the output for `iris.target`). Using a test set containing only one of the three classes would not tell us much about how well our model generalizes,
so we shuffle our data to make sure the test data contains data from all classes.

To make sure that we will get the same output if we run the same function several times, we can provide the pseudorandom number generator with a fixed seed using the `random_state` parameter. This will make the outcome deterministic, so this line will always have the same outcome. 

## Preprocessing the data for model fitting

Before a dataset is trained or fitted with a machine learning model, it necessarily undergoes some vital transformations. These transformations have a huge effect on the performance of the learning model. Transformations in Scikit-learn have a `fit()` and `transform()` method, or a `fit_transform()` method.

Depending on the use case, the `fit()` method can be used to learn the parameters of the dataset, while the `transform()` method applies the data transform based on the learned parameters to the same dataset and also to the test or validation datasets before modeling. Also, the `fit_transform()` method can be used to learn and apply the transformation to the same dataset in a one-off fashion. Data transformation packages are found in the [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) package.

This section will cover some critical transformation for numeric and categorical variables. They include:

- Data rescaling

- Encoding categorical variables

- Input missing data

- Generating higher-order polynomial features

- Binning

### Data rescaling

It is often the case that the features of the dataset contain data with different scales. In other words, the data in column A can be in the range of 1-5, while the data in column B is in the range of 1000-9000. This different scale for units of observations in the same dataset can have an adverse effect for certain machine learning models, especially when minimizing the cost function of the algorithm because it shrinks the function space and makes it difficult for an optimization algorithm like gradient descent to find the global minimum (see notebook [Introduction to gradient descent algorithm](https://github.com/victorviro/Machine-Learning-Python/blob/master/Introduction_gradient_descent_algorithm.ipynb)). Note that scaling the target values is generally not
required.

There are common ways to get all attributes to have the same scale: *min-max scaling*, *standardization* or *normalization*. 

#### Min-max scaling

Min-max scaling is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min. It is implemented in Scikit-learn using the [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) module. It has a `feature_range` hyperparameter that lets us change the range if we don’t want 0-1 for some reason. Let’s see an example.

In [None]:
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler

# Load dataset
data = datasets.load_iris()

# Separate features and target
X = data.data
y = data.target

# Print first 5 rows of X before rescaling
print(X[0:5,:])

# Rescale X
scaler = MinMaxScaler(feature_range=(0, 1))
rescaled_X = scaler.fit_transform(X)

# Print first 5 rows of X after rescaling
print(rescaled_X[0:5,:])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[[0.22222222 0.625      0.06779661 0.04166667]
 [0.16666667 0.41666667 0.06779661 0.04166667]
 [0.11111111 0.5        0.05084746 0.04166667]
 [0.08333333 0.45833333 0.08474576 0.04166667]
 [0.19444444 0.66666667 0.06779661 0.04166667]]


#### Standardization

Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance. Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1). However, standardization is much less affected by outliers. For example, suppose a plant had a sepal lenght equal to 100 (by mistake). Min-max scaling would then crush all the other values from 0-5 down to 0-0.05, whereas standardization would not be much affected. Scikit-Learn provides a transformer called [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for standardization. Let’s look at an example.

In [None]:
from sklearn.preprocessing import StandardScaler
# Load dataset
data = datasets.load_iris()
# Separate features and target
X = data.data
y = data.target
# Print first 5 rows of X before standardization
print(X[0:5,:])

# Standardize X
scaler = StandardScaler().fit(X)
standardize_X = scaler.transform(X)

# Print first 5 rows of X after standardization
print(standardize_X[0:5,:])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[[-0.90068117  1.01900435 -1.34022653 -1.3154443 ]
 [-1.14301691 -0.13197948 -1.34022653 -1.3154443 ]
 [-1.38535265  0.32841405 -1.39706395 -1.3154443 ]
 [-1.50652052  0.09821729 -1.2833891  -1.3154443 ]
 [-1.02184904  1.24920112 -1.34022653 -1.3154443 ]]


#### Normalization

Data normalization involves transforming the observations in the dataset so that it has a unit norm or has magnitude or length of 1. The length of a vector is the square root of the sum of squares of the vector elements. A unit vector (or unit norm) is obtained by dividing the vector by its length. Normalizing the dataset is particularly useful in scenarios where the dataset is sparse (i.e., a large number of observations are zeros) and also has differing scales. Normalization in Scikit-learn is implemented in the [`Normalizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html) module.

In [None]:
from sklearn.preprocessing import Normalizer
# Load dataset
data = datasets.load_iris()
# Separate features and target
X = data.data
y = data.target
# Print first 5 rows of X before standardization
print(X[0:5,:])

# Normalize X
scaler = Normalizer().fit(X)
normalize_X = scaler.transform(X)

# Print first 5 rows of X after normalization
print(normalize_X[0:5,:])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[[0.80377277 0.55160877 0.22064351 0.0315205 ]
 [0.82813287 0.50702013 0.23660939 0.03380134]
 [0.80533308 0.54831188 0.2227517  0.03426949]
 [0.80003025 0.53915082 0.26087943 0.03478392]
 [0.790965   0.5694948  0.2214702  0.0316386 ]]


### Encoding categorical variables

Most machine learning algorithms do not compute with non-numerical or categorical variables. Hence, encoding categorical variables is the technique for converting non-numerical features with labels into a numerical representation for use in machine learning modeling (for more information about why we need to convert non-numerical features with labels into a numerical features check this [notebook](https://nbviewer.jupyter.org/github/victorviro/Deep_learning_python/blob/master/Text_Vectorization_NLP.ipynb)). Scikit-learn provides modules for encoding categorical variables including the [`LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) for encoding labels as integers, [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for converting categorical features into a matrix of integers, and [`LabelBinarizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) for creating a one-hot encoding of target labels.

#### Label encoding

`LabelEncoder` is typically used on the target variable to transform a vector of categories (or labels) into an integer representation by encoding label with values between 0 and the number of categories minus 1.

Let’s see an example of `LabelEncoder`. (`OrdinalEncoder`?)

In [None]:
from sklearn.preprocessing import LabelEncoder
# Create dataset
data = np.array([[5,8,"calabar"],[9,3,"uyo"],[8,6,"owerri"],
                 [0,5,"uyo"],[2,3,"calabar"],[0,8,"calabar"],
                 [1,8,"owerri"]])
print(data)

# Separate features and target
X = data[:,:2]
y = data[:,-1]

# Encode y
encoder = LabelEncoder()
encode_y = encoder.fit_transform(y)

# adjust dataset with encoded targets
data[:,-1] = encode_y
print(data)

[['5' '8' 'calabar']
 ['9' '3' 'uyo']
 ['8' '6' 'owerri']
 ['0' '5' 'uyo']
 ['2' '3' 'calabar']
 ['0' '8' 'calabar']
 ['1' '8' 'owerri']]
[['5' '8' '0']
 ['9' '3' '2']
 ['8' '6' '1']
 ['0' '5' '2']
 ['2' '3' '0']
 ['0' '8' '0']
 ['1' '8' '1']]


#### One-hot encoding

`OneHotEncoder` is used to transform a categorical feature variable in a matrix of integers. This matrix is a sparse matrix with each column corresponding to one possible value of a category. The new attributes are sometimes called *dummy* attributes.

Let’s see an example of `OneHotEncoder`.

In [None]:
from sklearn.preprocessing import OneHotEncoder
# Create dataset
data = np.array([[5,"efik", 8,"calabar"],
                 [9,"ibibio",3,"uyo"],
                 [8,"igbo",6,"owerri"],
                 [0,"ibibio",5,"uyo"],
                 [2,"efik",3,"calabar"],
                 [0,"efik",8,"calabar"],
                 [1,"igbo",8,"owerri"]])
# Separate features and target
X = data[:,:3]
y = data[:,-1]
# Print the feature or design matrix X
print(X)

# One_hot_encode X
one_hot_encoder = OneHotEncoder(handle_unknown='ignore')
encode_categorical = X[:,1].reshape(len(X[:,1]), 1)
one_hot_encode_X = one_hot_encoder.fit_transform(encode_categorical)
# Print one_hot encoded matrix (the output is a SciPy sparse matrix)
# Call the toarray() method
print(one_hot_encode_X.toarray())

# Remove categorical label
X = np.delete(X, 1, axis=1)
# Append encoded matrix
X = np.append(X, one_hot_encode_X.toarray(), axis=1)
print(X)

[['5' 'efik' '8']
 ['9' 'ibibio' '3']
 ['8' 'igbo' '6']
 ['0' 'ibibio' '5']
 ['2' 'efik' '3']
 ['0' 'efik' '8']
 ['1' 'igbo' '8']]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
[['5' '8' '1.0' '0.0' '0.0']
 ['9' '3' '0.0' '1.0' '0.0']
 ['8' '6' '0.0' '0.0' '1.0']
 ['0' '5' '0.0' '1.0' '0.0']
 ['2' '3' '1.0' '0.0' '0.0']
 ['0' '8' '1.0' '0.0' '0.0']
 ['1' '8' '0.0' '0.0' '1.0']]


Alternatively, we can use the `get_dummies` method implemented by the Pandas package.

**Note**: In this example, we get the dummy variables on a dataset containing both the training and the test data. This is important to ensure categorical values are represented in the same way in the training set and the test set.

**Note**: Often, whether for ease of storage or because of the way the data is collected, categorical variables are encoded as integers. That they are numbers doesn’t mean that they should necessarily be treated as
continuous features. It is not always clear whether an integer feature should be treated as continuous or discrete (and one-hot encoded). If there is no ordering between the semantics that are encoded, the feature must be treated as discrete. The `get_dummies` function in pandas treats all numbers as continuous and will not
create dummy variables for them. To get around this, we can either use Scikit-learn ’s `OneHotEncoder`, for which we can specify which variables are continuous and which are discrete, or convert numeric columns in the DataFrame to strings (using `as_type('str')`) and use `get_dummies`.

### Input missing data

It is often the case that a dataset contains several missing observations. We can accomplish treat easily using DataFrame’s `dropna()`, `drop()`, and `fillna()`. If we choose the last option using the mean, we should compute the mean value on the training set, and use it to fill the missing values in the training set, but also we don’t forget to save the mean value that we have computed since we will need it later to replace missing values in the test set when we want to evaluate our system, and also once the system goes live to replace missing values in new data.

Scikit-learn implements the [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) method for completing missing values.

In [None]:
from sklearn. impute import SimpleImputer
# Create dataset
data = np.array([[5,np.nan,8],[9,3,5],[8,6,4],
                 [np.nan,5,2],[2,3,9],[np.nan,8,7],
                 [1,np.nan,5]])
print(data)

# Impute missing values - axis=0: impute along columns
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
print(imputer.fit_transform(data))

[[ 5. nan  8.]
 [ 9.  3.  5.]
 [ 8.  6.  4.]
 [nan  5.  2.]
 [ 2.  3.  9.]
 [nan  8.  7.]
 [ 1. nan  5.]]
[[5. 5. 8.]
 [9. 3. 5.]
 [8. 6. 4.]
 [5. 5. 2.]
 [2. 3. 9.]
 [5. 8. 7.]
 [1. 5. 5.]]


The `imputer` has simply computed the mean of each attribute and stored the result in its `statistics_` instance variable. Only the two attributes had missing values, but we cannot be sure that there won’t be any missing values in new data after the system goes live, so it is safer to apply the `imputer` to all the numerical attributes:

In [None]:
print(imputer.statistics_)

[5.         5.         5.71428571]


**Note**: The `SimpleImputer` approach is a univariate imputation algorithm, that is, it imputes values in the $i^{\text{th}}$ feature dimension using only non-missing values in that feature dimension. We can opt for multivariate imputation algorithms that use the entire set of available feature dimensions to estimate the missing values (e.g. [`IterativeImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html)).


We can use another strategy for filling in missing values. For example, the [`KNNImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer) class provides imputation for filling in missing values using the k-Nearest Neighbors approach.

See Scikit-learn guide [Imputation of missing values](https://scikit-learn.org/stable/modules/impute.html) for more information.

### Generating higher-order polynomial features


Scikit-learn has a module called [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) for generating a new dataset containing high-order polynomial and interaction features based off the features in the original dataset. For example, if the original dataset has two dimensions $[a, b]$, the second-degree polynomial transformation of the features will result in $[1, a, b, a^2, ab, b^2]$.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
# Create dataset
data = np.array([[5,8],[9,3],[8,6],
                 [5,2],[3,9],[8,7],
                 [1,5]])
print(data)

# Create polynomial features
polynomial_features = PolynomialFeatures(2)
data = polynomial_features.fit_transform(data)
print(data)

[[5 8]
 [9 3]
 [8 6]
 [5 2]
 [3 9]
 [8 7]
 [1 5]]
[[ 1.  5.  8. 25. 40. 64.]
 [ 1.  9.  3. 81. 27.  9.]
 [ 1.  8.  6. 64. 48. 36.]
 [ 1.  5.  2. 25. 10.  4.]
 [ 1.  3.  9.  9. 27. 81.]
 [ 1.  8.  7. 64. 56. 49.]
 [ 1.  1.  5.  1.  5. 25.]]


Adding polynomial features of the original data is a way to enrich a feature representation, particularly for linear models. 

This kind of feature engineering, as weel as adding interaction features, is often used in statistical modeling, but it’s also common in many practical machine learning applications. 

Using polynomial features together with a linear regression model yields the classical model of polynomial regression (check this [notebook](https://github.com/victorviro/Machine-Learning-Python/blob/master/Introduction_linear_regression_and_regularized_linear_models.ipynb) to see how apply polynomial regression). Adding interactions and polynomials can decrease performance slightly for other sort of machine learning algorithms.

### Binning

One way to make linear models more powerful on continuous data is to use [*binning*](https://en.wikipedia.org/wiki/Data_binning) (also known as *discretization* or *bucketing*) of the feature to split it up into multiple features. We imagine a partition of the input range for the feature (for example, the numbers of a variable from -3 to 3) into a fixed number of bins (say, 10). A data point will then be represented by which bin it falls into. To determine this, we first have to define the bins. In this case, we’ll define 10 bins equally spaced between -3 and 3. We use the `np.linspace` function for this, creating 11 entries, which will create 10 bins (the spaces in between two consecutive boundaries):

In [17]:
bins = np.linspace(-3, 3, 11)
print("bins: {}".format(bins))

bins: [-3.  -2.4 -1.8 -1.2 -0.6  0.   0.6  1.2  1.8  2.4  3. ]


Here, the first bin contains all data points with feature values -3 to -2.4, the second bin contains all points with feature values from -2.4 to -1.8, and so on.

Next, we record for each data point which bin it falls into. This can be easily computed using the `np.digitize` function:

In [18]:
X = np.array([-2.9,2.1,0.2,1,5,-2.1,1.88,2.55]).reshape((8, 1))
print('X shape:', X.shape)
which_bin = np.digitize(X, bins=bins)
print("\nData points:\n", X)
print("\nBin membership for data points:\n", which_bin)

X shape: (8, 1)

Data points:
 [[-2.9 ]
 [ 2.1 ]
 [ 0.2 ]
 [ 1.  ]
 [ 5.  ]
 [-2.1 ]
 [ 1.88]
 [ 2.55]]

Bin membership for data points:
 [[ 1]
 [ 9]
 [ 6]
 [ 7]
 [11]
 [ 2]
 [ 9]
 [10]]


What we did here is transform the single continuous input feature into a categorical feature that encodes which bin a data point is in. We can then transform this discrete feature to a one-hot encoding using the `OneHotEncoder` from the preprocessing module. Or we can use directly the [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html) class and set the `encode` attribute to `onehot`:

In [20]:
from sklearn.preprocessing import KBinsDiscretizer

est = KBinsDiscretizer(n_bins=10, encode='onehot', strategy='uniform')
est.fit(X)
X_binned = est.transform(X)
X_binned.toarray()

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]])

Because we specified 10 bins, the transformed dataset `X_binned` now is made up of 10 features.

### Other preprocessing techniques

- *Binarization* is a transformation technique for converting a dataset into binary values by setting a cutoff or threshold. All values above the threshold are set to 1, while those below are set to 0. This technique is useful for converting a dataset of probabilities into integer values or in transforming a feature to reflect some categorization. Scikit-learn implements binarization with the [`Binarizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html) module.

### Custom transformers

Although Scikit-Learn provides many useful transformers, we may need to write our own for tasks such as custom cleanup operations or combining specific attributes. We will want our transformer to work seamlessly with Scikit-learn functionalities (such as `pipelines`), and since Scikit-learn relies on duck typing (not inheritance), all we need is to create a class and implement three methods: `fit()` (returning `self`), `transform()`, and `fit_transform()`. We can get the last one for free by simply adding `TransformerMixin` as a base class. Also, if we add `BaseEstimator` as a base class (and avoid `*args` and `**kargs` in our constructor) we will get two extra methods (`get_params()` and `set_params()`) that will be useful for automatic hyperparameter tuning. For example, here is a small transformer class that adds a combined attribute in the iris dataset:

In [35]:
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [36]:
from sklearn.base import BaseEstimator, TransformerMixin

# Add a new variable: sepal length divided by sepal width
class VariableRatioSepalAdder(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    # Define the transformation
    def transform(self, X):
        X_new = X.copy()
        # Add a new variable
        X_new["ratio_sepal_lengh_width"] = X["sepal length (cm)"]/X["sepal width (cm)"]
        return X_new

## Machine learning algorithms



A series of notebooks that show how machine learning algorithms work using the Scikit-learn library:

- Supervised learning
 - [Introduction to linear regression and regularized linear models](https://github.com/victorviro/ML_algorithms_python/blob/master/Introduction_linear_regression_and_regularized_linear_models.ipynb)
 - [Logistic regression](https://github.com/victorviro/ML_algorithms_python/blob/master/Logistic_regression.ipynb)
 - [Discriminant analysis and Naive bayes](https://github.com/victorviro/ML_algorithms_python/blob/master/Gaussian_discriminant_analysis_and_Naive_bayes.ipynb)
 - [Decision Trees](https://github.com/victorviro/ML_algorithms_python/blob/master/Decision_Trees.ipynb)
 - [Ensemble learning](https://github.com/victorviro/ML_algorithms_python/blob/master/Ensemble_learning.ipynb)
 - [Support Vector Machines (SVM)](https://github.com/victorviro/ML_algorithms_python/blob/master/Support_Vector_Machines_explained.ipynb)

- Unsupervised learning
 - [Dimesionality reduction (PCA)](https://github.com/victorviro/ML_algorithms_python/blob/master/Dimensionality_reduction_algorithms.ipynb)
 - [Clustering](https://github.com/victorviro/ML_algorithms_python/blob/master/Unsupervised_Learning_Techniques_Clustering.ipynb)


## Scikit-learn design

Scikit-Learn’s API is remarkably well designed. The [main design principles](https://arxiv.org/abs/1309.0238) are:

- **Consistency**. All objects share a consistent and simple interface:
 - *Estimators*. Any object that can estimate some parameters based on a dataset is called an *estimator* (e.g., an `imputer` is an estimator). The estimation itself is performed by the `fit()` method, and it takes only a dataset as a parameter (or two for supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the estimation process is considered a hyperparameter (such as an `imputer strategy` ), and it must be set as an instance variable (generally via a constructor parameter).

 - *Transformers*. Some estimators (such as an `imputer`) can also transform a dataset; these are called *transformers*. Once again, the API is quite simple: the transformation is performed by the `transform()` method with the dataset to transform as a parameter. It returns the transformed dataset. This transformation generally relies on the learned parameters, as is the case for an `imputer` . All transformers also have a convenience method called `fit_transform()` that is equivalent to calling `fit()` and then `transform()` (but sometimes `fit_transform()` is optimized and runs much faster).

 - *Predictors*. Finally, some estimators are capable of making predictions given a dataset; they are called *predictors*. For example, the `LinearRegression` model. A predictor has a `predict()` method that takes a
dataset of new instances and returns a dataset of corresponding predictions. It also has a `score()` method that measures the quality of the predictions given a test set (and the corresponding labels in the case of supervised learning algorithms).

- **Inspection**. All the estimator’s hyperparameters are accessible directly via public instance variables (e.g., `imputer.strategy`), and all the estimator’s learned parameters are also accessible via public instance variables with an underscore suffix (e.g., `imputer.statistics_`).

- **Nonproliferation of classes**. Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes. Hyperparameters are just regular Python strings or numbers.

- **Composition**. Existing building blocks are reused as much as possible. For example, it is easy to create a `Pipeline` estimator from an arbitrary sequence of transformers followed by a final estimator, as we will see.

- **Sensible defaults**. Scikit-Learn provides reasonable default values for most parameters, making it easy to create a baseline working system quickly.


## Pipelines

The machine learning process often combines a series of transformers on raw data, transforming the dataset each step of the way until it is passed to the fit method of a final estimator. But if we don’t transform our data in the same exact manner, we will end up with wrong or, at the very least, unintelligible results. The Scikit-Learn [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) object is the solution to this dilemma.


Let’s look at how we can use the `Pipeline` class to express the workflow for training a SVC after scaling the data with MinMaxScaler.
First, we build a pipeline object by providing it with a list of steps. Each step is a tuple containing a name (any string of our choosing) and an instance of an estimator:

In [37]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

# Load iris dataset
iris = datasets.load_iris()

# Split in train and test sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, shuffle=True)

pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])

Here, we created two steps: the first, called `"scaler"`, is an instance of `MinMaxScaler`, and the second, called `"svm"`, is an instance of `SVC`. Now, we can fit the pipeline, like any other scikit-learn estimator:

In [38]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('svm',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='scale', kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

Here, `pipe.fit` first calls `fit` on the first step (the scaler), then transforms the training data using the scaler, and finally fits the SVM with the scaled data. To evaluate on the test data, we simply call `pipe.score`:

In [39]:
print(f'Test score: {pipe.score(X_test, y_test)}')

Test score: 0.9473684210526315


Calling the `score` method on the pipeline first transforms the test data using the scaler, and then calls the `score` method on the SVM using the scaled test data. Using the pipeline, we reduced the
code needed for our "preprocessing + classification" process. The main benefit of using the pipeline, however, is that we can now use this single estimator in `cross_val_score` or `GridSearchCV`.

The `Pipeline` class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could be regression or clustering instead of classification.

The only requirement for estimators in a pipeline is that all but the last step need to have a `transform` method, so they can produce a new representation of the data that can be used in the next step.

Internally, during the call to `Pipeline.fit`, the pipeline calls `fit` and then `transform` on each step in turn (or just `fit_transform`), with the input given by the output of the `transform method` of the previous step. For the last step in the pipeline, just `fit` is called.

When predicting using `Pipeline`, we similarly transform the data using all but the last step and then call `predict` on the last step.

The process is illustrated in Figure 6-3 for two transformers, `T1` and `T2` , and a classifier (called `Classifier`).

![](https://i.ibb.co/wR13LzF/pipeline-training-and-prediction.png)

The pipeline is actually even more general than this. There is no requirement for the last step in a pipeline to have a predict function, and we could create a pipeline just containing, for example, a scaler and `PCA`. Then, because the last step (`PCA`) has a `transform` method, we could call `transform` on the pipeline to get the output of
`PCA.transform` applied to the data that was processed by the previous step. The last step of a pipeline is only required to have a `fit` method.

**Accessing step attributes**: Often we will want to inspect attributes of one of the steps of the pipeline (the coefficients of a linear model or the components extracted by PCA). The easiest way to access the steps in a pipeline is via the `named_steps` attribute, which is a dictionary from the step names to the estimators:

In [42]:
print(pipe.named_steps)

# Extract the Regularization parameter from the "svm" step
c = pipe.named_steps["svm"].C
print(f'\nRegularization parameter: {c}')

{'scaler': MinMaxScaler(copy=True, feature_range=(0, 1)), 'svm': SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)}

Regularization parameter: 1.0


### Convenient pipeline creation with `make_pipeline`

Creating a pipeline using the syntax described earlier is sometimes a bit cumbersome, and we often don’t need user-specified names for each step. There is a convenience function, `make_pipeline`, that will create a pipeline for us and automatically name each step based on its class. The syntax for `make_pipeline` is as follows:

In [None]:
from sklearn.pipeline import make_pipeline
# Standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))])
# Abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))

The pipeline objects `pipe_long` and `pipe_short` do exactly the same thing, but `pipe_short` has steps that were automatically named. We can see the names of the steps by looking at the `steps` attribute:

In [None]:
print(f'Pipeline steps:\n{pipe_short.steps}')

Pipeline steps:
[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('svc', SVC(C=100, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False))]


The steps are named `minmaxscaler` and `svc`. In general, the step names are just lowercase versions of the class names. If multiple steps have the same class, a number is appended:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())
print(f'Pipeline steps:\n{pipe.steps}')

Pipeline steps:
[('standardscaler-1', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)), ('standardscaler-2', StandardScaler(copy=True, with_mean=True, with_std=True))]


However, in such settings it might be better to use the `Pipeline` construction with explicit names, to give more semantic names to each
step.

# References

- [Scikit-learn documentation](https://scikit-learn.org/stable/index.html)

- [Preprocessing data with Scikit-learn](https://scikit-learn.org/stable/modules/preprocessing.html)

- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

- [Introduction to Machine Learning with Python](https://www.oreilly.com/library/view/introduction-to-machine/9781449369880/)