<img src="../images/logo.png" align='right' width=250px>

# Custom Estimators

A Scikit-Learn [Estimator](https://scikit-learn.org/stable/developers/develop.html) implements a `fit` method to learn from data.

<img src="../images/oprah.jpeg" align="right">

We shall develop two types of Scikit-Learn Estimators:
- A (Predictor) Model (implements a `predict` method to make predictions and a `score` method to evaluate the "goodness of fit"),
- A Transformer (implements a `transform` method to transform a dataset).

This is the live-coded example used in the training. This notebook has been created to allow students to read back through the example.




---
# Build our own Model

We are going to build a Model that will look at the smallest euclidean distance between the data we'd like to predict and the initial training dataset.

That means when we fit our Model, we want to 'lock in' the X_train data as a fitted variable, so that when we pass in new data, X_test say, we can compare it with the original X_train data.

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import euclidean_distances

from sklearn.utils.validation import check_X_y, check_is_fitted, check_array

import numpy as np
import seaborn as sns

## Break down the predict method with a small example

Firstly, let's look at what we're doing - the euclidean distances - with a small sample of data:

In [None]:
train_data = np.array([[1, 2, 3], [4, 5, 6], [0, 1, 2], [1, 0, 1]])


train_labels = np.array(["a", "b", "c", "d"])

new_data = np.array([[0, 1, 2], [4, 4, 6]])

**What is Euclidean distance?**

The Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.

Let's demonstate this with the first rows of our new_data and our train_data: `[0, 1, 2]` and `[1, 2, 3]`

In [None]:
# Euclidean distances of [0, 1, 2] and [1, 2, 3]

((0 - 1) ** 2 + (1 - 2) ** 2 + (2 - 3) ** 2) ** 0.5

**Now let's work out all the Euclidean distances**

We can use the function `euclidean_distances` to calculate the distances between all rows in the train_data and new_data. Note that for each row in the new_data we will get 4 euclidean distances, because we want to calculate the euclidean distance from each row in the train_data (and there are 4 rows in the original train_data).

In [None]:
distances = euclidean_distances(new_data, train_data)
print(distances)

Now we want to select the smallest, however we don't want the value, we want the index location. 

**Question: Why do we want the index location and not the value?**

In [None]:
closest_index = np.argmin(distances, axis=1)
closest_index

**Answer**: With the index location, we can use this to filter on our train_labels:

In [None]:
train_labels[closest_index]

Now we have our predictions!

In [None]:
predictions = train_labels[closest_index]

In [None]:
predictions

## Build the Model

Now that we understand the mathematics we are aiming to perform, let's put this into a Model.

Our code in full is as follows:

In [None]:
# find all distances
distances = euclidean_distances(new_data, train_data)

# get location of smallest distances
closest_index = np.argmin(distances, axis=1)

# filter our train_labels to get predictions
predictions = train_labels[closest_index]

To build our Model, we'll inherit from the [`BaseEstimator`](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#examples-using-sklearn-base-baseestimator) class. Amongst other things, this provides a default implementation for the `get_params()` and `set_params()` methods. This is useful to make the model grid search-able with GridSearchCV for automated parameters tuning and behave well with others when combined in a Pipeline.

We'll also inherit from [`ClassifierMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassifierMixin.html), a Mixin class for all classifiers in scikit-learn. This provides us a default implementation of the `score()` method, which will return the mean accuracy on the given test data and labels.

Our custom Model needs:

- `__init__` - here we state any hyperparameters*
- `fit` - we need to fit our training data and training labels to use in the next method
- `predict` - find the distances and locate the predictions from the training labels


*we don't need to include any right now since we are finding the smallest distance. Let's include a redundant parameter `k` for example purposes. This won't be used now but could be useful if we wanted to build a k-nearest neighbour algorithm.

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_is_fitted, check_array


class CustomClassifier(BaseEstimator, ClassifierMixin):

    def __init__(self, k=1):
        self.k = k

    def fit(self, X, y):

        # check that X and y have the correct shape
        check_X_y(X, y)

        self.X_ = X
        self.y_ = y

        # should always return the object itself
        return self

    def predict(self, X):

        # check that X has the correct shape
        check_array(X)

        # check has the model been fitted
        check_is_fitted(self)

        distances = euclidean_distances(X, self.X_)
        closest_index = np.argmin(distances, axis=1)
        predictions = self.y_[closest_index]
        return predictions

## Run Model on Large Data

Let's use the make_blobs function to create some new, bigger data:

In [None]:
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=300, centers=2, random_state=0)

In [None]:
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y)

Now let's follow our classic model build steps:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=111)

In [None]:
# here we can alter the value of k, but it won't change the model at this moment
model = CustomClassifier(k=5)

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

In [None]:
model.score(X_test, y_test)

---
<img src='https://tfwiki.net/mediawiki/images2/thumb/3/37/Optimusg1.jpg/350px-Optimusg1.jpg' align='right' width=200px>

# Build our own Transformer

We are now going to build a mean imputer Transformer. The steps we will need to take:

- calculate the mean for each column
- identify the location of the missing values
- overwrite the missing values with *their corresponding column mean*

## Break down the transformation with smaller data

Let's work through this with a small sample of data to understand the process. 
<!-- Since this is a transformer, we don't need some new_data or any labels: -->

In [None]:
# data with missing values in:

train_data = np.array([[1, 2, 3], [np.nan, np.nan, 6], [0, 1, 2], [np.nan, 0, 1]])

Now we get the means of each column (disregarding the missing values)

In [None]:
means = np.nanmean(train_data, axis=0)
print(f"Mean value for each column: {means}\n")

We can use the function `np.isnan()` to find the row and column locations of our missing value. Note that row an `m x n` matrix, the output will be as follows:

$($ $(array(x_1, x_2, ..., x_n),$  $array(y_1, y_2, ..., y_n)$ $)$

where the $x$ values are the rows and corresponding to the $y$ columns in reference to the location in the original matrix.

For example:

In [None]:
missing_locations = np.where(np.isnan(train_data))
print(f"Array of Rows and Columns where we see missings: {missing_locations}\n")

The first array states there are missing values in rows 1 and 3, the second states in columns 0 and 1. 

The exact location of the three missing values are:
`(1, 0)`
`(1, 1)`
`(3, 0)`

We **only need to know the columns in which missing values occured** since we will replace the missing values with their column means, regardless of in which row they sit.

So, let's select only the second array using `missing_locations[1]`

In [None]:
cols_with_miss = missing_locations[1]

We can also use the missing locations to select the missing values from the column, this is vital so that we can overwrite these values:

In [None]:
print(
    f"These missings: {train_data[missing_locations]} correspond to columns {cols_with_miss}\n"
)

Now we need to use the `cols_with_miss` to select the corresponding mean values for those columns. We can use the `np.take()` function for this:

In [None]:
values = np.take(means, cols_with_miss)
print(
    f"These missings: {train_data[missing_locations]} will be replaced with {values}\n"
)

Now we have all we need! Let's create a copy of our data so we can overwrite those missing values:

In [None]:
copy = train_data.copy()
copy[missing_locations] = np.take(means, cols_with_miss)
print("Original data:", end="\n\n")
print(train_data, end="\n\n")
print("Final output:", end="\n\n")
print(copy, end="\n\n")

## Now build with the Transformer

Let's put all of that in a Transformer. Here was our code in full:

In [None]:
copy = train_data.copy()

means = np.nanmean(train_data, axis=0)
missing_locations = np.where(np.isnan(train_data))
cols_with_miss = missing_locations[1]
copy[missing_locations] = np.take(means, cols_with_miss)
copy

We shall inherit from [`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassifierMixin.html), a Mixin class for all Transformers in scikit-learn. This provides us a default implementation of the `fit_transform()` method, which can fit to data, then transform it.

In a Transformer we need:

- `init` - again we don't need any parameters
- `fit` - here we want to fit our means values, so these are always learned from the train data
- `transform` - this is where we locate our missing values, columns and output a **copy** of the data with missing values overwritten

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
class CustomMeanImputer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X):
        self.means_ = np.nanmean(X, axis=0)
        # ALWAYS return the object itself (self)
        return self

    def transform(self, X):

        # check has the model been fitted
        check_is_fitted(self)

        X = X.copy()
        inds = np.where(np.isnan(X))
        X[inds] = np.take(self.means_, inds[1])
        return X

In [None]:
custom_mean_imputer = CustomMeanImputer()


training_data = np.array([[1, 2, 3], [np.nan, np.nan, 6], [0, 1, 2], [np.nan, 0, 1]])

custom_mean_imputer.fit(training_data)

In [None]:
custom_mean_imputer.transform(training_data)

In [None]:
custom_mean_imputer.fit_transform(training_data)

Now let's run through this with larger data. This wasn't covered in the live-code session but is good to see on larger data.

Note that the Transformer will only work on data that has more than 1 column.

In [None]:
def create_eg_X_y(rows=100, cols=2, percent_miss=0.1, random_state=100):
    if cols < 2:
        raise ValueError("Must create X with more than 1 column")
    np.random.seed(random_state)
    np.random.random_sample(size=(100, 2))
    np.random.seed(0)
    X = np.random.random_sample(size=(rows, cols))
    X[X < percent_miss] = np.nan

    y = (X.prod(axis=1) > 0.5) * 1
    return X, y


X, y = create_eg_X_y()
X[:10]

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=111)

In [None]:
imputer = CustomMeanImputer()

In [None]:
X_train_transformed = imputer.fit_transform(X_train)

In [None]:
X_test_transformed = imputer.transform(X_test)

In [None]:
X_train[:5], X_train_transformed[:5]

In [None]:
X_test[:10], X_test_transformed[:10]