<center><img src=img/MScAI_brand.png width=70%></center>

# Scikit-Learn: Mimicking the Estimator API

As we have seen, Scikit-Learn has some infrastructure which helps to make ML projects run smoothly in practice, e.g.:

* Appropriate `score` methods for various models;
* Cross-validation e.g. `cross_validate_score`;
* Pipelines.


If we are writing a new ML algorithm, it is natural to try to conform to the Scikit-Learn API so that our code will work well with this infrastructure. For example, if we conform to the Estimator API, then we can put our new algorithm in a list with other models and compare them.

```python
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
class FabNewClassifier: 
    pass # whatever
clfs = [LogisticRegression, SVC, FabNewClassifier]
for clf in clfs:
    clf = clf()
    clf.fit(X, y)
    print(clf.score(X, y))
```

To make this concrete, we're going to: 

1. Invent a new classifier and implement it in Python with Numpy. 
2. Re-factor it as a class with `fit`, `predict`, and `score` methods, which allows the above polymorphism to work.
3. *Inherit* from Scikit-Learn base classes to get uniformity and to get functionality "for free". 

In [67]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
%matplotlib inline

In [68]:
iris = sns.load_dataset("iris")
X = iris.drop("species", axis=1).values
y = iris["species"].values

The classifier we'll implement is $1$-nearest neighbours. 

In [69]:
def one_nn(X, y, Q):
    D = scipy.spatial.distance.cdist(X, [Q])
    nearest = np.argmin(D)
    print("Query", Q)
    print("nearest", nearest)
    print("X[nearest]", X[nearest])
    print("D[nearest]", D[nearest])
    print("y[nearest]", y[nearest])
    return y[nearest]

We can write it as a single function because $k$-NN has no separate training phase. We use `scipy.spatial.distance.cdist` to calculate all distances from the query `Q` to each point in the training data `X`. Then we return the `y` label of whichever point in `X` is nearest to `Q`. We print out everything that's happening just to help explain, but of course in real code it should just return a value, not print anything.

In [70]:
Q = [5.75, 2.0, 4.0, 1.5]
one_nn(X, y, Q)

Query [5.75, 2.0, 4.0, 1.5]
nearest 53
X[nearest] [5.5 2.3 4.  1.3]
D[nearest] [0.43874822]
y[nearest] versicolor


'versicolor'

### Refactoring to mimic the Estimator API

We already know enough to refactor our $1$-NN as an Estimator class. We have to provide:

* `fit(X, y)`
* `predict(X)`
* `score(X, y)`

In [40]:
class OneNN:
    def fit(self, X, y):
        self.X = X
        self.y = y
    def predict(self, X):
        D = scipy.spatial.distance.cdist(self.X, X)
        nearest = np.argmin(D, axis=0)
        return self.y[nearest]
    def score(self, X, y):
        fX = self.predict(X)
        return np.mean(fX == y) # accuracy

`OneNN` is now an object. `fit` stores the `X` and `y` but doesn't actually *do* anything for $1$-NN. 

The biggest change here is that in Scikit-Learn, `predict` should accept a 2D array of query points, not a single point. So we've added `axis=0` to take account of that.

In [43]:
onenn = OneNN()
onenn.fit(X, y)
Q3 = np.array([[5.75,  2,   4, 1.5],
               [5.0, 2.5, 3.4, 1.6],
               [4.6, 2.8, 2.2, 1.7]])
onenn.predict(Q3)

array(['versicolor', 'versicolor', 'versicolor'], dtype=object)

And our `score` method calculates an accuracy value. E.g. in $1$-NN we will always get accuracy of 1 on training data. (Why?)

In [None]:
print(onenn.score(X[:3], y[:3]))

### The Estimator API and inheritance

However, our `OneNN` is still not quite compatible with Scikit-Learn. There are some extra details which help to keep things uniform and make our lives easier. For a start, estimators should inherit from `sklearn.base.BaseEstimator`.


In [59]:
from sklearn.base import BaseEstimator, ClassifierMixin
class OneNN(BaseEstimator, ClassifierMixin):
    def fit(self, X, y):
        self.X = X
        self.y = y
        return self
    def predict(self, X):
        D = scipy.spatial.distance.cdist(self.X, X)
        nearest = np.argmin(D, axis=0)
        return self.y[nearest]    

A *Mixin* is a class M which is designed for the *multiple inheritance* scenario where a class C is designed to inherit from M and from some other class D as well. In other words, C is composed of M "mixed-in" with D.

In Scikit-Learn, there is (e.g.) a `ClassifierMixin` which has the `score` behaviour. This makes sense, because all classifiers should share the same `score` behaviour: our `OneNN` class should not implement a custom version.

In [74]:
Q3 = np.array([[5.75,  2,   4, 1.5],
               [5.0, 1.5, 2.4, 1.6],
               [4.6, 2.8, 2.2, 1.7]])

onenn = OneNN().fit(X, y)
onenn.predict(Q3)

array(['versicolor', 'virginica', 'virginica'], dtype=object)

In [65]:
onenn.score(X[:3], y[:3])

1.0

### More details of the API

Here is a flavour of the "rules" of the API which are designed to keep things uniform and make both users and developers' lives easier.

* The arguments of `__init__` should be keyword arguments with defaults. So, calling `C()` (no arguments) will work.
* In order to fit in pipelines, even unsupervised estimators need to accept `y=None`

* The `fit` method should return `self`. This allows a nice "chained" usage `OneNN().fit(X, y).predict(Q3)`
* "In iterative algorithms, the number of iterations should be specified by an integer called `n_iter`."

* "You can check whether your estimator adheres to the scikit-learn interface and standards by running `utils.estimator_checks.check_estimator` on the class"
* There is a template for new "contrib" projects: https://github.com/scikit-learn-contrib/project-template/

* https://scikit-learn.org/dev/developers/develop.html
* https://scikit-learn.org/stable/modules/classes.html#module-sklearn.base