# MNIST using scikit-learn and SuperDuperDB

In a [previous example](mnist_torch.html) we discussed how to implement MNIST classification with CNNs in `torch`
using SuperDuperDB. 

In [1]:
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score,classification_report
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn import svm

As before we'll import the python MongoDB client `pymongo`
and "wrap" our database to convert it to a SuperDuper `Datalayer`:

In [2]:
import pymongo
from superduperdb import superduper

db = pymongo.MongoClient().documents

db = superduper(db)

INFO:numexpr.utils:NumExpr defaulting to 8 threads.


Similarly to last time, we can add data to SuperDuperDB in a way which very similar to using `pymongo`.
This time, we'll add the data as `numpy.array` to SuperDuperDB, using the `Document-Encoder` formalism:

In [4]:
from superduperdb.encoders.numpy.array import array
from superduperdb.core.document import Document as D
from superduperdb.datalayer.mongodb.query import Collection

mnist = fetch_openml('mnist_784')
ix = np.random.permutation(10000)
X = np.array(mnist.data)[ix, :]
y = np.array(mnist.target)[ix].astype(int)

a = array('float64', shape=(784,))

collection = Collection(name='mnist')

data = [D({'img': a(X[i]), 'class': int(y[i])}) for i in range(len(X))]

db.execute(
    collection.insert_many(data, encoders=[a])
)

  warn(
INFO:root:found 0 uris


(<pymongo.results.InsertManyResult at 0x1cf339810>,
 TaskWorkflow(database=<superduperdb.datalayer.base.datalayer.Datalayer object at 0x199ffff10>, G=<networkx.classes.digraph.DiGraph object at 0x19a118e50>))

In [5]:
db.execute(collection.find_one())

Document({'_id': ObjectId('64bf09b11c6729099727c433'), 'img': Encodable(x=array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,  

Models are built similarly to the `Datalayer`, by wrapping a standard Python-AI-ecosystem model:

In [6]:
model = superduper(
    svm.SVC(gamma='scale', class_weight='balanced', C=100, verbose=True),
    postprocess=lambda x: int(x)
)

Now let's fit the model. The optimization uses Scikit-Learn's inbuilt training procedures.
Unlike in a standard `sklearn` use-case, we don't need to fetch the data client side. Instead, 
we simply name the fields in the MongoDB collection which we'd like to use.

In [7]:
model.fit(X='img', y='class', db=db, select=collection.find())

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9511/9511 [00:00<00:00, 253008.01it/s]


[LibSVM]*
optimization finished, #iter = 235
obj = -23.988352, rho = -0.470720
nSV = 107, nBSV = 0
*
optimization finished, #iter = 641
obj = -84.873820, rho = 0.216024
nSV = 245, nBSV = 0
*
optimization finished, #iter = 622
obj = -71.442951, rho = 0.109371
nSV = 234, nBSV = 0
*
optimization finished, #iter = 425
obj = -47.762863, rho = -0.148560
nSV = 166, nBSV = 0
*
optimization finished, #iter = 733
obj = -89.869625, rho = 0.143424
nSV = 269, nBSV = 0
*
optimization finished, #iter = 592
obj = -88.511741, rho = -0.029409
nSV = 222, nBSV = 0
*
optimization finished, #iter = 465
obj = -49.685648, rho = -0.110425
nSV = 183, nBSV = 0
*
optimization finished, #iter = 600
obj = -74.241001, rho = -0.037598
nSV = 237, nBSV = 0
*
optimization finished, #iter = 505
obj = -71.490694, rho = -0.253592
nSV = 197, nBSV = 0
*
optimization finished, #iter = 637
obj = -129.249310, rho = 0.687340
nSV = 181, nBSV = 0
*
optimization finished, #iter = 569
obj = -102.681731, rho = 0.752286
nSV = 172, nBS

Installed models and functionality can be viewed using `db.show`:

In [8]:
db.show('model')

['svc']

The model may be reloaded in another session from the database. 
As with `.fit`, the model may be applied to data in the database with `.predict`:

In [9]:
m = db.load('model', 'svc')
m.predict(X='img', db=db, select=collection.find(), max_chunk_size=3000)



Computing chunk 0/3




Computing chunk 3000/3




Computing chunk 6000/3




Computing chunk 9000/3


We can verify that the predictions make sense by fetching a few random data-points:

In [30]:
r = next(db.execute(collection.aggregate([{'$match': {'_fold': 'valid'}} ,{'$sample': {'size': 1}}])))
print(r['class'])
print(r['_outputs'])

9
{'img': {'svc': 9}}
