# Data Science 101
## [Classifier Models](https://scikit-learn.org/stable/modules/tree.html#classification)

We'll use [scikit-learn](https://scikit-learn.org/stable/) to create a model that classifies some data (just created randomly) into 3 classes. This will be a good mockup of our cGMP datasets for the biosciences.

A classifier model separates 2 or more classes. For example, the blue and red lines below separate the filled circles from the empty circles. Each line would be a unique classifier (some are better than others as you can see).

![classifier image](https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Svm_separating_hyperplanes.png/1024px-Svm_separating_hyperplanes.png)

## Training a machine learning model to predict the class of a sample

We'll use [scikit-learn](https://scikit-learn.org) (aka sklearn) and [Dask](https://www.dask.org/) to build and train our classifier model.

### Scikit-learn - The Data Scientists pocket knife

There are essentially 4 steps to building a model with scikit-learn:
1. Build the dataloader & get the train/test/validation splits
2. Construct the model
3. Train the model on the train split
4. Evaluate the model predictions on the test split

### Dask - Scale our data loader to a planetary level!

Here we'll use Dask as the dataloader. Recall from the first notebook (*python_final_1_of_2_scikit_learn.ipynb*) that a data loader is a generator that loads data only when it is needed. This allows us to use datasets that are far too large to fit into the computer's memory. It can also allow us to use streaming data that comes in continuously from a source (*e.g.* instruments in the lab that are constantly measuring things).

In the first notebook we created a dataloader from scratch. However, typically you would use something like Dask to do this because we don't want to re-invent the wheel. (Note, there are other Python libraries that do out-of-memory and streaming/incremental modeling).


### Validation versus Test Data Splits

There are 2 datasets that are typically "held back" from training: the test dataset and the validation dataset.

"Held back" means that this data is never seen by the model during training. Before we begin training we split the data (randomly) into train, test, and validation shards. Only the training shard is used for training. This is the only way to provide an unbiased estimate of the model's performance. Remember, we are fitting the model to the training dataset so it stands to reason that the model will perform very well on the training dataset. 

The model is never trained on the validation nor test shards. The validation dataset is used to give an unbiased estimate of the performance of the final tuned model when comparing or selecting between final models. The test dataset is used at the very end (once you select the best model) to give an unbiased estimate of how your best model will perform with new data.

## Let's create a new classifier dataset and save it to a CSV file

We'll use the [Iris](https://archive.ics.uci.edu/ml/datasets/iris) dataset which is a classic classifier dataset to predict the type of flower from its petal measurements. The dataset fits into memory, but we're going to pretend that it doesn't and load it with Dask. This will exactly match the realworld case when we have GB to TB sized CSV files.

In [1]:
import pandas as pd

In [2]:
csv_filename = "iris_classification_dataset.csv"
target_label = "Flower class"

In [3]:
from sklearn import datasets

data = datasets.load_iris()
df = pd.DataFrame(data=data["data"], columns=data["feature_names"])
df[target_label] = data["target"]
df.to_csv(csv_filename, sep=",", index=False)

## Now we have our CSV. Next step is using Dask to load it.

Note we're using Dask to replace the standard Pandas library. Think of Dask as Pandas on steroids.

In [4]:
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split

## Construct the model code


In [5]:
ddf = dd.read_csv(csv_filename, blocksize="16MB")

In [6]:
ddf

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Flower class
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,float64,float64,float64,float64,int64
,...,...,...,...,...


In [7]:
ddf.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Flower class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [8]:
y = ddf[target_label]
X = ddf[ddf.columns.difference([target_label])]

X_train, X_remain, y_train, y_remain = train_test_split(
    X, y, test_size=0.3, shuffle=True, random_state=816
)

X_validate, X_test, y_validate, y_test = train_test_split(
    X_remain, y_remain, test_size=0.5, shuffle=True, random_state=816
)

In [9]:
y_train.head()

5      0
89     1
140    2
127    2
22     0
Name: Flower class, dtype: int64

In [10]:
X_train.head()

Unnamed: 0,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm)
5,1.7,0.4,5.4,3.9
89,4.0,1.3,5.5,2.5
140,5.6,2.4,6.7,3.1
127,4.9,1.8,6.1,3.0
22,1.0,0.2,4.6,3.6


In [11]:
from tqdm.notebook import tqdm  # This gives us the nice progress bar!

## Time for the classifier model!

We're using the stochastic gradient descent classifier from scikit-learn. Note not all scikit-learn models will work with Dask. You'll need to find scikit-learn models that support the `partial_fit()` function. This function allows us to train one batch at a time.

In [12]:
from sklearn.linear_model import SGDClassifier
from dask_ml.wrappers import Incremental

In [13]:
classifier_model = SGDClassifier(loss="log_loss")  # Log loss is for classification models (use MSE for regression models)

## Incremental/batch/stream learning

So far everything has looked like our usual scikit-learn training. Here's where we tweak it a bit to allow for incremental learning. This might also be called batch learning or stream learning or online learning. It can handle datasets that are read in a batch at a time. This is wonderful when you are using really big datasets and datasets that are streaming from a source (like an IOT device).

In [14]:
classifier_increment = Incremental(classifier_model, scoring="accuracy") 

In [15]:
NUMBER_OF_EPOCHS = 1000  # An epoch is just one pass through the dataset. We train for multiple epochs when we do SGD

In [16]:
classes = y_train.unique().compute()   # Note we need the .compute() at the end when using Dask (this is different than Pandas)
print(f"We have the following classes in our dataset:\n{classes}")

We have the following classes in our dataset:
0    0
1    1
2    2
Name: Flower class, dtype: int64


In [17]:
for _ in (pbar := tqdm(range(NUMBER_OF_EPOCHS))):
    
    # We're training for one batch at a time
    classifier_increment.partial_fit(X_train, y_train, classes=classes)
    
    pbar.set_description(f"Accuracy = {classifier_increment.score(X_train, y_train):.3f}")

  0%|          | 0/1000 [00:00<?, ?it/s]

## Check how well we did

During training we printed out the accuracy on the training dataset.

We use the test dataset to test how well the model performs to data that it hasn't seen before. This gives us an unbiased (hopefully) estimate of the true model performance for future data. This is what you would report as the final model accuracy.

In [18]:
print(f"The final model has an accuracy of {100*classifier_increment.score(X_test, y_test):.1f}%")

The final model has an accuracy of 95.7%


## Let's make some predictions of our test dataset

In [19]:
classifier_increment.predict(X_test)

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,7 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes unknown unknown Shape (nan,) (nan,) Count 7 Tasks 1 Chunks Type int64 numpy.ndarray",,

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,7 Tasks,1 Chunks
Type,int64,numpy.ndarray


In [20]:
df = pd.DataFrame()

df["Truth"] = y_test.compute()
df["Prediction"] =  classifier_increment.predict(X_test).compute()  # Note the compute()

df

Unnamed: 0,Truth,Prediction
11,0,0
130,2,2
28,0,0
21,0,0
125,2,2
37,0,0
84,1,1
9,0,0
19,0,0
33,0,0


In [21]:
def map_values(x:int):
    """Map the class value to the class name
    
    Args:
        x(int): Class value
    """
    if x == 0:
        return "Iris Setosa"
    elif x == 1:
        return "Iris Virginica"
    elif x == 2:
        return "Iris Versicolour"
    else:
        return "Error - Class not known"

In [22]:
df["Truth"] = y_test.compute()
df["Prediction"] =  classifier_increment.predict(X_test).compute()  # Note the compute()


df["Truth"] = df["Truth"].map(map_values)
df["Prediction"] = df["Prediction"].map(map_values)

df["Match?"] = (df["Truth"] == df["Prediction"])

df

Unnamed: 0,Truth,Prediction,Match?
11,Iris Setosa,Iris Setosa,True
130,Iris Versicolour,Iris Versicolour,True
28,Iris Setosa,Iris Setosa,True
21,Iris Setosa,Iris Setosa,True
125,Iris Versicolour,Iris Versicolour,True
37,Iris Setosa,Iris Setosa,True
84,Iris Virginica,Iris Virginica,True
9,Iris Setosa,Iris Setosa,True
19,Iris Setosa,Iris Setosa,True
33,Iris Setosa,Iris Setosa,True


## We won't have time to go over the validation dataset but it's a good topic for another time

In [23]:
classifier_increment.score(X_validate, y_validate)

1.0