# ML-Git

This notebook describes a basic flow with ml-git with api. In it, we will show how to obtain a dataset already versioned by ml-git, how to perform a versioning process of a model and new data generated. For this, we will use the MNIST dataset. 

### 1 - The dataset

Dataset MNIST is a set of small images of handwritten digits, in the version available in our docker environment, the set has a total of 70,000 images from numbers 0 to 9. Look at the below image which has a few examples instances:

![dataset](MNIST.png)

### 2 - Getting the data

To start working with our dataset it is necessary to carry out the checkout command of ml-git in order to bring the data from our storage to the user's workspace.

In [None]:
from ml_git import api

# def checkout(entity, tag, sampling=None, retries=2, force=False, dataset=False, labels=False, version=-1)
api.checkout('labels', 'labelsmnist', dataset=True)
mnist_dataset_path = 'dataset/handwritten/digits/mnist/data/'
mnist_labels_path = 'labels/handwritten/digits/labelsmnist/data/'

Some important points to highlight here are that the tag parameter can be the name of the entity, this way the ml-git will get the latest version available for this entity. With the dataset=True signals that ml-git should look for the dataset associated with these labels

Once we have the data in the workspace, we can load it into variables

#### Training data

In [None]:
from mlxtend.data import loadlocal_mnist
import numpy as np

X_train, y_train = loadlocal_mnist(
    images_path= mnist_dataset_path + 'train-images.idx3-ubyte', 
    labels_path= mnist_labels_path + 'train-labels.idx1-ubyte')

print('Training data: ')
print('Dimensions: %s x %s' % (X_train.shape[0], X_train.shape[1]))
print('Digits: %s' % np.unique(y_train))
print('Class distribution: %s' % np.bincount(y_train))

The training data consists of 60,000 entries of 784 pixels, distributed among the possible values ​​according to the output above.

#### Test data

In [None]:
X_test, y_test = loadlocal_mnist(
    images_path= mnist_dataset_path + 't10k-images.idx3-ubyte', 
    labels_path= mnist_labels_path + 't10k-labels.idx1-ubyte')

print('Test data: ')
print('Dimensions: %s x %s' % (X_test.shape[0], X_test.shape[1]))
print('Digits: %s' % np.unique(y_test))
print('Class distribution: %s' % np.bincount(y_test))

The test data consists of 10,000 entries of 784 pixels, distributed among the possible values according to the output above.

### 3 - Training and evaluating

Let’s take an example of RandomForest Classifier and train it on the dataset and evaluate it.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Training on the existing dataset
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)

# Evaluating the model
y_pred = rf_clf.predict(X_test)
score = accuracy_score(y_test, y_pred)
print("Accuracy score after training on existing dataset", score)

### 4 - Versioning our model

As we do not have any previously versioned models, it will be necessary to create a new entity. For this we use the following command:

In [None]:
! ml-git model create modelmnist --category=handwritten --category=digits --bucket-name=mlgit
! echo "  mutability: mutable" >> model/modelmnist/modelmnist.spec

Once we have our model trained and evaluated, we will version it with ml-git. For that we need to save it in a file.

In [None]:
import pickle

def save_model(model):
    filename = 'model/modelmnist/data/rf_mnist.sav'
    pickle.dump(model, open(filename, 'wb'))

save_model(rf_clf)

With the file in the workspace we use the following commands to create a version:

In [None]:
entity_type = 'model'
entity_name = 'modelmnist'

# def add(entity_type, entity_name, bumpversion=False, fsck=False, file_path=[])
api.add(entity_type, entity_name)

# def commit(entity, ml_entity_name, commit_message=None, related_dataset=None, related_labels=None)
api.commit(entity_type, entity_name, related_dataset='mnist', related_labels='labelsmnist')

# def push(entity, entity_name, retries=2, clear_on_fail=False)
api.push(entity_type, entity_name)

### 5 - Improving accuracy

We can improve by tuning hyperparameters of the Algorithm or by trying a different algorithm altogether. Sometimes the Algorithm requires more dataset to improve the prediction function, we can expand the dataset using the same dataset itself.

As we’ve discussed earlier, each instance of the dataset is nothing but a vector of (784) pixels values. (which is actually a representation of 28x28 image)
What if we shift the images by one pixel at either of side? See, the below example.

![dataset](MNIST_SHIFTED.png)


We will shift the images to each of the four directions by one pixel and generate four more images from a single image. The resulting dataset would now contain 3,00,000 images(60000 x 5).

In [None]:
from scipy.ndimage.interpolation import shift

# Method to shift the image by given dimension
def shift_image(image, dx, dy):
    image = image.reshape((28, 28))
    shifted_image = shift(image, [dy, dx], cval=0, mode="constant")
    return shifted_image.reshape([-1])

In [None]:
# Creating Augmented Dataset
X_train_augmented = [image for image in X_train]
y_train_augmented = [image for image in y_train]

for dx, dy in ((1,0), (-1,0), (0,1), (0,-1)):
     for image, label in zip(X_train, y_train):
             X_train_augmented.append(shift_image(image, dx, dy))
             y_train_augmented.append(label)

# Shuffle the dataset
shuffle_idx = np.random.permutation(len(X_train_augmented))
X_train_augmented = np.array(X_train_augmented)[shuffle_idx]
y_train_augmented = np.array(y_train_augmented)[shuffle_idx]

print('Test data: ')
print('Dimensions: %s x %s' % (X_train_augmented.shape[0], X_train_augmented.shape[1]))
print('Digits: %s' % np.unique(y_train_augmented))
print('Class distribution: %s' % np.bincount(y_train_augmented))

The train data now consists of 300,000 entries of 784 pixels, distributed among the possible values according to the output above.

### 6 - Versioning the dataset and labels with the new entries

In [None]:
dataset_augmented_file = 'dataset/handwritten/digits/mnist/data/train-images.idx3-ubyte'
pickle.dump(X_train_augmented, open(dataset_augmented_file, 'wb'))

labels_augmented_file = 'labels/handwritten/digits/labelsmnist/data/train-labels.idx1-ubyte'
pickle.dump(y_train_augmented, open(labels_augmented_file, 'wb'))

#### Versioning the dataset

In [None]:
entity_type = 'dataset'
entity_name = 'mnist'

api.add(entity_type, entity_name, bumpversion=True)
api.commit(entity_type, entity_name)
api.push(entity_type, entity_name)

#### Versioning the labels

In [None]:
entity_type = 'labels'
entity_name = 'labelsmnist'

api.add(entity_type, entity_name, bumpversion=True)
api.commit(entity_type, entity_name, related_dataset='mnist')
api.push(entity_type, entity_name)

### 7 - Training and evaluating

In [None]:
# Training on augmented dataset
rf_clf_for_augmented = RandomForestClassifier(random_state=42)
rf_clf_for_augmented.fit(X_train_augmented, y_train_augmented)

# Evaluating the model
y_pred_after_augmented = rf_clf_for_augmented.predict(X_test)
score = accuracy_score(y_test, y_pred_after_augmented)
print("Accuracy score after training on augmented dataset", score)

We’ve improved the accuracy by ~0.9%. This is great.

### 8 - Versioning our model

In [None]:
save_model(rf_clf_for_augmented)

entity_type = 'model'
entity_name = 'modelmnist'

api.add(entity_type, entity_name, bumpversion=True)
api.commit(entity_type, entity_name, related_dataset='mnist', related_labels='labels_mnist')
api.push(entity_type, entity_name)

###  <span style="color:blue"> 9 - Reproducing our experiment with ml-git</span> 


Once the experiment data is versioned, it is common that it is necessary to re-evaluate the result, or that someone else wants to see the result of an already trained model.

For this, we will perform the model checkout in version 1 (without the data augmentation), to get the test data and the trained model.

In [None]:
mnist_dataset_path = 'dataset/handwritten/digits/mnist/data/'
mnist_labels_path = 'labels/handwritten/digits/labelsmnist/data/'
mnist_model_path = 'model/handwritten/digits/modelmnist/data/'

api.checkout('model', 'handwritten__digits__modelmnist__1', dataset=True, labels=True)

# Getting test data
X_test, y_test = loadlocal_mnist(images_path= mnist_dataset_path + 't10k-images.idx3-ubyte', 
                                 labels_path= mnist_labels_path + 't10k-labels.idx1-ubyte')

With the test data in hand, let's upload the model and evaluate it for our dataset.

In [None]:
loaded_model = pickle.load(open(mnist_model_path + 'rf_mnist.sav', 'rb'))
y_pred = loaded_model.predict(X_test)
score = accuracy_score(y_test, y_pred)
print('Accuracy score for version 1: ', score)

Now let's take the model from the version 2 (model trained with data from the data augmentation) and evaluate it for the test set.

In [None]:
api.checkout('model', 'handwritten__digits__modelmnist__2')
loaded_model = pickle.load(open(mnist_model_path + 'rf_mnist.sav', 'rb'))
y_pred = loaded_model.predict(X_test)
score = accuracy_score(y_test, y_pred)
print('Accuracy score for version 2: ', score)

In a quick and practical way it was possible to obtain the models generated in the experiments and to evaluate them again.

### Conclusions

At the end of this execution we have two versions of each entity. If someone else wants to replicate this experiment, they can check out the model with the related dataset and labels.

|            DATASET            |                LABELS               |                MODEL               | ACCURACY |
|:-----------------------------:|:-----------------------------------:|:----------------------------------:|:--------:|
| handwritten__digits__mnist__1 | handwritten__digits__labelsmnist__1 | handwritten__digits__modelmnist__1 |  0.9705  |
| handwritten__digits__mnist__2 | handwritten__digits__labelsmnist__2 | handwritten__digits__modelmnist__2 |+~0.009   |