# Seamless Performance Scaling in SherlockML

Typical data science workloads run on large data sets can take days to complete. In this demonstration we will show how Sherlock makes it very easy for you to ask for more computational resources, when (and only when!) you need it. 

## The dataset

We will be using a subset of the `Adult` dataset which can be obtained from the UCI website at the following address: https://archive.ics.uci.edu/ml/datasets/Adult. This dataset consists of over 30K records extracted from the 1994 American census. The prediction task at hand is to determine whether an individual's income exceeds 50K $ / year or not.  

In [1]:
import pandas as pd
import numpy as np

In [2]:
columns = ['Age', 'Work Class', 'FNLWGT', 'Education', 'Education Number', 'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital Gain', 'Capital Loss', 'Hours Per Week', 'Native Country', 'Income Class']
column_types = {'Age': np.int32 ,'Work Class' : 'category', 'FNLWGT' : np.int32 , 'Education': 'category', 'Education Number': np.int32, 'Marital Status': 'category', 'Occupation':'category', 'Relationship':'category', 'Race':'category', 'Sex':'category', 'Capital Gain' : np.int32, 'Capital Loss' : np.int32, 'Hours Per Week' : np.int32, 'Native Country' : 'category', 'Income Class' : 'category'   }
adult = pd.read_csv('data/adult_adapted.data', header = None, names =  columns, dtype = column_types)


In [3]:
X = pd.get_dummies(adult.drop(['Income Class', 'FNLWGT'], axis = 1)).as_matrix()

In [4]:
Y = adult['Income Class'].cat.codes.as_matrix()

## Random Forests

For this classification problem we will use a Random Forest classifier, which is part of the `sklearn` library. Random forests are good general-purpose classifiers which work well in a wide variety of prediction tasks. In addition, their performance scales very well with the number of processing cores available.

In [5]:
import sklearn.ensemble as es

In [6]:
classifier = es.RandomForestClassifier(n_estimators = 300, n_jobs = -1, verbose = True)

In [7]:
classifier.fit(X,Y);

[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   14.4s finished


## Testing the classifier

Let's quickly test our classifier on test data.

In [8]:
from sklearn.metrics import classification_report

In [9]:
adult_test = pd.read_csv('data/adult_adapted.test', header = None, names =  columns, dtype = column_types)
test_country = pd.Categorical(adult_test['Native Country'], categories = list(adult['Native Country'].unique()))
adult_test['Native Country'] = test_country

In [10]:
X_test = pd.get_dummies(adult_test.drop(['Income Class', 'FNLWGT'], axis = 1)).as_matrix()

In [11]:
Y_true = adult_test['Income Class'].cat.codes.as_matrix()

In [12]:
Y_pred = classifier.predict(X_test)

[Parallel(n_jobs=40)]: Done 120 tasks      | elapsed:    0.6s
[Parallel(n_jobs=40)]: Done 300 out of 300 | elapsed:    1.1s finished


In [13]:
print(classification_report(Y_true, Y_pred, target_names=['<=50k', '>50K']))

             precision    recall  f1-score   support

      <=50k       0.88      0.92      0.90     11360
       >50K       0.73      0.61      0.66      3700

avg / total       0.84      0.85      0.84     15060



Our off-the-shelf classifier achieves decent results, but takes roughly 10 seconds to train. Can we do better? With Sherlock, running your notebook on a faster computer is only a few clicks away. Let's give it a go. 

## Create a larger server
<img src="images/create_instance.png" alt="Upload directions" style="width: 500px; float: right; margin-right:10px;"/>

#### 1. Close  this notebook.


#### 2. In the workspace view, click on the down arrow next to the `default` label, then click on `Create server`.

#### 3. Select `Extra large (8 cores, 32 GB memory)` from the `SIZE` drop-down menu, give your server a name and click on `Create instance`. Wait a few seconds for your new server to be ready, then select it.

#### 4. That's it!  When you no longer need your more powerful server, you can easily terminate it by clicking on the Servers tab on the left hand side panel and selecting 'Terminate' from the drop down menu which will appear after clicking on the three dots displayed to the right of the name of your server.



## Rerun the notebook

Let's run this notebook again. Training the classifier now only takes about 2.5 seconds, which is a 4X speed-up! Real-world data science tasks can take days to complete, and require a lot of computational resources. Without Sherlock, transferring all your data and libraries to a bigger AWS instance would be no mean feat, and would require a lot of time - and hence money. But Sherlock does it all for you under the hood, in a seamless and secure fashion.

Using the command line and the `sml` package, it is possible possible to create even larger servers, with more than 100 cores. This is especially useful when working on the most demanding projects.