# Unit Testing ML Code: Hands-on Exercise (Input Values)

## In this notebook we will explore unit tests to validate input data using a basic schema

#### We will use a classic toy dataset: the Iris plants dataset, which comes included with scikit-learn
Dataset details: https://scikit-learn.org/stable/datasets/index.html#iris-plants-dataset

As we progress through the course, the complexity of examples will increase, but we will start with something basic. This notebook is designed so that it can be run in isolation, once the setup steps described below are complete.

### Setup

Let's begin by importing the dataset and the libraries we are going to use. Make sure you have run `pip install -r requirements.txt` on requirements file located in the same directory as this notebook. We recommend doing this in a separate virtual environment (see dedicated setup lecture).

If you need a refresher on jupyter, pandas or numpy, there are some links to resources in the section notes.

In [1]:
from sklearn import datasets
import pandas as pd
import numpy as np

# Access the iris dataset from sklearn
iris = datasets.load_iris()

# Load the iris data into a pandas dataframe. The `data` and `feature_names`
# attributes of the dataset are added by default by sklearn. We use them to
# specify the columns of our dataframes.
iris_frame = pd.DataFrame(iris.data, columns=iris.feature_names)

# Create a "target" column in our dataframe, and set the values to the correct
# classifications from the dataset.
iris_frame['target'] = iris.target

In [2]:
# View the first 5 rows of our dataframe.
iris_frame.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [3]:
# View summary statistics for our dataframe.
iris_frame.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


### Now that we have our data loaded, we will create a simplified pipeline.

This pipeline is a class for encapsulating all the related functionality for our model. As the course unfolds, we will work with more complex pipelines, including those provided by third party libraries.

We train a logistic regression model to classify the flowers from the Iris dataset.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


class SimplePipeline:
    def __init__(self):
        self.frame = None
        # Shorthand to specify that each value should start out as
        # None when the class is instantiated.
        self.X_train, self.X_test, self.y_train, self.Y_test = None, None, None, None
        self.model = None
        self.load_dataset()
    
    def load_dataset(self):
        """Load the dataset and perform train test split."""
        # fetch from sklearn
        dataset = datasets.load_iris()
        
        # remove units ' (cm)' from variable names
        self.feature_names = [fn[:-5] for fn in dataset.feature_names]
        self.frame = pd.DataFrame(dataset.data, columns=self.feature_names)
        self.frame['target'] = dataset.target
        
        # we divide the data set using the train_test_split function from sklearn, 
        # which takes as parameters, the dataframe with the predictor variables, 
        # then the target, then the percentage of data to assign to the test set, 
        # and finally the random_state to ensure reproducibility.
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            self.frame[self.feature_names], self.frame.target, test_size=0.65, random_state=42)
        
    def train(self, algorithm=LogisticRegression):
        
        # we set up a LogisticRegression classifier with default parameters
        self.model = algorithm(solver='lbfgs', multi_class='auto')
        self.model.fit(self.X_train, self.y_train)
        
    def predict(self, input_data):
        return self.model.predict(input_data)
        
    def get_accuracy(self):
        
        # use our X_test and y_test values generated when we used
        # `train_test_split` to test accuracy.
        # score is a method on the Logisitic Regression that 
        # returns the accuracy by default, but can be changed to other metrics, see: 
        # https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score
        return self.model.score(X=self.X_test, y=self.y_test)
    
    def run_pipeline(self):
        """Helper method to run multiple pipeline methods with one call."""
        self.load_dataset()
        self.train()

In [6]:
pipeline = SimplePipeline()
pipeline.run_pipeline()
accuracy_score = pipeline.get_accuracy()

# note that f' string interpolation syntax requires python 3.6
# https://www.python.org/dev/peps/pep-0498/
print(f'current model accuracy is: {accuracy_score}')

current model accuracy is: 0.9693877551020408


### Test Inputs

Now that we have our basic pipeline, we are in a position to test the input data.

Best practice is to use a schema. A schema is a collection of rules which specify the expected values for a set of fields. Below we show a simple schema (just using a nested dictionary) for the Iris dataset. Later in the course we will look at more complex schemas, using some of the common Python libraries for data validation.

The schema specifies the maximum and minimum values that can be taken by each variable. We can learn these values from the data, as we have done for this demo, or these values may come from specific domain knowledge of the subject.

In [7]:
iris_schema = {
    'sepal length': {
        'range': {
            'min': 4.0,  # determined by looking at the dataframe .describe() method
            'max': 8.0
        },
        'dtype': float,
    },
    'sepal width': {
        'range': {
            'min': 1.0,
            'max': 5.0
        },
        'dtype': float,
    },
    'petal length': {
        'range': {
            'min': 1.0,
            'max': 7.0
        },
        'dtype': float,
    },
    'petal width': {
        'range': {
            'min': 0.1,
            'max': 3.0
        },
        'dtype': float,
    }
}

In [8]:
import unittest
import sys

class TestIrisInputData(unittest.TestCase):
    def setUp(self):
        
        # `setUp` will be run before each test, ensuring that you
        # have a new pipeline to access in your tests. See the 
        # unittest docs if you are unfamiliar with unittest.
        # https://docs.python.org/3/library/unittest.html#unittest.TestCase.setUp
        self.pipeline = SimplePipeline()
        self.pipeline.run_pipeline()
    
    def test_input_data_ranges(self):
        # get df max and min values for each column
        max_values = self.pipeline.frame.max()
        min_values = self.pipeline.frame.min()
        
        # loop over each feature (i.e. all 4 column names)
        for feature in self.pipeline.feature_names:
            
            # use unittest assertions to ensure the max/min values found in the dataset
            # are less than/greater than those expected by the schema max/min.
            self.assertTrue(max_values[feature] <= iris_schema[feature]['range']['max'])
            self.assertTrue(min_values[feature] >= iris_schema[feature]['range']['min'])
            
    def test_input_data_types(self):
        data_types = self.pipeline.frame.dtypes  # pandas dtypes method
        
        for feature in self.pipeline.feature_names:
            self.assertEqual(data_types[feature], iris_schema[feature]['dtype'])
        

In [9]:
# setup code to allow unittest to run the above tests inside the jupyter notebook.
suite = unittest.TestLoader().loadTestsFromTestCase(TestIrisInputData)
unittest.TextTestRunner(verbosity=1, stream=sys.stderr).run(suite)

..
----------------------------------------------------------------------
Ran 2 tests in 0.076s

OK


<unittest.runner.TextTestResult run=2 errors=0 failures=0>

### Data Input Test: Hands-on Exercise
Change either the schema or the input data (not the model config) so that the test fails. Do you understand why the test is failing?