<a href="https://colab.research.google.com/github/vinay10949/AnalyticsAndML/blob/master/MLTestingAndMonitoring/UnitTesting/unit_testing_model_configuration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit Testing ML Code: Hands-on Exercise (Configuration)

## In this notebook we will explore unit tests for *model configuration*

#### We will use a classic toy dataset: the Iris plants dataset, which comes included with scikit-learn
Dataset details: https://scikit-learn.org/stable/datasets/index.html#iris-plants-dataset

As we progress through the course, the complexity of examples will increase, but we will start with something basic. This notebook is designed so that it can be run in isolation, once the setup steps described below are complete. Cells should be run one after the other without skipping any.

### Setup

Let's begin by importing the dataset and the libraries we are going to use. Make sure you have run `pip install -r requirements.txt` on requirements file located in the same directory as this notebook. We recommend doing this in a separate virtual environment (see dedicated setup lecture).

If you need a refresher on jupyter, pandas or numpy, there are some links to resources in the section notes.

In [0]:
from sklearn import datasets
import pandas as pd
import numpy as np

# Access the iris dataset from sklearn
iris = datasets.load_iris()

# Load the iris data into a pandas dataframe. The `data` and `feature_names`
# attributes of the dataset are added by default by sklearn. We use them to
# specify the columns of our dataframes.
iris_frame = pd.DataFrame(iris.data, columns=iris.feature_names)

# Create a "target" column in our dataframe, and set the values to the correct
# classifications from the dataset.
iris_frame['target'] = iris.target

### Add the `SimplePipeline` from the Test Input Values notebook (same as first exercise, no changes here)

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


class SimplePipeline:
    def __init__(self):
        self.frame = None
        # Shorthand to specify that each value should start out as
        # None when the class is instantiated.
        self.X_train, self.X_test, self.y_train, self.Y_test = None, None, None, None
        self.model = None
        self.load_dataset()
    
    def load_dataset(self):
        """Load the dataset and perform train test split."""
        # fetch from sklearn
        dataset = datasets.load_iris()
        
        # remove units ' (cm)' from variable names
        self.feature_names = [fn[:-5] for fn in dataset.feature_names]
        self.frame = pd.DataFrame(dataset.data, columns=self.feature_names)
        self.frame['target'] = dataset.target
        
        # we divide the data set using the train_test_split function from sklearn, 
        # which takes as parameters, the dataframe with the predictor variables, 
        # then the target, then the percentage of data to assign to the test set, 
        # and finally the random_state to ensure reproducibility.
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            self.frame[self.feature_names], self.frame.target, test_size=0.65, random_state=42)
        
    def train(self, algorithm=LogisticRegression):
        
        # we set up a LogisticRegression classifier with default parameters
        self.model = algorithm(solver='lbfgs', multi_class='auto')
        self.model.fit(self.X_train, self.y_train)
        
    def predict(self, input_data):
        return self.model.predict(input_data)
        
    def get_accuracy(self):
        
        # use our X_test and y_test values generated when we used
        # `train_test_split` to test accuracy.
        # score is a method on the Logisitic Regression that 
        # returns the accuracy by default, but can be changed to other metrics, see: 
        # https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score
        return self.model.score(X=self.X_test, y=self.y_test)
    
    def run_pipeline(self):
        """Helper method to run multiple pipeline methods with one call."""
        self.load_dataset()
        self.train()

### Update the Pipeline

We now create a new pipeline class which inherits from the `SimplePipeline` with one important modification: The configuration for the model is passed in as an argument when the pipeline object is instantiated. This means that configuration can be set via an external object or file.

In [0]:
class PipelineWithConfig(SimplePipeline):
    def __init__(self, config):
        # Call the inherited SimplePipeline __init__ method first.
        super().__init__()
        # Pass in a config object which we use during the train method.
        self.config = config
            
    def train(self, algorithm=LogisticRegression):
        # note that we instantiate the LogisticRegression classifier 
        # with params from the pipeline config
        self.model = algorithm(solver=self.config.get('solver'),
                               multi_class=self.config.get('multi_class'))
        self.model.fit(self.X_train, self.y_train)

### Now we Unit Test

We will employ a simple unit test to check the configuration values.

Let's say that after extensive testing in the research environment, we deduce that certain types of configuration (parameters passed to the model, preprocessing settings, GPU configurations etc.) are optimal, or that certain configurations tend to be a bad idea. We should then test our configuration is validated against this understanding.

In [0]:
import unittest


# arbitrarily selected for demonstration purposes. In a real
# system you would define this in config and import into your
# tests so you didn't have to update config and tests when
# the values changed.
ENABLED_MODEL_SOLVERS = {'lbfgs', 'newton-cg'}


class TestIrisConfig(unittest.TestCase):
    def setUp(self):
        # We prepare the pipeline for use in the tests
        config = {'solver': 'lbfgs', 'multi_class': 'auto'}
        self.pipeline = PipelineWithConfig(config=config)
        self.pipeline.run_pipeline()
    
    def test_pipeline_config(self):
        # Given
        # fetch model config using sklearn get_params()
        # https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator.get_params
        model_params = self.pipeline.model.get_params()
        
        # Then
        self.assertTrue(model_params['solver'] in ENABLED_MODEL_SOLVERS)

In [10]:
import sys


suite = unittest.TestLoader().loadTestsFromTestCase(TestIrisConfig)
unittest.TextTestRunner(verbosity=1, stream=sys.stderr).run(suite)

.
----------------------------------------------------------------------
Ran 1 test in 0.073s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

## Model Configuration Testing: Hands-on Exercise
Change the model config so that the test fails. Do you understand why the test is failing?

In [0]:
import unittest


# arbitrarily selected for demonstration purposes. In a real
# system you would define this in config and import into your
# tests so you didn't have to update config and tests when
# the values changed.
ENABLED_MODEL_SOLVERS = {'lbfgs', 'newton-cg'}


class TestIrisConfig(unittest.TestCase):
    def setUp(self):
        # We prepare the pipeline for use in the tests
        config = {'solver': 'sag', 'multi_class': 'auto'}
        self.pipeline = PipelineWithConfig(config=config)
        self.pipeline.run_pipeline()
    
    def test_pipeline_config(self):
        # Given
        # fetch model config using sklearn get_params()
        # https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator.get_params
        model_params = self.pipeline.model.get_params()
        
        # Then
        self.assertTrue(model_params['solver'] in ENABLED_MODEL_SOLVERS)

In [20]:
suite = unittest.TestLoader().loadTestsFromTestCase(TestIrisConfig)
unittest.TextTestRunner(verbosity=1, stream=sys.stderr).run(suite)

F
FAIL: test_pipeline_config (__main__.TestIrisConfig)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython-input-19-daa4dec40c67>", line 25, in test_pipeline_config
    self.assertTrue(model_params['solver'] in ENABLED_MODEL_SOLVERS)
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 0.022s

FAILED (failures=1)


<unittest.runner.TextTestResult run=1 errors=0 failures=1>