## Model Performance Testing


### Example Online Experiment

- There are many solutions, each with its own nuances. To get to a minimum viable testing setup, we will instead do the following.
- We will develop a bare-bones setup to test two recommendation models based on the flask deployment from earlier. This will involve using the open source package `planout` (see [here](https://facebook.github.io/planout/index.html)) to do the randomization.
- Assuming flask, surpriselib, pytorch and others are already installed, we can install `planout` using the following:
```bash
pip install planout
```
- Note that its always good to do the above in a project specific conda environment (such as `datasci-dev` we have been using before).
- The key idea with using `planout` here is as follows. While deciding to serve the recommendations to the user when they login and reach the homepage or an appropriate screen, we will randomly pick either the pytorch model or the surpriselib model. 
- In particular, the user is assigned to a random cohort in the experiment when they login using help from `planout`.

- This is accomplished by creating an instance of an appropriately defined `ModelExperiment` class.

```python
model_perf_exp = ModelExperiment(userid=session['userid'])
model_type = model_perf_exp.get('model_type')
```
- Here the class `ModelExperiment` is defined in a straightforward manner:

In [None]:
class ModelExperiment(SimpleExperiment):
    def setup(self):
            self.set_log_file('model_abtest.log')
    def assign(self, params, userid):
            params.use_pytorch = BernoulliTrial(p=0.5, unit=userid)
            if params.use_pytorch:
                    params.model_type = 'pytorch'
            else:
                    params.model_type = 'surprise'

- The setup function defines where the log for the experiment is stored (this can be changed to a different storage location).
- The assign function uses the user's ID to bucket them into one of the control or treatment cohorts. In our example, lets assume that the surpriselib based recommendation model is the control.
- Take a look at the complete flask script in the code folder. What follows is a brief explanation:

- We read the models from disk and some associated metadata. Although we are loading the training data below, this is only for convenience and should not be done ideally.
- We start with the flask app setup. Notice how the app can be configured:

In [None]:
app = Flask(__name__)

app.config.update(dict(
	DEBUG=True,
	SECRET_KEY='MODEL_TESTING_BY_THEJA_TULABANDHULA',
))

 - The next few lines in the script have the recommendation function that responds with recommendations from either the surprise model or the pytorch model.

 - Notice the resetting function that essentially switches the user ID.

In [None]:
@app.route('/reset')
def reset():
    session.clear()
    return redirect(url_for('main'))

 - Below is the rating function that documents the received rating.

In [None]:
@app.route('/rate')
def rate():
    rate_string = request.args.get('rate')
    try:
            rate_val = int(rate_string)
            assert rate_val > 0 and rate_val < 11

            model_perf_exp = ModelExperiment(userid=session['userid'])
            model_perf_exp.log_event('rate', {'rate_val': rate_val})

            return render_template_string("""
                <html>
                    <head>
                        <title>Thank you for the feedback!</title>
                    </head>
                    <body>
                        <p>You rating is {{ rate_val }}. Hit the back button or click below to go back to recommendations!</p>
                        <p><a href="/">Back</a></p>
                    </body>
                </html>
                """, rate_val=rate_val)
    except:
            return render_template_string("""
                <html>
                    <head>
                        <title>Bad rating!</title>
                    </head>
                    <body>
                        <p>You rating could not be parsed. That's probably not a number between 1 and 10, so we won't be accepting your rating.</p>
                        <p><a href="/">Back</a></p>
                    </body>
                </html>
                """)

- Notice that currently we are logging events such as exposure (the recommendations were shown) and users explicitly rating into a simple log file in the same directory.

```bash
(datasci-dev) ttmac:code_lecture08 theja$ ls
flask_pytorch_model.py	model_abtest.log	movies.dat		pytorch_model		surprise_model
flask_surprise_model.py	model_testing.py	pytorch_inference.ipynb	recommend.py		two_sample_test.ipynb
(datasci-dev) ttmac:code_lecture08 theja$ head -n3 model_abtest.log
{"name": "ModelExperiment", "time": 1602739669, "salt": "ModelExperiment", "inputs": {"userid": "431"}, "params": {"use_pytorch": 1, "model_type": "pytorch"}, "event": "exposure", "checksum": "796b9a12"}
{"name": "ModelExperiment", "time": 1602739720, "salt": "ModelExperiment", "inputs": {"userid": "431"}, "params": {"use_pytorch": 1, "model_type": "pytorch"}, "event": "exposure", "checksum": "796b9a12"}
{"name": "ModelExperiment", "time": 1602739722, "salt": "ModelExperiment", "inputs": {"userid": "431"}, "params": {"use_pytorch": 1, "model_type": "pytorch"}, "event": "exposure", "checksum": "796b9a12"}
```

- We can load this log file into a Jupyter notebook to conduct our test.
  - [Download locally](two_sample_test.ipynb)

![test1](images/stat_test1.png)
- Lets zoom into the data we care about for testing.
![test2](images/stat_test2.png)
- We can do a simple two sample t test using the `scipy.stats` package.
![test3](images/stat_test3.png)

- For more deliberation on the choice of the test, have a look at [this discussion](https://stats.stackexchange.com/questions/305/when-conducting-a-t-test-why-would-one-prefer-to-assume-or-test-for-equal-vari) on the assumption about equal population variances.



#### Summary

 - We have seen a simple A/B testing setup where we are testing two recommendation models. Most commercial and open source implementations follow similar patterns for setting up an experiment.
 - While the basic A/B testing ideas seen above are a good way to validate improvements in models and products, more advanced techniques such as multi-armed bandits, Bayesian tests and contextual bandits may help with issues such as wrong assumptions or sample inefficiency.
 - For instance, there are several ideas worth exploring such as:
  - combining ML withing an A/B testing setup (note that this is not related to the variants being ML models)
  - Tests with small and large samples
  - Repeating A/B test to address drift
  - Avoiding peeking
  - Dealing with multiple hypotheses
  - Working with quantities such as posteriors rather than p-values
  - Being careful with post-selection inference
  - Causal modeling techniques such as causal forests and various observational techniques for causal inference.