# Chapter 6: Model Development and Offline Evaluation

## Evaluating ML Models

- Not just in terms of metrics (accuracy, F1 etc), but also in terms of stuff like data, compute, training time, inference latency, and interpretability

### Tips to select models

- Don't worry about SOTA. Worry about what works
- Start with simple models
- Avoid human biases e.g. you spend more time developing models that are "newer", so it performs better
    - When comparing different architectures, make sure it's compared under comparable setups
- Good performance now vs good performance later
    - Plot learning curve (i.e. score against sample count), to show you the marginal contribution of more training data

- What are your tradeoffs?
    - False positives vs false negative trade off
    - Compute requirement vs accuracy tradeoff

- What are your model's assumptions?
    - IID?
    - Smoothness?
    - Boundary shapes?
    - Conditional independence?
    - Normal distirbution?

### Ensembles

- Ensembling usually performs great on desktop, but not in production, because it is quite hard to maintain
- But generally, whenever you want to have some kind of marginal gain, ensembling should be your go to first step

### Bagging

- Bookstrap aggregation of raw observations will heolp you reduce variance in model predictions, and avoid overfitting
- Rather than training on only 1 dataset, create multiple datasets through resampling the data you have, then average over all the models

### Boosting

- XGBoost/LightGBM handles this out of the box
- Basically the idea is to train many lousy (simple) classifiers and recombine them to make a strong classifier

### Stacking

- Train base learnings from training data, then create a metalearner to combine the output of the base learners
- Kind of light ensembling, except instead of equal weights you train some complex set of combination

## Experiment Tracking/Versioning

- Best to use something like MLFlow to track and version your model training, so you can keep track of features used/parameters tried. Log things like
    - Loss curve
    - Model performance
    - Samples used
    - prediction vs ground truth
    - Speed (training and inference)
    - System metrics (CPU utilisation)
    - Hyperparameters

- Versioning
    - Versioning ML Models and data used is impt, because data will shift, and you will have multiple copies of the same model on different days (DVC)

- ML fails in many ways
    - Poor data
    - Poor implementation
    - Pipeline failures
    - Hyperparameter failures
    - Feature choice failures 
    - etc

- To debug
    - Start simple and add more components
    - Purposely overfit a single batch, to make sure that you implementation is at least correct
    - Use random seeds to test performance over a variety of runs

### Distributed training

- You can train models across multiple machines, for faster training speed

- Split data on multiple machines, train local models, then accumulate gradients

![async train](./artifacts/6_image.png)


- Split model on different machines and train if tasks are parallel (e.g. each layer of an FFN can be trained on a different model)

- Split each part of a pipeline to be trained on different machine

## AutoML

- AutoML is a catchall buzzword

- But what's interesting is how automl can be used to automate parts of the model building
    - Automate feature selection?
    - Automate architecture search? 
    - etc

## Model Offline Evaluation

- How do you know that your models are actually working? 

- Often, data in production does not give you ground truth
    - e.g. if you already chose NOT to lend money to someone,it is not possible to have a ground truth label of credit worthiness

- Solutions:
    - Baselines: 
        - Let's say your model is 90% accurate. What if you had random labelling? What is the uplift?
        - What about some simple heuristic?
        - How about some human baseline?
        - Simple model?

- Evaluation - do the labels flip dramatically when:
    - What happens to predictions if I randomly perturb my data?
    - What happens when I remove some features from the model? 
    - Does "70%" probability really mean 70% in your model output? Calibrate it properly
    - Confidence level: you may only want to take action on model outputs that you are confident about
    - Subset evaluation: Are there specific groups that your model underperforms on?