## Model Evaluation and Model Productionizing

In this section of the course, we'll focus on the production phase of the machine learning pipeline. We'll review the cycle of machine learning projects, and examine how AWS services can help the storage, monitoring, and maintenance aspects of model production. 

### Using ML Models in Production

* Integrating an ML solution with existing software
* Keeping it running successfully over time

#### Aspects to consider

* Model hosting
* Model deployment
* Pipelines to provide feature vectors
* Code to provide low-latency and/or high-volume predictions
* Model and data updating and versioning
* Quality monitoring and alarming
* Data and model security and encryption
* Customer privacy, fairness, and trust
* Data provider contractual constraints (e.g., attribution, cross-fertilization)

#### Types of production environments

##### Batch predictions
* Useful if all possible inputs known a priori (e.g., all product categories for which demand is to be forecast, all keywords to bid)
* Predictions can still be served real-time, simply read from pre-computed values

##### Online predictions
* Useful if input space is large (e.g., customer's utterances or photos, detail pages to be translated)
* Low latency requirement (e.g., at most 100ms)

##### Online training
* Sometimes training data patterns change often, so need to train online (e.g., fraud detection)

### Model Evaluation Metrics

* Business metrics may not be the same as the performance metrics that are optimized during training.  Why?
    * Example - click-through rate
* Ideally, performance metrics are highly correlated with business metrics
* Confusion matrix, TP, FP, TN, FN, precision, recall, etc.
* Issue: In many applications, TN dwarfs the other categories, making accuracy useless for comparing models
* F1-score - combines precision and recall - it's the harmonic mean of precision and recall

$$ F_1 Score = \frac{2 \cdot precision \cdot recall}{precision + recall} $$

### Cross-Validation

* Issue - metrics on training data can't measure generalization
    * Model could cheat by memorizing the data and getting a perfect scoe
    * Overfitting
* Solution - Cross-validation: Train and evaluate on distinct data sets
* `from sklearn.model_selection import train_test_split`
* Overfitting?  Reduce the number of predictor variables
* `from sklearn.metrics import precision_score, recall_score, f1_score`
* For cancer, recall is most important to make sure we don't miss those who have cancer (but I'd say it's also super important to not treat someone for cancer who doesn't have it!)

### K-Fold Cross-Validation

* Issue: small sets
    * Smaller training set -> not enough data for good training
    * Unrepresentative test set -> invalid metrics
* K-fold cross-validation
    * Randomly partition data into K "folds" (larger K means more time and more variance)
    * For each fold, train model on other K-1 folds and evaluate on that
    * Train on all data
    * Average metric across K folds estimates test metric for trained model
* Choosing K
    * Large -> more time, more variance
    * Small -> more bias
    * 5-10 is typical
* Leave-one-out cross-validation
    * K = n
    * Use for very small datasets
* Stratified K-fold cross-validation
    * Preserve class proportions in the folds
    * Use for imbalanced data
    * There are seasonality or subgroups

### Metrics for Linear Regression

* MSE - average squared error over entire dataset - very commonly used - `sklearn.metrics.mean_squared_error`
* $ R^2 $
    * $ R^2 = 1 - \frac{MSE}{Var(y)} $ which is between 0 and 1
    * Interpretation: Fraction of variance accounted for by the model
    * Basically, standardized version of MSE
    * Good $ R^2 $ are determined by actual problem
    * $ R^2 $ always increases when more variables are added to the model (can lead to overfitting - highest $ R^2 $ may not be the best model)
    * Adjusted $ R^2 $: Take into account of the effect of adding more variables such that it only increases when the added variables have significant effect in prediction
        * Better metric for multiple variates regression (multivariate regression?)
    * `sklearn.metrics.r2_score`
* Gaussian - probability density function

$$ f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}$$
![Gaussian](images/gaussian001.png)

* Why do we study normal distribution so often?
* Central Limit Theorem - no matter what is the original distrigution of X, the mean of X (i.e. \bar{X}) will follow a normal distribution:

$$ \bar{X} ~ N (\mu, \frac{\sigma^2}{n} )$$

#### Confidence interval
* An average computed on a sample is merely an **estimate** of the true population mean
* Confidence interval: Quantifies margin-of-error between sample metric and true metric due to **sampling randomness**
* Informal interpretation: With x% confidence, true metric lies within the interval
* Precisely: If the true distribution is as stated, then with x% probability the observed value is in the interval
    * Differently stated for a 90% confidence interval:  If you randomly draw data from the distribution 100 times and make 100 predictions, of those 100 predictions, 90 will fall into the confidence interval you reported
* For population proportion (i.e. the truth), the confidence interval is:

$$ CI = p \pm z(p(1-p)/n)^{1/2} $$

* Where p is the sample proportion, n is sample size, and z is z-score defined for that confidence level (e.g., 90, 95, 99)

### Using ML Models in Production: Storage

#### Data Formats
* Row-oriented formats
    * Comma/tab-separated values (CSV/TSV)
    * Read-only DB (RODB): Internal read-only file-based store with fast key-based access
    * Avro - allows schema evaluation for Hadoop
* Column-oriented formats
    * Parquet: Type-aware and indexed for Hadoop
    * Optimized row columnar (ORC): Type-aware, indexed, and with statistics for Hadoop
* User-defined formats
    * JSON: For key-value objects
    * Hierarchical data format 5 (HDF5): Flexible data model with chunks
* Compression can bre applied to all formats
* Usual trade-offs: Read/write speeds, size, platform-dependency, ability for schema to evolve, schema/data separability, type richness

#### Model and Pipeline Persistence
* Predictive Model Markup Language (PMML)
    * Vendor-independent XML-based language for storing ML models
    * Support varies in different libraries
        * KNIME (analytics/ML library): Full support
        * Scikit-learn: Extensive support
        * Spark MLlib: Limited support
* Custom methods
    * Scikit-learn: Uses the Python pickle method to serialize/deserialize Python objects
    * Spark MLlib: Transformers and Estimators implement MLWritable
    * TensorFlow: Allows saving of MetaGraph
    * MxNet: Saves into JSON
    
#### Model Deployment
* Technology transfer: Experimental framework may not suffice for production
    * A/B testing or shadow testing: helps catch production issues early
    * [Rules of Machine Learning: Best Practices for ML Engineering (2017)](http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf)
    
#### Information Security
* Make sure that you handle training and evaluation data in accordance with data classification
* Models may need to be treated with same classification level as source data.  Why?  Model parameters come from training data

### Using ML Models in Production: Monitoring and Maintenance

* It's important to monitor quality metrics and business impacts with dashboards, alarms, user feedback, etc.
    * The real-world domain may change over time ("model drift")
    * Software environment may change
    * High profile special cases may fail
    * There may be a change in business goals
* Performance deterioration may require new tuning
    * Changing goals may require new metrics
    * A changing domain may require changes to validation set
    * Your validation set may be replaced over time to avoid overfitting
    * Features may no longer be available
* Customer obsession
    * Think carefully about the impact on customer perception and trust
    * Give ML solutions the "creepiness sniff test" and "The front page of a newspaper test"
    * Provide explanations to customers

### Using ML Models in Production: Using AWS

* SageMaker - build, train, and deploy machine learning models at scale
    * built-in, high performance algorithms
    * one-click training
    * hyperparameter optimization
    * one-click deployment
    * fully managed hosting w/ auto scaling
* Rekognition - images and video, billions of images daily
* Lex - chatbot ASR and NLU
* Polly - text-to-voice - more than two dozen languages
* Comprehend - entity recognition, topic modeling, sentiment analysis, etc.
* Translate - neural machine translation service
* Transcribe - speech to text
* DeepLens
* AWS Glue - data integration service for managing ETL jobs
* [Deep Scalable Sparse Tensor Network Engine (pronounced "Destiny")](https://github.com/amzn/amazon-dsstne) - Neural network engine

### Common Mistakes

* You solved the wrong problem 
    * Interactions b/t data science and business teams must be early and often
* The data was flawed
    * Big data $ \neq $ good data
* The solution didn't scale
* Final result doesn't match with the prototype's results
* It takes too long to fail
    * Pull the plug if there's a strong indication the project isn't going to work - the sooner you stop a failing project the sooner you can start a successful project
* The solution was too complicated
* There weren't enough allocated engineering resources to try out long-term science ideas
* There was a lack of collaboration