# Deployment Considerations

A strong point to consider deployment from the early stages of scoping a project is made. Pressing on with the development in earnest prior to understanding the architecture tha app will be hosted on can cause all sorts of problems that ultimately cost more development time. 

Once a Data Scientist presents a trained model for hosting, here are some effective steps to consider:

1. **Compatability with infrastructure.** Will it run on the target infrastructure?
2. **Reproducibility.**
    - metadata store
    - data versioning
3. **Validation.**
    - data profile to store expected value references for comparison
4. **Logging.**
    - Monitoring of performance over time.
    - logging at 3 key stages, input data, system errors or warnings, predicted values.
5. **Maintenance.**
    - Bugfixes, feature requests etc will need to be handled.
    - How confident will you be in adjusting the codebase may be related to the
    comprehensive test suite included with the code.



## Profiling

Automated data analytics. Used to produce high level summary stats, refered to as **data
profiles** or **expectations**. This can be used to provide useful error messages to
users who provide erroneous inputs to the model or to indicate when enough data drift
has accumulated to suggest a need to retrain the model. 

Data profiles should be stored within the **metadata store**.

The widely used [Great Expectations](https://github.com/great-expectations/great_expectations) package in Python can be used for these purposes.

![Great Expectations Logo](https://github.com/great-expectations/great_expectations/blob/develop/docs/docusaurus/static/img/gx-mark-160.png?raw=true)


## Data Version Control

It's important to be able to reproduce the model by recording which version of the data the model was trained upon. The data itself should be stored wihtin the data store, however in the **metadata store** we can record a pointer to the data version with a verification fingerprint. If a single record in the dataset has changed, the fingerprint will not match. 

The model metadata should also record info to be able to reproduce the train test split used during model training.

Data versioning is often done with [Data Version Control](https://dvc.org/).

![DVC Logo](https://camo.githubusercontent.com/39f29e4d02d44888b656a15b1b51c14db4a139fb12da62b3c4e150b9bffa3373/68747470733a2f2f6476632e6f72672f696d672f6c6f676f2d6769746875622d726561646d652e706e67)

## Feature Stores

Record prepared data in **dual databases**, 1 part being optimised for bulk record
retrieval during training phase and the other part optimised for single record retrieval
during prediction.

Feature stores allow you to use features multiple times across several models. They also
reduce **train-serve skew**, where your model performs worse in production than at
training.

Such skew can be experienced by training your model on clean data, when during
production the model will receive mostly unclean / differently formatted data. For
example, training a spam filter on extracted Email text when the user will mostly be
passing HTML format.

***

Model Build Pipelines

A distinction is made between model pipeline & model build pipeline.

**Model Pipeline** considers the standard engineering to prediction workflow covered by 
say a scikit-learn pipeline. Compare with model build below.


### Model Pipeline
```{mermaid}
flowchart LR
    input(["Raw Data"])
    subgraph preprocessing
    eng["Engineering"]
    feat["Feature Extraction"]
    end
    subgraph prediction
    pred["Classification"]
    end
    lab(["Label"])
    input --> eng --> feat --> pred --> lab
```

### Model **Build** Pipeline

```{mermaid}
flowchart LR
    mod["Model Pipeline"]
    dat["Training Data"]
    train["Model Training"]
    write["Save Model"]
    mod --> train
    dat --> train --> write


```