# Defining best practices
## ~~Shared best practices~~

Comments:

-   Open questions
    - Is organizing by must have versus should have the best idea?
    - Where do we draw the line between must versus should have? 
    - How do we deal with differences in degree rather than differences in kind? For example, a basic amount of clean code is clearly a must-have for ML engineering, but some aspects of it may only be should- or even would-like-to haves.

-  Alternatives:
    - ~~by maturity level~~ 
      - Corresponds to must/should/would like to haves: Low maturity corresponds to not having any; medium maturity corresponds to having must-haves, high corresponds to additionally satisfying should-haves, and highest corresponds to also satisfying would-like-to-haves.
    - Should we specify which are the minimum requirements once we enter production? Or could just use color-coding!

-  Note that these may also differ by:
    - use case (e.g., inherent complexity of data transformations and ML modeling)
    - POC versus production.


### *Data science* best practices

#### Must-haves
- Background infra / team investment ~~/ Environment / Ecosystem~~
  - Quick & easy dev setup
    - Align incentives: The recommended way should be the easy way. 
      - Don't just write up the required setup steps in a readme – instead, put them in a script whenever possible! [Makefiles are ideals for this](./_details/why_makefiles.md).
  - Availability of tooling to:
    - manage the complexity that arises out of the *iterative* workflow 
      - In particular, a solution for experiment management (e.g., MLFlow). A model registry also falls into this category, as it serves to mark which of the many trained models have passed a quality threshold that makes them fit for deployment.
    - deal with *data versioning*
      - For basic use cases, Data Version Control may be a good enough starting point.
      - If the data has an important time-series component, the ability to time-travel (e.g., retrieve a snapshot of how the data looked like at a given point in time) becomes essential in order to avoid information leakage, which is a common source of training-serving skew.
        - This functionality could either come from either:
          - the underlying data infrastructure (e.g. it is available in Snowflake, DeltaLake, or Apache Iceberg tables); or
          - a dedicated feature store.
      - If the underlying data infrastructure does not allow implementing time-travel functionality, we may need to work around this in the short term by:
        - Making sure the data science team has the right time-series expertise in order to deal with these complications manually; and
        - We need to communicate to the business side what limitations we incur until our underlying data infrastructure supports time-travel:
          - *Which kinds of questions we won't be able to answer,* except at great cost (e.g., "Can you show me what the model would have predicted last January when we had this and that problem going on...")
          - Greater risk of training-serving skew;
          - Slower pace of feature delivery;
          - Greater cost: Need for more advanced skills.

- Team processes:
  - Coding standards:
    - Python
    - Notebook/Data-science-specific:
      - Clean up notebook before merging.
        - Remove duplication, e.g. resulting from copy-pasting cells and running similar training with different parameters. If we want to store the results from different experiments, this should be handled by experiment management solution such as MLFlow.
        - Decide what we want to keep and what can be deleted. If in doubt, delete it. While it is always tempting to keep code around "just in case", this increases the cognitive load on the future reader and decreases the information density.
      - Notebook output cells should reflect a run from top to bottom. 
        - This is not easy if notebook includes expensive calculations, which makes it impractical to simply restart kernel and then re-run whole notebook before committing. 
        - There are some tools that are supposed to help others, but I have not looked into them.
        - [The best solution may be to simply limit the use of notebooks beyond initial experimentation phase](https://conferences.oreilly.com/jupyter/jup-ny/public/schedule/detail/68282.html).

  - DevOps:
    - Code versioning:
      - Use GitHub, etc. for sharing code – don't share notebooks through email/chat, shared drive, etc.
      - However, Continuous Integration (i.e., merging code to mainline at least once a day) is NOT generally a good fit for the iterative data science workflow. This is because notebooks require a good amount of cleanup before it makes sense to merge (see "team processes/coding standards" above). 
    - Environment and dependency management: Environment should be *easily reproducible* for others. See more detailed notes [here](../coding/python/readme.md).
      - Package manager: Use pip over conda, if at all possible
      - For short-lived notebooks, it is usually sufficient to track *either* abstract or concrete dependencies. 
      - However, if dependency conflicts become common, it is time to track both (e.g., using pipenv).
      - It should be clear which (minor) version of python  to use to re-create an environment. Unfortunately, there is no way to encode the required python version in a requirements.txt file. So the main choices are:
        - If we want to stick with pip-installing a requirements.txt, the best we can do is probably to define all environment-related commands in a Makefile that explicitly hard-codes the minor version of Python.
        - Alternatively, this challenge alone can be a good reason to already learn how to use more sophisticated tools such as pipenv instead (which also bring additional benefits).
      - If run in a managed notebook environment, it should be clear which instance size and kernel is required to run the notebook.
      - Make sure to use reasonably up-to-date package versions. E.g., don't reuse a pre-existing environment for new project out of laziness; periodically update dependencies, etc.

  - *Common strategy* for how to *version* ML-specific artifacts:
    - models/experiments
      - Purpose: Avoid being overwhelmed by the great number of experiments
      - Goals:
        - Reproducibility -> Needs to log all relevant parameters (including data versions)
        - Low overhead -> Ideally, all experiments are automatically logged, without requiring custom code, etc.: 
    - data
  - Align with ML engineers on choices that have a downstream impact (deployability, maintainability, etc), to avoid the risk of going down an inferior path.
    - Double-check that environment creation process is sufficiently reproducible for production deployment.
    - For important packages, align on which versions should be used.
      - e.g., don't use any version of Python that is at or near its end of life. Usually, a "middle-aged" minor version of Python is the best choice, because it often takes surprisingly long until the newest version is supported everywhere.
      - e.g., for Pandas, we may want to always use 2.x in order to get additional functionality and performance benefits (Arrow, copy-on-write, missing-value handling, etc).
    - Before creating new features or training a new model, align on how the different options of doing so affect productionalizing it.
      - This is especially important if re-training would be prohibitively expensive.
      - At the very latest, this should be done before producing anything (code, data, models) that may be used in production. For example, it's ok to run experiments on a small subset of the data, where the main purpose is to get the code to work, but where any parameter estimates will be discarded.
      - However, the smartest point to have this discussion is usually earlier than that, namely *before investing a considerable amount of time* into trying out something new.
      - E.g., if Sagemaker Pipelines is used for deployment, it is easiest and safest to already use a Sagemaker Processor for preprocessing and feature engineering, and to use [an estimator from Sagemaker](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html) for model training. 

- individual
  - Follow a iterative and interactive workflow
  - ...but *clean up* code before handing off to others (whether to fellow data scientists or ML engineers) - see above.
  - Learn to *use* the essential tooling discussed above

#### Should-haves

- Clean code
- Invest time to find good data-science tools for job
- Use engineering tools determined to be worth the investment
  - engineering team should help with tool evaluation, recommendation, and set up
  - Examples: Use type hints, depending on maturity
  - leverage the power of a proper IDE (rather than notebook in browser)

#### Would-like-to-haves

- Leverage design patterns to achieve loose coupling between components
- Trusted test suite (automated unit and integration/acceptance tests); automated data validation, static analysis

#### Does-not-need-to-haves


## *ML Engineering* best practices

### Must have

- Clean code
  - Rationale:
    - Code is read much more of than written -> it should be optimized for readability.
    - Reduces bugs.
    - Increasing Agility, because it makes code easier to change.
    - Overall, reduces maintenance cost (which is majority of the cost of a typical software project)
- Leverage design patterns to achieve loose coupling between components
  - Rationale:
    - Greatly reduces complexity, thereby ensuring code stays maintainable (reduces cost and risk, while increasing speed of feature implementation)

- Trusted, automated test suite
  - Components:
    - unit tests
    - integration/acceptance tests
    - data validation
    - static analysis
  - Rationale:
    - Increases quality (reduces errors, outages, etc.)
    - Decreases costs, since the cost of bugs rises the later in the SDLC they are discovered ("shifting left")
    - Indirect benefit: A trusted test suite makes sure that engineers are not dominated by their own creation
- Type safety:
  - code: use type hints, and force in CI pipeline
  - data: use explicit data schemas
- Observability:
  - code
    - structure & centralized logging
    - monitoring and alerting
    - distributed tracing if using micro service architecture
  - model performance
    - comparison between different models
    - comparison of same model over time (model drift?)
    - performance for different subsets of the population/bias (if substantial, does this vary by model?).
  - data
    - data quality
- DevOps
  - Use infrastructure-as-code
    - CI/CD
    - Enforce quality gates in pipeline
      - in particular: 
        - tests: run unit and acceptance tests, check test coverage threshold, static code analysis (especially type checking)
        - readability: linting or auto-formatting, code complexity checks
        - security scans: scan dependencies for vulnerabilities and license risk scan, static analysis
      - Note: Enforcement requires that pipeline blocks deployment if any of these checks fail.
- Data lineage
- Invest time to find good tools for the job
  - Avoid reinventing the wheel -> Leverage managed services wherever possible (unless “unfair” pricing)
  - leverage power of IDE (rather than notebook in browser)

### Should have
- DevOps:
  - Infra-as-Code:
    - Also manage most of the *data science* infrastructure using IaC. 
      - The reason I say "most" is because we get the biggest bang for our buck by focusing on the constant/long-lived infrastructure components; if a data scientist wants to try out some new resources, it's fine to create it using the console - thereby, we avoid introducing dependencies/blockers.
  - CI/CD:
      - Manual modifications to prod should only be possible in emergencies 
        - Engineers shoul have read-only access to prod
        - "break-glass account" (with sign-off process) for emergencies
- Profile code and optimize performance bottlenecks
  - (Why we do not consider this a must-have: Engineers' time is very expensive, and so is delaying features or accumulating technical debt, so unfortunately performance optimization has sometimes to be sacrificed for these even more important goals.
- Periodically reevaluate if there are better tools for the job
  - e.g., Pandas alternatives (such as Polars)
  - e.g., Spark vs Presto
  - e.g., End-to-end (Sagemaker ) versus best-off-breed MLOps tools

### Would-like to have

### (Does not need to have:)

- Best practices for data science:
