# *ML Engineering* best practices
## Must have
- Clean code: Prioritizing readability and maintainability
  - Refactor for readability, e.g.:
    - proper naming
    - separate levels of abstraction (e.g., a function should only do a single thing)
    - Don't do too much in a single line
      - Leverage design patterns to decouple application components
    - Reserve comments for the "why", not the "what" (which should be apparent from the code itself)
  - Follow coding standards. Rationale: See for example [the excellent discussion in "Software Engineering at Google"](https://abseil.io/resources/swe-book/html/ch08.html)
    - Pick an existing standard (e.g., PEP-8 or Google Python Style Guide)
    - Customize it, if desired. (Make sure to document the rationale for decisions.) 
    - Define as a configuration file (e.g., .pylintrc + mypy.ini). An easy way to get started is [the .pylintrc from Google's Style Guide](https://github.com/google/styleguide/blob/gh-pages/pylintrc)
    - Enforce by running linting in CI pipeline. 
    - Optionally, configure IDE to run linting on file save. Generally, it makes sense to at least run basic linting, such as fixing whitespace issues, in IDE.
  - Rationale for why clean code is a must-have:
    - Code is read much more of than written -> it should be optimized for readability.
    - Reduces bugs.
    - Increasing Agility, because it makes code easier to change.
    - Overall, reduces maintenance cost (which is majority of the cost of a typical software project)
  - Note: This should be part of our definition of done. If we made it a separate story, we would be using a Scrum-fall/mini-waterfall (and we all know that it's too easy to indefinitely postpone these important but not urgent problems).

- Leverage design patterns to achieve loose coupling between components
  - Rationale:
    - Greatly reduces complexity, thereby ensuring code stays maintainable (reduces cost and risk, while increasing speed of feature implementation)

- Trusted, automated test suite
  - Components:
    - unit tests
    - integration/acceptance tests
    - data validation
    - static analysis
  - Rationale:
    - Increases quality (reduces errors, outages, etc.)
    - Decreases costs, since the cost of bugs rises the later in the SDLC they are discovered ("shifting left")
    - Indirect benefit: A trusted test suite makes sure that engineers are not dominated by their own creation

- Type safety:
  - code: use type hints, and run type checking (e.g. using mypy or pyre) 
    - Leverage the gradual nature of python's type system:
      - Makes it possible to apply type checking to only part of the code base, and increase coverage and strictness over time.
      - Does not require data scientists to add type annotations; but allows engineers to add these improvements later on.
    - Don't use dictionaries for heterogeneous collections. Often dataclasses/attrs or a custom class is the best choice. Otherwise, the type checker has no way of knowing if a given attribute access or method call is valid or not.
    - *enforcement*  in CI pipeline:
      - fail pipeline if type checks fail
      - guard against regression:
        - If codebase is fully typed, disallow untyped expressions or the use of "Any" (though we do need to give the option to override this line by line, but PR reviewers should enforce that the reason is documented in a comment, and that it is justified).
        - As long as we are still on the journey to full coverage, we need to at least make sure we are making progress by enforcing that each successive commit increases the type coverage.
  - data: use explicit data schemas
    - At a minimum, check schema at input and output.
    - Tools for semi-structured data:
      - For parsing JSON, such as an API response or configuration read from a file, pydantic is usually the best choice. It is very performant (v2 has been rimplemented in Rust) and offers a lot of useful configuration, such as whether to allow-casting. Thus, it has become the industry standard.
      - Where no parsing is necessary (i.e., when the data is already represented as Python objects), dataclasses and attrs have a performance advantage over pydantic.
    - Tools for tabular data:
      - Store data in a format that includes the schema (e.g., parquet). In particular, there is virtually never a reason to use CSV as the file format for production use cases!
      - A good choice for defining the schema for Pandas (and other) DataFrames is the pandera library. It allows defining the datatypes of columns (and optionally the column order), as well as other essential column constraints such as uniqueness and nullability.
       - Note that it is also possible to enforce the data type when reading data into Pandas by passing the `dtypes` option, but this does not allow us to specify any other column constraints.


- Observability:
  - code
    - structure & centralized logging
    - monitoring and alerting
    - distributed tracing if using micro-service architecture
  - model performance
    - comparison between different models
    - comparison of same model over time (model drift?)
    - performance for different subsets of the population/bias (if substantial, does this vary by model?).
    - Note that model performance is not always observable right away. If it is not, model validation becomes much harder, and we should pick a variety of validation metrics to track. In this case, comparison across models and over time is even more challenging, especially if there is no obvious way to boil down the different validation metrics to just a single metric.
      - A prominent example of this problem is the failure of Zillow's attempt to leverage their price prediction algorithm in order to buy underpriced properties and sell them at a premium: While it was straightforward to measure the performance of the model for predicting historical transaction prices, it was virtually impossible to know how good live predictions were for *the relevant subset of properties where Zillow's bid actually won.* Only a few months later, when those properties had been turned around and sold to new buyers, did it become possible to finally measure how accurate model predictions were. At that point, it became apparent that accuracy was alarmingly low (because bids won were not a random sample of all properties, but tended to be those where the model most over-estimated the value, so positive and negative errors on the whole sample did not cancel out).
  - data
    - data quality

- DevOps
  - Use infrastructure-as-code
    - CI/CD
    - Enforce quality gates in pipeline
      - in particular:
        - tests: run unit and acceptance tests, check test coverage threshold, static code analysis (especially type checking)
        - readability: linting or auto-formatting, code complexity checks
        - security scans: scan dependencies for vulnerabilities and license risk scan, static analysis
      - Note: Enforcement requires that pipeline blocks deployment if any of these checks fail.
- Be careful to create a data architecture that stays agile: Defer important decisions until the "latest responsible moment" – but not any later!
  - Note that this entails a balance between on the one hand not prematurely worrying about details we can figure out later, but on the other hand thinking ahead about the big picture, so that we avoid getting trapped on an inferior path.
    - Example: If there is a business use case for stream processing, the actual implementation of stream processing is probably something that can wait. However, any architectural and technology choices should take into account that we should be able to easily add this functionality later on, without having to undergo a complete redesign.

- Data lineage / provenance
  - Rationale for why this is a must-have: 
    - This is required for reproducibility of our results: Even if the code is versioned in GitHub etc., this tells only half the story if you don't know on what data it was run. 
      - Reproducibility is even a regulatory requirement for some use cases;
      - Even where it is not, it makes debugging inferior model performance *much* easier, preventing the team from eventually spending most of it's time searching for the proverbial needle in the haystack.
    - Thus, the only cases in which it may be justifiable to do without data lineage is:
      - data is very trustworthy (which is rare);
      - the model is not used in production;
      - the model is not used for any high-stakes decisions; or
      - we get quick feedback on model performance (e.g., can observe the true value soon after, allowing us to detect problems quickly).

- Invest some time to find good (but not perfect) tools for the job
  - Avoid reinventing the wheel -> Leverage managed services wherever possible (unless “unfair” pricing)
  - leverage power of IDE (rather than notebook in browser)

## Should have
- DevOps:
  - Infra-as-Code:
    - Also manage most of the *data science* infrastructure using IaC. 
      - The reason I say "most" is because we get the biggest bang for our buck by focusing on the constant/long-lived infrastructure components; if a data scientist wants to try out some new resources, it's fine to create it using the console - thereby, we avoid introducing dependencies/blockers.
  - CI/CD:
      - Manual modifications to prod should only be possible in emergencies 
        - Engineers shoul have read-only access to prod
        - "break-glass account" (with sign-off process) for emergencies
- Enable real-time intelligence (if there is a business use case):
  - Support stream processing
    - Depending on use case, micro-batch (e.g. Spark Structured Streaming) may be enough, or "real" (single-record) stream processing (e.g., Flink) may be necessary for lowest latency.
    - Note that this sometimes requires substantial changes to the data architecture, if it has not been designed with streaming in mind. Generally, we want to avoid doing all the work twice ("Lambda Architecture"), but instead leverage an architecture that allows both batch and streaming ("Kappa Architecture").
  - *Online* feature store
  - Low-latency inference and serving
- Identify and optimize performance bottlenecks
  - Profile code
  - Why we do not consider this a must-have: Engineers' time is very expensive, and so is delaying features or accumulating technical debt. Thus, performance optimization has sometimes to be sacrificed for these even more important goals.

- Periodically reevaluate if there are better tools for the job
  - Data processing
    - e.g., Pandas alternatives (such as Polars for greater speed and memory efficiency, or Modin + Ray for even greater speedup by parallelizing across multiple machines)
    - e.g., Spark vs Presto
  - MLOps tools
    - If started out with an end-to-end platform (e.g., Sagemaker): Is it worth switching to any of the best-off-breed MLOps tools for specific parts of the ML lifecycle (e.g., Tecton as feature platform, MLFlow for experiment tracking, Seldon for model serving, etc.)



## Would-like to have

## (Does not need to have:)

- Best practices for data science:
