# ~~Shared best practices~~

Comments:

-   Open questions
    - Is organizing by must have versus should have the best idea?
    - Where do we draw the line between must versus should have? 
    - How do we deal with differences in degree rather than differences in kind? For example, a basic amount of clean code is clearly a must-have for ML engineering, but some aspects of it may only be should- or even would-like-to haves.

-  Alternatives:
    - ~~by maturity level~~ 
      - Corresponds to must/should/would like to haves: Low maturity corresponds to not having any; medium maturity corresponds to having must-haves, high corresponds to additionally satisfying should-haves, and highest corresponds to also satisfying would-like-to-haves.
    - Should we specify which are the minimum requirements once we enter production? Or could just use color-coding!

-  Note that these may also differ by:
    - use case (e.g., inherent complexity of data transformations and ML modeling)
    - POC versus production.

# End Goals (data science)
- Reproducible analysis
  - Code
  - Environment
  - Data
  - Results
- Efficient collaboration with and handover to ML engineers (since the end goal is deployment to production)
- Easy setup: Minimize *time* scientists spend with DevOps/software dev work (e.g., environment setup, etc)
- Minimize *knowledge* requirement of software dev tools and concepts
  - Note that this goal often conflicts with reproducible analysis and efficient handover to engineering, requiring difficult trade-offs.

# *Data science* best practices
## Must-haves
### 1) Background infra / team investment ~~/ Environment / Ecosystem~~
- Quick & easy dev setup
  - Align incentives: The recommended way should be the easy way. 
    - Don't just write up the required setup steps in a readme – instead, put them in a script whenever possible! [Makefiles are ideals for this](./_details/why_makefiles.md).
  - This is most efficiently solved by a central (company-wide) ML infrastructure team, so that each team does not have to reinvent the wheel. However, were such a central team does not yet exist, this task can easily be handled by ML engineers. (In the latter case, this is a good candidate for scaling the solution develop to other teams, once the benefits can be demonstrated.)
  - Try to leverage *managed* services, rather than re-inventing the wheel! Because undifferentiated heavy lifting is more cheaply carried out by specialized providers, it pays to focus on a company's *core competencies*. Most companies tend to under-rely on managed services because:
    - They systematically under-estimate the relative cost of building versus buying, because they:
      - are overly focused on the additional out-of-pocket *expenses* of managed services, while not sufficiently taking into account the value of engineers' time that this frees up;
      - don't sufficiently account for the risk of "unknown unknowns" of implementing a solution themselves (["planning fallacy"](https://en.wikipedia.org/wiki/Planning_fallacy));
    - If there is already an internal team currently providing these services in-house, these internal stakeholders - who now risk loosing their project - often have disproportionate influence on decision-making. This is 
      - because of [loss aversion](https://en.wikipedia.org/wiki/Loss_aversion), i.e. people tend to be more sensitive to losses compared to gains;
      - because the losses are *concentrated* while the gains are *spread out*, making it easier for the would-be losers to organize and lobby for their interests.
- Availability of tooling to:
  - Use the explorative notebook workflow without forgoing the benefits of an IDE (e.g., syntax highlighting, code completion, easy display of documentation such as parameter names of a function, ability to use powerful plugins such as GitHub Co-Pilot, etc)
    - Reliable solution to run Notebook in IDE (VSCode support for notebooks is borderline as of late 2023 - net positive in my opinion but still a lot of bugs, though no dealbreakers)
    - Promising alternative: JetBrains Datalore. Managed notebook environment with a lot of additional features (in particular, inbuilt visualization of data frames, similar to what DataBricks does for Spark; feature for collaboration and sharing of interactive reports). Need to evaluate in more detail if it is worth the cost, but if it has less bugs than VSCode, the time it saves data scientists should easily outweigh the subscription cost.
  - manage the complexity that arises out of the *explorative* workflow 
    - In particular, a solution for experiment management (e.g., MLFlow). A model registry also falls into this category, as it serves to mark which of the many trained models have passed a quality threshold that makes them fit for deployment.
  - deal with *data versioning*
    - For basic use cases, Data Version Control may be a good enough starting point.
    - If the data has an important time-series component, the ability to time-travel (e.g., retrieve a snapshot of how the data looked like at a given point in time) becomes essential in order to avoid information leakage, which is a common source of training-serving skew.
      - This functionality could come from either:
        - the underlying data infrastructure (e.g. it is available in Snowflake, DeltaLake, Apache Iceberg tables, or any system based on event-sourcing); or
        - a dedicated feature store.
    - If the underlying data infrastructure does not allow implementing time-travel functionality, we may need to work around this in the short term by:
      - Making sure the data science team has the right time-series expertise in order to deal with these complications manually; and
      - We need to communicate to the business side what limitations we incur until our underlying data infrastructure supports time-travel:
        - *Which kinds of questions we won't be able to answer,* except at great cost (e.g., "Can you show me what the model would have predicted last January when we had this and that problem going on...")
        - Greater risk of training-serving skew;
        - Slower pace of feature delivery;
        - Greater cost: Need for more advanced skills.
  - Share productionalized features between teams to:
    - Avoid duplicating work. (Reducing this waste is even more important since data cleaning and feature engineering tend to be among the less favorite part of a data scientists job. Thus, by allowing them to spend a greater proportion of their time on actually building models, we increase job satisfaction of data scientists, and thus likely improve retention.)
    - Ensure different teams use same definition of how exactly to *operationalize* important business concepts. This:
      - limits differnt teams and organizational units from drifting apart and getting increasingly siloed;
      - makes it easier to compare the performance of different models. (If we calculate features in different ways, we introduce yet another variable that could explain different results.)

- Safety to experiment: 
  - The easy/default way should be the safe way. 


### 2) Team processes:
- Coding standards:
  - Python:
    - auto-format code  (but don't spend much time worrying about details)
      - The reason I consider this a must-have - even though it may not seem highest priority - is that it is a *quick win,* delivering outsized benefits relative to the small amount of work required. Often, it will be possible to simply borrow the setup that ML engineers already have. If not, remember that any autoformatter is better than none, so just pick one and get started! 
      - At the latest, auto-formatting should be run by pipeline on merge.
      - It may also make sense to have the IDE auto-format files on save, because this instant feedback is a great way to coach data scientists over time how to write better code in the first place. However, it is crucial that IDE and pipeline apply the *same* rules (e.g., are configured from the same .pylintrc file). 
      - Rationale: Ensure  consistency; *enforce* best practices, such as optimizing for readability and avoiding surprising or error-prone constructs. A lot has already been written about this topic; see for example [the excellent discussion in "Software Engineering at Google"](https://abseil.io/resources/swe-book/html/ch08.html). 
      - Data-science specific modifications:
        - Since data scientists tend to be less interested in learning the details of a programming language, we need to find a process that does not require them to spend much time mastering the rules. Therefore, the *easiest solution is usually to simply set up automatic code formatting* (using auto-pep8, black, yapf, etc). This reduces the effort to a one-time set up, which can be handled by the engineering team.
        - Generally, we want to align with the engineering team on using the same formatting standards (i.e., use same linting config file). If handoff from data science to engineering is based on branching (i.e., commits flow from a "data-science" branch to an "ml-engineering" branch), we could technically apply a different set of formatting rules at the point of that merge. However, unless there is a clear reason why there are different formatting needs, it is better to avoid this to not only simplify the process but also increase the consistency of the code base.
  - Notebook/Data-science-specific:
    - Clean up notebook before merging: It is completely natural that notebooks get messy from the explorative data science workflow. This is not a problem in itself - but the key is to eventually distill the essential insights into a more easily digestible form. The right time to do so is when merging to mainline, as this is when the work will be shared with other people, who need to be able to understand it with the least possible cognitive overhead. 
      - Extract functions into a separate python module (and import them into the notebook) if:
        - any of the analysis is promising for reuse; or
        - analysis will be handed off to others to refine or productionize. (In this case, the rationale is primarily readability).
        - Rationale:
          - Follow the DRY (Do Not Repeat Yourself) principle.
          - Certain operations work better on python files as opposed to notebook cells (e.g., unit tests, real-time static analysis)
        - Makes the analysis easier to understand for others, because it separates levels of abstractions (which is one of the "clean code" principles): The notebook contains the high-level analysis, and if you want to delve deeper into the details of what a given function is doing, you can go to the function definition in the module.
        - In addition, separating the interactive notebook from the core functions offers a great way to use our screen real estate efficiently by opening notebook and python module side-by-side. This is in my experience the best way to solve the problem that notebooks tend to get very long due to the output, thus require a lot of scrolling and making it easy to get lost. This way, the python module containing the important details of the code does *not* suffer from the same problem.
      - Remove duplication, e.g. resulting from copy-pasting cells and running similar transformations/training with different parameters. If we want to store the results from different experiments, this should be handled by experiment management solution such as MLFlow.
      - Decide what we want to keep around, and what can be deleted. (If in doubt, it is probably fine to delete. While it is always tempting to keep code around "just in case", this increases the cognitive load on the future reader and detracts from the main insights.)
    - Ensure notebook output cells reflect a run from top to bottom. 
      - This is not easy if notebook includes expensive calculations, which makes it impractical to simply restart kernel and then re-run whole notebook before committing. 
      - There are some tools that are supposed to help with this, but I have not looked into them.
      - [The best solution may be to simply limit the use of notebooks beyond the initial experimentation phase](https://conferences.oreilly.com/jupyter/jup-ny/public/schedule/detail/68282.html).
    - Make the *intent* clear, so code makes sense to others
      - Basic refactoring:
        - Improve variable naming
        - Remove any hacks
      - Add comments - in particular about information that can not be gathered from the code itself, such as the "why?".


- DevOps:
  - Code versioning:
    - Use GitHub, etc. for sharing code – don't share notebooks through email/chat, shared drive, etc.
    - However, Continuous Integration (i.e., merging code to mainline at least once a day) is NOT generally a good fit for the explorative data science workflow. This is because notebooks require a good amount of cleanup before it makes sense to merge (see "team processes/coding standards" above).
  - Environment and dependency management: Environment should be *easily reproducible* for others. See more detailed notes [here](../coding/python/readme.md).
    - Package manager: Use pip over conda, if at all possible
    - For short-lived notebooks, it is usually sufficient to track *either* abstract or concrete dependencies.
    - However, if dependency conflicts become common, it is time to track both (e.g., using pipenv).
    - It should be clear which (minor) version of python  to use to re-create an environment. Unfortunately, there is no way to encode the required python version in a requirements.txt file. So the main choices are:
      - If we want to stick with pip-installing a requirements.txt, the best we can do is probably to define all environment-related commands in a Makefile that explicitly hard-codes the minor version of Python.
      - Alternatively, this challenge alone can be a good reason to already learn how to use more sophisticated tools such as pipenv instead (which also bring additional benefits).
    - If run in a managed notebook environment, it should be clear which instance size and kernel is required to run the notebook.
    - Make sure to use reasonably up-to-date package versions. E.g., don't reuse a pre-existing environment for new project out of laziness; periodically update dependencies, etc.

- *Common strategy* for how to *version* ML-specific artifacts:
  - models/experiments
    - Purpose: Avoid being overwhelmed by the great number of experiments
    - Goals:
      - Reproducibility -> Needs to log all relevant parameters (including data versions)
      - Low overhead -> Ideally, all experiments are automatically logged, without requiring custom code, etc.:
  - data


- Align with ML engineers on choices that have a downstream impact (deployability, maintainability, etc), to avoid the risk of going down an inferior path.
    - Double-check that environment creation process is sufficiently reproducible for production deployment.
    - For important packages, align on which versions should be used.
      - e.g., don't use any version of Python that is at or near its end of life. Usually, a "middle-aged" minor version of Python is the best choice, because it often takes surprisingly long until the newest version is supported everywhere.
      - e.g., for Pandas, we may want to always use 2.x in order to get additional functionality and performance benefits (Arrow, copy-on-write, missing-value handling, etc).
    - Before creating new features or training a new model, align on how the different options of doing so affect productionalizing it.
      - This is especially important if re-training would be prohibitively expensive.
      - At the very latest, this should be done before producing anything (code, data, models) that may be used in production. For example, it's ok to run experiments on a small subset of the data, where the main purpose is to get the code to work, but where any parameter estimates will be discarded.
      - However, the smartest point to have this discussion is usually earlier than that, namely *before investing a considerable amount of time* into trying out something new.
      - E.g., if Sagemaker Pipelines is used for deployment, it is easiest and safest to already use a Sagemaker Processor for preprocessing and feature engineering, and to use [an estimator from Sagemaker](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html) for model training. 


### 3) individual
  - Follow a explorative and interactive workflow
  - ...but *clean up* code before handing off to others (whether to fellow data scientists or ML engineers) - see above.
  - Learn to *use* the essential tooling discussed above

## Should-haves

- Clean code
  - Coding standards: potentially customize configuration of linting, etc. to the team's specific needs. Remember that the rationale for considering automatic code formatting a must-have is that it is a quick win to set it up with default parameters - that's why it is important not to spend too much time tweaking the linting parameters until we have reached this should-have stage. At this point, we may also consider switching to a different linting tool (e.g., move from Black to Pylint for more customizability).
- Invest time to find good data-science tools for job
- Use engineering tools determined to be worth the investment
  - engineering team should help with tool evaluation, recommendation, and set up
  - Examples: Use type hints, depending on maturity
  - leverage the power of a proper IDE (rather than notebook in browser)

## Would-like-to-haves

- Leverage design patterns to achieve loose coupling between components
- Trusted test suite (automated unit and integration/acceptance tests); automated data validation, static analysis

## Does-not-need-to-haves
