# *ML Engineering* best practices
## Must have

- Clean code
  - Rationale:
    - Code is read much more of than written -> it should be optimized for readability.
    - Reduces bugs.
    - Increasing Agility, because it makes code easier to change.
    - Overall, reduces maintenance cost (which is majority of the cost of a typical software project)
- Leverage design patterns to achieve loose coupling between components
  - Rationale:
    - Greatly reduces complexity, thereby ensuring code stays maintainable (reduces cost and risk, while increasing speed of feature implementation)

- Trusted, automated test suite
  - Components:
    - unit tests
    - integration/acceptance tests
    - data validation
    - static analysis
  - Rationale:
    - Increases quality (reduces errors, outages, etc.)
    - Decreases costs, since the cost of bugs rises the later in the SDLC they are discovered ("shifting left")
    - Indirect benefit: A trusted test suite makes sure that engineers are not dominated by their own creation
- Type safety:
  - code: use type hints, and force in CI pipeline
  - data: use explicit data schemas
- Observability:
  - code
    - structure & centralized logging
    - monitoring and alerting
    - distributed tracing if using micro service architecture
  - model performance
    - comparison between different models
    - comparison of same model over time (model drift?)
    - performance for different subsets of the population/bias (if substantial, does this vary by model?).
  - data
    - data quality
- DevOps
  - Use infrastructure-as-code
    - CI/CD
    - Enforce quality gates in pipeline
      - in particular: 
        - tests: run unit and acceptance tests, check test coverage threshold, static code analysis (especially type checking)
        - readability: linting or auto-formatting, code complexity checks
        - security scans: scan dependencies for vulnerabilities and license risk scan, static analysis
      - Note: Enforcement requires that pipeline blocks deployment if any of these checks fail.
- Data lineage
- Invest time to find good tools for the job
  - Avoid reinventing the wheel -> Leverage managed services wherever possible (unless “unfair” pricing)
  - leverage power of IDE (rather than notebook in browser)


## Should have
- DevOps:
  - Infra-as-Code:
    - Also manage most of the *data science* infrastructure using IaC. 
      - The reason I say "most" is because we get the biggest bang for our buck by focusing on the constant/long-lived infrastructure components; if a data scientist wants to try out some new resources, it's fine to create it using the console - thereby, we avoid introducing dependencies/blockers.
  - CI/CD:
      - Manual modifications to prod should only be possible in emergencies 
        - Engineers shoul have read-only access to prod
        - "break-glass account" (with sign-off process) for emergencies
- Profile code and optimize performance bottlenecks
  - (Why we do not consider this a must-have: Engineers' time is very expensive, and so is delaying features or accumulating technical debt, so unfortunately performance optimization has sometimes to be sacrificed for these even more important goals.
- Periodically reevaluate if there are better tools for the job
  - e.g., Pandas alternatives (such as Polars)
  - e.g., Spark vs Presto
  - e.g., End-to-end (Sagemaker ) versus best-off-breed MLOps tools



## Would-like to have

## (Does not need to have:)

- Best practices for data science:
