# Integrating data science and ML engineering: Designing a collaboration and handoff process between data scientists and ML engineers

- ## Challenge: Different needs of data science vs ML engineering

- In both we want agility – but this is achieved in different ways:
  - In data science, we achieve agility through the explorative and iterative notebook workflow.
  - By contrast, for a production software system to stay agile, we require clean code, type safety, moving away from notebooks, and loose coupling between components (e.g., stable interfaces).
- -> **Corollary: the former does not automatically scale into the latter**
  - What leads to agility in model training leads to a lack of agility in model deployment and maintenance.
  - Different best practices / quality standards for each
  - Need to find a good process for collaboration and handoff

- ## Side note on terminology: What about ML scientists?

I don't like arguing about terminology, but unfortunately we have to get rid of source of confusion first...

- “ML scientist” vs “ML engineer”
  - Confusingly, ML scientists  are often called "ML engineers".
  - But “engineering” refers to an approach to building reliable, production-grade systems.
- “ML scientist” vs “data scientist”
  - Basically, a ML scientist is a data scientist working with models requiring a higher level of advanced ML expertise.
  - Like data scientists, their emphasis is on getting a ML model to work, rather than ongoing maintainability.
  - E.g., they may spend time on performance optimization of compute bottlenecks, but are less concerned with the readability and maintainability of their code, or how easy it is to run it on another machine.
  - -> For our purposes, we can subsume both under the same category. I will use the term “data scientist” to refer to both.

## Conflicting best practices in data science and ML

- Starting point: Need to acknowledge this dilemma:
  - unique *needs* of both sides (as we just discussed)
  - unique *talents* of both sides (division of labor)
- Next step: Codify best practices / quality standards separately for each side.
- ~~Keep in mind: How to structure incentives~~

### Data science vs engineering

- How data science *differs* from engineering:
  - explorative and iterative -> notebook workflow
  - Higher importance of domain expertise
    - Do these data make sense?
    - What way of computing this feature makes the most sense from a domain perspective?
- How data science *supplements* engineering:
  - Exploration of the data by someone with domain expertise can:
    - surface problems
    - create new ideas
  - But any changes resulting from these insights should be addressed:
    - using *production-grade* fixes rather than hacks (implemented by engineers, based on insights from data scientists)
    - at the *source* (rather than each data scientist reinventing the wheel by performing the same data cleaning downstream)

## Why not *all* engineering best practices apply to data science

- Applying a specific engineering best practice in data science can be:
  - Counterproductive: specific needs of the data science process
  - Productive - but there may be hurdles to adoption:
    - adoption cost (can we lower it sufficiently?)
    - not well-known enough (educate!)
    - incentives encourage short-term focus (same problem as in software engineering)
  - Neutral to data scientist productivity (We may as well adopt them early on to make handoff easier)

### Why *some* engineering best practices may be *counterproductive* in data science

Some engineering practices are too constraining (their cost is not offset by large enough concomitant benefit)

- Code is short-lived -> maintainability is less important
- Lesser need to foresee possible problems; instead, take a close look at actual data, and react to problems as we observe them.
- Interactive data science workflow provides different ways of addressing certain problems. E.g.:
  - Knowing the data schema beforehand is less important, because we can just take a look at the data and fix any problems we observe.
  - In a notebook, type hints are less important for readability because we can interactively inspect what variable looks like if we're not sure.

### Why *some* engineering best practices *are relevant* to data science

- Even though data science code is more short-lived, and long-term maintainability is thus less important, changeability is still important due to the iterative nature of the explorative workflow.


### How engineering *complements* data science

- There are plenty of cases where engineering principles can benefit data scientists, but:
  - there is a substantial adoption cost:
    - if you can bring adoption cost down, it may make sense to include some engineering best practices as data science best practices
  - There is a knowledge gap
  - the adoption threshold seems too high