# **Chapter 6: The universal workflow of machine learning**
---
- Framing an ML problem.
- Developing a working model.
- Deploying model in production and maintaining it.

Previously:
- We already had a labeled dataset, and we could immediatley start training the model.
- IRL, this is not the case. We start with a problem, not with a dataset.

Universal workflow of ML:
1. *Define the task*:
  - Understand the problem.
  - What customer asked for.
  - Collect a dataset.
  - Figure out what the data represents.
  - Choose how you will measure the success.
2. *Develop a model*
  - Select a model evaluation protocol.
  - Simple baseline to beat.
  - Train a first model that has some generalization power, can overfit.
  - Then regularize and tune, until you achieve the maximum generalization.
3. *Deploy the model*
  - Present work to stakeholders.
  - Ship the model to a web app, etc.
  - Monitor the model's performance, collect data you'll need to create the next-generation model.

# 6.1 Define the task

- Deep understanding of the context of the problem.
- Why is your customer trying to solve this problem?
- Value they will derive from this solution.
- How the model will be used? How it will fit to the customer's business process?
- Kind of data available, or could be collected.
- Kind of ML task that can be mapped on to the problem.


## 6.1.1 Frame the problem

Framing involves many detailed meetings/discussions with the stakeholders. Important questions to be answered:
1. What will the input data be? What are we trying to predict? Can only predict if training data is available.
  - Data availability can be a limiting factor.
  - May need to resort to collecting your own data.
2. Type of ML task? Binary class^n? Multiclass cass^n? Scalar regression? etc.
  - In some cases, ML may not even be the best way to make sense of the data.
3. What do existing solutions look like?
  - Understand what systems are already in place and how they work!
4. Particular constraints you will need to deal with:
  - The ins and outs of the deploying of the model.

Once done with research, know what your inputs will be, targets will be, and the ML task. You are hypothesizing:
1. That your targets can be predicted given your inputs.
2. That the data that is available (or that you're going to collect) is sufficiently informative to learn the relationship b/w inputs and targets.

Until we have a working model, these are merely hypotheses.

Just because we've assembled examples of X inputs and Y targets doesn't mean X contains enough information to predict Y.

## 6.1.2 Collect a dataset

- Once task is understood and inputs and targets are know, time to *collect data*: most arduous, expensive and time-consuming part.
- Model's ability to generalize comes directly fro the data trained on, # of data points, reliability of labels, quality of features.
- If you have extra time, most effective way to allocate is to collect more data rather than spend on model imporvements.
- Google researchers' paper "*The Unreasonable Effectiveness of Data* was first to emphasize the importance of data.
If doing supervised learning, after collecting the data, gotta *annotate* them (tage for images, etc.).
- Sometimes annotations can be autoatic, sometimes might need to do them ourselves. Labour-heavy process.

### Investing in data annotation infrastructure:

- Data annotation determines quality of targets → quality of model. Important questions:
1. Should you annotate the data *yourself*?
2. Use a *crowdsourcing* platform like Mechanical Turk to collect labels?
3. Use services of a specialized data-labeling company?

Outsourcing can save money and time, but you lose control. Might be inexpensive but your annotations may end up being noisy.

To choose the best option, consider:
1. Do data labelers need to be experts or anyone could do it?
2. If they need to be experts, could you train people?
3. Do YOU understand the way experts come up with the annotations? If not, won't be able to perform manual feature engineering. Might be limiting.

If you do decide to label yourself, ask what software you would use. Might need to MAKE YOUR OWN software. This might save a lot of time, so it's wirth investing sometimes.


### Beware of non-representative data

- Models can only make sense of inputs similar to what they were trained on.
- Important: training data should be *representative* of the production data.

Example:
- A model in which users uplaod pictures of their food and model classifies the dish.
- You start getting complaints from users.
- Reason: the training data was well-lit, professional-quality photos of dishes, while real-life picturesby people are not like that!
- *Training model was not representative of the production data*! ML hell!

- If possible, collect data from the environment where your model will be used/deployed.
- If not possible, make sure you fully understand the differences b/w training and production data, and are actively correcting for those differences.

*Concept drift*: occurs when the properties of the production data change over time, causing model accuracy to gradually decay.
  - Particularly acute in adversarial contexts like credit card fraud detection (fraud patterns change every day lol).
  - It requires constant data collection, and model retraining.

Using ML trained on past data to predict future is making the assumption that the future will behave like the past.

*Sampling bias*: when your data collection process interacts with what you are trying to predict, resulting in biased measurements. DEWEY DEFEATS TRUMAN is an example of this from real life.

## 6.1.3 Understand your data

- Bad practice to treat the data you're working on as a black box.
- Visualize and explore your data to get insight for feature engineering, screen for potential issues, etc.

- If data has images/natural language text, take a look at some of your samples directly.
- If data has numerical features, plot the histogram of feature values: get idea about range of values, frequency of different values, etc.
- If location information, plot on a map. Do clear patterns emerge?
- Missing values from some features in some samples? Deal with this too (explained in later section).
- Task is classification? Print # of instances of each class in data. Are they roughly equally represented? If not, account for the imbalance.
- *Target leaking* check: presence of features in your data that provide info. about the targets and which may not be available in production. Ask: is every feature in your data something that will be available in the same form in production?

## 6.1.4 Choose a measure of success

- Again, to control something, gotta be able to observe it!
- First, define what you mean be success. Accuracy? Precision and recall? Customer retention rate?
- Metric guides technical choices you'd make in the project. Should directly align with your higher-level goals.
- For balanced classification (every class equally likely), accuracy and area under a *receiver operating characteristic* (ROC) curve (ROC AUC) are common metrics. For imbalanced, precision and recall might be better.
- Might need to define your own custom metric as well.