# ML Crash Course

## 05 October 2023

## What's this whole ML thing?

![image.png](./assets/ml-tweet.png)

## The spectrum of interpretability

![spectrum of interpretability](./assets/interpretability.jpg)

In [None]:
Prior figure taken from https://wires.onlinelibrary.wiley.com/doi/full/10.1002/wics.1617

## Some terminology...

### Labels

> A **label** is the thing we're predicting. This could be what type of animal is in a picture, the price of a stock tomorrow, the transcribed content of an audio recording, etc

### Features

> A **feature** is an input variable—one or multiple pieces of data that are fed into a model in order to generate a predicted label

### Model Training & Inference

> Model training refers to the model "learning" to predict labels from features. Model inference refers to the trained model **creating new predicted labels** based on a set of features

## Labelled vs unlabelled data

![labelled vs unlabelled data](./assets/labelled-data.png)

Data points excerpted from https://developers.google.com/machine-learning/crash-course/california-housing-data-description

## Feature Engineering

![engineered features](./assets/feature-engineering.png)

## Problem framing

**Problem framing** is the process of analyzing a problem to isolate the individual elements that need to be addressed to solve it. Problem framing helps determine your project's technical feasibility and provides a clear set of goals and success criteria.

At a high level, ML problem framing consists of two distinct steps:

  1. Determining whether ML is the right approach for solving a problem.
  2. Framing the problem in ML terms.


## Understanding the problem

  - State the goal for the product you are developing or refactoring.
  - Determine whether the goal is best solved using, predictive ML, generative AI, or a non-ML solution.
  - Verify you have the data required to train a model if you're using a predictive ML approach.

## Clearly stating goals

Begin by stating your goal in non-ML terms. The goal is the answer to the question, "What am I trying to accomplish?"

![ml goals](./assets/ml-goals.png)

## Is it an ML use case?

![ml categories](./assets/ml-categories.png)

## Start simple, add complexity later

If you don't have a non-ML solution implemented, try solving the problem manually using a heuristic. Consider then:

  - **Quality.** How much better do you think an ML solution can be? If you think an ML solution might be only a small improvement, that might indicate the current solution is the best one.
  - **Cost and maintenance.** How expensive is the ML solution in both the short- and long-term? In some cases, it costs significantly more in terms of compute resources and time to implement ML.

## Data and predictive modelling

To make good predictions, you need data that contains features with predictive power. Your data should have the following characteristics:

  - **Abundant.** The more relevant and useful examples in your dataset, the better your model will be.
  - **Consistent and reliable.** Having data that's consistently and reliably collected will produce a better model.
  - **Trusted.** Understand where your data will come from. Will the data be from trusted sources you control?
  - **Available.** Make sure all inputs are available at prediction time in the correct format.
  - **Correct.** If more than a small percentage of labels are incorrect, the model will produce poor predictions.
  - **Representative.** The datasets should be as representative of the real world as possible.

## Predictive power

For a model to make good predictions, the features in your dataset should have predictive power. The more correlated a feature is with a label, the more likely it is to predict it.

Some features will have more predictive power than others. For example:
  - in a weather dataset, features such as `cloud_coverage`, `temperature`, and `dew_point` would be better predictors of rain than `moon_phase` or `day_of_week`;
  - for a video recommender system, you could hypothesize that features such as `video_description`, `length` and `views` might be good predictors for which videos a user would want to watch

## Predictions vs. actions

There's no value in predicting something if you can't turn the prediction into an action that helps users. That is, your product should take action from the model's output.

For example, a model that predicts whether a user will find a video useful should feed into an app that recommends useful videos. A model that predicts whether it will rain should feed into a weather app.

## Defining goals in ML terms

![ml-oriented goals](./assets/ml-goals2.png)

## Identify the output

![predictive outputs](./assets/predictive-outputs.png)

![generative outputs](./assets/generative-outputs.png)

## Proxy labels

**Proxy labels** substitute for labels that aren't in the dataset. Proxy labels are necessary when you can't directly measure what you want to predict.

In the video app, we can't directly measure whether or not a user will find a video useful. It would be great if the dataset had a useful feature, and users marked all the videos that they found useful, but because the dataset doesn't, we'll need a proxy label that substitutes for usefulness.

![proxy labels](./assets/proxy-labels.png)

## Evaluation vs success metrics

**Evaluation metrics** define what a model is optimizing for, like accuracy, precision, recall, or AUC. These are framed in terms of predictive correctness of the model.

**Success metrics** define what you care about, like engagement or helping users take appropriate action, such as watching videos that they'll find useful. Success metrics differ from the model's evaluation metrics. These are framed in terms of outcomes resulting from the model's predictions.

## When to stop optimizing

When analyzing the model's performance, consider the following question: Would improving the model get you closer to your defined success criteria?

  - A model might have great evaluation metrics, but not move you closer to your success criteria, indicating that even with a perfect model, you wouldn't meet the success criteria you defined
  - A model might have poor evaluation metrics, but get you closer to your success criteria, indicating that improving the model would get you closer to success.

## Monitoring

Work doesn't stop after a model is trained. Once it's built, it needs to be tested and monitored

  - **A/B testing.** If you are replacing an existing model, how can you run them both side-by-side to see if the new model outperforms the existing one?
  - **Training/serving skew.** If inference features fall outside of the distribution of data your model was trained on, it will produce poor predictions.
  - **Drift.** If the distribution of inference features shifts over time, your model may need to be retrained on an updated dataset.

## Data prep & feature engineering

![ml flowchart](./assets/ml-flowchart.png)

## How much data is enough?

  - Your model should generally train on at least an order of magnitude more examples than trainable parameters.
  - Simple models on large data sets generally beat fancy models on small data sets.
  - This could mean a few hundred records for very simple models, or a few trillion records for something like Google Translate.

## Dataset quality

Certain aspects of quality tend to correspond to better-performing models:
 - reliability
 - feature representation
 -  minimizing skew

**Reliability** refers to the degree to which you can trust your data. In measuring reliability, you must determine:

  - How common are label errors? For example, if your data is labeled by humans, sometimes humans make mistakes.
  - Are your features noisy? For example, GPS measurements fluctuate. Some noise is okay, and collecting more records can help smooth out the noise.
  - Is the data properly filtered for your problem? For example, should your data set include search queries from bots? If you're building a spam-detection system, then likely the answer is yes, but if you're trying to improve search results for humans, then no.

**Feature representation** refers to how raw data is mapped to inputs to a model. Some features may be mapped directly, others might be transformed or normalized.

You'll want to consider the following questions:

  - How is data shown to the model?
  - Should you normalize numeric values?
  - How should you handle outliers?


## Data transformation

We transform features primarily for the following reasons:

  1. Mandatory transformations for data compatibility. Examples include:
    - Converting non-numeric features into numeric.
    - Resizing inputs to a fixed size. Linear models and feed-forward neural networks have a fixed number of input nodes, so your input data must always have the same size.
  2. Optional quality transformations that may help the model perform better. Examples include:
    - Tokenization or lower-casing of text features.
    - Normalized numeric features (most models perform better afterwards).
    - Allowing linear models to introduce non-linearities into the feature space.

## Numeric transformations

You may need to apply two kinds of transformations to numeric data:

  - **Normalizing** - transforming numeric data to the same scale as other numeric data.
  - **Bucketing** - transforming numeric (usually continuous) data to categorical data.

## Normalization

Four common normalization techniques may be useful:

  - scaling to a range
  - clipping
  - log scaling
  - z-score
  
![normalization](./assets/normalization.svg)

![normalization defined](./assets/normalization-maff.png)

## Bucketing

![bucketing](./assets/bucketing.svg)

<img src="./assets/bucketizing-needed.svg" width="80%" height="80%" />

## Quantile bucketing

We can change the bucket cutoffs to ensure that an equal number of observations fall into each bucket. This is called **quantile bucketing**.

<img src="./assets/quantile-bucketing.png" width="50%" height="50%" />

## Categorical transformations

![indexed features](./assets/indexed-features.png)

## Vocabulary

In a vocabulary, each value represents a unique feature.

<img src="./assets/feature-vectors.png" width="80%" height="80%" />

## Embeddings

An embedding is a categorical feature represented as a continuous-valued feature.

<img src="./assets/embeddings.png" width="80%" height="80%" />