# Overview (Example: Google Search Engine)

Latency and scale
- Return results in 100 miliseconds or 500 miliseconds?
- How many requests to handle per second?

Define metrics
- Offline metrics 
    - Binary classification: AUC, log loss, precision, recall, F1
    - search engine ranking: NDCG
- Online mertics
    - component-wise: NDCG
    - end-to-end: user engagement and retention rate
    
Architecture for scale
- Funnel approach where each stage has fewer data to process.

Training data
- If users click on search enginer result, count it as positive.

Feature engineering
- Investigate the problem.

Model training
- In funnel approach, simpler models at the top and complex models at the bottom.

Offline evaluation
- Evaludate models on validation set.

Online evaluation
- Test result will determine whether to deploy the model or not.

Iterations
- If model performs well offline, but not online. Need to debug.

# Performance and capacity

Performance
- Ensure we return results back within given time.

Capacity
- Load that the system can handled (Ex. number of queries per second)

Training time:
- How much data and capacity do we need?
  
Evaludation time:
- What is SLA to meet while serving the model?
    
Paramaters
- $n$: number of training examples.
- $f$: number of features.
- $n_{l_{i}}$: number of neurons in $i$th layer
- $e$: number of epochs.    
- $n_{trees}$: number of trees.
- $d$: max depth of tree.
    
Complexities    
- Linear and logistic regression (batch)
    - Train: $O(nfe)$
    - Evaluation: $O(f)$
- Neural network
    - Train: exponential (varies between models)
    - Evaluation: $O(fn_{l_{i}} + n_{l_{i}}n_{l_{i}} + \dots)$
- Multiple additive regression trees (MART)
    - Train: $O(ndfn_{trees})$
    - Evaluation: $O(dfn_{trees})$
    
Where
- Training complexity:
    - Time taken to train.
- Evaluation complexity:
    - Time taken to evaluate inputs at testing time.
- Sample complexity: 
    - Total number of samples to learn target function.
    
Funnel approach

# Training data 

Make sure to capture all kinds of patterns in each split.
- Training data: fit model parameters.
- Validation data: hyper parameter tuning.
- Test data: predict on data the model has not seen before.

Data filtering
- Cleaning up data
    - Handle missing data, outliers, duplicates.
    - Drop out irrelevant features.
- Removing bias
- Boostraping new items

# Online experimentation

A/B testing
- Original version is control and new version is variation.
- Determine if variation is significantly better than control.

# Embeddings

- Encode entities (words, images, etc) into vector space.

Text embeddings
- Word2vec
    - Uses shallow NN (a single hidden layer) from a large corpus of text data.
    - Uses neighboring words to predict the current words, and generates embeddings during this process.
        - Ex. Continuous bag of words (CBOW)
    - Uses current word to predict surrounding words.
        - Ex. Skipgram
    - Has a fixed vector for every term. (does not consider the context)
- Embedding from Language Models (ELMo)
    - Uses bi-directional LSTM to capture words before and after current word.
- Bidirectional Encoder Representations from Transformers (BERT)
    - Uses attention to see all words in the context, and utilizes only the ones that help the prediction.
    
Image embeddings
- Auto-encoders
    - Consists of encoder and decoder.
    - Compress raw image pixel data into small dimension, then decompress re-generate the same input image. Last layer of encoder determines the dimension of the embedding.
    - Tries to minimize the difference between original and generated pixels.

# Transfer learning

Fine tuning
- Change/tune the existing parameters in a pre-trained network. 
- How many layers can we freeze (the weights) and how many layers we want to fune tune?
- Eg. for image classification, once we understand convolution, pooling, full connected layers, we can decide how many final layers we want to fine tune. 

# Model debugging and testing

- Launch the first version quickly and interate to improve it using real traffic.

Debugging
- Feature distribution change
    - Real traffic data can change due to seasonality.
- Feature logging issue
    - Feature was computed differently during training and evaluation time.
- Overfitting or underfitting
- Missing important feature

# Search ranking

- Assume
    - Billions of documents to search from.
    - 10K quries per second.
    
## Metrics

- Online metrics
    - Click-through rate = # of clicks / # of impressions or views
    - Sucessful session rate = # of successful sessions / # of total sessions
        - Sucessful session is when users spend 10 seconds or longer viewing the page.
    - Time to success (low number of queries per session)
- Offline metrics

# 1. Search ranking

## Scope
- A general search enginer like Google.

## Scale
- How many websites to search from? Billions of documents.
- How many requests per second? 10k queries per second.

## Personalization
- Assume user is logged in and historical search data of user is available.

## Online metrics
- Click through rate: number of clicks / number of impressions (or views) 
    - Unsuccessful clicks would also me part of this metric.
- Dwell time: time user spent viewing a page.
- Session success rate: number of sucessful sessions (dwell time > 10s) / number of total sessions