# Overview (Example: Google Search Engine)

Latency and scale
- Return results in 100 miliseconds or 500 miliseconds?
- How many requests to handle per second?

Define metrics
- Offline metrics 
    - Binary classification: AUC, log loss, precision, recall, F1
    - search engine ranking: NDCG
- Online mertics
    - component-wise: NDCG
    - end-to-end: user engagement and retention rate
    
Architecture for scale
- Funnel approach where each stage has fewer data to process.

Training data
- If users click on search enginer result, count it as positive.

Feature engineering
- Investigate the problem.

Model training
- In funnel approach, simpler models at the top and complex models at the bottom.

Offline evaluation
- Evaludate models on validation set.

Online evaluation
- Test result will determine whether to deploy the model or not.

Iterations
- If model performs well offline, but not online. Need to debug.

# Performance and capacity

Performance
- Ensure we return results back within given time.

Capacity
- Load that the system can handled (Ex. number of queries per second)

Training time:
- How much data and capacity do we need?
  
Evaludation time:
- What is SLA to meet while serving the model?
    
Paramaters
- $n$: number of training examples.
- $f$: number of features.
- $n_{l_{i}}$: number of neurons in $i$th layer
- $e$: number of epochs.    
- $n_{trees}$: number of trees.
- $d$: max depth of tree.
    
Complexities    
- Linear and logistic regression (batch)
    - Train: $O(nfe)$
    - Evaluation: $O(f)$
- Neural network
    - Train: exponential (varies between models)
    - Evaluation: $O(fn_{l_{i}} + n_{l_{i}}n_{l_{i}} + \dots)$
- Multiple additive regression trees (MART)
    - Train: $O(ndfn_{trees})$
    - Evaluation: $O(dfn_{trees})$
    
Where
- Training complexity:
    - Time taken to train.
- Evaluation complexity:
    - Time taken to evaluate inputs at testing time.
- Sample complexity: 
    - Total number of samples to learn target function.
    
Funnel approach

# Training data 

Make sure to capture all kinds of patterns in each split.
- Training data: fit model parameters.
- Validation data: hyper parameter tuning.
- Test data: predict on data the model has not seen before.

Data filtering
- Cleaning up data
    - Handle missing data, outliers, duplicates.
    - Drop out irrelevant features.
- Removing bias
- Boostraping new items

# Online experimentation

A/B testing
- Original version is control and new version is variation.
- Determine if variation is significantly better than control.

# Embeddings

- Encode entities (words, images, etc) into vector space.

Text embeddings
- Word2vec
    - Uses shallow NN (a single hidden layer) from a large corpus of text data.
    - Uses neighboring words to predict the current words, and generates embeddings during this process.
        - Ex. Continuous bag of words (CBOW)
    - Uses current word to predict surrounding words.
        - Ex. Skipgram
    - Has a fixed vector for every term. (does not consider the context)
- Embedding from Language Models (ELMo)
    - Uses bi-directional LSTM to capture words before and after current word.
- Bidirectional Encoder Representations from Transformers (BERT)
    - Uses attention to see all words in the context, and utilizes only the ones that help the prediction.
    
Image embeddings
- Auto-encoders
    - Consists of encoder and decoder.
    - Compress raw image pixel data into small dimension, then decompress re-generate the same input image. Last layer of encoder determines the dimension of the embedding.
    - Tries to minimize the difference between original and generated pixels.

# Transfer learning

Fine tuning
- Change/tune the existing parameters in a pre-trained network. 
- How many layers can we freeze (the weights) and how many layers we want to fune tune?
- Eg. for image classification, once we understand convolution, pooling, full connected layers, we can decide how many final layers we want to fine tune. 

# Model debugging and testing

- Launch the first version quickly and interate to improve it using real traffic.

Debugging
- Feature distribution change
    - Real traffic data can change due to seasonality.
- Feature logging issue
    - Feature was computed differently during training and evaluation time.
- Overfitting or underfitting
- Missing important feature

# Search ranking

- Assume
    - Billions of documents to search from.
    - 10K quries per second.
    
## Metrics

- Online metrics
    - Click-through rate = # of clicks / # of impressions or views
    - Sucessful session rate = # of successful sessions / # of total sessions
        - Sucessful session is when users spend 10 seconds or longer viewing the page.
    - Time to success (low number of queries per session)
- Offline metrics

# 1. Search ranking

### Scope
- A general search enginer like Google.

### Scale
- How many websites to search from? Billions of documents.
- How many requests per second? 10k queries per second.

### Personalization
- Assume user is logged in and historical search data of user is available.

## Metrics

### Online metrics
- Click through rate
    - (number of clicks / number of impressions or views) 
    - Unsuccessful clicks would also be part of this metric.
- Session success rate
    - Dwell time: time user spent viewing a page.
    - (number of sucessful sessions (dwell time > 10s) / number of total sessions)
- Time to success
    - Low number of quries means the system was good at guess what user wanted.

### Offline metrics
- Ground truth: actual outputs desired by the system. In this case, it is the rating provided by humans.
- Assume the search engine returns documents $D_{1}, D_{2}, D_{3}, D_{4}$ in the order of relevance.
- Assume human rates the documetns on scale of $0$ to $3$ ($3$ is highly relevant, $0$ is merely relevant) such that
    - $D_{1} = 3$, $D_{2} = 2$, $D_{3} = 3$, $D_{4} = 0$ 
- Cumulative gain simply adds.
    - $3+2+3+0 = 8$
- Discounted cumulative gain (DCG) penalizes if highly relevant document appears lower in the result.
    - $\dfrac{3}{log_{2}(1+1)}+\dfrac{2}{log_{2}(2+1)}+\dfrac{3}{log_{2}(3+1)}+\dfrac{0}{log_{2}(4+1)} = 3+1.262+1.5+0 = 5.762$
- Normalized discounted cumulative gain (NDCG) is computed by (DCG / IDCG) where IDCG is DCG of ideal ordering.
    - NDCG does not penalize irrelevant search result.
    
## Architecture

<img src="img/search_engine1.png" style="width:800px;height:400px;">

### Layered model approach 

<img src="img/search_engine2.png" style="width:1000px;height:200px;">

### Query rewriting
- Queries are often poorly worded.
- Increases recall. (return larger set of relevant results)

Spell checker
- Corrects spelling mistakes.

Query expansion
- Ex. expand "restaurant" to "food" or "recipe" to look for all candidates.

### Query understanding
- Intent behind query
    - Ex. "gas station" has local intent.
    - Ex. "earthquake" has newsy intent.
    
### Document selection
- Select set of documents that are relevant to query.
- Focused on recall. 

<img src="img/search_engine3.png" style="width:400px;height:200px;">

Inverted index
- Map words to documents.

<img src="img/search_engine4.png" style="width:600px;height:400px;">

Selection criteria
- Go to index and retrive all documents based on this criteria.

<img src="img/search_engine5.png" style="width:600px;height:300px;">

Scoring scheme

<img src="img/search_engine6.png" style="width:800px;height:400px;">

Personalization measures searcher's profile such as age, gender, interest, location.

### Ranker
- Find best order of documents.
- Stage 1 
    - Find subset of document that should be passed to stage 2.
    - Use simpler algorithm like linear regression to do binary classification.
- Stage 2
    - Perform complex algorithm like LambdaMART or LambdaRank to do document ordering.

### Blender
- Provides various results like posts, images, news, videos.
- Avoid displaying results from a single or few sources.
- Outputs final result page to users.

### Filter 

- Filter inappropriate result despite good user engagement.

## Training data generation
- Takes online user data and generates positive and negative examples.

Binary classification (pointwise approach)
- Document is either relevant or irrevant.
    - If user spent some time in the document, mark it relevant.
    - If user immediate backed after clicking the document, mark it irrelevant.

<img src="img/search_engine8.png" style="width:500px;height:300px;">

Document ordering (pairwise approach)
- The goal is to minimize inversion. (number of wrong orders compared to ground truth) 
- Rank the document based on user activity on each document and use that as training data.

## Feature engineering

<img src="img/search_engine7.png" style="width:800px;height:400px;">

Searcher (Assume the user is logged in)
- Age
- Gender
- Interest

Query
- History
    - For example, query "earthquake" historically was related to recent news.
- Intent
    - For example, query "Pizza places" has "local" intent, thus should give higher rank to pizza places located nearby the searcher.
    
Document
- Page rank
    - For example, the number of quality documents that link to it.
- Radius
    - For example, coffee shop in Toronto is relevant to people in 10km radicus but Eiffel tower has global scope.
    
Context
- Time of day
    - For example, query "restaurant" should consider restaurant open at the time of query.
- Recent query
    - Take a look at previous quries. For example, "python" -> "python list"  
    
Searcher-document
- Distance
    - For queries regarding locations, consider distance between searcher and matching location.
- History
    - For example, if searcher looked for video document in the past, then vidoe document would be more relevant to the searcher.
    
Query-document
- Text match
    - Matches in the title, metadata, content of document

# 2. Twitter feed

## Scope

- Reverse chronological order fails to catch most engaging tweets due to the sheer large number of tweets.

<img src="img/twitter_feed1.png" style="width:500px;height:300px;">

## Scale

Assume
- 500M daily active users.
- 1 user is connected to 100 users.
- User fetches the feed 10 times a day.

<img src="img/twitter_feed2.png" style="width:500px;height:300px;">

## Metrics

Positive user actions
- Time spent viewing Tweets.
- Liking Tweets.
- Re-Tweeting.
- Commenting on Tweets.

Negative user actions
- Hiding Tweets.
- Reporting Tweets as inappropriate.

<img src="img/twitter_feed3.png" style="width:500px;height:300px;">

Weighted user actions
- Not all actions are equal value.

<img src="img/twitter_feed4.png" style="width:500px;height:300px;">

## Architecture

<img src="img/twitter_feed5.png" style="width:500px;height:300px;">

### Tweet selection

<img src="img/twitter_feed6.png" style="width:500px;height:500px;">

Consider
- Tweets generated from user's log out and log in.
- Previous Tweets viewed by user, which was not popular but now is popular. 

<img src="img/twitter_feed7.png" style="width:500px;height:500px;">

User comes back after a while
- Need to fetch certain numbers of Tweets from a pool.

<img src="img/twitter_feed8.png" style="width:500px;height:500px;">

Tweets outside the user network
- Aligns with user interests.
- Locally/globally tredning.
- Tweet is relevant to user's network.

## Feature engineering

<img src="img/twitter_feed9.png" style="width:500px;height:300px;">

User-author historical relations
- author_liked_posts_3months: percentage of author Tweets user liked in the last 3 months.
- author_liked_posts_count_1year: number of author Tweets user linke in the past one year.

User-author similarity
- common_followees: numbers of users and hash tags followed by both.
- topic_similarity: similarity between hash tags in the posts that both interacted.
- tweet_content_embedding_similarity: generate embedding (bag-of-words) for every user and take dot product between them.
- social_embedding_similarity: every user is represented by bag-of-ids (rather than bag-of-words)

Author influence
- is_verified: if author is verified.
- author_social_rank: similar to Google page rank.
- author_num_followers: nubmer of followers that author has.
- follower_to_following_ratio

Author Tweets historical trend
- author_engagement_rate_3months: (Tweets-interactions) / (Tweets-views)
- author_topic_engagement_rate_3months: compute similar feature above but per topic.

User-tweet
- topic_similarity: similarity between hashtags and contents that user tweeted in the past and the tweet itself.

Tweet content
- Tweet_length: concise Tweet has higher chance of getting likes.
- Tweet_recency:
- is_image_video: Tweets with image or video are more catchy.
- is_URL:

Tweet interaction
- num_total_interactions: need to use time decay model to give proper attention to trending Tweets.

<img src="img/twitter_feed10.png" style="width:500px;height:500px;">

## Training data generation

<img src="img/twitter_feed11.png" style="width:500px;height:500px;">

- Randomly downsample to match the number of positive and negative examples.
- Train data on one time interval and validate data on next time interval.

<img src="img/twitter_feed12.png" style="width:500px;height:300px;">

## Ranking

- Given Tweets, predict probabilities of likes, comments, and re-Tweets.

Logistic regression
- Must create feature in training data manually. (Tree and NN are able to learn features)

<img src="img/twitter_feed13.png" style="width:700px;height:500px;">

Deep learning
- Hyperparameters
    - Learning rate.
    - Number of hidden layers.
    - Batch size.
    - Number of epochs.
    - Dropout rate.
- Multi task NN where total_loss = like_loss + comment_loss + retweet_loss
- Better than training sepearate network for each task because shared layers make training faster.

<img src="img/twitter_feed14.png" style="width:500px;height:300px;">

<img src="img/twitter_feed15.png" style="width:900px;height:900px;">

Stacking models
- Ex. use Tree and NN to generate features to use in linear regression. 

<img src="img/twitter_feed16.png" style="width:1000px;height:900px;">

## Diversity

- Introduce penalty for same authors and similar content.
    - For example, add negative score for repeated author and contents.

# 3. Recommendation system

## Scope

- Give a user and context (time, location, etc) predict probability of engagement for each movie, and order movies.
- Will use implicit feedback (user watched the movie or not) rathen explicit feedback (user rated the movie) to gather large training data.

## Metrics

Online
- Engagement rate: (user clicked a movie / total number of sessions)
- Videos watched: count videos user watch at least for some time.
- Session watch time: overall time that user spent watching movies based on recommendation in a session.

Offline
- Mean Average Precision (mAP @ N)
    - $AP@N = \dfrac{1}{n}\displaystyle\sum_{k=1}^{N}P(k)rel(k)$
    - $P(k)$ = precision up to $k$
    - Precision = number of relevant recommendations / total number of recommendations
    - rel(k) = whether $k^{th}$ item is relevant or not
    - N = length of recommendation list
    - m = number of movies relevant to user based on historical data

- Mean Average Recall (mAR @ N)
    - Recall = number of relevant recommendations / number of all movies
    
- F1 score = 2 * (mAP*mAR) / (mAP+mAR)

## Architecture

<img src="img/recommendation_system1.png" style="width:1000px;height:700px;">

## Feature engineering

<img src="img/recommendation_system2.png" style="width:1000px;height:200px;">

User
- age
- gender
- language
- country
- average_session_time
- last_genre_watched
- user_actor_histogram: histogram showing historical interaction between users and actors in movies.
- user_genre_histogram
- user_language_histogram

Context
- season_of_the_year
- upcoming_holiday
- days_to_upcoming_holiday
- time_of_day
- day_of_week
- device

Media
- public-platform-rating
- revenue
- time_passed_since_release_date
- time_on_platform
- media_watch_history
- genre
- movie_duration
- content_set_time_period
- content_tags
- show_season_number
- country_of_origin
- release_country
- release_year
- release_type
- maturity_rating

## Candidate generation

- Select top $k$ movies to recommend to user.

Collaborative filtering
- Find users simialr to active user based on historical watches.

<img src="img/recommendation_system3.png" style="width:300px;height:300px;">

1. Nearest neighborhood

<img src="img/recommendation_system4.png" style="width:500px;height:300px;">

- Consider $n$ by $m$ matrix of user $u_{i}$ and movie $m_{j}$
- 1: user watched the movie.
- 0: user ignored the movie.
- empty: no impression yet.

<img src="img/recommendation_system5.png" style="width:500px;height:300px;">

- Take is to predict the feedback for movies that users haven't watched.
- Compute (for example) cosine similarity between user $i$ and other users, and select top $k$ similar users. (nearest neighbors)
- Then, take weighted average of feedback from top $k$ similar users for movie $j$.

2. Matrix factorization

- Use latent vector $M$ such that
    - User profile matrix $n$ by $M$.
    - Media profile matrix $M$ by $m$. 
- Latent vector $M$ can be considered as features of users or movies.
- Initialize user and movie vectors randomly. 
- For each known feedback value $f_{ij}$, predict feedback by taking dot product between user profile vertor $u_{i}$ and movie profile vector $m_{j}$. 
- Difference betweeen actual and predicted will be the error.
    - $e_{ij} = f_{ij} - u{i} \cdot m_{j}$
- Use stochastic gradient descent to update user and movie latent vectors.

<img src="img/recommendation_system6.png" style="width:500px;height:300px;">

Content-based filtering
- Make recommendations based on content of media that user had already interacted with.

## Training data generation

- User watched 80% or more of the movie? positive example
- User watched 10% or less of the movie? negative example
- Between 10% and 80%? uncertain

## Ranking

- Probability of user watching a media.

# 4. Self-driving car

## Metric

Intersection over union (IoU)
- overlapping area / area of union
- meanIoU is computed by taking average of IoU for each class. (building, road, sky, etc)

## Architecture

### Model

## Training data generation

## 