# Search ranking

### Scope
- A general search enginer like Google.

### Scale
- How many websites to search from? Billions of documents.
- How many requests per second? 10k queries per second.

### Personalization
- Assume user is logged in and historical search data of user is available.

## Metrics

### Online metrics
- Click through rate
    - (number of clicks / number of impressions or views) 
    - Unsuccessful clicks would also be part of this metric.
- Session success rate
    - Dwell time: time user spent viewing a page.
    - (number of sucessful sessions (dwell time > 10s) / number of total sessions)
- Time to success
    - Low number of quries means the system was good at guess what user wanted.

### Offline metrics
- Ground truth: actual outputs desired by the system. In this case, it is the rating provided by humans.
- Assume the search engine returns documents $D_{1}, D_{2}, D_{3}, D_{4}$ in the order of relevance.
- Assume human rates the documetns on scale of $0$ to $3$ ($3$ is highly relevant, $0$ is merely relevant) such that
    - $D_{1} = 3$, $D_{2} = 2$, $D_{3} = 3$, $D_{4} = 0$ 
- Cumulative gain simply adds.
    - $3+2+3+0 = 8$
- Discounted cumulative gain (DCG) penalizes if highly relevant document appears lower in the result.
    - $\dfrac{3}{log_{2}(1+1)}+\dfrac{2}{log_{2}(2+1)}+\dfrac{3}{log_{2}(3+1)}+\dfrac{0}{log_{2}(4+1)} = 3+1.262+1.5+0 = 5.762$
- Normalized discounted cumulative gain (NDCG) is computed by (DCG / IDCG) where IDCG is DCG of ideal ordering.
    - NDCG does not penalize irrelevant search result.
    
## Architecture

<img src="img/search_engine1.png" style="width:800px;height:400px;">

### Layered model approach 

<img src="img/search_engine2.png" style="width:1000px;height:200px;">

### Query rewriting
- Queries are often poorly worded.
- Increases recall. (return larger set of relevant results)

Spell checker
- Corrects spelling mistakes.

Query expansion
- Ex. expand "restaurant" to "food" or "recipe" to look for all candidates.

### Query understanding
- Intent behind query
    - Ex. "gas station" has local intent.
    - Ex. "earthquake" has newsy intent.
    
### Document selection
- Select set of documents that are relevant to query.
- Focused on recall. 

<img src="img/search_engine3.png" style="width:400px;height:200px;">

#### Inverted index
- Map words to documents.

<img src="img/search_engine4.png" style="width:600px;height:400px;">

#### Selection criteria
- Go to index and retrive all documents based on this criteria.

<img src="img/search_engine5.png" style="width:600px;height:300px;">

#### Scoring scheme
- Personalization measures searcher's profile such as age, gender, interest, location.

<img src="img/search_engine6.png" style="width:800px;height:400px;">

### Ranker
- Find best order of documents.
- Stage 1 
    - Find subset of document that should be passed to stage 2.
    - Use simpler algorithm like logistic regression to do binary classification.
    - Objective function takes pointwise approach.
- Stage 2
    - Perform complex algorithm like LambdaMART (If using offline NDCG, which is based on human-rated data) or LambdaRank (If using online training data) to do document ordering. 
    - Objective function takes pairwise approach.
        - Get as many pairs of document in the right order as possible.
    
### Blender
- Provides various results like posts, images, news, videos.
- Avoid displaying results from a single or few sources.
- Outputs final result page to users.

### Filter 
- Filter inappropriate result despite good user engagement.
- Training data can come from human raters and/or online feedback.
- Extra features could be considered such as
    - Website historical report rate
    - Sexually explicit terms used
    - Domain name
    - Website description
    - Images used on the website
- Use classification to determine if result inappropriate or not.

## Training data generation
- Takes online user data and generates positive and negative examples.

### Binary classification (pointwise approach)
- Document is either relevant or irrevant.
    - If user spent some time in the document, mark it relevant.
    - If user immediate backed after clicking the document, mark it irrelevant.
- We may never get enough negative examples.
    - Maybe treat all document displayed in 50th page in Google as negative.

### Train / test split

<img src="img/search_engine8.png" style="width:500px;height:300px;">

### Document ordering (pairwise approach)
- The goal is to minimize inversion. (number of wrong orders compared to ground truth) 
- Rank the document based on user activity on each document and use that as training data.

## Feature engineering

<img src="img/search_engine7.png" style="width:800px;height:300px;">

### Searcher (Assume the user is logged in)
- Age
- Gender
- Interest

### Query
- History
    - For example, query "earthquake" historically was related to recent news.
- Intent
    - For example, query "Pizza places" has "local" intent, thus should give higher rank to pizza places located nearby the searcher.
    
### Document
- Page rank
    - For example, the number of quality documents that link to it.
- Radius
    - For example, coffee shop in Toronto is relevant to people in 10km radicus but Eiffel tower has global scope.
    
### Context
- Time of day
    - For example, query "restaurant" should consider restaurant open at the time of query.
- Recent query
    - Take a look at previous quries. For example, "python" -> "python list"  
    
### Searcher-document
- Distance
    - For queries regarding locations, consider distance between searcher and matching location.
- History
    - For example, if searcher looked for video document in the past, then vidoe document would be more relevant to the searcher.
    
### Query-document
- Text match
    - Matches in the title, metadata, content of document
- N-gram match
    - For example, "Seattle tourism guide". Find text match for the combinations of three words.
    - TF-IDF
        - TF: (Term Frequency) importance of each term in the document.
        - IDF: (Inverse Document Frequency) how much information a particular term provides.
- Click rate
    - User's historical engagement with document.
- Embeddings
    - Find relationship between query and document.
    - Similarity score is computed between query vector and each document vector.