# Search engine use cases

- Structured search
- Full text search
- Record linkage 
- Analytics  => Log analytics
- Faceted search

## Structured search

  * Find exact values that match a set of filters: aka Database Queries 
  * Filters do not evaluate the scoring 
  * Filters may be cached 
  
###  Term filter - apply a filter to a field
  * We need to include it in a query
  * Care should be taken with text fields - if they should work as values => define mapping not analyzed
  * Internally uses a bitset to cache filter results.
  * The operation is that the field contain some of the values - not equals => examples with tags
  * Check for equality is difficult in an inverted index 
  
  
### Boolean Filter 
  * allow to compound several filters
  * boolean logic is implemented with this clauses
    * must (AND) 
    * should (OR) 
    * must not (NOT)
    
   
### Terms filter 
  * apply several conditions to the same field (OR)
  
### Range filter 
  * apply comparison operators to 
  * Works on the folloing types: 
     * Numbers 
     * Dates - perform date operations - explain cases 
     * Strings - however be careful because range on strings  it may be slow if the cardinality of the field is large

### Exists/Missing filters 
  * "NULL does not exists in ES"
  * exists checks if a document do have a field
  * missing is the opposite
  * There are ways to treat a null value when defining a mapping. A value with the rihgt type must be assigned. 
  
  
### Comments on caching and order affects performance  

<pre>
GET /my_store/products/_search
{
    "query" : {
        "filtered" : { 
            "query" : {
                "match_all" : {} 
            },
            "filter" : {
                "term" : { 
                    "price" : 20
                }
            }
        }
    }
}
</pre>

## Full text search

Search into full text fields in order to find relevant documents

Two most important aspects are: 
    * Relevance: how to score documents 
    * Analysis : how to represent documents


## Search process
  1. Check the field type 
  2. Analyze the query string 
  3. Find matching docs 
  4. Score each doc

## Queries

  - Term queries - (term) are not analyzed before matching the field. Consider if they could be filters
  - Full text queries (match) - analysys depends on the type of the field
       - numbers and dates 
       - String (not analyzed) 
       - Full text queries - use the analysis in the mapping
       
  - Fuzzy queries - match the term and similar terms  

## Queries II 

  - Boolean queries - we can combine queries with boolean values 
  - Boost queries - we can modify the importance of certain terms 
  

## Phrase queries

  aka . Proximity queries
  
  * match_phrase - requires indexing positions in documents 
  * slop - match
  * Proximity is included in the relevance score
  * Can be combined to other queries 
  * Phrase and proximity queries are more expensive
  
### Multivalued fields 
  * allow to add gaps to avoid false positives
  * define in the mapping

### Index bigrams (shingles) 
  * Consider as an option to phrase queries


## Partial matching
  
  
At query time: 
  - Prefix queries
  - wildcards
  - regexp
  
  - match phrase prefix
  
At index time:  
  -ngrams indexing 




## Relevance
  - How to sort the documents with respect their relevance to a query - a end user would look only to the top k documents 
  - Measuring full text relevance: 
     - Default: TFIDF 
     - other text similarity measures between strings and documents: Fuzzy similarity 
  
  - We can take into account other relevance measures
     - Time - recency
     - Location - proximity
     - Other numerical fields
  - Difference with databases: algorithms are adapted to sort and get top k documents. 
  
  

## Relevance: TFIDF



## Re-Scoring

## Multifield search
  
Motivation: 

  * Different uses: 
    * Match different full text queries in different fields: title and author
    * Order and bool queries impact, boosting may also be used
    
    * Tuning: 
       * dis_max - selecting the score of the best fields
       * tie_breaker
       * multi_match - helper to direct the same query to different fields
       * we can select fields by using regular expressions 
       * cross fields entity search
       
   * best fields 
   * most fields 
   * cross fields 
   
       