# <center> Information Retrieval </center>

Reference: 
- https://pdfs.semanticscholar.org/e2b6/4bf95ee3d2f9bb56042e93eb7688158b7d35.pdf
- https://web.stanford.edu/class/cs276/handouts/lecture-lucene.pptx
- https://webcourse.cs.technion.ac.il/236621/Winter2011-2012/ho/WCFiles/tutorial%201%20%28Information%20Retrieval%20Basics%29.pdf
- https://people.cs.umass.edu/~jpjiang/cs646/16_learning_to_rank.pdf
- https://logz.io/learn/complete-guide-elk-stack/

## 1. Basic Concepts:

- Document ($d$) – any piece of information (usually textual data)
- Query ($q$) – some text representing the user’s information need
- Relevance – a predicate between documents and queries $R(d,q)$
- Process of Inforamtion Retrival
 

<img src="information_retrieval.png" width="50%">
source: https://web.stanford.edu/class/cs276/handouts/lecture-lucene.pptx


## 2. Information Retrieval Process
- Analyze document: tokenize, remove space, remove stop words, extract phrases, stemming, tf-idf weights...
- Index document: after analysis, each document is represented as a set of terms with tf-idf weights 
  - Inverted index
  <img src='inverted_index.jpg' width='50%'>
- Use index for retrieval
  - Given a user's query $q=(t_1, t_2, ..., t_k)$, find the relevant documents
  - Popular methods:
      - Boolean model
      - Vector Space Model
      - Doc2vector
      - Learning to Rank (Supervised)
      - ...

### 2.1 Boolean Model
- A query is specified as a boolean expression, e.g. ('blue' and 'butterfly')
  - boolean operator: OR, AND, NOT
- Relevance: A document is relevant to the query if it satisfies the query's boolean expression
- Pros:
  - Fast
  - Relevance is binary (Yes or No)
- Cons:
  - No ranking
  - How to convert search query as a boolean expression? (e.g. 'bright butterfly not in color blue')

###  2.2 Vector Space Model
- Documents are represented as vectors in a (huge) N- dimensional space
    – N is the number of terms in the corpus, i.e. size of the lexicon/dictionary
- Query is a document like any other document
- Relevance – measured by similarity:
  - Cosine similarity between query and document vectors
- Pros
  - tf-idf improves retrieval effectiveness
  - Cosine similarity is a good ranking measure
  - Simple and elegant
- Cons
  - Ranking does not guarantee multiple term containment
  - Term weighting schemes sometimes difficult to maintain in incremental settings

### 2.4 Doc2Vector
### 2.5 Learning to Rank (Supervised)
- Learn a function to automatically rank results 
- Basic idea:
  - Given a training dataset that contains a list of $(q, d, r)$ where $d$ is document returned from query $q$ with relevance $r$
  - Training a machine learning model with features extracted from concatenated $(d :: q)$ and $r$ as target
  - SVM, regression, deep learning, and other machine learning techniques can be used to learn the ranking model

## 3. Popular text search engines
- Apache Lucene (https://lucene.apache.org/)
- Apache Solr (based on Lucene) (http://lucene.apache.org/solr/)
- ElasticSearch (based on Lucene) (https://www.elastic.co)
- ...


## 4. ElasticSearch, Logstash, and Kibana (ELK)
- ElasticSearch, Logstash, and Kibana (denoted as ELK Stack) is the most popular opne-source log management platform. It is downloaded 500,000 times every month
- Elasticsearch is a NoSQL database that is based on the Lucene search engine. 
  - It's not only for log management
  - It supports scalable full text search
- Logstash is a log pipeline tool that accepts inputs from various sources (e.g. tweets) and exports the data to various targets. 
- Kibana is a visualization layer that works on top of Elasticsearch


### 4.1. Demo of ElasticSearch and Kibana
- Demo Site: http://155.246.104.27:5601/
- Select "logstash-\*" index (i.e. documents in indexes named with "logstash...")
- Change the time range from 2015-05-17 to 2015-05-20
- Investigate the json data store
- Search for documents, e.g. traffic related to "twitter.com" or "music AND movie"
- Play with the dashboard
- Create your own visualization (it's a shared environment, be sure to name it uniquely, i.e. with your student id)