# ElasticSearch

In this notebook we have created a search engine using ElasticSearch. 

For our own reference:
* Literature: <https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html>
* Telegraaf XML documents: http://data.politicalmashup.nl/arjan/telegraaf/



Create a search engine for the telegraaf newspaper collection using eg ElasticSearch. Make facets for years and document types. Pay attention to telephone numbers (in mini advertisements). Hieronder een voorbeeld van 1 document (= 1 artikeltje).
Je ziet dat er zelfs een link naar de bron tekst (als plaatje) instaat. De URL linked door naar de nieuwe url http://www.delpher.nl/nl/kranten/view?identifier=ddd%3A010563762%3Ampeg21%3Aa0005&coll=ddd ElasticSearch gebruikt een JSON formaat als invoer, en dit is dus triviaal om te zetten naar JSON.

Each of the following points must be addressed. Create a seperate page on the wiki for each point. Make sure these pages can be found from the menu of your wiki. Explain what you did, and exemplify with links to screenshots/a working system.

* Search as we know it from Google. Give a result page (SERP), with links to the documents and some description of each hit.
* Advanced search. Let a user be able to search in several fields, also in several fields simulteanously. Queries like "return kamervragen by Wilders about XXX with an answer about YYY in the period ZZZ" should be possible. (For the "Telegraaf" collectie, let the user search in both title and tekst fields)
* Do one of the following:
    1. Represent the hits of a query with a wordcloud of 25-50 informative words. The wordcloud should somehow summarise what the collection has to say about the query. You may think of these words as words that you could add to the query in order to improve recall (blind relevance feedback/query expansion).
    2. Represent each document (a kamervraag) with a word-cloud. Also make word-clouds for the question and for the answer. EXAMPLE: The html files in http://data.politicalmashup.nl/arjan/odeii/data_as_html/ contain such wordcloud summaries, which work rather well.   

You can use several techniques to get rid of high frequency, but meaningless words: of course IDF, but also mutual information (see 13.5.1), or of course the technique from the paper by Kaptein et al on wordclouds.

* Give next to a traditional list of results, a timeline in which you indicate how many hits there are over time.
* Give next to the traditional list of results, a table with the number of hits for each political party. Link the party names, which should result in only selecting the hits "ingediend" by members of that party. (Faceted Search) (For the "Telegraaf" collectie, use the dc:subject element as facet values.)
* Evaluate your results Let 2 persons assess the relevancy of the top 10 documents for 5 different queries. Compute Cohen's kappa. Determine the average precision at 10 for your system based on these 10 queries, and the two relevance assesments. Also plot the P@10 (for both judges) for each query, showing differences in hard and easy queries. Describe clearly how you solved differences in judgements. 
Create your queries in the following format:

                    <topic number="6"  >
          <query>kcs</query>
          <description>Find information on the Kansas City Southern railroad.
          </description>
           
        </topic>

        <topic number="16"  >
          <query>arizona game and fish</query>
          <description>I'm looking for information about fishing and hunting
          in Arizona.
          </description>
           
        </topic>
                

So, both provide the actual query, and a description of the information need that was behind the query.
Give a small set of clear guidelines for judging the results, and let your judges follow these guidelines.
It is far more interesting to have difficult queries (both for the search engine and for the judges) than to have queries on which all ten retrieved documents are relevant. So, try to create a good list of information needs.

* Change the ranking of your system, compute the average precision at 10 using your 10 queries, compare the results to your old system, and EXPLAIN what is going on.


# The Search Engine

Before running ES run: 

    export ES_HEAP_SIZE=Half_RAM

where Half_RAM is half your ram

AND: 
in /config/elasticsaerch.yml add 
indices.memory.index_buffer_size: 50% 
(Still need to check if this makes a difference)

To start the Elastic searh serive, please run the following code in commandline:

    ./elasticsearch-2.4.1/bin/elasticsearch --node.name telegraaf

## Initiate connection to the Elastic Search engine

In [1]:
import sys
import json
from elasticsearch import Elasticsearch, helpers

HOST = 'http://localhost:9200/'
es = Elasticsearch(hosts=[HOST])

# If code runs, the connection is made

## Generator to read Telegraaf XML and add them to the ES database

A generator makes it possible to immediately put the XML files/documents into the ES databse

* Remove high frequency, but meaningless words

You can use several techniques to get rid of high frequency, but meaningless words: of course IDF, but also mutual information (see 13.5.1), or of course the technique from the paper by Kaptein et al on wordclouds.

    Possibly also create an inverted index at this point? 

In [2]:
from bs4 import BeautifulSoup
import sys
from os import listdir
from os.path import isfile

def read(year):
    '''
    return a generator for the date, subject(type), 
    title, and text for each item in the given year. 
    '''
    soup = BeautifulSoup(open(year,'r'),'xml')
    for date,subject, title, text, identifier in zip(soup.find_all('date'), soup.find_all('subject'), 
                                                     soup.find_all('title'), soup.find_all('text'),
                                                     soup.find_all('identifier')):
            yield (date.text,subject.text,title.text,text.text,identifier.text)

documents = ['./Telegraaf/'+i for i in listdir('./Telegraaf') if not isfile(i)]

print(documents)

# Create the generator for the bulk importer
# I'm not sure if it's a good idea to use _type here as a subject (which is artcle or advertisement, or more...)
# The score calculation for the Elastic Search database uses whole-index statistics. 
# If you're searching a subsection this will alter the scores! WE WILL NEED TO KEEP THIS IN MIND.
#k = ({'_type':subject, '_index':'telegraaf','_source':{'year':date[:4], 'date':date[5:], 'title':title, 'text':text}} 
#    for year in documents for (date,subject,title,text) in read(year))


['./Telegraaf/telegraaf-1918.xml', './Telegraaf/telegraaf-1922.xml']


## Populate ES database

In [2]:
# List of all indices
! curl 'localhost:9200/_cat/indices?v'

health status index     pri rep docs.count docs.deleted store.size pri.store.size 
yellow open   telegraaf   5   1     157546            0      333mb          333mb 
yellow open   megacorp    5   1          0            0       800b           800b 


In [3]:
# delete any pre-excisting index
es.indices.delete(index='telegraaf', ignore=[404,400])

{u'acknowledged': True}

In [4]:
# Create the telegraaf index in our telegraaf node
es.indices.create(index='telegraaf', ignore=400)

{u'acknowledged': True}

In [5]:
# turn refresh off to speed up bulk import
es.indices.put_settings(index='telegraaf',body={"index" : 
                                            {"refresh_interval" : "-1"
                                            }
                                       })

{u'acknowledged': True}

In [6]:
import time

#Import the information into the database
# The generator can only be used once. So this code will only work once. 
print "Test with chunk size = 500 and max_chunk_bytes = 15728640 "
# helper.parallel_bulk might increase the speed even more!!!

def bulk_all(documents):
    start = time.time()
    print "Starting time:", start

    k = ({'_type':subject, '_id':identifier, '_index':'telegraaf','_source':{'year':date[:4], 
         'date':date[5:], 'title':title, 'text':text}}
          for doc in documents[:5] for (date,subject,title,text,identifier) in read(doc))
    for ok in helpers.parallel_bulk(es,k, chunk_size=500,max_chunk_bytes=15728640):
        continue
    end_doc =time.time()
    print "Finished", (end_doc-start)

def bulk_per_doc(documents):
    start = time.time()
    print "Starting time:", start

    for i in documents[:1]:
        start_doc = time.time()
        k = ({'_type':subject, '_id':identifier, '_index':'telegraaf','_source':{'year':date[:4], 
             'date':date[5:], 'title':title, 'text':text}}
             for (date,subject,title,text,identifier) in read(i))
        for ok in helpers.parallel_bulk(es,k,chunk_size=500,max_chunk_bytes=15728640):
            continue
        end_doc =time.time()
        print "Finished", (end_doc-start_doc)
        end = time.time()
    print "Done:", end - start
        
bulk_all(documents)

Test with chunk size = 500 and max_chunk_bytes = 15728640 
Starting time: 1477241574.61
Finished 17.2869999409


In [7]:
# Set the refresh rate back to default
es.indices.put_settings(index='telegraaf',body={"index" : 
                                            {"refresh_interval" : "1s"
                                            }
                                       })

{u'acknowledged': True}

In [41]:
# Speed improvements/performance: 
# - bootstrap.mlockall: true in config of the file 
# (make sure  ES_HEAP_SIZE is large enough) 
# Parsing whole document xml.cElementTree.parse()
# Streaming the xml document: xml.sax.reader.html

# import xml.etree.ElementTree as etree
# for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
#  print event, elem
# http://boscoh.com/programming/reading-xml-serially.html
# Event handlers 

## Query system 

* Normalise query
* Get right tokens from the query. Use patterns to split up the query in parts? 
* Put them in the right representation for ES search

In [8]:
def search(query, advanced=False):
    '''
    Given a query it returns a SERP with rakings based on the score
    '''
    if advanced:
        must = []
        if query[0]:
            must.append({"term": {"_type": query[0]}})
        if query[2]:
            must.append({"term": {"year": query[2]}})
        
        q = {"query": 
                {"filtered": 
                    {"query": {
                        "multi_match": 
                            {"query" : query[1],
                             "type" : "cross_fields",  # with 'and' operator 
                             "fields" : ['title', 'text'],
                             "operator" : 'and'
                            }
                        },
                     "filter": 
                        {"bool" : 
                            {"must" : must
                            }
                        }
                    }
                }
            }
        res = es.search(index='telegraaf', size=10, body=q)
        return res
    else:
        # filter_path can help reduce the amount of data that is returned by the es.search
        # The query context is for how well the document fits the query
        # The filter context is a boolean context. Does it match or not.
        # example: Does this timestamp fall into the range 2015 to 2016?
        #
        
        # The outer 'query': is necessary to show that this is the query.  
        q = {'query':
                {'multi_match':
                    {'query' : query,
                     'type' : 'cross_fields',  # with 'and' operator 
                     'fields' : ['title', 'text'],
                     'operator' : 'and'
                     }
                 }
             }
        # In other words, all terms must be present in at least one field for a document to match.
        res = es.search(index='telegraaf', size=10, body=q)
        return res
    

In [64]:
#search('stoomschip engelsche')

## Result page function

* Take query output and use score to order result on a Search Engine Result Page (SERP).
* Return title, link, and description of each hit

-> The description can be a word cloud of 20-25 most informative words. Represent the hits of a query with a wordcloud of 25-50 informative words. The wordcloud should somehow summarise what the collection has to say about the query. You may think of these words as words that you could add to the query in order to improve recall (blind relevance feedback/query expansion). 


Additions
* A timeline with the amount of hits over time
* A table with the number of hits for each political party. Link the party names, which should result in only selecting the hits "ingediend" by members of that party. (Faceted Search) (For the "Telegraaf" collectie, use the dc:subject element as facet values.)

## Advanced Search

The query system will have to be changed to implement this

* Make multiple fields searchable:
    * Title 
    * Tekst
    * Year?
    
Let a user be able to search in several fields, also in several fields simulteanously. Queries like "return kamervragen by Wilders about XXX with an answer about YYY in the period ZZZ" should be possible. (For the "Telegraaf" collectie, let the user search in both title and tekst fields)

In [9]:
# Determine values for the document type facets
unique_doc_types = list(set(es.indices.get_mapping(index='telegraaf')['telegraaf']['mappings']))

In [10]:
import formlayout 

def query_database():
    query = formlayout.fedit([('Document type',[0]+unique_doc_types),
                   ('Zoektermen',''),
                   ('Jaar publicatie','')],
                   title="Telegraaf zoekmachine", 
                   comment="Wat voor krantenartikel zoek je?")

    query[0] = unique_doc_types[query[0]]

    res = search(query, advanced=True)
    serp = result_page(query[1], res['hits']['hits'], "http://kranten.kb.nl/view/article/id/", 50, 15)
    results = formlayout.fedit(serp, title="Telegraaf results")
    


## Result page function

* Take query output and use score to order result on a Search Engine Result Page (SERP).
* Return title, link, and description of each hit

### SERP & Worcloud
Since stopwords have high frequencies, they are likely to occupy most places in the word cloud. We therefore remove an extensive stopword list consisting of 571 common English words. Only single words (unigrams) are included in the cloud and stemming is applied. To create a word cloud all terms in the document are sorted by their probabilities and a fixed number of the 25 top ranked terms are kept. The top 10 documents retrieved by a language model run are concatenated and treated as one long document.

In [15]:
from stop_words import get_stop_words
import snowballstemmer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import math

def result_page(query, hits, url, n, m):
    """
    Return the SERP information from a given list of
    search results.
    """
    if hits:
        serp = []
        total_text = []
        for i, hit in enumerate(hits[:10]):
            serp.append((None,None))
            serp.append((None,"Jaar publicatie: %s, Onderwerp: %s" % (hit['_source']['year'],hit['_type'])))
            serp.append((None, "URL: <a href='%s'>%s</a>" % (url+str(hit['_id']),url+str(hit['_id']))))
            serp.append((None,hit['_source']['title']))
            text = hit['_source']['text'].split()
            total_text = total_text + text
            serp.append((None,"Beschrijving: " + extract_description(query,text,m)))
        wordcloud = create_wordcloud(text, n)
    else:
        serp = [(None,'\n\n\n\nEr zijn helaas geen zoekresultaten.\n\n\n\n')]
    return serp

def create_wordcloud(text, n):
    """
    Display a wordcloud with at most n words, generated
    from the given text.
    """
    # Filter words to use for the wordcloud, by stemming and stop words removal
    stop_words = get_stop_words("dutch")
    stemmer = snowballstemmer.stemmer("dutch")
    text = [word for word in text if word.lower() not in stop_words]
    text = stemmer.stemWords(text)

    # Plot wordcloud
    wordcloud = WordCloud(max_font_size=40, background_color="white",
                          max_words = n).generate(" ".join(text))
    plt.figure()
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

def extract_description(query, text, m):
    query = query.split()
    positions = sorted([text.index(word) for word in query 
                        if word in text ])
    
    positions = [i for i in positions if i > 7]

    # If word(s) appeared in text, return these sentences
    if positions:
        description = position_sentences(positions, text, m)
    # If the word only occured in title, return first sentence
    else:
        description = ' '.join(text[0:m]) + '...'
    
    return description

def position_sentences(positions, text, m):
    
    mini = positions[0]
    maxi = mini
    for i in positions[1:]:
        if i > mini and i <= mini + m:
            maxi = i
    diff = int(math.floor((m - (maxi - mini)) / 2))
    return '...'+' '.join(text[mini-diff:maxi+diff])+'...'
    

# res = search(query, advanced=True)
# serp = result_page(query[1], res['hits']['hits'], "http://kranten.kb.nl/view/article/id/", 50, 15)
# results = formlayout.fedit(serp, title="Telegraaf results")

## Evaluation

* Manual relevance check
* P@10
* Change the ranking of the system + explain what is going on and why it is improving/decreasing

Evaluate your results Let 2 persons assess the relevancy of the top 10 documents for 5 different queries. Compute Cohen's kappa. Determine the average precision at 10 for your system based on these 10 queries, and the two relevance assesments. Also plot the P@10 (for both judges) for each query, showing differences in hard and easy queries. Describe clearly how you solved differences in judgements. 
So, both provide the actual query, and a description of the information need that was behind the query.
Give a small set of clear guidelines for judging the results, and let your judges follow these guidelines.
It is far more interesting to have difficult queries (both for the search engine and for the judges) than to have queries on which all ten retrieved documents are relevant. So, try to create a good list of information needs.


In [20]:
query_database()

In [37]:
from __future__ import division

def cohens_kappa(bins):
    '''
    Given bins made by bin_evaluations, returns cohen\'s kappa
    '''
    
    
    total = bins[0]+bins[1]+bins[2]+bins[3]
    po = (bins[0]+bins[3]) / total
    
    marginala = ((bins[3] + bins[2]) * (bins[3] + bins[1])) / total
    marginalb = ((bins[0] + bins[1]) * (bins[0] + bins[2])) / total
    pe = (marginala + marginalb) / total
    
    cohens_kappa = (po - pe) / (1 - pe)
    return cohens_kappa
    
def precision(bins, agreement_necessary):
    '''
    Given bins made by bin_evaluations, returns precision
    If second argument is True, document is only seen as
    relevant if both judges think so; otherwise one of the 
    judges is enough
    '''
    if agreement_necessary:
        correct = bins[3]
    else:
        correct = bins[3] + bins[2] + bins[1]
    total = bins[0]+bins[1]+bins[2]+bins[3]
    return correct/total
    
    
def bin_evaluations(evaluations):
    '''
    Given a list of tuples of binary values which contain
    relevance judgements by two judges in which 1 means 
    relevant, and 0 means non-relevant, returns a list
    in which the 4 elements represent the amount of times
    any combination has occurred.
    '''
    
    
    # bins are 00, 01, 10, 11 in that order
    bins = [0,0,0,0]
    for evaluation in evaluations:
        b = 0
        if evaluation[0]:
            b += 2
        if evaluation[1]:
            b += 1
        bins[b] += 1
    return bins

In [44]:
def evaluate_search_results(number_of_queries):
    '''
    Asks the user to submit a query using query_database() some amount 
    of times after which two judges can give their relevancy 
    assessments. Returns evaluations in a format that can be used for 
    bin_evaluations.
    '''

    print('After the query is resolved, you will be asked to rate the documents on relevancy.')
    print('To do this, enter the numbers of the relevant results sperated by whitespace.')
    print('For example, if only the first and third documents were relevant, enter \'1 3\' without the quotes.')
    print('If more than ten documents are returned, ignore all but the first 10.\n')
    evaluations = []
    for _ in range(number_of_queries):
        query_database()
        print('Indicate which documents were relevant:')
        judge1_input = raw_input('Judge 1 -->').split()
        judge2_input = raw_input('Judge 2 -->').split()
        for i in range(1, 11):
            if str(i) in judge1_input:
                assessment1 = 1
            else:
                assessment1 = 0
            if str(i) in judge2_input:
                assessment2 = 1
            else:
                assessment2 = 0
            evaluations.append((assessment1, assessment2))
    
    bins = bin_evaluations(evaluations)
    print('\n The average P@10 if we require agreement for correctness: ' + str(precision(bins, 1)))
    print("The average P@10 if we require a single \'relevant\' assessment for correctness: " + str(precision(bins, 0)))
    print("Cohen\'s Kappa was: " +str(cohens_kappa(bins)))
    return evaluations

In [47]:
evaluations = evaluate_search_results(5)

After the query is resolved, you will be asked to rate the documents on relevancy.
To do this, enter the numbers of the relevant results sperated by whitespace.
For example, if only the first and third documents were relevant, enter '1 3' without the quotes.
If more than ten documents are returned, ignore all but the first 10.

Indicate which documents were relevant:
Judge 1 -->1
Judge 2 -->1
Indicate which documents were relevant:
Judge 1 -->2
Judge 2 -->2

 The average P@10 if we require agreement for correctness: 0.1
The average P@10 if we require a single 'relevant' assessment for correctness: 0.1
Cohen's Kappa was: 1.0
