# ElasticSearch

In this notebook we have created a search engine using ElasticSearch. 

For our own reference:
* Literature: <https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html>
* Telegraaf XML documents: http://data.politicalmashup.nl/arjan/telegraaf/



Create a search engine for the telegraaf newspaper collection using eg ElasticSearch. Make facets for years and document types. Pay attention to telephone numbers (in mini advertisements). Hieronder een voorbeeld van 1 document (= 1 artikeltje).
Je ziet dat er zelfs een link naar de bron tekst (als plaatje) instaat. De URL linked door naar de nieuwe url http://www.delpher.nl/nl/kranten/view?identifier=ddd%3A010563762%3Ampeg21%3Aa0005&coll=ddd ElasticSearch gebruikt een JSON formaat als invoer, en dit is dus triviaal om te zetten naar JSON.

Each of the following points must be addressed. Create a seperate page on the wiki for each point. Make sure these pages can be found from the menu of your wiki. Explain what you did, and exemplify with links to screenshots/a working system.

* Search as we know it from Google. Give a result page (SERP), with links to the documents and some description of each hit.
* Advanced search. Let a user be able to search in several fields, also in several fields simulteanously. Queries like "return kamervragen by Wilders about XXX with an answer about YYY in the period ZZZ" should be possible. (For the "Telegraaf" collectie, let the user search in both title and tekst fields)
* Do one of the following:
    1. Represent the hits of a query with a wordcloud of 25-50 informative words. The wordcloud should somehow summarise what the collection has to say about the query. You may think of these words as words that you could add to the query in order to improve recall (blind relevance feedback/query expansion).
    2. Represent each document (a kamervraag) with a word-cloud. Also make word-clouds for the question and for the answer. EXAMPLE: The html files in http://data.politicalmashup.nl/arjan/odeii/data_as_html/ contain such wordcloud summaries, which work rather well.   

You can use several techniques to get rid of high frequency, but meaningless words: of course IDF, but also mutual information (see 13.5.1), or of course the technique from the paper by Kaptein et al on wordclouds.

* Give next to a traditional list of results, a timeline in which you indicate how many hits there are over time.
* Give next to the traditional list of results, a table with the number of hits for each political party. Link the party names, which should result in only selecting the hits "ingediend" by members of that party. (Faceted Search) (For the "Telegraaf" collectie, use the dc:subject element as facet values.)
* Evaluate your results Let 2 persons assess the relevancy of the top 10 documents for 5 different queries. Compute Cohen's kappa. Determine the average precision at 10 for your system based on these 10 queries, and the two relevance assesments. Also plot the P@10 (for both judges) for each query, showing differences in hard and easy queries. Describe clearly how you solved differences in judgements. 
Create your queries in the following format:

                    <topic number="6"  >
          <query>kcs</query>
          <description>Find information on the Kansas City Southern railroad.
          </description>
           
        </topic>

        <topic number="16"  >
          <query>arizona game and fish</query>
          <description>I'm looking for information about fishing and hunting
          in Arizona.
          </description>
           
        </topic>
                

So, both provide the actual query, and a description of the information need that was behind the query.
Give a small set of clear guidelines for judging the results, and let your judges follow these guidelines.
It is far more interesting to have difficult queries (both for the search engine and for the judges) than to have queries on which all ten retrieved documents are relevant. So, try to create a good list of information needs.

* Change the ranking of your system, compute the average precision at 10 using your 10 queries, compare the results to your old system, and EXPLAIN what is going on.


# The Search Engine

Before running ES run: 

    export ES_HEAP_SIZE=Half_RAM

where Half_RAM is half your ram

AND: 
in /config/elasticsaerch.yml add 
indices.memory.index_buffer_size: 50% 
(Still need to check if this makes a difference)

To start the Elastic searh serive, please run the following code in commandline:

    ./elasticsearch-2.4.1/bin/elasticsearch --node.name telegraaf

## Initiate connection to the Elastic Search engine

In [1]:
import sys
import json
from elasticsearch import Elasticsearch, helpers

HOST = 'http://localhost:9200/'
es = Elasticsearch(hosts=[HOST])

# If code runs, the connection is made

## Generator to read Telegraaf XML and add them to the ES database

A generator makes it possible to immediately put the XML files/documents into the ES databse

* Remove high frequency, but meaningless words

You can use several techniques to get rid of high frequency, but meaningless words: of course IDF, but also mutual information (see 13.5.1), or of course the technique from the paper by Kaptein et al on wordclouds.

    Possibly also create an inverted index at this point? 

In [2]:
from bs4 import BeautifulSoup
import sys
from os import listdir
from os.path import isfile

def read(doc):
    '''
    return a generator for the date, subject(type), 
    title, and text for each item in the given year. 
    '''    
    for date,subject, title, text, identifier in zip(doc.find_all('date'), doc.find_all('subject'), 
                                                     doc.find_all('title'), doc.find_all('text'),
                                                     doc.find_all('identifier')):
            yield (date.text,subject.text,title.text,text.text,identifier.text)

documents = ['./Telegraaf/'+i for i in listdir('./Telegraaf') if not isfile(i)]
soup_documents = [BeautifulSoup(open(year,"r"),"xml") for year in documents]

# Create the generator for the bulk importer
# I'm not sure if it's a good idea to use _type here as a subject (which is artcle or advertisement, or more...)
# The score calculation for the Elastic Search database uses whole-index statistics. 
# If you're searching a subsection this will alter the scores! WE WILL NEED TO KEEP THIS IN MIND.
#k = ({'_type':subject, '_index':'telegraaf','_source':{'year':date[:4], 'date':date[5:], 'title':title, 'text':text}} 
#    for year in documents for (date,subject,title,text) in read(year))


## Populate ES database

In [14]:
# List of all indices
! curl 'localhost:9200/_cat/indices?v'

health status index     pri rep docs.count docs.deleted store.size pri.store.size 
yellow open   telegraaf   5   1     157546            0    345.8mb        345.8mb 
yellow open   megacorp    5   1          0            0       800b           800b 


In [3]:
# delete any pre-excisting index
es.indices.delete(index='telegraaf', ignore=[404,400])

{u'acknowledged': True}

In [4]:
# Create the telegraaf index in our telegraaf node
es.indices.create(index='telegraaf', ignore=400)

{u'acknowledged': True}

In [11]:
# turn refresh off to speed up bulk import
es.indices.put_settings(index='telegraaf',body={"index" : 
                                            {"refresh_interval" : "-1"
                                            }
                                       })

{u'acknowledged': True}

In [5]:
import time

#Import the information into the database
# The generator can only be used once. So this code will only work once. 
print "Test with chunk size = 500 and max_chunk_bytes = 15728640 "
# helper.parallel_bulk might increase the speed even more!!!

def bulk_all(documents):
    start = time.time()
    print "Starting time:", start

    k = ({'_type':subject, '_index':'telegraaf','_source':{'year':date[:4], 
         'date':date[5:], 'title':title, 'text':text}}
          for doc in documents[:5] for (date,subject,title,text,identifier) in read(doc))
    for ok in helpers.parallel_bulk(es,k, chunk_size=500,max_chunk_bytes=15728640):
        continue
    end_doc =time.time()
    print "Finished", (end_doc-start)

def bulk_per_doc(documents):
    start = time.time()
    print "Starting time:", start

    for i in documents[:5]:
        start_doc = time.time()
        k = ({'_type':subject, '_id':identifier, '_index':'telegraaf','_source':{'year':date[:4], 
             'date':date[5:], 'title':title, 'text':text}}
             for (date,subject,title,text,identifier) in read(i))
        for ok in helpers.parallel_bulk(es,k,chunk_size=500,max_chunk_bytes=15728640):
            continue
        end_doc =time.time()
        print "Finished", (end_doc-start_doc)
        end = time.time()
    print "Done:", end - start
        
bulk_per_doc(soup_documents)

Test with chunk size = 500 and max_chunk_bytes = 15728640 
Starting time: 1477062060.45
Finished 0.307443857193
Finished 52.7827241421
Finished 3.96766901016
Finished 15.3635549545
Finished 2.05416107178
Done: 74.4765508175


In [6]:
# Set the refresh rate back to default
es.indices.put_settings(index='telegraaf',body={"index" : 
                                            {"refresh_interval" : "1s"
                                            }
                                       })

{u'acknowledged': True}

In [9]:
# Speed improvements/performance: 
# - bootstrap.mlockall: true in config of the file 
# (make sure  ES_HEAP_SIZE is large enough) 
# Parsing whole document xml.cElementTree.parse()
# Streaming the xml document: xml.sax.reader.html

# import xml.etree.ElementTree as etree
# for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
#  print event, elem
# http://boscoh.com/programming/reading-xml-serially.html
# Event handlers 

## Query system 

* Normalise query
* Get right tokens from the query. Use patterns to split up the query in parts? 
* Put them in the right representation for ES search

In [7]:
def search(query, advanced=False):
    '''
    Given a query it returns a SERP with rakings based on the score
    '''
    if advanced:
        q = {"query": 
                {"filtered": 
                    {"query": {
                        "multi_match": 
                            {"query" : query[1],
                             "type" : "cross_fields",  # with 'and' operator 
                             "fields" : ['title', 'text'],
                             "operator" : 'and'
                            }
                        },
                     "filter": 
                        {"bool" : 
                            {"must" : [{"term": {"year": query[2]}},{"term": {"_type": query[0]}}]
                            }
                        }
                    }
                }
            }
        res = es.search(index='telegraaf', size=10, body=q)
        print res
    else:
        # filter_path can help reduce the amount of data that is returned by the es.search
        # The query context is for how well the document fits the query
        # The filter context is a boolean context. Does it match or not.
        # example: Does this timestamp fall into the range 2015 to 2016?
        #
        
        # The outer 'query': is necessary to show that this is the query.  
        q = {'query':
                {'multi_match':
                    {'query' : query,
                     'type' : 'cross_fields',  # with 'and' operator 
                     'fields' : ['title', 'text'],
                     'operator' : 'and'
                     }
                 }
             }
        # In other words, all terms must be present in at least one field for a document to match.
        res = es.search(index='telegraaf', size=10, body=q)
        print res
    

In [439]:
search('stoomschip engelsche')

{u'hits': {u'hits': [{u'_score': 1.2368431, u'_type': u'artikel', u'_id': u'AVfio6_4YswG1S9go2MY', u'_source': {u'date': u'03-01', u'text': u't', u'year': u'1923', u'title': u'HEVIGE STORM AAN DE ENGELSCHE KUST. Een stoomschip door de hemanning verlaten.'}, u'_index': u'telegraaf'}], u'total': 1, u'max_score': 1.2368431}, u'_shards': {u'successful': 5, u'failed': 0, u'total': 5}, u'took': 2, u'timed_out': False}


## Result page function

* Take query output and use score to order result on a Search Engine Result Page (SERP).
* Return title, link, and description of each hit

-> The description can be a word cloud of 20-25 most informative words. Represent the hits of a query with a wordcloud of 25-50 informative words. The wordcloud should somehow summarise what the collection has to say about the query. You may think of these words as words that you could add to the query in order to improve recall (blind relevance feedback/query expansion). 


Additions
* A timeline with the amount of hits over time
* A table with the number of hits for each political party. Link the party names, which should result in only selecting the hits "ingediend" by members of that party. (Faceted Search) (For the "Telegraaf" collectie, use the dc:subject element as facet values.)

## Advanced Search

The query system will have to be changed to implement this

* Make multiple fields searchable:
    * Title 
    * Tekst
    * Year?
    
Let a user be able to search in several fields, also in several fields simulteanously. Queries like "return kamervragen by Wilders about XXX with an answer about YYY in the period ZZZ" should be possible. (For the "Telegraaf" collectie, let the user search in both title and tekst fields)

In [8]:
unique_years = list(set( document.date.get_text()[:4]
                    for document in soup_documents ))

unique_doc_types = list(set( subject.get_text()
                       for document in soup_documents
                       for subject in document.find_all('subject')))

In [9]:
from formlayout import fedit, FormDialog

query = fedit([('Document type',[0]+unique_doc_types),
               ('Zoektermen',''),
               ('Jaar publicatie',[0]+unique_years)], 
               title="Telegraaf zoekmachine", 
               comment="Wat voor krantenartikel zoek je?")

print query

query[0] = unique_doc_types[query[0]]
query[2] = unique_years[query[2]]

print query

[0, u'mandaten', 1]
[u'artikel', u'mandaten', u'1922']


In [10]:
res = search(query, advanced=True)

{u'hits': {u'hits': [{u'_score': 2.9305606, u'_type': u'artikel', u'_id': u'ddd:010563557:mpeg21:p001:a0002', u'_source': {u'date': u'11-29', u'text': u'i', u'year': u'1922', u'title': u'Sovjet-Rusland en de mandaten over Syri\xeb, Palestina en Mesopotami\xeb.'}, u'_index': u'telegraaf'}, {u'_score': 1.3677711, u'_type': u'artikel', u'_id': u'ddd:010563560:mpeg21:p006:a0159', u'_source': {u'date': u'11-30', u'text': u"DANZIG. ?.<) Nov. \u2014 Volgens een bericM aft ' Kowno hebben de Poolschen en Joodsche frac- ' \u25a0ties in den Landdag van Kaamt) geprotearteerd de onrechtvaardige verdeeling \xbb ! mandaten der waarna zij te \u2022 ! zutiea de zittlngza i", u'year': u'1922', u'title': u'De Landdag van Kowno.'}, u'_index': u'telegraaf'}, {u'_score': 0.9130335, u'_type': u'artikel', u'_id': u'ddd:010563575:mpeg21:p002:a0037', u'_source': {u'date': u'12-09', u'text': u"Het plan van do\xab ij ka-hen bondskanselier S tn den N Raad spoed:. we verkie.- te doen all afgr-vaardiedta, rsa \u25a0

## Result page function

* Take query output and use score to order result on a Search Engine Result Page (SERP).
* Return title, link, and description of each hit

## Evaluation

* Manual relevance check
* P@10
* Change the ranking of the system + explain what is going on and why it is improving/decreasing

Evaluate your results Let 2 persons assess the relevancy of the top 10 documents for 5 different queries. Compute Cohen's kappa. Determine the average precision at 10 for your system based on these 10 queries, and the two relevance assesments. Also plot the P@10 (for both judges) for each query, showing differences in hard and easy queries. Describe clearly how you solved differences in judgements. 
So, both provide the actual query, and a description of the information need that was behind the query.
Give a small set of clear guidelines for judging the results, and let your judges follow these guidelines.
It is far more interesting to have difficult queries (both for the search engine and for the judges) than to have queries on which all ten retrieved documents are relevant. So, try to create a good list of information needs.
