# ElasticSearch

In this notebook we have created a search engine using ElasticSearch. 

For our own reference:
* Literature: <https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html>
* Telegraaf XML documents: http://data.politicalmashup.nl/arjan/telegraaf/



Create a search engine for the telegraaf newspaper collection using eg ElasticSearch. Make facets for years and document types. Pay attention to telephone numbers (in mini advertisements). Hieronder een voorbeeld van 1 document (= 1 artikeltje).
Je ziet dat er zelfs een link naar de bron tekst (als plaatje) instaat. De URL linked door naar de nieuwe url http://www.delpher.nl/nl/kranten/view?identifier=ddd%3A010563762%3Ampeg21%3Aa0005&coll=ddd ElasticSearch gebruikt een JSON formaat als invoer, en dit is dus triviaal om te zetten naar JSON.

Create a search engine for the telegraaf newspaper collection using eg ElasticSearch. Make facets for years and document types. Pay attention to telephone numbers (in mini advertisements). Hieronder een voorbeeld van 1 document (= 1 artikeltje).
Je ziet dat er zelfs een link naar de bron tekst (als plaatje) instaat. De URL linked door naar de nieuwe url http://www.delpher.nl/nl/kranten/view?identifier=ddd%3A010563762%3Ampeg21%3Aa0005&coll=ddd ElasticSearch gebruikt een JSON formaat als invoer, en dit is dus triviaal om te zetten naar JSON.

Each of the following points must be addressed. Create a seperate page on the wiki for each point. Make sure these pages can be found from the menu of your wiki. Explain what you did, and exemplify with links to screenshots/a working system.

* Search as we know it from Google. Give a result page (SERP), with links to the documents and some description of each hit.
* Advanced search. Let a user be able to search in several fields, also in several fields simulteanously. Queries like "return kamervragen by Wilders about XXX with an answer about YYY in the period ZZZ" should be possible. (For the "Telegraaf" collectie, let the user search in both title and tekst fields)
* Do one of the following:
    1. Represent the hits of a query with a wordcloud of 25-50 informative words. The wordcloud should somehow summarise what the collection has to say about the query. You may think of these words as words that you could add to the query in order to improve recall (blind relevance feedback/query expansion).
    2. Represent each document (a kamervraag) with a word-cloud. Also make word-clouds for the question and for the answer. EXAMPLE: The html files in http://data.politicalmashup.nl/arjan/odeii/data_as_html/ contain such wordcloud summaries, which work rather well.   

You can use several techniques to get rid of high frequency, but meaningless words: of course IDF, but also mutual information (see 13.5.1), or of course the technique from the paper by Kaptein et al on wordclouds.

* Give next to a traditional list of results, a timeline in which you indicate how many hits there are over time.
* Give next to the traditional list of results, a table with the number of hits for each political party. Link the party names, which should result in only selecting the hits "ingediend" by members of that party. (Faceted Search) (For the "Telegraaf" collectie, use the dc:subject element as facet values.)
* Evaluate your results Let 2 persons assess the relevancy of the top 10 documents for 5 different queries. Compute Cohen's kappa. Determine the average precision at 10 for your system based on these 10 queries, and the two relevance assesments. Also plot the P@10 (for both judges) for each query, showing differences in hard and easy queries. Describe clearly how you solved differences in judgements. 
Create your queries in the following format:

                    <topic number="6"  >
          <query>kcs</query>
          <description>Find information on the Kansas City Southern railroad.
          </description>
           
        </topic>

        <topic number="16"  >
          <query>arizona game and fish</query>
          <description>I'm looking for information about fishing and hunting
          in Arizona.
          </description>
           
        </topic>
                

So, both provide the actual query, and a description of the information need that was behind the query.
Give a small set of clear guidelines for judging the results, and let your judges follow these guidelines.
It is far more interesting to have difficult queries (both for the search engine and for the judges) than to have queries on which all ten retrieved documents are relevant. So, try to create a good list of information needs.

* Change the ranking of your system, compute the average precision at 10 using your 10 queries, compare the results to your old system, and EXPLAIN what is going on.


# The Search Engine

To start the Elastic searh serive, please run the following code in commandline:

    ./elasticsearch-2.4.1/bin/elasticsearch --node.name telegraaf

## Initiate connection to the Elastic Search engine

In [101]:
import sys
import json
from elasticsearch import Elasticsearch, helpers

HOST = 'http://localhost:9200/'
es = Elasticsearch(hosts=[HOST])

# If code runs, the connection is made

## Generator to read Telegraaf XML and add them to the ES database

A generator makes it possible to immediately put the XML files/documents into the ES databse

* Remove high frequency, but meaningless words

You can use several techniques to get rid of high frequency, but meaningless words: of course IDF, but also mutual information (see 13.5.1), or of course the technique from the paper by Kaptein et al on wordclouds.

    Possibly also create an inverted index at this point? 

In [359]:
from bs4 import BeautifulSoup
import sys

def read(year):
    '''
    return a generator for the date, subject(type), title, and text for each item in the given year. 
    '''
    # Possibly we want to open the files online? This will take up less space on our computers
    
    s = BeautifulSoup(open(year, 'r'),"xml")
    # This is possibly a little bit too crude. Maybe first find the root? That will be slightly less efficient.
    for date,subject, title, text in zip(s.find_all('date'), s.find_all('subject'), s.find_all('title'), s.find_all('text')):
            yield (date.text,subject.text,title.text,text.text)

documents = ['Telegraaf/telegraaf-1923.xml']
# Create the generator for the bulk importer
# I'm not sure if it's a good idea to use _type here as a subject (which is artcle or advertisement, or more...)
# The score calculation for the Elastic Search database uses whole-index statistics. 
# If you're searching a subsection this will alter the scores! WE WILL NEED TO KEEP THIS IN MIND.
k = ({'_type':subject, '_index':'telegraaf','year':date[:4], 'date':date[5:], 'title':title, 'text':text} 
    for year in documents for (date,subject,title,text) in read(year))


You can skip the section below. It was a try out for fully online data import into our local database, but it is very slow due to the internet connection.

In [None]:
# Try out to read the Gzipped files from online and input them into the database without downloading them.
# However, the server responds very slowly. Thus, the requests take a long time. 
# Furthermore, all the documents have to be decoded on the fly, which takes up a lot of memory
import requests
from StringIO import StringIO
import gzip
from urllib import urlopen

def get_zips(main_url):
    html = BeautifulSoup(requests.get(main_url).text,'html')
    return [link.get('href') for link in html.find_all('a',href=True) if 'telegraaf' in link.get('href')]
    
def open_zip(main_url, url):
    zipfile = gzip.GzipFile(fileobj=StringIO(urlopen(main_url+url).read()))
    
    print zipfile.read()[:100].encode('utf-8')
    
main_url = 'http://data.politicalmashup.nl/arjan/telegraaf/'
zips = get_zips(main_url)

print zips[0][:-3]
print open_zip(main_url,zips[0])


## Populate ES database

In [386]:
# List of all indices
! curl 'localhost:9200/_cat/indices?v'

health status index                   pri rep docs.count docs.deleted store.size pri.store.size 
yellow open   .marvel-es-data-1         1   1          2            0      4.1kb          4.1kb 
yellow open   telegraaf                 5   1        228            0    369.4kb        369.4kb 
yellow open   .marvel-es-1-2016.10.20   1   1       8025           98      2.6mb          2.6mb 


In [358]:
# Create the telegraaf index in our telegraaf node
es.indices.create(index='telegraaf', ignore=400)

{u'acknowledged': True}

In [360]:
# Import the information into the database
# The generator can only be used once. So this code will only work once. 
helpers.bulk(es,k)

(228, [])

## Query system 

* Normalise query
* Get right tokens from the query. Use patterns to split up the query in parts? 
* Put them in the right representation for ES search

In [438]:
def search(query, advanced=False):
    '''
    Given a query it returns a SERP with rakings based on the score
    '''
    if advanced:
        print "Advanced search not yet implemented."
        print "please wait till further notice! ;)"
    else:
        # filter_path can help reduce the amount of data that is returned by the es.search
        # The query context is for how well the document fits the query
        # The filter context is a boolean context. Does it match or not.
        # example: Does this timestamp fall into the range 2015 to 2016?
        #
        
        # The outer 'query': is necessary to show that this is the query.  
        q = {'query':
                {'multi_match':
                    {'query' : query,
                     'type' : 'cross_fields',  # with 'and' operator 
                     'fields' : ['title', 'text'],
                     'operator' : 'and'
                     }
                 }
             }
        # In other words, all terms must be present in at least one field for a document to match.
        
        # Very simple query search of a single word:
#         q = {'query':
#                  {'match':
#                   {'title': 'stoomschip'}
#                  }
#             }
        # size shows the top number of results
        res = es.search(index='telegraaf', size=10, body=q)
        print res
    

In [439]:
search('stoomschip engelsche')

{u'hits': {u'hits': [{u'_score': 1.2368431, u'_type': u'artikel', u'_id': u'AVfio6_4YswG1S9go2MY', u'_source': {u'date': u'03-01', u'text': u't', u'year': u'1923', u'title': u'HEVIGE STORM AAN DE ENGELSCHE KUST. Een stoomschip door de hemanning verlaten.'}, u'_index': u'telegraaf'}], u'total': 1, u'max_score': 1.2368431}, u'_shards': {u'successful': 5, u'failed': 0, u'total': 5}, u'took': 2, u'timed_out': False}


## Result page function

* Take query output and use score to order result on a Search Engine Result Page (SERP).
* Return title, link, and description of each hit

-> The description can be a word cloud of 20-25 most informative words. Represent the hits of a query with a wordcloud of 25-50 informative words. The wordcloud should somehow summarise what the collection has to say about the query. You may think of these words as words that you could add to the query in order to improve recall (blind relevance feedback/query expansion). 


Additions
* A timeline with the amount of hits over time
* A table with the number of hits for each political party. Link the party names, which should result in only selecting the hits "ingediend" by members of that party. (Faceted Search) (For the "Telegraaf" collectie, use the dc:subject element as facet values.)

## Advanced Search

The query system will have to be changed to implement this

* Make multiple fields searchable:
    * Title 
    * Tekst
    * Year?
    
Let a user be able to search in several fields, also in several fields simulteanously. Queries like "return kamervragen by Wilders about XXX with an answer about YYY in the period ZZZ" should be possible. (For the "Telegraaf" collectie, let the user search in both title and tekst fields)

## Result page function

* Take query output and use score to order result on a Search Engine Result Page (SERP).
* Return title, link, and description of each hit

## Evaluation

* Manual relevance check
* P@10
* Change the ranking of the system + explain what is going on and why it is improving/decreasing

Evaluate your results Let 2 persons assess the relevancy of the top 10 documents for 5 different queries. Compute Cohen's kappa. Determine the average precision at 10 for your system based on these 10 queries, and the two relevance assesments. Also plot the P@10 (for both judges) for each query, showing differences in hard and easy queries. Describe clearly how you solved differences in judgements. 
So, both provide the actual query, and a description of the information need that was behind the query.
Give a small set of clear guidelines for judging the results, and let your judges follow these guidelines.
It is far more interesting to have difficult queries (both for the search engine and for the judges) than to have queries on which all ten retrieved documents are relevant. So, try to create a good list of information needs.


# REFERENCE CODE:

* Now just follow the guide and learn
* Instead of using the sense plugin or curl, you can talk to elastic search using the python API

In [None]:
# Here are some curl commands to check stuff about the elasticsearch service
# These can be rewritten to python code. -> Check the documentation at ElasticSearch for them. 

In [182]:
# Cluster Health
! curl 'localhost:9200/_cat/health?v'

epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 
1476967177 14:39:37  elasticsearch yellow          1         1      7   7    0    0        7             0                  -                 50.0% 


In [183]:
# List of nodes and their name
! curl 'localhost:9200/_cat/nodes?v'

host      ip        heap.percent ram.percent load node.role master name      
127.0.0.1 127.0.0.1            8         100 1.46 d         *      telegraaf 


In [189]:
# List of all indices
! curl 'localhost:9200/_cat/indices?v'

health status index                   pri rep docs.count docs.deleted store.size pri.store.size 
yellow open   .marvel-es-data-1         1   1          2            0        4kb            4kb 
yellow open   .marvel-es-1-2016.10.20   1   1        201           22    300.1kb        300.1kb 


In [168]:
# Creating a new index:
# ignore 400 cause by IndexAlreadyExistsException when creating an index
es.indices.create(index='test-index', ignore=400)

{u'acknowledged': True}

In [188]:
# Delete index:
# ignore 404 and 400
es.indices.delete(index='telegraaf', ignore=[400, 404])

{u'acknowledged': True}

In [98]:
# Import bulk data:
k = ({'_type':'foo', '_index':'test2','letters':''.join(letters)} for letters in itertools.permutations(string.letters,2))
helpers.bulk(es,k)

<generator object <genexpr> at 0x10685aaf0>


In [None]:
# Create an index
es.index(index='megacorp', doc_type='employee', id=1, body=employee1)

In [None]:
# Count the number of items in an index
es.count(index='test2')

In [106]:
# Verna's read doc
import sys
import time
import json
from elasticsearch import Elasticsearch, helpers
from bs4 import BeautifulSoup
import xmltodict

HOST = 'http://localhost:9200/'
es = Elasticsearch(hosts=[HOST])

xml = BeautifulSoup(open('Telegraaf/telegraaf-1994.xml', 'r'),"xml")

start = time.time()
documents = xml.find_all('root')

json_docs = []

for i, doc in enumerate(documents):
    doc_type = doc.subject.get_text()
    year = doc.date.get_text()[:4]
    document_dict = dict()
    document_dict['body'] = xmltodict.parse(str(doc), xml_attribs=True)
    document_dict['doc_type'] = doc_type
    document_dict['year'] = year
    json_docs.append(json.loads(json.dumps(document_dict, indent=4)))
    #es.index(index='telegraaf', doc_type='document', id=i, body=json_doc)
    
k = ({'_type':'document', '_index':'telegraaf','_source':doc}
    for doc in json_docs)

ImportError: No module named xmltodict

# Using the Python elastic search api

* Documentation: <https://elasticsearch-py.readthedocs.org/en/master/>

In [22]:
import sys
import json
from elasticsearch import Elasticsearch

HOST = 'http://localhost:9200/'
es = Elasticsearch(hosts=[HOST])

query={
  "query": {
    "match_all": {}
  }
}

es.search(body=query)

{u'_shards': {u'failed': 0, u'successful': 2, u'total': 2},
 u'hits': {u'hits': [{u'_id': u'AVfhc1hy09KZ5fwZ_NQX',
    u'_index': u'.marvel-es-1-2016.10.20',
    u'_score': 1.0,
    u'_source': {u'cluster_uuid': u'qkSNHqQ6QMqEo9fljsFQzQ',
     u'indices_stats': {u'_all': {u'primaries': {u'docs': {u'count': 79},
        u'indexing': {u'index_time_in_millis': 366,
         u'index_total': 119,
         u'is_throttled': False,
         u'throttle_time_in_millis': 0},
        u'search': {u'query_time_in_millis': 10, u'query_total': 2},
        u'store': {u'size_in_bytes': 390749}},
       u'total': {u'docs': {u'count': 79},
        u'indexing': {u'index_time_in_millis': 366,
         u'index_total': 119,
         u'is_throttled': False,
         u'throttle_time_in_millis': 0},
        u'search': {u'query_time_in_millis': 10, u'query_total': 2},
        u'store': {u'size_in_bytes': 390749}}}},
     u'source_node': {u'attributes': {},
      u'host': u'127.0.0.1',
      u'ip': u'127.0.0.1',
 

In [23]:
# The example from https://www.elastic.co/guide/en/elasticsearch/guide/current/_talking_to_elasticsearch.html
es.count(body=query)

{u'_shards': {u'failed': 0, u'successful': 2, u'total': 2}, u'count': 135}

# Putting information in the DB

* We follow <https://www.elastic.co/guide/en/elasticsearch/guide/current/_indexing_employee_documents.html>

* Notice that the path /megacorp/employee/1 contains three pieces of information:
    * megacorp: The index name
    * employee: The type name
    * 1 : The ID of this particular employee
    
* We use the `es.index` method 

In [24]:
employee1= {
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

es.index(index='megacorp', doc_type='employee', id=1, body=employee1)


{u'_id': u'1',
 u'_index': u'megacorp',
 u'_shards': {u'failed': 0, u'successful': 1, u'total': 2},
 u'_type': u'employee',
 u'_version': 1,
 u'created': True}

In [28]:
res = es.get(index='megacorp', doc_type='employee', id=1)
print res
print(res['_source'])

{u'_type': u'employee', u'_source': {u'interests': [u'sports', u'music'], u'age': 25, u'about': u'I love to go rock climbing', u'last_name': u'Smith', u'first_name': u'John'}, u'_index': u'megacorp', u'_version': 1, u'found': True, u'_id': u'1'}
{u'interests': [u'sports', u'music'], u'age': 25, u'about': u'I love to go rock climbing', u'last_name': u'Smith', u'first_name': u'John'}


In [31]:
es.indices.refresh(index="megacorp")

res = es.search(index="megacorp", body={"query": {"match_all": {}}})
print res
print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print("%(first_name)s %(last_name)s is  %(age)d years old" % hit["_source"])

{u'hits': {u'hits': [{u'_score': 1.0, u'_type': u'employee', u'_id': u'1', u'_source': {u'interests': [u'sports', u'music'], u'age': 25, u'about': u'I love to go rock climbing', u'last_name': u'Smith', u'first_name': u'John'}, u'_index': u'megacorp'}], u'total': 1, u'max_score': 1.0}, u'_shards': {u'successful': 5, u'failed': 0, u'total': 5}, u'took': 1, u'timed_out': False}
Got 1 Hits:
John Smith is  25 years old


In [32]:
# Example from https://www.elastic.co/guide/en/elasticsearch/guide/current/_search_lite.html
# GET /megacorp/employee/_search?q=last_name:Smith
# View the query in sense to see the specific JSON way of writing it

q= {
  "query": {
    "match": {
      "last_name": "smith"
    }
  }
}
res = es.search(index="megacorp", body=q)
res

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'1',
    u'_index': u'megacorp',
    u'_score': 0.30685282,
    u'_source': {u'about': u'I love to go rock climbing',
     u'age': 25,
     u'first_name': u'John',
     u'interests': [u'sports', u'music'],
     u'last_name': u'Smith'},
    u'_type': u'employee'}],
  u'max_score': 0.30685282,
  u'total': 1},
 u'timed_out': False,
 u'took': 114}

In [33]:
# res is a dict
res['hits']['hits']

[{u'_id': u'1',
  u'_index': u'megacorp',
  u'_score': 0.30685282,
  u'_source': {u'about': u'I love to go rock climbing',
   u'age': 25,
   u'first_name': u'John',
   u'interests': [u'sports', u'music'],
   u'last_name': u'Smith'},
  u'_type': u'employee'}]

In [34]:
# score of first hit 
res['hits']['hits'][0]['_score']

0.30685282

# Bulk indexing

If you index a lot of documents you need to use the bulk index methods.

See 
* <https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html> for the explanation in the guide
* <http://unroutable.blogspot.nl/2015/03/quick-example-elasticsearch-bulk-index.html> for the Python way

In [97]:
>>> import itertools
>>> import string
>>> from elasticsearch import  helpers
 
>>> # k is a generator expression that produces
... # a series of dictionaries containing test data.
... # The test data are just letter permutations
... # created with itertools.permutations.
... #
... # We then reference k as the iterator that's
... # consumed by the elasticsearch.helpers.bulk method.
>>> k = ({'_type':'foo', '_index':'test2','letters':''.join(letters)}
...      for letters in itertools.permutations(string.letters,2))

>>> # calling k.next() shows examples
... # (while consuming the generator, of course)
>>> # each dict contains a doc type, index, and data (at minimum)
>>> k.next()

{'_index': 'test2', '_type': 'foo', 'letters': 'AB'}

In [38]:
# What is this k generator?

letters=  [letters for letters in itertools.permutations(string.letters,4)]

len(letters),letters[:5]

(6497400,
 [('A', 'B', 'C', 'D'),
  ('A', 'B', 'C', 'E'),
  ('A', 'B', 'C', 'F'),
  ('A', 'B', 'C', 'G'),
  ('A', 'B', 'C', 'H')])

In [39]:
k.next()

{'_index': 'test2', '_type': 'foo', 'letters': 'AC'}

In [40]:
>>> # create our test index
>>> es.indices.create('test2')

{u'acknowledged': True}

In [41]:

>>> helpers.bulk(es,k)

(2650, [])

In [66]:
!curl 'localhost:9200/_cat/indices?v'
# res = es.search(body={"query":{ 'match_all':{}}})
# for i in res:
#     print i 
#     for j in res[i]:
#         print '-',j
#         for k in res[i][j]:
#             for l in k:
#                 print '--', l
#                 print k[l]

health status index                   pri rep docs.count docs.deleted store.size pri.store.size 
yellow open   megacorp                  5   1          1            0      4.7kb          4.7kb 
yellow open   test2                     5   1       2650            0    107.7kb        107.7kb 
yellow open   .marvel-es-data-1         1   1          2            0        4kb            4kb 
yellow open   .marvel-es-1-2016.10.20   1   1       1447           24    452.1kb        452.1kb 


In [100]:
>>> # check to make sure we got what we expected...
>>> es.count(index='test2')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5}, u'count': 2650}

# Your turn
* Make quite a bit more documents by changing the 2 in the definition of k to 3, or 4...
* index them again and query, and notice performance
* find out how you can delete an index ;-)

In [47]:
import time

import json, xmljson
from lxml.etree import fromstring, tostring



start = time.time()
xml = fromstring(open('Telegraaf/mini.xml','r').read())
xml_done = time.time()
print xml_done - start

json_data = json.loads(json.dumps(xmljson.parker.data(xml)))
json_done = time.time()
print json_done - xml_done

0.000593900680542
0.0020489692688


In [54]:
# i = 0

# for root in json_data:
#     for document in json_data[root]:
#         if json_data[root][document]:
#             for j, ding in enumerate(json_data[root][document]):
#                 i += 1
#                 print(json_data[root][document][j])
#                 print i

for root in json_data:
    for i, document in enumerate(json_data[root]):
        print json_data[root][i]

{u'{http://www.politicalmashup.nl}meta': {u'{http://purl.org/dc/elements/1.1/}subject': u'advertentie', u'{http://purl.org/dc/elements/1.1/}date': u'1918-04-02', u'{http://purl.org/dc/elements/1.1/}identifier': u'ddd:011211202:mpeg21:p001:a0001', u'{http://purl.org/dc/elements/1.1/}source': {u'{http://purl.org/dc/elements/1.1/}source': {u'{http://www.politicalmashup.nl}link': None}}}, u'{http://www.politicalmashup.nl}docinfo': None, u'{http://www.politicalmashup.nl}content': {u'text': {u'p': u'I;L * .-\u25a0\u2022\u25a0 -."\u25a0 4,1 ffii \' \'\u2022\'\u2022^*V*t \'S ~fr\u2022-\'K\'j\',-^; r \u25a0&*!* ff\' r- AMSTERDAM. v: >\u2022\u2022 . C-. _.:\u2022\u2022\'\u2022 \u2022 - .:-\'.v- - -\u2022\u2022 ;-V->- \'t-\'-\'S- .\u25a0\u2022\u25a0 V \u25a0\u25a0:\u25a0 \u2022 -\'\u25a0 \\v^,> > *_. . \u2022\u2022 | - \' \'\u25a0- --\u2022-f \\- .\xbb\'-\'\u25a0 v*-\': \u2022 \u2022+**\u2022\u25a0\u25a0*\'\u25a0;\u25a0- ;---" *\u2022 \u2022 \u2022\'.\u2022 \u2022\xbb,.. ** .\u25a0 i\'i \'/ \u202