# ElasticSearch

* Literature: <https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html>
* In this notebook we follow the steps in this guide
* Java JDK 8 is strongly recommended, so you may need to upgrade your Java
    (on my mac with java 1.6 elasticsearch did not work)

# We start elastic search again
* Now just follow the guide and learn
* Instead of using the sense plugin or curl, you can talk to elastic search using the python API

In [11]:
! ./elasticsearch-2.4.1/bin/elasticsearch -d

# Using the Python elastic search api

* Documentation: <https://elasticsearch-py.readthedocs.org/en/master/>

In [21]:
import sys
import json
from elasticsearch import Elasticsearch

HOST = 'http://localhost:9200/'
es = Elasticsearch(hosts=[HOST])

query={
  "query": {
    "match_all": {}
  }
}

es.search(body=query)

In [13]:
# The example from https://www.elastic.co/guide/en/elasticsearch/guide/current/_talking_to_elasticsearch.html
es.count(body=query)

{u'_shards': {u'failed': 0, u'successful': 2, u'total': 2}, u'count': 7}

# Putting information in the DB

* We follow <https://www.elastic.co/guide/en/elasticsearch/guide/current/_indexing_employee_documents.html>

* Notice that the path /megacorp/employee/1 contains three pieces of information:
    * megacorp: The index name
    * employee: The type name
    * 1 : The ID of this particular employee
    
* We use the `es.index` method 

In [14]:
employee1= {
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

es.index(index='megacorp', doc_type='employee', id=1, body=employee1)


{u'_id': u'1',
 u'_index': u'megacorp',
 u'_shards': {u'failed': 0, u'successful': 1, u'total': 2},
 u'_type': u'employee',
 u'_version': 1,
 u'created': True}

In [15]:
res = es.get(index='megacorp', doc_type='employee', id=1)
print(res['_source'])

{u'interests': [u'sports', u'music'], u'age': 25, u'about': u'I love to go rock climbing', u'last_name': u'Smith', u'first_name': u'John'}


In [16]:
es.indices.refresh(index="megacorp")

res = es.search(index="megacorp", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print("%(first_name)s %(last_name)s is  %(age)d years old" % hit["_source"])

Got 1 Hits:
John Smith is  25 years old


In [17]:
# Example from https://www.elastic.co/guide/en/elasticsearch/guide/current/_search_lite.html
# GET /megacorp/employee/_search?q=last_name:Smith
# View the query in sense to see the specific JSON way of writing it

q= {
  "query": {
    "match": {
      "last_name": "smith"
    }
  }
}
res = es.search(index="megacorp", body=q)
res

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'1',
    u'_index': u'megacorp',
    u'_score': 0.30685282,
    u'_source': {u'about': u'I love to go rock climbing',
     u'age': 25,
     u'first_name': u'John',
     u'interests': [u'sports', u'music'],
     u'last_name': u'Smith'},
    u'_type': u'employee'}],
  u'max_score': 0.30685282,
  u'total': 1},
 u'timed_out': False,
 u'took': 12}

In [18]:
# res is a dict
res['hits']['hits']

[{u'_id': u'1',
  u'_index': u'megacorp',
  u'_score': 0.30685282,
  u'_source': {u'about': u'I love to go rock climbing',
   u'age': 25,
   u'first_name': u'John',
   u'interests': [u'sports', u'music'],
   u'last_name': u'Smith'},
  u'_type': u'employee'}]

In [19]:
# score of first hit 
res['hits']['hits'][0]['_score']

0.30685282

# Bulk indexing

If you index a lot of documents you need to use the bulk index methods.

See 
* <https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html> for the explanation in the guide
* <http://unroutable.blogspot.nl/2015/03/quick-example-elasticsearch-bulk-index.html> for the Python way

In [20]:
>>> import itertools
>>> import string
>>> from elasticsearch import  helpers
 
>>> # k is a generator expression that produces
... # a series of dictionaries containing test data.
... # The test data are just letter permutations
... # created with itertools.permutations.
... #
... # We then reference k as the iterator that's
... # consumed by the elasticsearch.helpers.bulk method.
>>> k = ({'_type':'foo', '_index':'test2','letters':''.join(letters)}
...      for letters in itertools.permutations(string.letters,2))

>>> # calling k.next() shows examples
... # (while consuming the generator, of course)
>>> # each dict contains a doc type, index, and data (at minimum)
>>> k.next()

{'_index': 'test2', '_type': 'foo', 'letters': 'AB'}

In [22]:
# What is this k generator?

letters=  [letters for letters in itertools.permutations(string.letters,4)]

len(letters),letters[:5]

(6497400,
 [('A', 'B', 'C', 'D'),
  ('A', 'B', 'C', 'E'),
  ('A', 'B', 'C', 'F'),
  ('A', 'B', 'C', 'G'),
  ('A', 'B', 'C', 'H')])

In [23]:
k.next()

{'_index': 'test2', '_type': 'foo', 'letters': 'AC'}

In [24]:
>>> # create our test index
>>> es.indices.create('test2')

{u'acknowledged': True}

In [25]:

>>> helpers.bulk(es,k)

(2650, [])

In [26]:
>>> # check to make sure we got what we expected...
>>> es.count(index='test')

NotFoundError: TransportError(404, u'index_not_found_exception', u'no such index')

# Your turn
* Make quite a bit more documents by changing the 2 in the definition of k to 3, or 4...
* index them again and query, and notice performance
* find out how you can delete an index ;-)

In [47]:
import time

import json, xmljson
from lxml.etree import fromstring, tostring



start = time.time()
xml = fromstring(open('Telegraaf/mini.xml','r').read())
xml_done = time.time()
print xml_done - start

json_data = json.loads(json.dumps(xmljson.parker.data(xml)))
json_done = time.time()
print json_done - xml_done

0.000593900680542
0.0020489692688


In [54]:
# i = 0

# for root in json_data:
#     for document in json_data[root]:
#         if json_data[root][document]:
#             for j, ding in enumerate(json_data[root][document]):
#                 i += 1
#                 print(json_data[root][document][j])
#                 print i

for root in json_data:
    for i, document in enumerate(json_data[root]):
        print json_data[root][i]

{u'{http://www.politicalmashup.nl}meta': {u'{http://purl.org/dc/elements/1.1/}subject': u'advertentie', u'{http://purl.org/dc/elements/1.1/}date': u'1918-04-02', u'{http://purl.org/dc/elements/1.1/}identifier': u'ddd:011211202:mpeg21:p001:a0001', u'{http://purl.org/dc/elements/1.1/}source': {u'{http://purl.org/dc/elements/1.1/}source': {u'{http://www.politicalmashup.nl}link': None}}}, u'{http://www.politicalmashup.nl}docinfo': None, u'{http://www.politicalmashup.nl}content': {u'text': {u'p': u'I;L * .-\u25a0\u2022\u25a0 -."\u25a0 4,1 ffii \' \'\u2022\'\u2022^*V*t \'S ~fr\u2022-\'K\'j\',-^; r \u25a0&*!* ff\' r- AMSTERDAM. v: >\u2022\u2022 . C-. _.:\u2022\u2022\'\u2022 \u2022 - .:-\'.v- - -\u2022\u2022 ;-V->- \'t-\'-\'S- .\u25a0\u2022\u25a0 V \u25a0\u25a0:\u25a0 \u2022 -\'\u25a0 \\v^,> > *_. . \u2022\u2022 | - \' \'\u25a0- --\u2022-f \\- .\xbb\'-\'\u25a0 v*-\': \u2022 \u2022+**\u2022\u25a0\u25a0*\'\u25a0;\u25a0- ;---" *\u2022 \u2022 \u2022\'.\u2022 \u2022\xbb,.. ** .\u25a0 i\'i \'/ \u202