## Inserting data into Elasticsearch
Before you can query Elasticsearch, you will need to load some data into an index. In
the previous section, you used a library, psycopg2, to access PostgreSQL. To access Elasticsearch, you will use the elasticsearch library. To load data, you need to create the connection, then you can issue commands to Elasticsearch. Follow the given steps to add a record to Elasticsearch:

1. Import the libraries. You can also create the Faker object to generate random data:

In [91]:
from elasticsearch import Elasticsearch
from faker import Faker

fake=Faker()

2. Create a connection to Elasticsearch:

In [92]:
es = Elasticsearch()

3. The preceding code assumes that your Elasticsearch instance is running on localhost. If it is not, you can specify the IP address, as shown:

In [93]:
es=Elasticsearch({'127.0.0.1'})

In [14]:
doc={"name": fake.name(),"street": fake.street_address(),"city": fake.city(),"zip":fake.zipcode()}
res=es.index(index="users",body=doc)

In [15]:
print(res['result']) #created

created


In [16]:
res

{'_index': 'users',
 '_type': '_doc',
 '_id': 'E5wNYHgBgZXbBKAYBJ_U',
 '_version': 1,
 'result': 'created',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 0,
 '_primary_term': 1}

## Inserting data using helpers
Using the bulk method, you can insert many documents at a time. The process is similar to inserting a single record, except that you will generate all the data, then insert it. The steps are as follows:

1. You need to import the helpers library to access the bulk method:

In [17]:
from elasticsearch import helpers

In [26]:
actions = [
    {
        "_index": "users",
        "_source": {
        "name": fake.name(),
        "street": fake.street_address(),
        "city": fake.city(),
        "zip":fake.zipcode()}
    }
    for x in range(998) # or for i,r in df.iterrows()
]

In [27]:
actions

[{'_index': 'users',
  '_source': {'name': 'Tammy Reyes',
   'street': '3981 Ronnie Wells',
   'city': 'Hannahland',
   'zip': '83342'}},
 {'_index': 'users',
  '_source': {'name': 'Micheal Wilson',
   'street': '6214 Jason Orchard Apt. 150',
   'city': 'Harrishaven',
   'zip': '36760'}},
 {'_index': 'users',
  '_source': {'name': 'Jeffery Chen',
   'street': '228 Allison Manors',
   'city': 'Port Jessica',
   'zip': '37875'}},
 {'_index': 'users',
  '_source': {'name': 'Bryce Drake',
   'street': '181 Haley Stream Suite 205',
   'city': 'Ethanton',
   'zip': '47114'}},
 {'_index': 'users',
  '_source': {'name': 'James Hodges',
   'street': '8741 Aaron Stravenue',
   'city': 'West Morgan',
   'zip': '56433'}},
 {'_index': 'users',
  '_source': {'name': 'Isabel Sandoval',
   'street': '5214 Norris Lodge',
   'city': 'South Heather',
   'zip': '07033'}},
 {'_index': 'users',
  '_source': {'name': 'Erica Mayo',
   'street': '6577 Rogers Road Suite 340',
   'city': 'Maryton',
   'zip': '19

In [30]:
res = helpers.bulk(es, actions)
print(res)

(998, [])


## Querying Elasticsearch
Querying Elasticsearch follows the exact same steps as inserting data. The only difference is you use a different method – search – to send a different body object. Let's walk through a simple query on all the data:

1. Import the library and create your elasticsearch instance:

In [2]:
from elasticsearch import Elasticsearch
es = Elasticsearch()

2. Create the JSON object to send to Elasticsearch. The object is a query, using the match_all search:

In [30]:
doc={"query":{"match_all":{}}}

3. Pass the object to Elasticsearch using the search method. Pass the index and the return size. In this case, you will only return 10 records. The maximum return size is 10,000 documents:

In [31]:
res=es.search(index="users",body=doc,size=10)

4. Lastly, you can print the documents:

In [32]:
print(res['hits']['hits'])

[{'_index': 'users', '_type': '_doc', '_id': 'upwaYHgBgZXbBKAYpKwB', '_score': 1.0, '_source': {'name': 'Tammy Reyes', 'street': '3981 Ronnie Wells', 'city': 'Hannahland', 'zip': '83342'}}, {'_index': 'users', '_type': '_doc', '_id': 'u5waYHgBgZXbBKAYpKwB', '_score': 1.0, '_source': {'name': 'Micheal Wilson', 'street': '6214 Jason Orchard Apt. 150', 'city': 'Harrishaven', 'zip': '36760'}}, {'_index': 'users', '_type': '_doc', '_id': 'vJwaYHgBgZXbBKAYpKwB', '_score': 1.0, '_source': {'name': 'Jeffery Chen', 'street': '228 Allison Manors', 'city': 'Port Jessica', 'zip': '37875'}}, {'_index': 'users', '_type': '_doc', '_id': 'vZwaYHgBgZXbBKAYpKwB', '_score': 1.0, '_source': {'name': 'Bryce Drake', 'street': '181 Haley Stream Suite 205', 'city': 'Ethanton', 'zip': '47114'}}, {'_index': 'users', '_type': '_doc', '_id': 'vpwaYHgBgZXbBKAYpKwB', '_score': 1.0, '_source': {'name': 'James Hodges', 'street': '8741 Aaron Stravenue', 'city': 'West Morgan', 'zip': '56433'}}, {'_index': 'users', '_ty

In [15]:
doc={"query": { "match_phrase": { "name": "affadoul" } }}
res=es.search(index="users",body=doc)
print(res['hits']['hits'])

[{'_index': 'users', '_type': '_doc', '_id': 'l56vY3gBlPWjxInHVYPe', '_score': 8.2178135, '_source': {'name': 'affadoul', 'street': 'hay salam', 'city': 'khouribga', 'zip': '2500'}}]


Or you can iterate through grabbing _source only:

In [42]:
for doc in res['hits']['hits']:
    print(doc['_source'])

{'name': 'Tammy Reyes', 'street': '3981 Ronnie Wells', 'city': 'Hannahland', 'zip': '83342'}
{'name': 'Micheal Wilson', 'street': '6214 Jason Orchard Apt. 150', 'city': 'Harrishaven', 'zip': '36760'}
{'name': 'Jeffery Chen', 'street': '228 Allison Manors', 'city': 'Port Jessica', 'zip': '37875'}
{'name': 'Bryce Drake', 'street': '181 Haley Stream Suite 205', 'city': 'Ethanton', 'zip': '47114'}
{'name': 'James Hodges', 'street': '8741 Aaron Stravenue', 'city': 'West Morgan', 'zip': '56433'}
{'name': 'Isabel Sandoval', 'street': '5214 Norris Lodge', 'city': 'South Heather', 'zip': '07033'}
{'name': 'Erica Mayo', 'street': '6577 Rogers Road Suite 340', 'city': 'Maryton', 'zip': '19551'}
{'name': 'Molly Jones', 'street': '680 Rice Inlet', 'city': 'Lake Joelland', 'zip': '16025'}
{'name': 'Megan Kelly', 'street': '323 Christopher Flat', 'city': 'Chadville', 'zip': '85058'}
{'name': 'Crystal Graham', 'street': '70221 Bishop Village Apt. 961', 'city': 'West Megan', 'zip': '60334'}


You can load the results of the query into a pandas DataFrame – it is JSON, and you learned how to read JSON in Chapter 3, Reading and Writing Files. To load the results into a DataFrame, import json_normalize from the pandas json library, and use it (json_normalize) on the JSON results, as shown in the following code:

In [47]:
from pandas import json_normalize
df=json_normalize(res['hits']['hits'])

In [94]:
#adding records for test purposes
doc={"name": fake.name(),"street": fake.street_address(),"city": "Lake Jamesberg","zip":fake.zipcode()}
res=es.index(index="users",body=doc)

Now you will have the results of the search in a DataFrame. In this example, you just grabbed all the records, but there are other queries available besides match_all.
Using the match_all query, I know I have a document with the name Ronald Goodman. You can query on a field using the match query:

In [85]:
doc={"query":{"match":{"name":"Ronald Williams"}}}
res=es.search(index="users",body=doc,size=10)
print(res['hits']['hits'][0]['_source'])

{'name': 'Ronald Williams', 'street': '969 Hughes Place Apt. 350', 'city': 'Smithmouth', 'zip': '50537'}


In [86]:
for doc in res['hits']['hits']:
    print(doc['_source'])

{'name': 'Ronald Williams', 'street': '969 Hughes Place Apt. 350', 'city': 'Smithmouth', 'zip': '50537'}
{'name': 'Robert Williams', 'street': '156 Breanna Port Apt. 276', 'city': 'West Jessicaland', 'zip': '65477'}
{'name': 'Anna Williams', 'street': '566 Katelyn Court', 'city': 'Port Reginald', 'zip': '78438'}
{'name': 'Wendy Williams', 'street': '9609 Welch Throughway', 'city': 'Christianstad', 'zip': '93385'}
{'name': 'Christopher Williams', 'street': '31173 Bennett Union', 'city': 'West Johnfort', 'zip': '95465'}
{'name': 'Shane Williams', 'street': '4921 Henry Green Suite 881', 'city': 'Derekside', 'zip': '82489'}
{'name': 'Ashley Williams', 'street': '0795 Clayton Branch Apt. 184', 'city': 'Martinezton', 'zip': '74005'}
{'name': 'Tracey Williams', 'street': '74041 Vance Centers Apt. 346', 'city': 'Davisview', 'zip': '34484'}
{'name': 'Hannah Williams', 'street': '756 Nelson Plains Apt. 569', 'city': 'Williestad', 'zip': '88334'}
{'name': 'Gerald Williams', 'street': '1306 Andrew

You can also use a Lucene syntax for queries. In Lucene, you can specify field:value. When performing this kind of search, you do not need a document to send. You can pass the q parameter to the search method:

In [87]:
res=es.search(index="users",q="name:Ronald Goodman",size=10)
print(res['hits']['hits'][0]['_source'])

{'name': 'Ronald Williams', 'street': '969 Hughes Place Apt. 350', 'city': 'Smithmouth', 'zip': '50537'}


In [88]:
for doc in res['hits']['hits']:
    print(doc['_source'])

{'name': 'Ronald Williams', 'street': '969 Hughes Place Apt. 350', 'city': 'Smithmouth', 'zip': '50537'}
{'name': 'Christopher Goodman', 'street': '348 Tran Square Apt. 042', 'city': 'Lake Codyburgh', 'zip': '08992'}


Using the City field, you can search for Jamesberg. It will return two records: one for Jamesberg and one for Lake Jamesberg. Elasticsearch will tokenize strings with spaces in them, splitting them into multiple strings to search:

In [95]:
# Get City Jamesberg - Returns Jamesberg and Lake Jamesberg
doc={"query":{"match":{"city":"Jamesberg"}}}
res=es.search(index="users",body=doc,size=10)
print(res['hits']['hits'])

[{'_index': 'users', '_type': '_doc', '_id': 'zpwaYHgBgZXbBKAYpK0B', '_score': 6.9401026, '_source': {'name': 'Kenneth Buchanan', 'street': '7916 Derek Mews', 'city': 'Jamesberg', 'zip': '41495'}}, {'_index': 'users', '_type': '_doc', '_id': 'mJ7iY3gBlPWjxInHb4Om', '_score': 5.2750554, '_source': {'name': 'Karen Johnson', 'street': '0474 Herrera Inlet Apt. 473', 'city': 'Lake Jamesberg', 'zip': '89575'}}]


In [96]:
for doc in res['hits']['hits']:
    print(doc['_source'])

{'name': 'Kenneth Buchanan', 'street': '7916 Derek Mews', 'city': 'Jamesberg', 'zip': '41495'}
{'name': 'Karen Johnson', 'street': '0474 Herrera Inlet Apt. 473', 'city': 'Lake Jamesberg', 'zip': '89575'}


In [97]:
# with lucene syntax
res=es.search(index="users",q="city:Jamesberg",size=10)
for doc in res['hits']['hits']:
    print(doc['_source'])

{'name': 'Kenneth Buchanan', 'street': '7916 Derek Mews', 'city': 'Jamesberg', 'zip': '41495'}
{'name': 'Karen Johnson', 'street': '0474 Herrera Inlet Apt. 473', 'city': 'Lake Jamesberg', 'zip': '89575'}


You can use Boolean queries to specify multiple search criteria. For example, you can
use must, must not, and should before your queries. Using a Boolean query, you can filter out Lake Jamesberg. Using a must match on Jamesberg as the city (which will return two records), and adding a filter on the ZIP, you can make sure only Jamesberg with the ZIP 63792 is returned. You could also use a must not query on the Lake Jameson ZIP:

In [105]:
# Get Jamesberg and filter on zip so Lake Jamesberg is removed
doc={"query":{"bool":{"must":{"match":{"city":"Jamesberg"}},
"filter":{"term":{"zip":"41495"}}}}}
res=es.search(index="users",body=doc,size=10)
print(res['hits']['hits'])

[{'_index': 'users', '_type': '_doc', '_id': 'zpwaYHgBgZXbBKAYpK0B', '_score': 6.9401026, '_source': {'name': 'Kenneth Buchanan', 'street': '7916 Derek Mews', 'city': 'Jamesberg', 'zip': '41495'}}]


## Using scroll to handle larger results

In the first example, you used a size of 10 for your search. You could have grabbed all 1,000 records, but what do you do when you have more than 10,000 and you need all of them? Elasticsearch has a scroll method that will allow you to iterate over the results until you get them all. To scroll through the data, follow the given steps:

1. Import the library and create your Elasticsearch instance:

In [139]:
from elasticsearch import Elasticsearch
es = Elasticsearch()

2. Search your data. Since you do not have over 10,000 records, you will set the size to 500. This means you will be missing 500 records from your initial search. You will pass a new parameter to the search method – scroll. This parameter specifies how long you want to make the results available for. I am using 20 milliseconds. Adjust this number to make sure you have enough time to get the data – it will depend on the document size and network speed:

In [152]:
res = es.search(
index = 'users',
scroll = '20m',
size = 500,
body = {"query":{"match_all":{}}}
)

3. The results will include _scroll_id, which you will need to pass to the scroll method later. Save the scroll ID and the size of the result set:

In [153]:
sid = res['_scroll_id']
size = res['hits']['total']['value']

4. To start scrolling, use a while loop to get records until the size is 0, meaning there is no more data. Inside the loop, you will call the scroll method and pass _scroll_id and how long to scroll. This will grab more of the results from the original query:

In [154]:
i = 0
j = 1
while (size > 0):
    sid = res['_scroll_id']
    size = len(res['hits']['hits'])
    for doc in res['hits']['hits']:
        print(j,doc['_source'])
        i = i + 1
    res = es.scroll(scroll_id = sid, scroll = '20m')
    j +=1
print(i)

1 {'name': 'Tammy Reyes', 'street': '3981 Ronnie Wells', 'city': 'Hannahland', 'zip': '83342'}
1 {'name': 'Micheal Wilson', 'street': '6214 Jason Orchard Apt. 150', 'city': 'Harrishaven', 'zip': '36760'}
1 {'name': 'Jeffery Chen', 'street': '228 Allison Manors', 'city': 'Port Jessica', 'zip': '37875'}
1 {'name': 'Bryce Drake', 'street': '181 Haley Stream Suite 205', 'city': 'Ethanton', 'zip': '47114'}
1 {'name': 'James Hodges', 'street': '8741 Aaron Stravenue', 'city': 'West Morgan', 'zip': '56433'}
1 {'name': 'Isabel Sandoval', 'street': '5214 Norris Lodge', 'city': 'South Heather', 'zip': '07033'}
1 {'name': 'Erica Mayo', 'street': '6577 Rogers Road Suite 340', 'city': 'Maryton', 'zip': '19551'}
1 {'name': 'Molly Jones', 'street': '680 Rice Inlet', 'city': 'Lake Joelland', 'zip': '16025'}
1 {'name': 'Megan Kelly', 'street': '323 Christopher Flat', 'city': 'Chadville', 'zip': '85058'}
1 {'name': 'Crystal Graham', 'street': '70221 Bishop Village Apt. 961', 'city': 'West Megan', 'zip': 