## Inserting and extracting NoSQL database data in Python

#### Inserting data into Elasticsearch

##### Before you can query Elasticsearch, you will need to load some data into an index. In the previous section, you used a library, psycopg2, to access PostgreSQL
##### Elasticsearch, you will use the elasticsearch library. To load data, you need to create the connection, then you can issue commands to Elasticsearch. Follow the given steps to add a record to Elasticsearch

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Import the libraries. You can also create the Faker object to generate random data:
from elasticsearch import Elasticsearch
from faker import Faker
fake=Faker()

In [3]:
#Create a connection to elastic search and type your password
es=Elasticsearch('https://localhost:9200', verify_certs=False, basic_auth=('elastic', 'IsGl-m-KckB5HC4h1aak'))

##### Now, you can issue commands to your Elasticsearch instance. The index method will allow you to add data. The method takes an index name, the document type, and a body. 

In [4]:
#The following code creates a JSON object to add to the database, then uses index to send it to the users index (which will be created automatically during the index operation):
doc={"name": fake.name(),"street": fake.street_address(),"city": fake.city(),"zip":fake.zipcode()}
res=es.index(index="users",id=1,document=doc)
print(res['result'])

created


### Inserting data using helpers

##### Using the bulk method, you can insert many documents at a time. The process is similar to inserting a single record, except that you will generate all the data, then insert it. The steps are as follows

In [5]:
#You need to import the helpers library to access the bulk method
from elasticsearch import helpers

In [6]:
#Create a connection to elastic search and type your password
es=Elasticsearch('https://localhost:9200', verify_certs=False, basic_auth=('elastic', 'IsGl-m-KckB5HC4h1aak'))

In [7]:
actions = [
 {
 "_index": "users",
 "document": "doc",
 "_source": {
"name": fake.name(),
"street": fake.street_address(),
"city": fake.city(),
"zip":fake.zipcode()}
 }
 for x in range(998) # or for i,r in df.iterrows()
]

In [8]:
#Now, you can call the bulk method and pass it the elasticsearch instance and the array of data.
#You can print the results to check that it worked
res = helpers.bulk(es, actions)
print(res[0])

998


### Querying Elasticsearch

##### Querying Elasticsearch follows the exact same steps as inserting data. The only difference is you use a different method – search – to send a different body object.

In [8]:
#Import the library and create your elasticsearch instance:
from elasticsearch import Elasticsearch

In [9]:
es=Elasticsearch('https://localhost:9200', verify_certs=False, basic_auth=('elastic', 'IsGl-m-KckB5HC4h1aak'))

In [10]:
#Create the JSON object to send to Elasticsearch. The object is a query, using the match_all search
doc={"query":{"match_all":{}}}

In [11]:
#Pass the object to Elasticsearch using the search method. Pass the index and the return size
res=es.search(index="users",body=doc,size=10)

In [12]:
#Lastly, you can print the documents:
print(res['hits']['hits'])

[{'_index': 'users', '_id': '1', '_score': 1.0, '_source': {'name': 'Teresa Bowen', 'street': '8302 Scott Stravenue Apt. 764', 'city': 'Frazierstad', 'zip': '75764'}}, {'_index': 'users', '_id': 'ewIerYMBJc2NpmmKse6T', '_score': 1.0, '_source': {'name': 'Shelby Allen', 'street': '557 Jackson Plaza Suite 077', 'city': 'West Manuel', 'zip': '69405'}}, {'_index': 'users', '_id': 'fAIerYMBJc2NpmmKse6T', '_score': 1.0, '_source': {'name': 'Ashley Cantu', 'street': '1330 Tara Coves', 'city': 'South Adrienne', 'zip': '75215'}}, {'_index': 'users', '_id': 'fQIerYMBJc2NpmmKse6T', '_score': 1.0, '_source': {'name': 'Paul Khan', 'street': '245 Jonathan Park', 'city': 'Karafurt', 'zip': '98041'}}, {'_index': 'users', '_id': 'fgIerYMBJc2NpmmKse6T', '_score': 1.0, '_source': {'name': 'Roy Smith', 'street': '28299 Simmons Tunnel Suite 628', 'city': 'West Miguelview', 'zip': '35409'}}, {'_index': 'users', '_id': 'fwIerYMBJc2NpmmKse6T', '_score': 1.0, '_source': {'name': 'Shannon Ochoa', 'street': '414

In [13]:
#Or you can iterate through grabbing _source only
for doc in res['hits']['hits']:
    print(doc['_source'])

{'name': 'Teresa Bowen', 'street': '8302 Scott Stravenue Apt. 764', 'city': 'Frazierstad', 'zip': '75764'}
{'name': 'Shelby Allen', 'street': '557 Jackson Plaza Suite 077', 'city': 'West Manuel', 'zip': '69405'}
{'name': 'Ashley Cantu', 'street': '1330 Tara Coves', 'city': 'South Adrienne', 'zip': '75215'}
{'name': 'Paul Khan', 'street': '245 Jonathan Park', 'city': 'Karafurt', 'zip': '98041'}
{'name': 'Roy Smith', 'street': '28299 Simmons Tunnel Suite 628', 'city': 'West Miguelview', 'zip': '35409'}
{'name': 'Shannon Ochoa', 'street': '41402 Wagner Meadows', 'city': 'New Patriciaville', 'zip': '07411'}
{'name': 'Dr. Paul Edwards', 'street': '6040 Gary Plaza Suite 641', 'city': 'New Josephborough', 'zip': '03444'}
{'name': 'Brittany Miller', 'street': '9783 Navarro Mountain Suite 884', 'city': 'Smithville', 'zip': '84477'}
{'name': 'Marie Swanson', 'street': '93168 Leah Plains', 'city': 'Stewartfurt', 'zip': '49459'}
{'name': 'David Clark', 'street': '2656 Bauer Tunnel', 'city': 'Jason

In [14]:
#Load the query in pandas dataframe
#To load the results into a DataFrame, import json_normalize from the pandas json library, and use it (json_normalize) on the JSON results, as shown in the following code
from pandas.io.json import json_normalize
df=json_normalize(res['hits']['hits'])
df

Unnamed: 0,_index,_id,_score,_source.name,_source.street,_source.city,_source.zip
0,users,1,1.0,Teresa Bowen,8302 Scott Stravenue Apt. 764,Frazierstad,75764
1,users,ewIerYMBJc2NpmmKse6T,1.0,Shelby Allen,557 Jackson Plaza Suite 077,West Manuel,69405
2,users,fAIerYMBJc2NpmmKse6T,1.0,Ashley Cantu,1330 Tara Coves,South Adrienne,75215
3,users,fQIerYMBJc2NpmmKse6T,1.0,Paul Khan,245 Jonathan Park,Karafurt,98041
4,users,fgIerYMBJc2NpmmKse6T,1.0,Roy Smith,28299 Simmons Tunnel Suite 628,West Miguelview,35409
5,users,fwIerYMBJc2NpmmKse6T,1.0,Shannon Ochoa,41402 Wagner Meadows,New Patriciaville,7411
6,users,gAIerYMBJc2NpmmKse6T,1.0,Dr. Paul Edwards,6040 Gary Plaza Suite 641,New Josephborough,3444
7,users,gQIerYMBJc2NpmmKse6T,1.0,Brittany Miller,9783 Navarro Mountain Suite 884,Smithville,84477
8,users,ggIerYMBJc2NpmmKse6T,1.0,Marie Swanson,93168 Leah Plains,Stewartfurt,49459
9,users,gwIerYMBJc2NpmmKse6T,1.0,David Clark,2656 Bauer Tunnel,Jasonmouth,51047


##### Using the match_all query, I know I have a document with the name Ronald Goodman. You can query on a field using the match query

In [15]:
doc={"query":{"match":{"name":"Daniel"}}}
res=es.search(index="users",body=doc, size=10)
print(res['hits']['hits'][1]['_source'])

{'name': 'Daniel Phillips', 'street': '5453 Arnold View', 'city': 'Isabelport', 'zip': '18599'}


##### You can also use a Lucene syntax for queries. In Lucene, you can specify field:value.
##### When performing this kind of search, you do not need a document to send. You can pass the q parameter to the search method:

In [16]:
res=es.search(index="users",q="name:Ronald Goodman",size=10)
print(res['hits']['hits'][0]['_source'])

{'name': 'Brandon Smith', 'street': '58823 Goodman Lodge', 'city': 'South Vicki', 'zip': '92546'}


##### Using the City field, you can search for East. It will return many records:. Elasticsearch will tokenize strings with spaces in them, splitting them into multiple strings to search

In [17]:
doc={"query":{"match":{"city":"East"}}}
res=es.search(index="users",body=doc,size=10)
print(res['hits']['hits'])

[{'_index': 'users', '_id': 'kAIerYMBJc2NpmmKse6T', '_score': 2.2545793, '_source': {'name': 'Michael Decker', 'street': '3906 Stephanie Canyon Suite 493', 'city': 'East Robert', 'zip': '49662'}}, {'_index': 'users', '_id': 'qwIerYMBJc2NpmmKse6T', '_score': 2.2545793, '_source': {'name': 'Jennifer Gould', 'street': '340 Danielle Spur', 'city': 'East James', 'zip': '98182'}}, {'_index': 'users', '_id': 'rQIerYMBJc2NpmmKse6T', '_score': 2.2545793, '_source': {'name': 'Brittney Johnson', 'street': '131 Daniel Harbor', 'city': 'East Adam', 'zip': '76521'}}, {'_index': 'users', '_id': 'xwIerYMBJc2NpmmKse6U', '_score': 2.2545793, '_source': {'name': 'Jeffrey Russo Jr.', 'street': '73006 Lopez Streets', 'city': 'East Victoria', 'zip': '56986'}}, {'_index': 'users', '_id': 'ywIerYMBJc2NpmmKse6U', '_score': 2.2545793, '_source': {'name': 'Kevin White', 'street': '98026 Carly Ramp Suite 143', 'city': 'East Brandon', 'zip': '99929'}}, {'_index': 'users', '_id': '4QIerYMBJc2NpmmKse6U', '_score': 2

##### You can use Boolean queries to specify multiple search criteria. For example, you can use must, must not, and should before your queries. Using a Boolean query, you can filter out Jamesberg. Using a must match on Jamesberg as the city (which will return two records), and adding a filter on the ZIP, you can make sure only Jamesberg with the ZIP 63792 is returned. You could also use a must not query on the Lake Jameson ZIP:

In [18]:
doc={"query":{"bool":{"must":{"match":{"city":"East"}},"filter":{"term":{"zip":"89229"}}}}}
res=es.search(index="users",body=doc,size=10)
print(res['hits']['hits'])

[]


### Using scroll to handle larger results

##### In the first example, you used a size of 10 for your search. You could have grabbed all 1,000 records, but what do you do when you have more than 10,000 and you need all of them? Elasticsearch has a scroll method that will allow you to iterate over the results until you get them all

In [19]:
import warnings
warnings.filterwarnings('ignore')

In [20]:
#Import the library and create your Elasticsearch instance:
from elasticsearch import Elasticsearch
es=Elasticsearch('https://localhost:9200', verify_certs=False, basic_auth=('elastic', 'IsGl-m-KckB5HC4h1aak'))

#####  You will pass a new parameter to the search method – scroll. This parameter specifies how long you want to make the results available for. I am using 20 milliseconds. Adjust this number to make sure you have enough time to get the data – it will depend on the document size and network speed

In [21]:
res = es.search(index = 'users', scroll = '20m', size = 500, body = {"query":{"match_all":{}}})
res

ObjectApiResponse({'_scroll_id': 'FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFmYxcGpuY21hUUlDRmVDbE9LV1VXSGcAAAAAAAACFxZYdFZ6ZExVYlNDMmdfdWhkUG1Fc0Vn', 'took': 10, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 999, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'users', '_id': '1', '_score': 1.0, '_source': {'name': 'Teresa Bowen', 'street': '8302 Scott Stravenue Apt. 764', 'city': 'Frazierstad', 'zip': '75764'}}, {'_index': 'users', '_id': 'ewIerYMBJc2NpmmKse6T', '_score': 1.0, '_source': {'name': 'Shelby Allen', 'street': '557 Jackson Plaza Suite 077', 'city': 'West Manuel', 'zip': '69405'}}, {'_index': 'users', '_id': 'fAIerYMBJc2NpmmKse6T', '_score': 1.0, '_source': {'name': 'Ashley Cantu', 'street': '1330 Tara Coves', 'city': 'South Adrienne', 'zip': '75215'}}, {'_index': 'users', '_id': 'fQIerYMBJc2NpmmKse6T', '_score': 1.0, '_source': {'name': 'Paul Khan', 'street': '245 Jonathan Park', 'city':

In [22]:
#The results will include _scroll_id, which you will need to pass to the scroll method later
#Save the scroll ID and the size of the result set
sid = res['_scroll_id']
size = res['hits']['total']['value']
sid

'FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFmYxcGpuY21hUUlDRmVDbE9LV1VXSGcAAAAAAAACFxZYdFZ6ZExVYlNDMmdfdWhkUG1Fc0Vn'

In [23]:
size

999

##### To start scrolling, use a while loop to get records until the size is 0, meaning there is no more data. Inside the loop, you will call the scroll method and pass _scroll_id and how long to scroll. This will grab more of the results from the original query

In [None]:
while (size > 0):
    res = es.scroll(scroll_id = sid, scroll = '20m')

In [24]:
#get the new scroll ID and the size so that you can loop through again if the data still exists
sid = res['_scroll_id']
size = len(res['hits']['hits'])
sid

'FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFmYxcGpuY21hUUlDRmVDbE9LV1VXSGcAAAAAAAACFxZYdFZ6ZExVYlNDMmdfdWhkUG1Fc0Vn'

In [25]:
size

500

In [26]:
#Lastly, you can do something with the results of the scrolls. In the following code,
#you will print the source for every record
for doc in res['hits']['hits']:
    print(doc['_source'])

{'name': 'Teresa Bowen', 'street': '8302 Scott Stravenue Apt. 764', 'city': 'Frazierstad', 'zip': '75764'}
{'name': 'Shelby Allen', 'street': '557 Jackson Plaza Suite 077', 'city': 'West Manuel', 'zip': '69405'}
{'name': 'Ashley Cantu', 'street': '1330 Tara Coves', 'city': 'South Adrienne', 'zip': '75215'}
{'name': 'Paul Khan', 'street': '245 Jonathan Park', 'city': 'Karafurt', 'zip': '98041'}
{'name': 'Roy Smith', 'street': '28299 Simmons Tunnel Suite 628', 'city': 'West Miguelview', 'zip': '35409'}
{'name': 'Shannon Ochoa', 'street': '41402 Wagner Meadows', 'city': 'New Patriciaville', 'zip': '07411'}
{'name': 'Dr. Paul Edwards', 'street': '6040 Gary Plaza Suite 641', 'city': 'New Josephborough', 'zip': '03444'}
{'name': 'Brittany Miller', 'street': '9783 Navarro Mountain Suite 884', 'city': 'Smithville', 'zip': '84477'}
{'name': 'Marie Swanson', 'street': '93168 Leah Plains', 'city': 'Stewartfurt', 'zip': '49459'}
{'name': 'David Clark', 'street': '2656 Bauer Tunnel', 'city': 'Jason