# Obtaining Documents from Elasticsearch

This notebook demonstrates how to query for documents from Elasticsearch.


### Configuration
First, ensure that the appropriate credentials are stored in your AWS credentials at `~/.aws/credentials`.

These should be stored under the `wmuser` profile with something like:

```
[wmuser]
aws_access_key_id = WMUSER_ACCESS_KEY
aws_secret_access_key = WMUSER_SECRET_KEY
```

> Note that this profile must be specified by name when creating the `boto3` session.

### Requirements

```
pip install requests-aws4auth==0.9
pip install elasticsearch==7.0.2
pip install boto3==1.9.172
```

## Connecting to Elasticsearch
First we should connect to Elasticsearch using AWS authentification. This will make it easy to index each parsed document later.

In [1]:
import boto3
import json
from elasticsearch import Elasticsearch, RequestsHttpConnection
from elasticsearch.helpers import scan
from requests_aws4auth import AWS4Auth

region = 'us-east-1'
service = 'es'
eshost = 'search-world-modelers-dev-gjvcliqvo44h4dgby7tn3psw74.us-east-1.es.amazonaws.com'

session = boto3.Session(region_name=region, profile_name='wmuser')
credentials = session.get_credentials()
credentials = credentials.get_frozen_credentials()
access_key = credentials.access_key
secret_key = credentials.secret_key
token = credentials.token

aws_auth = AWS4Auth(
    access_key,
    secret_key,
    region,
    service,
    session_token=token
)



In [2]:
es = Elasticsearch(
    hosts = [{'host': eshost, 'port': 443}],
    http_auth=aws_auth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300
)

print(json.dumps(es.info(), indent=2))

{
  "name": "ZhaR9MU",
  "cluster_name": "342635568055:world-modelers-dev",
  "cluster_uuid": "nGeAO1lMTKaG6_LOpSg17w",
  "version": {
    "number": "6.7.0",
    "build_flavor": "oss",
    "build_type": "zip",
    "build_hash": "8453f77",
    "build_date": "2019-04-17T05:34:35.022392Z",
    "build_snapshot": false,
    "lucene_version": "7.7.0",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}


## Querying from Elasticsearch
We can pull documents from Elasticsearch with a variety of queries using [Elasticsearch's Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl.html).

First, we can try querying for documents based on the Tika extracted text:

In [3]:
index = "wm-dev"

In [4]:
query = {
    "query": {
        "query_string" : {
            "default_field" : "extracted_text.tika", # Ensure we use the correct field (could search on `title` as well)
            "query" : "refugee AND aid AND (addis OR NGO)" # Lucene query syntax
        }
    }
}

In [5]:
results = es.search(index=index, body=query)['hits']['hits']
print(f"The first result file name is: {results[0]['_source']['file_name']}")

The first result file name is: Aid_workers_killed,_kidnapped_and_arrested_Dec-17.pdf


In [6]:
query = {
    "query": {
        "match_all": {}
    }
}

count = es.count(index=index, body=query)['count']
print("There are {0} total documents in the {1} index.".format(count,index))

There are 356 total documents in the wm-dev index.


For larger queries or bulk downloads you can use a `scan`:

In [7]:
scanner = scan(es,
    query = query,
    index = index
)

In [8]:
for doc in scanner:
    # do something with `doc`
    print(doc['_source']['file_name'])

South_Sudan_Humanitarian_Situation_Report_30-Nov-17.pdf
Integrated_Disease_Surveillance_and_Response_(IDSR)_Epidemiological_Update_8-Jan-18.pdf
Integrated_Disease_Surveillance_and_Response_(IDSR)_Epidemiological_Update_11-Dec-17.pdf
South_Sudan_Crop_Watch_Updates_to_3rd_Dekad_of_July_2017_1-Aug-17.pdf
South_Sudan_Humanitarian_Situation_Report_30-Apr-18.pdf
Integrated_Disease_Surveillance_and_Response_(IDSR)_Annexes_25-Sep-17.pdf
Integrated_Disease_Surveillance_and_Response_(IDSR)_Epidemiological_Update_5-Mar-18.pdf
South_Sudan_Humanitarian_Situation_Report_9-Apr-18.pdf
South_Sudan_Regional_Refugee_Response_Plan_-_At_a_Glance_Dec-17.pdf
Integrated_Disease_Surveillance_and_Response_(IDSR)_Epidemiological_Update_2-Oct-17.pdf
WFP_South_Sudan_Situation_Report_17-Jul-17.pdf
SOUTH_SUDAN_Food_Security_Outlook_Update_Dec-17.pdf
East_Africa_Juba_Conflict_Hits_Uganda_s_Economy_11-May-17.html
WFP_South_Sudan_Situation_Report_2-Dec-17.pdf
South_Sudan_mVAM_Bulletin_Food_Security_Monitoring,_phone_