## What does this Notebook achieve?

Purpose-built databases provide innovative ways to build data access patterns that cannot be solved otherwise. Many customers are looking to solve their business problems by storing and integrating data across a combination of purpose-built databases. For example, we can model highly connected geospatial data as a graph and store it in Amazon Neptune. We can query such datasets quickly and at massive scale using a graph data model. Another purpose-built database, Amazon Elasticsearch (OpenSearch) Service, can store geospatial data and provide powerful geo queries in addition to its full text search capabilities.

This artifact provides the code samples for combining Amazon Neptune with Amazon Elasticsearch (OpenSearch) to perform geospatial queries on a dataset that's synchronized between both purpose-built datastores.

Some common uses cases where geospatial querying is required are: 
* Given an entity in the dataset, find another entity in that dataset that is located the closest to it on the surface of the earth
* Given an entity and a geographical radius parameter, find all entities located within this radius of the given entity

Answering the first question in the graph context, where the entities in question are commonly connected via edges, typically does not present computational challenges since the set of the entities eligible for analysis is typically represented by the  nodes in the graph directly connected to the starting node. 
Answering the second question without relying on a persistence layer with built-in geospatial radius query capabilities can become challenging, considering that it’s commonly factoring in all of the eligible entities in the graph.

### Prerequisites
You need an Amazon Neptune cluster to store the geospatial data in graph data model. You also need to provision a managed Amazon SageMaker notebook and attach it to the Neptune database cluster. 
You can follow the step-by-step instructions on how to configure this workload, including CloudFormation templates given in our documentation:
* Create Neptune cluster: https://docs.aws.amazon.com/neptune/latest/userguide/get-started-create-cluster.html
* Create a SageMaker hosted Notebook: https://docs.aws.amazon.com/neptune/latest/userguide/graph-notebooks.html
* Also configure the Neptune to Elasticsearch integration by following this guide/CloudFormation: https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search-cfn-create.html

Once you have the prerequisites taken care of, you can follow along the code samples in the Jupyter Notebook to do the geospatial queries.

## Generate fictitious dataset in graph

In [None]:
%%gremlin
g.addV('distribution_center').property('dc_id', 'dc_1').property('coordinates', '40.7128,74.0060').as('nyc_dc')
.addV('distribution_center').property('dc_id', 'dc_2').property('coordinates', '37.7749,122.4194').as('sf_dc')
.addV('store').property('store_id', 'nyc_store_1').property('coordinates', '40.7111,74.0080')
.property(single, 'address', '100 Main St').as('nyc_store_1')   
.addV('store').property('store_id', 'nyc_store_2').property('coordinates', '40.8111,74.0180')
.property(single, 'address', '100 Other St').as('nyc_store_2')   
.addV('store').property('store_id', 'nyc_store_3').property('coordinates', '40.9111,74.0280')
.property(single, 'address', '100 Another St').as('nyc_store_3')
.addV('store').property('store_id', 'nyc_store_4').property('coordinates', '40.7128,74.1061')
.property(single, 'address', '100 Here St').as('nyc_store_4')
.addV('store').property('store_id', 'sf_store_1').property('coordinates', '37.6749,122.4194').as('sf_store_1')
.addV('store').property('store_id', 'sf_store_2').property('coordinates', '38.6749,122.5194').as('sf_store_2')
.addV('store').property('store_id', 'sf_store_3').property('coordinates', '37.7749,123.4194').as('sf_store_3')
.addV('store').property('store_id', 'sf_store_4').property('coordinates', '37.8749,123.5194').as('sf_store_4')
.addE('ships_to').from('nyc_dc').to('nyc_store_1')
.addE('ships_to').from('nyc_dc').to('nyc_store_2')
.addE('ships_to').from('nyc_dc').to('nyc_store_3')
.addE('ships_to').from('nyc_dc').to('nyc_store_4')
.addE('ships_to').from('sf_dc').to('sf_store_1')
.addE('ships_to').from('sf_dc').to('sf_store_2')
.addE('ships_to').from('sf_dc').to('sf_store_3')
.addE('ships_to').from('sf_dc').to('sf_store_4')

## Visualize the dataset

In [62]:
%%gremlin
g.V().hasLabel('distribution_center').bothE().otherV().path().by('dc_id').by().by(valueMap())

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Force(network=<graph…

## Get stores that a distribution center is connected to

In [63]:
%%gremlin
g.V().hasLabel('distribution_center')
.has('dc_id', 'dc_1')
.out('ships_to')
.valueMap('store_id', 'coordinates')

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')),), _titles={'0': 'Con…

## Calculate distance using haversine

In [64]:
import haversine as hs
from haversine import Unit

store1=(40.7111,74.0080) # store 1 coordinates
store2=(40.8111,74.0180) # store 2 coordinates
store3=(40.9111,74.0280) # store 3 coordinates
store4=(40.7128,74.1061) # store 4 coordinates
dc=(40.7128,74.0060) # distribution center coordinates
print(f'store 1: {hs.haversine(dc,store1,unit=Unit.MILES)}')
print(f'store 2: {hs.haversine(dc,store2,unit=Unit.MILES)}')
print(f'store 3: {hs.haversine(dc,store3,unit=Unit.MILES)}')
print(f'store 4: {hs.haversine(dc,store4,unit=Unit.MILES)}')

store 1: 0.15737906385088177
store 2: 6.820854822600475
store 3: 13.749441422880885
store 4: 5.242439610692904


# Use Elasticsearch integration

## Find all stores within 1 mile radius from the distribution center

In [65]:
%%bash
curl -X GET "https://vpc-neptune-es-bulk-2-u36uxlwsvcupuhmckq5qzqfmaa.us-east-1.es.amazonaws.com/amazon_neptune/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "entity_type": "store"
          }
        },
        {
          "geo_distance": {
            "distance": "1mi",
            "predicates.coordinates.value": {
              "lat": 40.7128,
              "lon": 74.006
            }
          }
        }
      ]
    }
  }
}'

{
  "took" : 18,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "amazon_neptune",
        "_type" : "_doc",
        "_id" : "a711d186296a7be46f448c39ece79453",
        "_score" : 0.0,
        "_source" : {
          "entity_id" : "20bde10f-fcfd-d51c-bf44-a35b37bdf63d",
          "document_type" : "vertex",
          "entity_type" : [
            "store"
          ],
          "predicates" : {
            "store_id" : [
              {
                "value" : "nyc_store_1"
              }
            ],
            "address" : [
              {
                "value" : "100 Main St"
              }
            ],
            "coordinates" : [
              {
                "value" : "40.7111,74.0080"
              }
            ]
          }
        }
      }
    ]
  }
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1365  100  1001  100   364  20854   7583 --:--:-- --:--:-- --:--:-- 28437


## Fuzzy search on Address

In [66]:
%%bash
curl -X GET "https://vpc-neptune-es-bulk-2-u36uxlwsvcupuhmckq5qzqfmaa.us-east-1.es.amazonaws.com/amazon_neptune/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "entity_type": "store"
          }
        },
        {
          "query_string": {
            "query": "mazn~",
            "fields": ["predicates.address.value"]
          }
        },
        {
          "geo_distance": {
            "distance": "10mi",
            "predicates.coordinates.value": {
              "lat": 40.7128,
              "lon": 74.006
            }
          }
        }
      ]
    }
  }
}'

{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "amazon_neptune",
        "_type" : "_doc",
        "_id" : "a711d186296a7be46f448c39ece79453",
        "_score" : 0.0,
        "_source" : {
          "entity_id" : "20bde10f-fcfd-d51c-bf44-a35b37bdf63d",
          "document_type" : "vertex",
          "entity_type" : [
            "store"
          ],
          "predicates" : {
            "store_id" : [
              {
                "value" : "nyc_store_1"
              }
            ],
            "address" : [
              {
                "value" : "100 Main St"
              }
            ],
            "coordinates" : [
              {
                "value" : "40.7111,74.0080"
              }
            ]
          }
        }
      }
    ]
  }
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1508  100  1001  100   507  23833  12071 --:--:-- --:--:-- --:--:-- 36780


## Combine Neptune and Elasticsearch querying in Gremlin

In [67]:
%%gremlin
g
.withSideEffect("Neptune#fts.endpoint", "https://vpc-neptune-es-bulk-2-u36uxlwsvcupuhmckq5qzqfmaa.us-east-1.es.amazonaws.com")
.V().hasLabel('distribution_center')
.has('dc_id', 'dc_1')
.out('ships_to')
.has("address", "Neptune#fts mazn~")
.valueMap()

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')),), _titles={'0': 'Con…