# Neptune Lab

This notebook tests create Neptune DB with boto3.

This notebook follows guidance from [Example: Loading Data into a Neptune DB Instance](https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-data.html).

## 1. Create a Neptune DB cluster.

In [2]:
import boto3
from botocore.errorfactory import ClientError

In [7]:
neptune = boto3.client('neptune')
db_cluster_identifier = 'kg-neptune'

In [13]:
try:
    response = neptune.describe_db_clusters(DBClusterIdentifier=db_cluster_identifier)['DBClusters'][0]
    print(response)
except ClientError as e:
    print(e)
    print(f"Trying to create a Neptune DB Cluster with identifier {db_cluster_identifier}")
    response = neptune.create_db_cluster(
        DBClusterIdentifier=db_cluster_identifier, 
        Engine='neptune'
    )['DBCluster']

An error occurred (DBClusterNotFoundFault) when calling the DescribeDBClusters operation: DBCluster kg-neptune-1 not found.
Trying to create a Neptune DB Cluster with identifier kg-neptune-1


In [16]:
response['DBCluster']

{'AllocatedStorage': 1,
 'AvailabilityZones': ['us-east-1f', 'us-east-1b', 'us-east-1d'],
 'BackupRetentionPeriod': 1,
 'DBClusterIdentifier': 'kg-neptune-1',
 'DBClusterParameterGroup': 'default.neptune1',
 'DBSubnetGroup': 'default',
 'Status': 'creating',
 'Endpoint': 'kg-neptune-1.cluster-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com',
 'ReaderEndpoint': 'kg-neptune-1.cluster-ro-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com',
 'MultiAZ': False,
 'Engine': 'neptune',
 'EngineVersion': '1.0.5.0',
 'Port': 8182,
 'MasterUsername': 'admin',
 'PreferredBackupWindow': '08:53-09:23',
 'PreferredMaintenanceWindow': 'sat:04:52-sat:05:22',
 'ReadReplicaIdentifiers': [],
 'DBClusterMembers': [],
 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-dc24d3db',
   'Status': 'active'}],
 'HostedZoneId': 'ZUFXD4SLT2LS7',
 'StorageEncrypted': False,
 'DbClusterResourceId': 'cluster-VNRV4H6V2Q3YSS3RY6PZJFZFSA',
 'DBClusterArn': 'arn:aws:rds:us-east-1:093729152554:cluster:kg-neptune-1',
 'AssociatedRoles'

In [21]:
db_cluster_endpoint = response['DBClusters'][0]['Endpoint']
db_cluster_port = response['DBClusters'][0]['Port']
db_cluster_arn = response['DBClusters'][0]['DBClusterArn']
db_cluster_region = db_cluster_arn.split(':')[3]
print(f"DB Cluster Endpoint: {db_cluster_endpoint}")
print(f"DB Cluster Port: {db_cluster_port}")
print(f"DB Cluster Region: {db_cluster_region}")

DB Cluster Endpoint: kg-neptune.cluster-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com
DB Cluster Port: 8182
DB Cluster Region: us-east-1


### Test function

In [7]:
def get_or_create_db_cluster(db_cluster_identifier):
    neptune = boto3.client('neptune')
    try:
        response = neptune.describe_db_clusters(DBClusterIdentifier=db_cluster_identifier)
        db_cluster = response['DBClusters'][0]
    except ClientError as e:
        if e.response["Error"]["Code"] != 'DBClusterNotFoundFault':
            raise e
        print(f"Neptune Cluster {db_cluster_identifier} does not exist.")
        print(f"Trying to create a Neptune Cluster with identifier {db_cluster_identifier}")
        response = neptune.create_db_cluster(
            DBClusterIdentifier=db_cluster_identifier, 
            Engine='neptune'
        )
        db_cluster = response['DBCluster']
    return db_cluster

In [8]:
get_or_create_db_cluster('kg-neptune')

{'AllocatedStorage': 1,
 'AvailabilityZones': ['us-east-1b', 'us-east-1a', 'us-east-1c'],
 'BackupRetentionPeriod': 1,
 'DBClusterIdentifier': 'kg-neptune',
 'DBClusterParameterGroup': 'default.neptune1',
 'DBSubnetGroup': 'default',
 'Status': 'available',
 'EarliestRestorableTime': datetime.datetime(2021, 9, 15, 11, 5, 29, 136000, tzinfo=tzlocal()),
 'Endpoint': 'kg-neptune.cluster-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com',
 'ReaderEndpoint': 'kg-neptune.cluster-ro-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com',
 'MultiAZ': True,
 'Engine': 'neptune',
 'EngineVersion': '1.0.5.0',
 'LatestRestorableTime': datetime.datetime(2021, 9, 17, 2, 1, 22, 32000, tzinfo=tzlocal()),
 'Port': 8182,
 'MasterUsername': 'admin',
 'PreferredBackupWindow': '09:47-10:17',
 'PreferredMaintenanceWindow': 'fri:05:05-fri:05:35',
 'ReadReplicaIdentifiers': [],
 'DBClusterMembers': [{'DBInstanceIdentifier': 'kg-neptune-instance-4',
   'IsClusterWriter': False,
   'DBClusterParameterGroupStatus': 'in-sync',

In [30]:
import time
db_cluster = get_or_create_db_cluster(db_cluster_identifier)
while db_cluster['Status'] == 'creating':
    print(f"Cluster {db_cluster_identifier} is in status \'creating\'")
    time.sleep(30) # check status every 30 seconds
    db_cluster = get_or_create_db_cluster(db_cluster_identifier)
print(f"Cluster {db_cluster_identifier} is now in state \'{db_cluster['Status']}\'")

Neptune Cluster kg-neptune-3 does not exist.
Trying to create a Neptune Cluster with identifier kg-neptune-3
Cluster kg-neptune-3 is still in status creating
Cluster kg-neptune-3 is still in status creating
Cluster kg-neptune-3 is now in state 'available'


## 2.Create a Neptune DB instance within the cluster

In [12]:
db_cluster_identifier = 'kg-neptune'

In [14]:
respones = neptune.describe_db_instances(
        Filters=[
            {
                'Name': 'db-cluster-id',
                'Values': [db_cluster_identifier]
            },
            {
                'Name': 'engine',
                'Values': ['neptune']
            }
        ]
    )
db_instances = respones['DBInstances']
db_instances[0]

{'DBInstanceIdentifier': 'kg-neptune-instance-1',
 'DBInstanceClass': 'db.t3.medium',
 'Engine': 'neptune',
 'DBInstanceStatus': 'available',
 'MasterUsername': 'admin',
 'Endpoint': {'Address': 'kg-neptune-instance-1.c2ycbhkszo5s.us-east-1.neptune.amazonaws.com',
  'Port': 8182,
  'HostedZoneId': 'ZUFXD4SLT2LS7'},
 'AllocatedStorage': 1,
 'InstanceCreateTime': datetime.datetime(2021, 9, 16, 1, 53, 40, 578000, tzinfo=tzlocal()),
 'PreferredBackupWindow': '09:47-10:17',
 'BackupRetentionPeriod': 1,
 'DBSecurityGroups': [],
 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-dc24d3db',
   'Status': 'active'}],
 'DBParameterGroups': [{'DBParameterGroupName': 'default.neptune1',
   'ParameterApplyStatus': 'in-sync'}],
 'AvailabilityZone': 'us-east-1c',
 'DBSubnetGroup': {'DBSubnetGroupName': 'default',
  'DBSubnetGroupDescription': 'default',
  'VpcId': 'vpc-851683f8',
  'SubnetGroupStatus': 'Complete',
  'Subnets': [{'SubnetIdentifier': 'subnet-fccf40cd',
    'SubnetAvailabilityZone': {'Nam

In [43]:
db_instance_identifier = f"{db_cluster_identifier}-instance-1"
db_instance_class = 'db.t3.medium'

In [44]:
try:
    response = neptune.describe_db_instances(DBInstanceIdentifier=db_instance_identifier)
    db_instance = response['DBInstances'][0]
except ClientError as e:
    print(e)
    print(f"Trying to create a Neptune DB instance with identifier {db_instance_identifier}")
    response = neptune.create_db_instance(
        DBInstanceIdentifier=db_instance_identifier,
        DBInstanceClass=db_instance_class,
        Engine='neptune',
        DBClusterIdentifier=db_cluster_identifier,
    )
    db_instance = response['DBInstance']
print(db_instance)

An error occurred (DBInstanceNotFound) when calling the DescribeDBInstances operation: DBInstance kg-neptune-instance-2 not found.
Trying to create a Neptune DB instance with identifier kg-neptune-instance-2
{'DBInstanceIdentifier': 'kg-neptune-instance-2', 'DBInstanceClass': 'db.t3.medium', 'Engine': 'neptune', 'DBInstanceStatus': 'creating', 'MasterUsername': 'admin', 'AllocatedStorage': 1, 'PreferredBackupWindow': '09:47-10:17', 'BackupRetentionPeriod': 1, 'DBSecurityGroups': [], 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-dc24d3db', 'Status': 'active'}], 'DBParameterGroups': [{'DBParameterGroupName': 'default.neptune1', 'ParameterApplyStatus': 'in-sync'}], 'DBSubnetGroup': {'DBSubnetGroupName': 'default', 'DBSubnetGroupDescription': 'default', 'VpcId': 'vpc-851683f8', 'SubnetGroupStatus': 'Complete', 'Subnets': [{'SubnetIdentifier': 'subnet-fccf40cd', 'SubnetAvailabilityZone': {'Name': 'us-east-1e'}, 'SubnetStatus': 'Active'}, {'SubnetIdentifier': 'subnet-f8676fb5', 'SubnetAva

Extract **VpcId** from DB instance response:

In [9]:
vpc_id = response['DBInstances'][0]['DBSubnetGroup']['VpcId']

### Test function

In [32]:
def get_or_create_db_instance(db_cluster_identifier, db_instance_suffix, db_instance_class):
    neptune = boto3.client('neptune')
    db_instance_identifier = f"{db_cluster_identifier}-{db_instance_suffix}"
    try:
        response = neptune.describe_db_instances(DBInstanceIdentifier=db_instance_identifier)
        db_instance = response['DBInstances'][0]
    except ClientError as e:
        if e.response["Error"]["Code"] != 'DBInstanceNotFound':
            raise e
        print(f"Trying to create a Neptune DB instance with identifier {db_instance_identifier}")
        response = neptune.create_db_instance(
            DBInstanceIdentifier=db_instance_identifier,
            DBInstanceClass=db_instance_class,
            Engine='neptune',
            DBClusterIdentifier=db_cluster_identifier,
        )
        db_instance = response['DBInstance']
    return db_instance

In [33]:
db_instance_suffix = 'instance-1'
db_instance_class = 'db.t3.medium'

In [None]:
db_instance = get_or_create_db_instance(db_cluster_identifier, db_instance_suffix, db_instance_class)
while db_instance['DBInstanceStatus'] == 'creating':
    print(f"Instance {db_cluster_identifier}-{db_instance_suffix} is in status \'creating\'")
    time.sleep(30) # check status every 30 seconds
    db_instance = get_or_create_db_instance(db_cluster_identifier, db_instance_suffix, db_instance_class)
print(f"Instance {db_cluster_identifier}-{db_instance_suffix} is now in status \'{db_instance['DBInstanceStatus']}\'")

In [35]:
db_instance

{'DBInstanceIdentifier': 'kg-neptune-3-instance-1',
 'DBInstanceClass': 'db.t3.medium',
 'Engine': 'neptune',
 'DBInstanceStatus': 'creating',
 'MasterUsername': 'admin',
 'AllocatedStorage': 1,
 'PreferredBackupWindow': '09:18-09:48',
 'BackupRetentionPeriod': 1,
 'DBSecurityGroups': [],
 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-dc24d3db',
   'Status': 'active'}],
 'DBParameterGroups': [{'DBParameterGroupName': 'default.neptune1',
   'ParameterApplyStatus': 'in-sync'}],
 'DBSubnetGroup': {'DBSubnetGroupName': 'default',
  'DBSubnetGroupDescription': 'default',
  'VpcId': 'vpc-851683f8',
  'SubnetGroupStatus': 'Complete',
  'Subnets': [{'SubnetIdentifier': 'subnet-fccf40cd',
    'SubnetAvailabilityZone': {'Name': 'us-east-1e'},
    'SubnetStatus': 'Active'},
   {'SubnetIdentifier': 'subnet-f8676fb5',
    'SubnetAvailabilityZone': {'Name': 'us-east-1c'},
    'SubnetStatus': 'Active'},
   {'SubnetIdentifier': 'subnet-2f611949',
    'SubnetAvailabilityZone': {'Name': 'us-east-1a'}

## 3. Create an IAM role to allow Neptune access S3 bucket

Read this to get further details: [Prerequisites: IAM Role and Amazon S3 Access](https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-IAM.html). </br>
Docs for boto3 IAM API: [IAM](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/iam.html)

In [10]:
iam = boto3.client("iam")
iam_role_name = 'NeptuneLoadFromS3'

In [18]:
import json

s3_read_only_policy_arn = 'arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess'

assume_role_policy_doc = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow", 
            "Principal": {
                "Service": [
                  "rds.amazonaws.com"
                ]
              },
            "Action": "sts:AssumeRole"
        }
    ],
}

try:
    iam_role_loadfroms3 = iam.create_role(
        RoleName=iam_role_name,
        AssumeRolePolicyDocument=json.dumps(assume_role_policy_doc),
        Description="Allow Amazon Neptune to Access Amazon S3 Resources",
    )
    # attach s3 read only policy
    response = iam.attach_role_policy(
        RoleName=iam_role_name,
        PolicyArn=s3_read_only_policy_arn
    )
    print('Role:\n', iam_role_loadfroms3)
    print('Attach Policy Response:\n', response)
except ClientError as e:
    if e.response["Error"]["Code"] == "EntityAlreadyExists":
        print("Role already exists")
        iam_role_loadfroms3 = iam.get_role(
            RoleName=iam_role_name
        )
        print(iam_role_loadfroms3)
    else:
        print("Unexpected error: %s" % e)

Role already exists
{'Role': {'Path': '/', 'RoleName': 'NeptuneLoadFromS3', 'RoleId': 'AROARLUVRLYVCD5OPEFC7', 'Arn': 'arn:aws:iam::093729152554:role/NeptuneLoadFromS3', 'CreateDate': datetime.datetime(2021, 6, 17, 7, 31, 51, tzinfo=tzlocal()), 'AssumeRolePolicyDocument': {'Version': '2012-10-17', 'Statement': [{'Sid': '', 'Effect': 'Allow', 'Principal': {'Service': 'rds.amazonaws.com'}, 'Action': 'sts:AssumeRole'}]}, 'Description': 'Allows S3 to call AWS services on your behalf.', 'MaxSessionDuration': 3600, 'RoleLastUsed': {'LastUsedDate': datetime.datetime(2021, 8, 3, 8, 53, 3, tzinfo=tzlocal()), 'Region': 'us-east-1'}}, 'ResponseMetadata': {'RequestId': 'c2351ba6-11ac-40bf-bb1d-16dcac159207', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'c2351ba6-11ac-40bf-bb1d-16dcac159207', 'content-type': 'text/xml', 'content-length': '1035', 'date': 'Thu, 16 Sep 2021 06:02:01 GMT'}, 'RetryAttempts': 0}}


## 4. Creating an Amazon S3 VPC Endpoint 

The Neptune loader requires a VPC endpoint for Amazon S3. An s3 endpoint allows other AWS services to access s3 without leaving Amazon network.

Here we check whether a VPC endpoints already exists for S3 service in the given VPC. We create one only if there is no VPC exists. 

Docs for EC2 boto3 API: [EC2](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html)

In [12]:
ec2 = boto3.client('ec2')

In [34]:
s3_service_name = f"com.amazonaws.{db_cluster_region}.s3"

`VpcId` is defined in [section 2](#2.Create-a-Neptune-DB-instance-within-the-cluster).

In [14]:
# Check endpoints existence
response = ec2.describe_vpc_endpoints(
    Filters=[
        {
            'Name': 'service-name',
            'Values': [s3_service_name]
        },
        {
            'Name': 'vpc-id',
            'Values': [vpc_id]
        },
        {
            'Name': 'vpc-endpoint-type',
            'Values': ['Gateway']
        }
    ]
)
print(f"There exists {len(response['VpcEndpoints'])} S3 endpoints in VPC {vpc_id}:")
response['VpcEndpoints']

There exists 3 S3 endpoints in VPC vpc-851683f8:


[{'VpcEndpointId': 'vpce-097a269924ed1df04',
  'VpcEndpointType': 'Gateway',
  'VpcId': 'vpc-851683f8',
  'ServiceName': 'com.amazonaws.us-east-1.s3',
  'State': 'available',
  'PolicyDocument': '{"Version":"2008-10-17","Statement":[{"Effect":"Allow","Principal":"*","Action":"*","Resource":"*"}]}',
  'RouteTableIds': [],
  'SubnetIds': [],
  'Groups': [],
  'PrivateDnsEnabled': False,
  'RequesterManaged': False,
  'NetworkInterfaceIds': [],
  'DnsEntries': [],
  'CreationTimestamp': datetime.datetime(2021, 6, 17, 7, 37, 44, tzinfo=tzlocal()),
  'Tags': [],
  'OwnerId': '093729152554'},
 {'VpcEndpointId': 'vpce-0b804e31709c7d3f2',
  'VpcEndpointType': 'Gateway',
  'VpcId': 'vpc-851683f8',
  'ServiceName': 'com.amazonaws.us-east-1.s3',
  'State': 'available',
  'PolicyDocument': '{"Version":"2008-10-17","Statement":[{"Effect":"Allow","Principal":"*","Action":"*","Resource":"*"}]}',
  'RouteTableIds': ['rtb-1246556c'],
  'SubnetIds': [],
  'Groups': [],
  'PrivateDnsEnabled': False,
  'R

In [15]:
if len(response['VpcEndpoints']) > 0:
    vpc_endpoint_id = response['VpcEndpoints'][0]['VpcEndpointId']
else:
    print('Trying to create an VPC endpoint ID:')
    response = ec2.create_vpc_endpoint(
        VpcEndpointType='Gateway',
        VpcId=vpc_id,
        ServiceName=s3_service_name,
    )
    vpc_endpoint_id = response['VpcEndpoint']['VpcEndpointId']
    response

In [16]:
print(f"S3 VPC Endpoint ID: {vpc_endpoint_id}")

S3 VPC Endpoint ID: vpce-097a269924ed1df04


## 5. Bulkload

Amazon Neptune uses a cluster of DB instances rather than a single instance. Each Neptune connection is handled by a specific DB instance. When you connect to a Neptune cluster, the host name and port that you specify point to an intermediate handler called an *endpoint*. An endpoint is a URL that contains a host address and a port. [[src]](https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-endpoints.html)

Learn more about connection to Neptune DB: [Connecting to Amazon Neptune Endpoints](https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-endpoints.html)

In [28]:
bucket = 'sm-nlp-data'
save_prefix = 'ie-baseline/outputs'

In [29]:
load_script = f"""curl -X POST \
    -H 'Content-Type: application/json' \
    https://{db_cluster_endpoint}:{db_cluster_port}/loader -d "
    {{
      'source' : 's3://{bucket}/{save_prefix}/',
      'format' : 'csv',
      'iamRoleArn' : '{iam_role_loadfroms3['Role']['Arn']}',
      'region' : '{db_cluster_region}',
      'failOnError' : 'FALSE',
      'parallelism' : 'MEDIUM',
      'updateSingleCardinalityProperties' : 'FALSE',
      'queueRequest' : 'TRUE',
      'dependencies' : []
    }}"
    """

In [33]:
import subprocess

subprocess.run(load_script, shell=True)

CompletedProcess(args='curl -X POST     -H \'Content-Type: application/json\'     https://kg-neptune.cluster-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com:8182/loader -d "\n    {\n      \'source\' : \'s3://sm-nlp-data/ie-baseline/outputs/\',\n      \'format\' : \'csv\',\n      \'iamRoleArn\' : \'arn:aws:iam::093729152554:role/NeptuneLoadFromS3\',\n      \'region\' : \'us-east-1\',\n      \'failOnError\' : \'FALSE\',\n      \'parallelism\' : \'MEDIUM\',\n      \'updateSingleCardinalityProperties\' : \'FALSE\',\n      \'queueRequest\' : \'TRUE\',\n      \'dependencies\' : []\n    }"\n    ', returncode=7)