# Knowledge Extraction and Graph Generation

This repository details how to extract relations from unstructured texts, and how to bulkload extracted relations into Amazon Neptune.

Run the Jupyter notebook version of this file: [README.ipynb](./README.ipynb)

## Knowledge Extraction

Knowledge extraction programs is in `programs/ie-baseline/`. If you are using SageMaker notebook, it is advised to use a pytorch kernel like `pytorch_latest_p36` or `pytorch_p36`.
Note: The model used in this repo requires torch >= 1.9.0

### Install dependencies

In [None]:
%%bash
# just make sure you are in programs/ie-baseline
# cd programs/ie-baseline
pip install -r requirements.txt

### Download and process training data
Skip this step if you have already downloaded it. Unzipped data is placed at folder `data`, this is hard-coded now. In a future version it would become an argument of training script. Transformed data is placed at folder `generated`.

In [None]:
%%bash
# download DuIE dataset
wget https://dataset-bj.cdn.bcebos.com/qianyan/DuIE_2_0.zip
unzip -j DuIE_2_0.zip -d data
# transform data and place it in generated
mkdir generated
python trans.py

### Train the model
Check `main.py` or [main.ipynb](main.ipynb) for more detail. It takes around 8 mintues for an epoch on a p3.2xl machine (evaluation is currently sequential and can't be parallized, so it takes even more time than training). You can specify batch size with `--batch_size`, specify tensorboard log subfolder name with `--logname`. If you want to load a previously trained, use flag `--loadweight weight_name`. `weight_name` is the part after `subject_` and `object_`, i.e. the `weight_name` for `subject_att1_195` and `object_att1_195` is `att1_195`.

Warning: it may stop training once this notebook is terminated (since the traing process is killed as a subprocess of this terminal). You can run it in terminal with deamon protection to keep it running.

In [None]:
!python main.py --logname att1

Running statistics are logged with tensorboard, and saved in folder `logs`. You can lauch tensor board to track training status. (You may need to run this in a separate cli window.) Visit `https://[notebook_addr].sagemaker.aws/proxy/6006/` to access tensorboard. The slash at end is **necessary**.

In [None]:
!tensorboard --logdir=./logs

### Load the model for evaluation / inference
Models are saved at `save` folder. Subject models are saved as `subject_[logname]_[epoch]`, object prediction models are saved as `object_[logname]_[epoch]`, where `[logname]` is the logname you specified in parameters, `[epoch]` is the epoch num when it was saved.

I uploaded one of my trained model weights to Google drive, it can be accessed at [weight_att1_195.zip](https://drive.google.com/file/d/1YTFvOXCSJaUlj745XZ-LNQsv0xuVq7wv/view?usp=sharing). You can download it and extract the weights to `save/` folder.

In [None]:
import torch
import config
from model_origin import SubjectModel, ObjectModel
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

#specify the model to load with epoch x
# breakpoint_epoch = 195 # 210 is saved in repo
model_dir = 'save'
weight_name = 'att3_295'
subject_model = SubjectModel(config.bert_dict_len, config.word_emb_size).to(device)
object_model = ObjectModel(config.word_emb_size, config.num_classes).to(device)
subject_model.load_state_dict(torch.load(f"./{model_dir}/subject_{weight_name}", map_location=device))
object_model.load_state_dict(torch.load(f"./{model_dir}/object_{weight_name}", map_location=device))

### Load data for evaluation

Data are loaded into json objects, related dictionaries are also loaded for later use.

In [3]:
import json
dev_path = 'generated/dev_data_me.json'
train_path = 'generated/train_data_me.json'
dev_data = json.load(open(dev_path))
train_data = json.load(open(train_path))
generated_char_path = 'generated/all_chars_me.json'
id2char, char2id = json.load(open(generated_char_path))
generated_schema_path =  'generated/schemas_me.json'
id2predicate, predicate2id = json.load(open(generated_schema_path))
id2predicate = {int(i): j for i, j in id2predicate.items()}

### Evaluation and Inference
Extract relations text by text with `extract_items` function. Here we write extracted relations to `pandas` frame first, then write to a csv file.

Previously loaded `subject_model` and `object_model` will be utilized here.

In [None]:
import pandas as pd
import csv
from tqdm import tqdm
from torch.utils.data import DataLoader
from utils import extract_spoes
from data_gen import MyDevDataset, dev_collate_fn

dev_dataset = MyDevDataset(dev_data, config.bert_model_name)
dev_loader = DataLoader(
    dataset=dev_dataset,  
    batch_size=256, 
    shuffle=False,
    num_workers=1,
    collate_fn=dev_collate_fn,
    multiprocessing_context='spawn',
)
train_dataset = MyDevDataset(train_data, config.bert_model_name)
train_loader = DataLoader(
    dataset=train_dataset,  
    batch_size=256, 
    shuffle=False,
    num_workers=1,
    collate_fn=dev_collate_fn,
    multiprocessing_context='spawn',
)
rel_df = pd.DataFrame({'subject':[], 'predicate':[], 'object':[]})
with torch.no_grad():
    for batch in tqdm(dev_loader, desc="Extracting relations from dev"):
        texts, tokens, spoes, att_masks, offset_mappings = batch
        items = extract_spoes(texts, tokens, offset_mappings, subject_model, object_model, id2predicate, attention_mask=att_masks)
        for item in items:
            rel_df.loc[len(rel_df)] = item
    num_rel_dev = len(rel_df)
    print("num of extracted relations from dev set is:", num_rel_dev)
    for batch in tqdm(train_loader, desc="Extracting relations from train"):
        texts, tokens, spoes, att_masks, offset_mappings = batch
        items = extract_spoes(texts, tokens, offset_mappings, subject_model, object_model, id2predicate, attention_mask=att_masks)
        for item in items:
            rel_df.loc[len(rel_df)] = item
    num_rel_train = len(rel_df) - num_rel_dev
    print("num of extracted relations from dev set is:", num_rel_train)

Save extracted relations to a csv file

In [None]:
rel_df.to_csv('generated/triplets_att3.csv', index=False, header=False)

Count and compare with gold triplets

In [4]:
train_spo = []
dev_spo = []
for item in train_data:
    train_spo += item['spo_list']
for item in dev_data:
    dev_spo += item['spo_list']

In [5]:
gold_spo = train_spo + dev_spo
gold_spo = [tuple(spo) for spo in gold_spo]
len(gold_spo)

348534

In [6]:
import pandas as pd
rel_df = pd.read_csv('generated/triplets.csv', names=['subject', 'predicate', 'object'])

In [7]:
extracted_spo = []
for idx, row in rel_df.iterrows():
    extracted_spo.append((row['subject'], row['predicate'], row['object']))
len(extracted_spo)

228796

In [8]:
gold_spo_set = set(gold_spo)
extracted_spo_set = set(extracted_spo)
overlap = len(gold_spo_set & extracted_spo_set)
recall = overlap / len(gold_spo_set)
precision = overlap / len(extracted_spo_set)
f1 = overlap * 2 / (len(gold_spo_set) + len(extracted_spo_set))
print(f"#extracted_pos: {len(extracted_spo)}, #gold_spo: {len(gold_spo)}")
print(f"#extracted_pos_set: {len(extracted_spo_set)}, #gold_spo_set: {len(gold_spo_set)}")
print(f"f1: {f1}, recall: {recall}, precision: {precision}")

#extracted_pos: 228796, #gold_spo: 348534
#extracted_pos_set: 168526, #gold_spo_set: 225479
f1: 0.7273968604459334, recall: 0.6355314685624825, precision: 0.8503079643497146


In [None]:
gold_spo_set = set(gold_spo)
extracted_spo_set = set(extracted_spo)
overlap = len(gold_spo_set & extracted_spo_set)
recall = overlap / len(gold_spo_set)
precision = overlap / len(extracted_spo_set)
f1 = overlap * 2 / (len(gold_spo_set) + len(extracted_spo_set))
print(f"#extracted_pos: {len(extracted_spo)}, #gold_spo: {len(gold_spo)}")
print(f"#extracted_pos_set: {len(extracted_spo_set)}, #gold_spo_set: {len(gold_spo_set)}")
print(f"f1: {f1}, recall: {recall}, precision: {precision}")

### Tranform relation triplets to nodes and edges
Create relation dictionary

In [12]:
from tqdm import tqdm

rel_dict = {}
schema_path = 'data/schema.json'
with open(schema_path) as f:
    for l in tqdm(f):
        rel = json.loads(l)
        #schemas.add(a['predicate'])
        predicate = rel['predicate']
        sub_type = rel['subject_type']
        obj_type = rel['object_type']['@value']
        rel_dict[predicate] = {'subject_type': sub_type, 'object_type': obj_type}

48it [00:00, 19837.09it/s]


In [17]:
rel_dict

{'毕业院校': {'subject_type': '人物', 'object_type': '学校'},
 '嘉宾': {'subject_type': '电视综艺', 'object_type': '人物'},
 '配音': {'subject_type': '娱乐人物', 'object_type': '人物'},
 '主题曲': {'subject_type': '影视作品', 'object_type': '歌曲'},
 '代言人': {'subject_type': '企业/品牌', 'object_type': '人物'},
 '所属专辑': {'subject_type': '歌曲', 'object_type': '音乐专辑'},
 '父亲': {'subject_type': '人物', 'object_type': '人物'},
 '作者': {'subject_type': '图书作品', 'object_type': '人物'},
 '上映时间': {'subject_type': '影视作品', 'object_type': 'Date'},
 '母亲': {'subject_type': '人物', 'object_type': '人物'},
 '专业代码': {'subject_type': '学科专业', 'object_type': 'Text'},
 '占地面积': {'subject_type': '机构', 'object_type': 'Number'},
 '邮政编码': {'subject_type': '行政区', 'object_type': 'Text'},
 '票房': {'subject_type': '影视作品', 'object_type': 'Number'},
 '注册资本': {'subject_type': '企业', 'object_type': 'Number'},
 '主角': {'subject_type': '文学作品', 'object_type': '人物'},
 '妻子': {'subject_type': '人物', 'object_type': '人物'},
 '编剧': {'subject_type': '影视作品', 'object_type': '人物'},
 '气候':

In order to transform entities and edges to a gremlin-compatible format, we need to assign ID to each of them. ID is currently constructed in a very simple way:
```python
node_id = 'node_' + node_type + '_' + node_name
edge_id = 'edge_' + predicate + '_' + from + '_' + to
```

Again, we use a dataframe to store transformed edges and nodes.

In [None]:
node_df = pd.DataFrame({'~id':[], '~label':[], 'name': []})
edge_df = pd.DataFrame({'~id':[], '~from':[], '~to':[], '~label':[]})

node_dict = {}

# currently id is constructed naively.
def node_name2id(entity_type, entity_name):
    return 'node_' + entity_type + '_' + entity_name

for idx, row in tqdm(rel_df.iterrows(), total=rel_df.shape[0]):
    sub = row['subject']
    obj = row['object']
    rel = row['predicate']
    sub_type = rel_dict[rel]['subject_type']
    obj_type = rel_dict[rel]['object_type']
    sub_id = 'node_' + sub_type + '_' + sub
    obj_id = 'node_' + obj_type + '_' + obj
    # order matter: ~id, ~label, name
    node_dict[sub_id] = [sub_type, sub]
    node_dict[obj_id] = [obj_type, obj]
    edge_id = 'edge_' + rel + '_' + sub_id + '_' + obj_id
    edge_df.loc[len(edge_df)] = [edge_id, sub_id, obj_id, rel]
    
for key, val in node_dict.items():
    node_df.loc[len(node_df)] = [key, val[0], val[1]]  

print("We have scanned {} nodes and {} relations".format(len(node_df), len(edge_df)))

Save nodes and relations to csv files.

In [None]:
node_df.to_csv('generated/nodes.csv', index=False)
edge_df.to_csv('generated/edges.csv', index=False)

In [40]:
!ls -lh generated/edges.csv

-rw-rw-r-- 1 ec2-user ec2-user 28M Aug  3 08:20 generated/edges.csv


Upload nodes and edges files to S3 for bulkloading into Neptune

In [38]:
%%bash

# You need to relace this with your own S3 buckets and paths
export S3_SAVE_BUCKET="sm-nlp-data"
export SAVE_PATH="ie-baseline/outputs"
aws s3 cp ./generated/edges.csv s3://$S3_SAVE_BUCKET/$SAVE_PATH/edges.csv
aws s3 cp ./generated/nodes.csv s3://$S3_SAVE_BUCKET/$SAVE_PATH/nodes.csv

echo "The path for the Property Graph bulk loading step is 's3://$S3_SAVE_BUCKET/$SAVE_PATH/'"

upload: generated/edges.csv to s3://sm-nlp-data/ie-baseline/outputs/edges.csv
upload: generated/nodes.csv to s3://sm-nlp-data/ie-baseline/outputs/nodes.csv
The path for the Property Graph bulk loading step is 's3://sm-nlp-data/ie-baseline/outputs/'


## Load Graph Data into Neptune

You need to find your Netune endpoint and port in the Neptune database instance detail page. Here I paste mine.

- Neptune endpoint & port: database-1-instance-1.c2ycbhkszo5s.us-east-1.neptune.amazonaws.com:8182 [info](https://console.aws.amazon.com/neptune/home?region=us-east-1#database:id=database-1-instance-1;is-cluster=false;tab=connectivity)
- Source:
    - s3://sm-nlp-data/ie-baseline/outputs/nodes.csv
    - s3://sm-nlp-data/ie-baseline/outputs/edges.csv
- IAM role ARN: arn:aws:iam::093729152554:role/service-role/AWSNeptuneNotebookRole-NepTestRole [link](https://console.aws.amazon.com/iam/home?region=us-east-1#/roles/AWSNeptuneNotebookRole-NepTestRole)

*Trouble shooting*:

- You have to create an endpoint following the section 'Creating an Amazon S3 VPC Endpoint' in this [post](https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-data.html).
- Choose the endpoint type as 'Gateway'.
- Do select the check box next to the route tables that are associated 

Bulkload nodes and edges into Neptune using `loader` provided by Neptune with `curl` command. You need to specify neptune database and port, namely this part `https://database-2-instance-1.c2ycbhkszo5s.us-east-1.neptune.amazonaws.com:8182/`, as well as `source`, `iamRoleArn` and `region`.

In [None]:
database-2.cluster-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com
database-2-instance-1.c2ycbhkszo5s.us-east-1.neptune.amazonaws.com:8182

In [41]:
%%bash

curl -X POST \
    -H 'Content-Type: application/json' \
    https://database-2.cluster-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com:8182/loader -d '
    {
      "source" : "s3://sm-nlp-data/ie-baseline/outputs/",
      "format" : "csv",
      "iamRoleArn" : "arn:aws:iam::093729152554:role/NeptuneLoadFromS3",
      "region" : "us-east-1",
      "failOnError" : "FALSE",
      "parallelism" : "MEDIUM",
      "updateSingleCardinalityProperties" : "FALSE",
      "queueRequest" : "TRUE",
      "dependencies" : []
    }'

{
    "status" : "200 OK",
    "payload" : {
        "loadId" : "6ad96976-1b80-4e33-88d1-74faa308dba3"
    }
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   480  100   110  100   370    873   2936 --:--:-- --:--:-- --:--:--  3840


Check load status

In [47]:
%%bash

curl -G 'https://database-2.cluster-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com:8182/loader/6ad96976-1b80-4e33-88d1-74faa308dba3'

{
    "status" : "200 OK",
    "payload" : {
        "feedCount" : [
            {
                "LOAD_COMPLETED" : 2
            }
        ],
        "overallStatus" : {
            "fullUri" : "s3://sm-nlp-data/ie-baseline/outputs/",
            "runNumber" : 8,
            "retryNumber" : 0,
            "status" : "LOAD_COMPLETED",
            "totalTimeSpent" : 43,
            "startTime" : 1627980773,
            "totalRecords" : 572294,
            "totalDuplicates" : 403768,
            "parsingErrors" : 0,
            "datatypeMismatchErrors" : 0,
            "insertErrors" : 0
        }
    }
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   612  100   612    0     0  12750      0 --:--:-- --:--:-- --:--:-- 13021


Now, you can query this database within the same VPC using `curl` command.

In [61]:
%%bash
# show the total nodes on the current neptune instance
curl -X POST -d '{"gremlin":"g.V().limit(5)"}' https://database-2.cluster-ro-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com:8182/gremlin

{"requestId":"63cbad10-0674-4ecb-abfb-64d459eae351","status":{"message":"","code":200,"attributes":{"@type":"g:Map","@value":[]}},"result":{"data":{"@type":"g:List","@value":[{"@type":"g:Vertex","@value":{"id":"node_人物_范琳琳","label":"人物","properties":{"name":[{"@type":"g:VertexProperty","@value":{"id":{"@type":"g:Int32","@value":309923863},"value":"范琳琳","label":"name"}}]}}},{"@type":"g:Vertex","@value":{"id":"node_人物_伍翠珍","label":"人物","properties":{"name":[{"@type":"g:VertexProperty","@value":{"id":{"@type":"g:Int32","@value":-96177417},"value":"伍翠珍","label":"name"}}]}}},{"@type":"g:Vertex","@value":{"id":"node_人物_捷克","label":"人物","properties":{"name":[{"@type":"g:VertexProperty","@value":{"id":{"@type":"g:Int32","@value":-1571363459},"value":"捷克","label":"name"}}]}}},{"@type":"g:Vertex","@value":{"id":"node_人物_许绍洋","label":"人物","properties":{"name":[{"@type":"g:VertexProperty","@value":{"id":{"@type":"g:Int32","@value":381103735},"value":"许绍洋","label":"name"}}]}}},{"@type":"g:Vertex","

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1344  100  1316  100    28  29909    636 --:--:-- --:--:-- --:--:-- 30545


In [60]:
%%bash

curl -X POST -d '{"gremlin":"g.hasLabel(Text).count()"}' https://database-2.cluster-ro-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com:8182/gremlin

{"code":"MalformedQueryException","requestId":"68e366ae-6ec6-441f-b682-c7fc64fafb6f","detailedMessage":"Failed to interpret Gremlin query: Query parsing failed at line 1, character position at 11, error message : token recognition error at: 'Text)'"}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   288  100   250  100    38   5952    904 --:--:-- --:--:-- --:--:--  6857


## Access Neptune from Outside the VPC

We set up a load balancer to redirect traffics from outside the VPC to the neptune endpoints.

Architectures and best practices of connecting to Neptune with load balancers are detailed in this post: [Connecting to Amazon Neptune from Clients Outside the Neptune VPC](https://github.com/aws-samples/aws-dbs-refarch-graph/tree/master/src/connecting-using-a-load-balancer).

This [answer](https://stackoverflow.com/a/52622164) from stackoverflow may also help.

### Steps

1. Find out Neptune cluster's master IP address. `dig +short <your cluster endpoint>` 


In [None]:
!dig +short database-2-instance-1.c2ycbhkszo5s.us-east-1.neptune.amazonaws.com

2. Create an Application Load Balancer (ALB)
    
    - In EC2's left panel, click 'Load Balancer'. 
    - In availablity zone, make sure you select at least the zone where your Neptune DB instance is located. 
    - In Configure Security Groups, create a security group that allows inbound traffic from everywhere. i.e. Inbound TCP rule for 0.0.0.0 on 80.
    - In Configure routing, choose target type as IP, protocal as HTTP, port as 80
    - In register targets, add the IP Address obtained for step #1, and the port as 8182, then click "add to list".

3. Access!

After configuring an ALB(application load balancer), you can find the DNS name of the it in load balancers, the accessing port is as you set in "configure routing", which is 80.

In [6]:
%%bash

curl -X POST -d '{"gremlin":"g.V().limit(5)"}' alb-neptune-test-62758122.us-east-1.elb.amazonaws.com/gremlin

{"requestId":"a6133143-6e85-4c82-ac43-251335a74eeb","status":{"message":"","code":200,"attributes":{"@type":"g:Map","@value":[]}},"result":{"data":{"@type":"g:List","@value":[{"@type":"g:Vertex","@value":{"id":"node_人物_范琳琳","label":"人物","properties":{"name":[{"@type":"g:VertexProperty","@value":{"id":{"@type":"g:Int32","@value":309923863},"value":"范琳琳","label":"name"}}]}}},{"@type":"g:Vertex","@value":{"id":"node_人物_伍翠珍","label":"人物","properties":{"name":[{"@type":"g:VertexProperty","@value":{"id":{"@type":"g:Int32","@value":-96177417},"value":"伍翠珍","label":"name"}}]}}},{"@type":"g:Vertex","@value":{"id":"node_人物_捷克","label":"人物","properties":{"name":[{"@type":"g:VertexProperty","@value":{"id":{"@type":"g:Int32","@value":-1571363459},"value":"捷克","label":"name"}}]}}},{"@type":"g:Vertex","@value":{"id":"node_人物_许绍洋","label":"人物","properties":{"name":[{"@type":"g:VertexProperty","@value":{"id":{"@type":"g:Int32","@value":381103735},"value":"许绍洋","label":"name"}}]}}},{"@type":"g:Vertex","

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1344  100  1316  100    28  47000   1000 --:--:-- --:--:-- --:--:-- 48000


Set up the Gremlin console to connect to a Neptune DB instance 

#### Use ipython-gremlin extension 

Install ipython gremlin by `!pip install ipython-gremlin --user`

A detailed documentation can be found [here](https://ipython-gremlin.readthedocs.io/en/latest/usage.html)

In [None]:
%load_ext gremlin

In [41]:
%gremlin.connection.set_current ws://alb-neptune-test-62758122.us-east-1.elb.amazonaws.com/gremlin

Alias-- alb-neptune-test-62758122.us-east-1.elb.amazonaws.com --created for database at ws://alb-neptune-test-62758122.us-east-1.elb.amazonaws.com/gremlin
Now using connection at ws://alb-neptune-test-62758122.us-east-1.elb.amazonaws.com/gremlin


In [None]:
verts = %gremlin g.V().limit(5)

#### Use python API

Ipython runs in a loop, this may cause problem for graph traversal, since it can not run in another loop.

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [12]:
from __future__  import print_function  # Python 2/3 compatibility

from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection

graph = Graph()

remoteConn = DriverRemoteConnection('wss://database-2.cluster-ro-c2ycbhkszo5s.us-east-1.neptune.amazonaws.com:8182/gremlin','g')
g = graph.traversal().withRemote(remoteConn)

print(g.V().hasLabel('机构').limit(5).toList())
remoteConn.close()

[v[node_机构_嘉兴中润光学科技有限公司], v[node_机构_厦门博乐德平台拍卖有限公司], v[node_机构_北京泡泡玛特文化创意有限公司], v[node_机构_大卫博士有限公司], v[node_机构_山东金天牛矿山机械有限公司]]
