# Extracting triplets from data for further analysis

Graph data can be stored in two main forms:
+ LPS: labeled property graph with no schema and arbitary label properties
+ RDF: with predefined types of relations; supports resoning 

### How to automatically fill this data.
For example, we can extract Named Entities, then, from natural text, using sematic properties of language, we can extend our data. 

## Language semantics

As a baseline in extracting relations from text we can use spaCy capabilities of semantic parsing.

In [12]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 11.8 MB/s eta 0:00:01
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047106 sha256=c942e3487e851c002aa5fd481e4740035e51264d665a4a8f8bdc60fc81174ed3
  Stored in directory: /tmp/pip-ephem-wheel-cache-wqqzkxpm/wheels/ee/4d/f7/563214122be1540b5f9197b52cb3ddb9c4a8070808b22d5a84
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.3.1
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Demand for lithium is expected to outpace global supply as consumers switch to battery-powered vehicles. With China currently leading in processing of the vital raw material, the U.S. government is looking to boost domestic production."
doc = nlp(text)

In [3]:
for token in doc:
    print(f"{token.text}:  {token.pos_}")

Demand:  NOUN
for:  ADP
lithium:  NOUN
is:  AUX
expected:  VERB
to:  PART
outpace:  VERB
global:  ADJ
supply:  NOUN
as:  SCONJ
consumers:  NOUN
switch:  VERB
to:  ADP
battery:  NOUN
-:  PUNCT
powered:  VERB
vehicles:  NOUN
.:  PUNCT
With:  ADP
China:  PROPN
currently:  ADV
leading:  VERB
in:  ADP
processing:  NOUN
of:  ADP
the:  DET
vital:  ADJ
raw:  ADJ
material:  NOUN
,:  PUNCT
the:  DET
U.S.:  PROPN
government:  NOUN
is:  AUX
looking:  VERB
to:  PART
boost:  VERB
domestic:  ADJ
production:  NOUN
.:  PUNCT


In [6]:
for entity in doc.ents:
    print(entity.text, entity.start_char, entity.end_char, entity.label_)

China 110 115 GPE
U.S. 179 183 GPE


Spacy provides us with a dependency tree of some kind

In [8]:
import spacy
from nltk import Tree

def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
    else:
        return node.orth_


for sent in doc.sents:
    print(sent.text)
    to_nltk_tree(sent.root).pretty_print()

Demand for lithium is expected to outpace global supply as consumers switch to battery-powered vehicles.
                    expected                                       
  _____________________|________                                    
 |   |     |                 outpace                               
 |   |     |      ______________|________                           
 |   |     |     |     |               switch                      
 |   |     |     |     |         ________|________________          
 |   |     |     |     |        |        |                to       
 |   |     |     |     |        |        |                |         
 |   |   Demand  |     |        |        |             vehicles    
 |   |     |     |     |        |        |                |         
 |   |    for    |   supply     |        |             powered     
 |   |     |     |     |        |        |         _______|______   
 is  .  lithium  to  global     as   consumers battery           - 

With

With help of matchers we can extract some relations

## Temporal information

Along with temporal (datetime) data found in metadata, we can extract some of it

In [9]:
!pip install dateparser

Collecting dateparser
  Downloading dateparser-1.0.0-py2.py3-none-any.whl (279 kB)
[K     |████████████████████████████████| 279 kB 2.2 MB/s eta 0:00:01
Collecting tzlocal
  Downloading tzlocal-2.1-py2.py3-none-any.whl (16 kB)
Installing collected packages: tzlocal, dateparser
Successfully installed dateparser-1.0.0 tzlocal-2.1


In [4]:
import dateparser

text = "The analysts expects AMD fiscal earnings of $2.04 a share in 2021, $2.59 a share in 2022, and $2.90 a share in 2023, while analysts surveyed by FactSet expect per-share earnings of $1.95, $2.51, and $3.23, respectively. "

dateparser.parse(text)

In [18]:
dateparser.parse("In December")

datetime.datetime(2021, 12, 26, 0, 0)

works poorly for my purposes. :(

## Semantic graphs

In [1]:
!pip install graphbrain
!python3 -m spacy download en_core_web_lg

distutils: /home/master/.local/share/miniconda3/include/python3.8/UNKNOWN
sysconfig: /home/master/.local/share/miniconda3/include/python3.8[0m
user = False
home = None
root = None
prefix = None[0m
distutils: /home/master/.local/share/miniconda3/include/python3.8/UNKNOWN
sysconfig: /home/master/.local/share/miniconda3/include/python3.8[0m
user = False
home = None
root = None
prefix = None[0m


In [11]:
# from graphbrain.parsers import *
from graphbrain.notebook import *
parser = create_parser(lang='en')

In [14]:
parses = parser.parse(text)
for parse in parses["parses"]:
    edge = parse['main_edge']
    show(edge)

In [None]:
!pip3 install amrlib
!python3 -m spacy download en_core_web_sm

In [17]:
wget https://github.com/bjascob/amrlib-models/releases/download/model_parse_t5-v0_1_0/model_parse_t5-v0_1_0.tar.gz -O - | tar -xz -C ./.venv/lib/python3.8/site-packages/amrlib/data/

--2021-05-01 16:25:56--  https://github.com/bjascob/amrlib-models/releases/download/model_parse_t5-v0_1_0/model_parse_t5-v0_1_0.tar.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/322345122/cd54c180-4113-11eb-8fae-b31d39c0d07a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210501%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210501T162556Z&X-Amz-Expires=300&X-Amz-Signature=38d4d3a050f5065d0e230edb7ef089575e432a1ba4c864941ebe1003e46fa318&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=322345122&response-content-disposition=attachment%3B%20filename%3Dmodel_parse_t5-v0_1_0.tar.gz&response-content-type=application%2Foctet-stream [following]
--2021-05-01 16:25:56--  https://github-releases.githubusercontent.com/322345122/cd54c180-4113-11eb-8fae-b31d39c0d07a?X-Amz-Algorithm=AWS4-HMA

In [22]:
amrlib.__file__

'/home/master/code/python-playground/app_triplet_extractor/.venv/lib/python3.8/site-packages/amrlib/__init__.py'

In [25]:
import os
stog = amrlib.load_stog_model(model_dir=os.path.dirname(amrlib.__file__)+'/data/model_parse_t5-v0_1_0')

ImportError: 
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment.


In [16]:
import amrlib
import spacy
amrlib.setup_spacy_extension()
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

# The following are roughly equivalent but demonstrate the different objects.
graphs = doc._.to_amr()
for graph in graphs:
    print(graph)

for span in doc.sents:
    graphs = span._.to_amr()
    print(graphs[0])

FileNotFoundError: [Errno 2] No such file or directory: '/home/master/code/python-playground/app_triplet_extractor/.venv/lib/python3.8/site-packages/amrlib/data/model_stog'