## Label Propagation

In this notebook, we will import the graph using the `neo4j-admin` tool and the CSV files that were built in the previous notebook. The command to load the data into a new Neo4j graph is as follows:

    cd $NEO4J_HOME
    bin/neo4j-admin import \
        --nodes=/path/to/nodes-compound.csv \
        --relationships=/path/to/edges-compound.csv
        
In this notebook, we will connect to the database, run Label Propagation against the graph, and then collect the predictions made by the Label Propagation algorithm.

We will read the `sentences-compound-plabels.tsv` file to build a mapping of `doc_id` to `sentence_text` so we can show the prediction along with the sentence in the output file `sentences-compound-preds.tsv`.

In [1]:
import os
import pandas as pd
import py2neo

In [2]:
DATA_DIR = "../data"
SENTS_FILEPATH = os.path.join(DATA_DIR, "sentences-compound-plabels.tsv")
PREDS_FILEPATH = os.path.join(DATA_DIR, "sentences-compound-preds.tsv")

NEO4J_CONN_URL = "bolt://localhost:7687"

### Build sentence ID to text mapping

In [3]:
# docid to text
test_docid2text = {}
fsents = open(SENTS_FILEPATH, "r")
for line in fsents:
    if line.startswith("#"):
        continue
    pii, sent_id, sent_text, label = line.strip().split('\t')
    doc_id = "-".join([pii, sent_id])
    label = int(label)
    if label == -1:
        test_docid2text[doc_id] = sent_text
fsents.close()

### Label Propagation

Note that we run the Label Propagation algorithm on unweighted graph (i.e., we don't specify `weightProperty: "similarity"` in the parameters to `algo.labelPropagation.stream`). This is even though we set the `similarity` edge attribute. Reason for this is that the algorithm refuses to converge when run on the weighted graph. But when run on an unweighted graph, it does converge and gives reasonable results.

In [4]:
def run_label_propagation(graph):
    query = """
    CALL algo.labelPropagation.stream("Sentence", "SIM", {
        direction: "OUTGOING", 
        seedProperty: "seed_label", 
        iterations: 10
    })
    YIELD nodeId, label
    RETURN algo.asNode(nodeId).doc_id AS doc_id, label AS community
    """
    results = graph.run(query).data()
    return results

In [5]:
graph = py2neo.Graph(NEO4J_CONN_URL, auth=("neo4j", "graph"))
results = run_label_propagation(graph)

### Predictions

We have captured the `test_docid2text` mappings for sentences that haven't been annotated. We record the label predictions made by the algorithm in the `community` field, into `sentences-compound-preds.tsv`.

As can be seen, only some additional sentences have been annotated. Of the 628 unannotated sentences, 323 have been annotated with sense 1 (chemical compound), 7 have been annotated with sense 2 (multiple or composite), and 298 remain unannotated.

In [6]:
num_preds = 0
fpreds = open(PREDS_FILEPATH, "w")
for result in results:
    doc_id = result["doc_id"]
    prediction = result["community"]
    if doc_id in test_docid2text.keys():
        sent_text = test_docid2text[doc_id]
        fpreds.write("{:s}\t{:s}\t{:d}\n".format(doc_id, sent_text, prediction))
        num_preds += 1

fpreds.close()
print("number of predictions: {:d}".format(num_preds))

number of predictions: 623


In [7]:
pred_df = pd.read_csv(PREDS_FILEPATH, delimiter='\t', 
    names=["doc_id", "sent_text", "prediction"])
pred_df.head()

Unnamed: 0,doc_id,sent_text,prediction
0,S0010938X15301268-4821,"In all, the IRAS results suggest that the near...",1
1,S0010938X15301268-6252,As a consequence of the radial distribution in...,1
2,S0013468616323520-5696,)(1)Jlim=2nFDI3−cI3−lwhere n is the electron n...,42
3,S0013468616323520-5843,The presence of polymer network can also be th...,163
4,S0014299914007481-4079,"Their compound, the hexapeptide MeFKPdChaFr (N...",170


In [8]:
pd.set_option("display.max_colwidth", 250)
chemical_df = pred_df[pred_df.prediction == 1]
chemical_df.head()

Unnamed: 0,doc_id,sent_text,prediction
0,S0010938X15301268-4821,"In all, the IRAS results suggest that the near-surface region of HZ3 consists of zincite and cuprite as surface constituents, and at least one more compound which contains hydroxide ions and carbonate ions, most likely hydrozincite [20] (zinc hyd...",1
1,S0010938X15301268-6252,"As a consequence of the radial distribution in potential and in local chemistry in the NaCl spreading area [24], and combined with the elemental analysis presented above, the formation of the commonly occurring compound simonkolleite, Zn5(OH)8Cl2...",1
6,S002016931100750X-787,ORTEP view of the compound [CuL8(ClO4)2] with the numbering scheme adopted.,1
7,S002016931100750X-2427,"The compound L5 featuring the 1,7-diaza-4-thiacyclononane ([9]aneN2S, 3) was also synthesized for comparison purposes.",1
8,S0020751913002750-981,"Representative two-electrode voltage clamp current traces from Xenopus oocytes expressing Rsanα1/β2 to (A) 1 mM acetylcholine (ACh), (B) 100 μM imidacloprid (IMI) and (C) 100 μM spinosad following a 5 s exposure (grey bar) to each compound.",1


In [9]:
chemical_df.count()["sent_text"]

319

In [10]:
composite_df = pred_df[pred_df.prediction == 2]
composite_df.head()

Unnamed: 0,doc_id,sent_text,prediction
76,S0040402010010859-3830,"As expected, iodination of 19 with 2.5 equiv of NIS in MeCN proceeded smoothly to give the requisite compound 21 (Scheme 4).",2
108,S0040402010010859-23636,"Numbering used for the spectra description is based on 1-(1H-imidazol-2-yl)pent-4-ene-1,2,3-triol backbone as shown for compound 16 (Scheme 2).",2
109,S0040402010010859-23674,"Numbering used for the spectra description is based on the 5-(hydroxymethyl)-5,6,7,8-tetrahydroimidazo[1,2-a]pyridine-6,7,8-triol backbone as shown for compound 18 (Scheme 3).",2
110,S0040402010010859-23726,"Numbering used for the spectra description is based on 6,7,8,9-tetrahydro-5H-imidazo[1,2-a]azepine-7,8,9-triol backbone as shown for compound 25 (Scheme 5).",2
486,S1047847710001474-2520,Sensitive to compound fluorescence,2


In [11]:
composite_df.count()["sent_text"]

7

In [12]:
neither_df = pred_df[(pred_df.prediction != 1) & 
                     (pred_df.prediction != 2)]
neither_df.head()

Unnamed: 0,doc_id,sent_text,prediction
2,S0013468616323520-5696,")(1)Jlim=2nFDI3−cI3−lwhere n is the electron number per molecule, F is the Faraday constant, DI3− is the diffusion coefficient of the limiting compound, andcI3−is the initial concentration of the limiting compound [24].",42
3,S0013468616323520-5843,The presence of polymer network can also be thought of as a barrier for the mobile ions to recombine with each other to form a tightly bound compound which effectively reduce the ion diffusion.,163
4,S0014299914007481-4079,"Their compound, the hexapeptide MeFKPdChaFr (N-methylphenylalanine-Lys-Pro-d-cyclohexylalanine-Phe-d-arginine), was shown to be an antagonist but also had partial agonist behavior (Drapeau et al., 1993).",170
5,S0014488618304321-7563,"Since previous work has shown that α-synuclein increases oxidative stress in models of PD (Esteves et al., 2009; Pan et al., 2011; Perfeito et al., 2017; Tapias et al., 2017), we examined sensitivity to oxidative in stress by exposing worms to th...",170
9,S0020751913002750-3610,"USA), normalised data were fitted to the following equation: Y = Imin + (Imax − Imin)/1 + 10(logEC50−X)nH where Y is the normalised response amplitude to a compound applied at concentration X, Imax and Imin are the maximum and minimum normalised ...",42


In [13]:
neither_df.count()["sent_text"]

297