# INDRA Data Statistics
This notebook is intended to provide an overview over the following characteristics of the dataset:
1. Histogram with types of relations (distribution on number of triples per relation type)
* Number of triples WITH evidence (all of them are supposed to have?)
* Number of triples WITH annotations per fine-tuning task in our “benchmark”
* Class distribution per annotation type (with %)
* Average token length per annotation and its length distribution (plot distribution)
* Number of triples with MULTIPLE evidences (%) among the annotated ones

## Options for reading large json files
The regular json package just loads the entire file into memory. Ideally, we would want to avoid that, given that we're
dealing with large json files. Possible candidate packages to use are:
* ijson: https://pypi.org/project/ijson/
* json-streamer: https://github.com/kashifrazzaqui/json-streamer
* bigjson: https://github.com/henu/bigjson
* pybel

--> just use pybel for now, it's much easier anyways

## Imports and constants

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pybel
from pybel.constants import (
    ANNOTATIONS,
    EVIDENCE,
    RELATION,
    CITATION,
    INCREASES,
    DIRECTLY_INCREASES,
    DECREASES,
    DIRECTLY_DECREASES,
    REGULATES,
    BINDS,
    CORRELATION,
    NO_CORRELATION,
    NEGATIVE_CORRELATION,
    POSITIVE_CORRELATION,
    ASSOCIATION,
    PART_OF,
)

## Loading the Data

In [3]:
DUMMY_EXAMPLE_INDRA = os.path.join("../data/input/", 'statements_2021-01-30-17-21-54.json')

# Use pybel for processing the json
indra_kg = pybel.io.indra.from_indra_statements_json_file(DUMMY_EXAMPLE_INDRA)
indra_kg.summarize()

INFO: [2021-04-12 13:13:36] indra.assemblers.pybel.assembler - Skipping modification of type modification on agent EP300(mods: (modification))
INFO: [2021-04-12 13:13:36] indra.assemblers.pybel.assembler - Skipping modification of type modification on agent EP300(mods: (modification))
INFO: [2021-04-12 13:13:36] indra.assemblers.pybel.assembler - Skipping modification of type modification on agent EP300(mods: (modification))
INFO: [2021-04-12 13:13:37] indra.assemblers.pybel.assembler - Skipping modification of type modification on agent ERBB2(mods: (modification))
INFO: [2021-04-12 13:13:37] indra.assemblers.pybel.assembler - Skipping modification of type modification on agent BAX(mods: (modification))
INFO: [2021-04-12 13:13:38] indra.assemblers.pybel.assembler - Skipping modification of type sumoylation on agent AKT1(mods: (sumoylation, K, 276))
INFO: [2021-04-12 13:13:38] indra.assemblers.pybel.assembler - Skipping modification of type sumoylation on agent SMAD4(mods: (sumoylation,

---------------------  ------------------------------------
Name                   indra
Version                f02568a3-91f2-4d2d-916d-6e539b8f572e
Author                 INDRA
Number of Nodes        2939
Number of Namespaces   11
Number of Edges        20034
Number of Annotations  31
Number of Citations    10003
Number of Authors      0
Network Density        2.32E-03
Number of Components   13
---------------------  ------------------------------------

Type (4)             Count  Example
-----------------  -------  ----------------------------------------------------------------------------------------------
Protein               1468  p(HGNC:391 ! AKT1, pmod(go:0006468 ! "protein phosphorylation", Ser, 473))
Complex               1071  complex(p(HGNC:3236 ! EGFR), p(HGNC:644 ! AR))
Abundance              318  a(CHEBI:94525 ! "N-(2-aminophenyl)-4-[[[4-(3-pyridinyl)-2-pyrimidinyl]amino]methyl]benzamide")
BiologicalProcess       82  bp(MESH:D053903 ! "DNA Breaks, Double-Stranded")

Na

## Put everything in a dataframe

In [6]:
# Dump the entire data (as long as there is an evidence)
triple_text_pairs = []
counter = 0

# Iterate through the graph and infer a subgraph with edges that contain the annotation of interest
for u, v, data in indra_kg.edges(data=True):

    if counter > 200:
         break

    # TODO: get the annotation stuff (make 6 columns)
    # TODO: deal with memory constraints
    print(u)
    print(v)
    print("\n")

    triple_text_pairs.append({
        'source': u,
        'relation': data[RELATION],
        'target': v,
        'evidence': data[EVIDENCE] if EVIDENCE in data else None,
        'pmid': data[CITATION] if CITATION in data else None,
        'cell_line': data[ANNOTATIONS]['cell_line'] if ANNOTATIONS in data and 'cell_line' in data[ANNOTATIONS]
                     else None,
        'cell_type': data[ANNOTATIONS]['cell_type'] if ANNOTATIONS in data and 'cell_type' in data[ANNOTATIONS]
                     else None,
        'species': data[ANNOTATIONS]['species'] if ANNOTATIONS in data and 'species' in data[ANNOTATIONS]
                   else None,
        'location': data[ANNOTATIONS]['location'] if ANNOTATIONS in data and 'location' in data[ANNOTATIONS]
                    else None,
        'organ': data[ANNOTATIONS]['organ'] if ANNOTATIONS in data and 'organ' in data[ANNOTATIONS]
                 else None,
        'disease': data[ANNOTATIONS]['disease'] if ANNOTATIONS in data and 'disease' in data[ANNOTATIONS]
                   else None,
    })

all_triple_text_pairs = pd.DataFrame(triple_text_pairs)

all_triple_text_pairs.head(n=20)
# TODO

a(CHEBI:17051 ! fluoride)
p(HGNC:11998 ! TP53, pmod(go:0006473 ! "protein acetylation"))


a(CHEBI:17051 ! fluoride)
p(HGNC:11998 ! TP53, pmod(go:0006473 ! "protein acetylation"))


p(HGNC:11998 ! TP53, pmod(go:0006473 ! "protein acetylation"))
p(HGNC:11998 ! TP53, pmod(go:0006473 ! "protein acetylation"))


p(HGNC:11998 ! TP53, pmod(go:0006473 ! "protein acetylation"))
p(HGNC:11998 ! TP53, pmod(go:0006473 ! "protein acetylation"))


p(HGNC:11998 ! TP53, pmod(go:0006473 ! "protein acetylation"))
p(HGNC:11998 ! TP53)


p(HGNC:11998 ! TP53, pmod(go:0006473 ! "protein acetylation"))
p(HGNC:11998 ! TP53)


p(HGNC:11998 ! TP53)
p(HGNC:11998 ! TP53, pmod(go:0006473 ! "protein acetylation"))


p(HGNC:11998 ! TP53)
p(HGNC:11998 ! TP53, pmod(go:0006473 ! "protein acetylation"))


p(HGNC:11998 ! TP53)
p(HGNC:11998 ! TP53, pmod(go:0006468 ! "protein phosphorylation"))


p(HGNC:11998 ! TP53)
p(HGNC:11998 ! TP53, pmod(go:0006468 ! "protein phosphorylation"))


p(HGNC:11998 ! TP53)
p(HGNC:11998 ! TP

Unnamed: 0,source,relation,target,evidence,pmid,cell_line,cell_type,species,location,organ,disease
0,"{'function': 'Abundance', 'concept': {'namespa...",directlyIncreases,"{'function': 'Protein', 'concept': {'namespace...","However, how p53 is acetylated by fluoride and...","{'namespace': 'pubmed', 'identifier': '31927229'}",,,,,,
1,"{'function': 'Abundance', 'concept': {'namespa...",directlyIncreases,"{'function': 'Protein', 'concept': {'namespace...",Here we demonstrate that fluoride activates hi...,"{'namespace': 'pubmed', 'identifier': '31927229'}",,,,,,
2,"{'function': 'Protein', 'concept': {'namespace...",increases,"{'function': 'Protein', 'concept': {'namespace...","Notably, blocking beta-catenin and CBP interac...","{'namespace': 'pubmed', 'identifier': '27499244'}",,,,,,
3,"{'function': 'Protein', 'concept': {'namespace...",increases,"{'function': 'Protein', 'concept': {'namespace...","In response to DNA damage, acetylation of p53 ...","{'namespace': 'pubmed', 'identifier': '20160719'}",,,,,,
4,"{'function': 'Protein', 'concept': {'namespace...",increases,"{'function': 'Protein', 'concept': {'namespace...",Based on results showing that either ubiquitin...,"{'namespace': 'pubmed', 'identifier': '20639885'}",,,,,,
5,"{'function': 'Protein', 'concept': {'namespace...",increases,"{'function': 'Protein', 'concept': {'namespace...",Subsequent assays indicated gamma-bisabolene e...,"{'namespace': 'pubmed', 'identifier': '26194454'}",,,,,,
6,"{'function': 'Protein', 'concept': {'namespace...",hasVariant,"{'function': 'Protein', 'concept': {'namespace...",,,,,,,,
7,"{'function': 'Protein', 'concept': {'namespace...",decreases,"{'function': 'Protein', 'concept': {'namespace...",HAT enzymes act on non histone substrates such...,"{'namespace': 'pubmed', 'identifier': '19652528'}",,,,,,
8,"{'function': 'Protein', 'concept': {'namespace...",hasVariant,"{'function': 'Protein', 'concept': {'namespace...",,,,,,,,
9,"{'function': 'Protein', 'concept': {'namespace...",directlyIncreases,"{'function': 'Protein', 'concept': {'namespace...","For ATM inhibitors, a number of potential meas...","{'namespace': 'pubmed', 'identifier': '25512053'}",,,,,,


## 1. Histogram with types of relations

