# Data Processing Explained

This notebook was created in order to explain the data processing of the KG4All, that is the creation of a Knowledge Graph from natural language text. 

The notebok contains the followinf topics:

- 1 Enviroment Setup (there are both a requirements.txt and a poetry.lock files)
- 2 The data processing using the abstracts from the [COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

## 1) Enviroment Setup

To setup the enviroment it is needed to install the requirements.txt or poetry.lock file as well as the following command to instal the model. Note that this should be done just once ans it takes a while the install.

``` !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz ```

## 2) Data Processing

In layman's terms the data processing receives natural language text as input and from the text extracts entities and related entities using the [ScispaCy Framework](https://arxiv.org/abs/1902.07669).

In [1]:
# Importing the libs and configuration the log file
import os
import pandas as pd
import spacy
import scispacy
from scispacy.umls_linking import UmlsEntityLinker

from loguru import logger
logger.add("create_triplets_df.log", rotation="50 MB") 

KeyboardInterrupt: 

In [None]:
## Defining the scispacy class

# Load entity extractor UMLS Model
nlp = spacy.load("en_core_sci_sm")

# Load entity Linker
linker = UmlsEntityLinker(resolve_abbreviations=True,max_entities_per_mention=3)

# Merging Entity Extractor and Entity Linker
nlp.add_pipe(linker)

In [3]:
# Loading the COVID-19 Open Research Dataset Challenge (CORD-19) abstracts dataset. YOU SHOULD DOWNLOAD THE metadata.csv FILE ANS SAVE IT IN THE SAME FOLDER AS THIS NOTEBOOK.

df_abstracts = pd.read_csv("metadata.csv").dropna(subset=['sha','abstract']).filter(['sha','abstract'])
print(df_abstracts.shape)
df_abstracts.head()

(81354, 2)


Unnamed: 0,sha,abstract
0,d1aafb70c066a2068b02786f8929fd9c900897fb,OBJECTIVE: This retrospective chart review des...
1,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,Inflammatory diseases of the respiratory tract...
2,06ced00a5fc04215949aa72528f2eeaae1d58927,Surfactant protein-D (SP-D) participates in th...
3,348055649b6b8cf2b9a376498df9bf41f7123605,Endothelin-1 (ET-1) is a 21 amino acid peptide...
4,5f48792a5fa08bed9f56016f4981ae2ca6031b32,Respiratory syncytial virus (RSV) and pneumoni...


In [4]:
# function definitions

def get_text_entities(sha,text):
    """Extracts the UMLS entities from a text.

    Args:
        sha (str): Document id.
        text (str): Natural Language text. 

    Returns:
        Dict: Document id ans extracted entities.
    """
    try:
        output={
            'sha':sha,
            'ents':[*nlp(text).ents]
        }
        return output 
    except Exception as e:
        logger.error(f'Erro na funcao get_text_entities(): {e}')

def get_ent_info_from_cui(cui):
    """Parser a bunch of information about the entity given its cui.

    Args:
        cui (str): entity cui as str.

    Returns:
        Dict: Information about the entity of a goven cui.
    """
    try:
        entity = linker.umls.cui_to_entity[cui]
        name = entity[1]
        alias = entity[2]
        tui = entity[3]
        semantic_descrition = entity[4]
        
        entity_json = {
            'cui':cui,
            'name':name,
            'semantic_descrition':semantic_descrition
        }
        # In case entity is represented in by more than one tui
        if len(tui) > 1:
            entity_json['tui'] = [tui_i in tui]
        else:
            entity_json['tui'] = ','.join(tui)
        return entity_json
    except Exception as e:
        logger.error(f'Erro na funcao get_ent_info_from_cui(): {e}')

def create_head_json(sha,ent):
    """Creates a json for a head entiti extracted from a text.

    Args:
        sha (str): document id.
        ent (scispacy ent): Anentity object from the scispacy package.

    Returns:
        Dict: Head entity information.
    """
    try:
        head_cui = linker.umls.cui_to_entity[ent._.umls_ents[0][0]][0]
        related_cui = [*map(lambda tuple_related:tuple_related[0],ent._.umls_ents)]
        head_info_json = get_ent_info_from_cui(head_cui)
        head_info_json['related_cui'] = related_cui
        head_info_json['sha']=sha
        return head_info_json
    except Exception as e:
        logger.error(f'Erro na funcao create_head_json(): {e}')

def create_relation_df(head_entity):
    """Given a head entity it creates the the relations df.

    Args:
        head_entity (Dict): result from the create_head_json function.

    Returns:
        pd.DataFrame: Data Frame with the relations
    """
    try:
        relations = []
        for linked_cui in head_entity['related_cui']:
            try:
                linked_cui_info = get_ent_info_from_cui(linked_cui)
                tail_tui = linked_cui_info['tui']
                relation = {
                        'head_sha':head_entity['sha'],
                        'head_cui':head_entity['cui'],
                        'head_name':head_entity['name'],
                        'head_tui':head_entity['tui'],
                        'head_semantic_descrition':head_entity['semantic_descrition'],
                        'tail_cui':linked_cui_info['cui'],
                        'tail_name':linked_cui_info['name'],
                        'tail_semantic_descrition':linked_cui_info['semantic_descrition']
                    }
                if type(tail_tui) != str:
                    relation['tail_tui'] = [*map(lambda x:x, tail_tui)]
                else:
                    relation['tail_tui'] = tail_tui
                relations.append(relation)
            except:
                continue
        return pd.DataFrame.from_dict(relations)
    except Exception as e:
        logger.error(f'Error na funfa create_relation_df(): {e}')

In [1]:
# create the triplets dataframe

n_abstracts=len(df_abstracts)
triplets_file='2020-09-21-triplets.csv'
for index,row in df_abstracts.iterrows():
    try:
        logger.debug(f'index: {index}')
        sha=row['sha'][-40:],
        text=row['abstract']
        text_entities=get_text_entities(sha=sha,text=text)
        head_entities = create_head_json(sha=text_entities['sha'],ent=text_entities['ents'][0])
        df_relations = create_relation_df(head_entities)
        # if file does not exist write header 
        if not os.path.isfile(triplets_file):
            df_relations.to_csv(triplets_file, header='column_names',sep='|')
        else: # else it exists so append without writing the header
            df_relations.to_csv(triplets_file, mode='a', header=False,sep='|')
    except Exception as e:
        logger.error(f"Error: {e}")

NameError: name 'df_abstracts' is not defined