# Manual Relations Analysis

Analysis of relations extracted manually from selected articles. 

100 most frequent NER disease entities were manually categorized into disease groups (broader classifications of disease such as breast cancer, lung cancer, mental health disorder). For exact categorizations, see "most_frequent_disease.xlsx" file. The five most frequent disease categories were selected and 3 most enriched articles for each category were manually annotated. For list of extracted relations by diseaes category, see "relations_csv/" directory. Among the extracted relations, those already represented by synonymous terms in CRO were indicated. This notebook contains code that streamlined this process as well as code that was used to determine the percentage of manually extracted relations that are covered by CRO.

In [1]:
from collections import Counter
import pandas as pd

csv_dir = 'relations_csv' # directory containing relations by disease category

In [60]:
# the five most frequent disease categories
diseases = [
    'breast_cancer',
    'cardiovascular_disease',
    'neurological_disease',
    'blood_cancer',
    'lung_cancer'
]
relations = [] # used to aggregate relations from all disease categories
for disease in diseases:
    # iterate through all 5 disease categories and aggregate relations
    with open(f'{csv_dir}/{disease}_relations.csv', 'r') as f:
        disease_relation = f.read().split('\n')
        relations += disease_relation
with open(f'{csv_dir}/combined_disease_relations.csv', 'w') as f:
    f.write('\n'.join(relations))

In [79]:
# read in combined list of relations
combined_relations = []
with open(f'{csv_dir}/combined_disease_relations.csv', 'r') as f:
    combined_relations = f.read().split('\n')

# find unique relations in aggregated relations and calculate frequency
# create {relation: count} dictionary for each unique relation
relation_counts = dict(Counter(combined_relations))
relation_counts_df = pd.DataFrame({
    'relation': list(relation_counts.keys()),
    'count': list(relation_counts.values())
})
# write to excel sheet
relation_counts_df.to_excel(f'{csv_dir}/relation_counts.xlsx')

In [123]:
# Each unique relation was manually determined to have an equivalent relation in CRO already.
# Relations that have no translation have an NaN in their row. 

# Calculate the percentage of coverage (i.e. what percent of relations have an equivalent in CRO)
# unweighted i.e. unique terms
relations_translations = pd.read_excel('relation_counts_rough_translations.xlsx')
has_translation_abs = (sum(relations_translations['rough translation'].notna()) / len(relations_translations)) * 100

# weighted by occurrence
counts = relations_translations['count']
has_translation_weighted = sum((relations_translations['rough translation'].notna() * 100) * counts) / sum(counts)

print(f'unweighted: {has_translation_abs:.2f}%')
print(f'weighted by frequency of occurrence: {has_translation_weighted:.2f}%')