## My annotation 

Starting from annotated the first 50 paper and last 30 paper from the raw data file using the data_prepare.py which is a python script allow you work on annotation. And store the result into annotations/annotated_50.jsonl and annotations/annotated_last30.jsonl

Now, merge both of my annotated_50.jsonl and annotated_last30.jsonl file into one file. Ensure first 50 go into first and last 30 next form measuring annotation aggrement. 

In [18]:
# merge_annotations.py

import json

ANNOTATED_50_FILE = "annotations/annotated_50.jsonl"
ANNOTATED_LAST30_FILE = "annotations/annotated_last30.jsonl"
MERGED_OUTPUT_FILE = "annotations/merged_annotations.jsonl"

def read_jsonl_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        return [json.loads(line) for line in f]

def write_jsonl_file(file_path, data):
    with open(file_path, 'w', encoding='utf-8') as f:
        for item in data:
            f.write(json.dumps(item, ensure_ascii=False) + "\n")

# Read the annotated files
first_50_annotations = read_jsonl_file(ANNOTATED_50_FILE)
last_30_annotations = read_jsonl_file(ANNOTATED_LAST30_FILE)

# Merge the annotations
merged_annotations = first_50_annotations + last_30_annotations

# Write the merged annotations to the output file
write_jsonl_file(MERGED_OUTPUT_FILE, merged_annotations)

## Get other annotations

Goal: measure annotation agreement between my annotationed and annotated from other (HK / QC)

#### Step: 
1. Extract the first 50 and last 30 annotated result from the provided xlsx file. 
2. Covert other annotated from sheet format into a papers.jsonl file. 

In [10]:
import pandas as pd
import json

# Load CSV file with specified encoding
df = pd.read_csv("papers.csv", encoding="latin1")

# Function to build nested 'externalIds' dict
def build_external_ids(row):
    external_keys = ["MAG", "DBLP", "PubMedCentral", "DOI", "CorpusId", "PubMed", "ACL", "ArXiv"]
    return {
        key: row[f"externalIds.{key}"] for key in external_keys if not pd.isna(row[f"externalIds.{key}"])
    }

# Function to map isBionlp values to labels (1 for 'Y', 0 for 'N' and 'N/A')
def map_is_bionlp(value):
    if value == "Y":
        return 1
    elif value == "N":
        return 0
    else:
        return 2 

# Convert each row to a JSON object
jsonl_lines = []
for _, row in df.iterrows():
    json_obj = {
        "paperId": row["paperId"],
        "semanticScholarUrl": row["semanticScholarUrl"],
        "title": row["title"],
        "abstract": row["abstract"],
        "externalIds": build_external_ids(row),
        "label": map_is_bionlp(row["isBionlp"]) if "isBionlp" in row else 0
    }
    jsonl_lines.append(json_obj)

# Save to .jsonl
with open("papers.jsonl", "w", encoding="utf-8") as f:
    for item in jsonl_lines:
        print(item)
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

{'paperId': 'd88f70c59b02bba47b3b25b36598917e130b950a', 'semanticScholarUrl': 'https://www.semanticscholar.org/paper/d88f70c59b02bba47b3b25b36598917e130b950a', 'title': 'Drug-Drug Interaction Extraction via Convolutional Neural Networks', 'abstract': 'Drug-drug interaction (DDI) extraction as a typical relation extraction task in natural language processing (NLP) has always attracted great attention. Most state-of-the-art DDI extraction systems are based on support vector machines (SVM) with a large number of manually defined features. Recently, convolutional neural networks (CNN), a robust machine learning method which almost does not need manually defined features, has exhibited great potential for many NLP tasks. It is worth employing CNN for DDI extraction, which has never been investigated. We proposed a CNN-based method for DDI extraction. Experiments conducted on the 2013 DDIExtraction challenge corpus demonstrate that CNN is a good choice for DDI extraction. The CNN-based DDI e

## Get test data (remove first 50 paper / last 30 from test)

In [26]:
import pandas as pd
import json

# Load Excel file
df = pd.read_excel("test_data.xlsx", engine='openpyxl')

# Function to build nested 'externalIds' dict
def build_external_ids(row):
    external_keys = ["MAG", "DBLP", "PubMedCentral", "DOI", "CorpusId", "PubMed", "ACL", "ArXiv"]
    return {
        key: row[f"externalIds.{key}"] for key in external_keys if not pd.isna(row[f"externalIds.{key}"])
    }

# Function to map isBionlp values to labels (1 for 'Y', 0 for 'N' and 'N/A')
def map_is_bionlp(value):
    if value == "Y":
        return 1
    else:
        return 0

# Convert each row to a JSON object
jsonl_lines = []
for _, row in df.iterrows():
    json_obj = {
        "paperId": row["paperId"],
        "semanticScholarUrl": row["semanticScholarUrl"],
        "title": row["title"],
        "abstract": row["abstract"],
        "externalIds": build_external_ids(row),
        "label": map_is_bionlp(row["isBionlp"]) if "isBionlp" in row else 0
    }
    jsonl_lines.append(json_obj)

# Save to .jsonl
with open("test_data.jsonl", "w", encoding="utf-8") as f:
    for item in jsonl_lines:
        print(item)
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

{'paperId': '88ed818b44ac91d0e410c2444b785ae33db490f5', 'semanticScholarUrl': 'https://www.semanticscholar.org/paper/88ed818b44ac91d0e410c2444b785ae33db490f5', 'title': 'Diseases 2.0: a weekly updated database of disease–gene associations from text mining and data integration', 'abstract': 'The scientific knowledge about which genes are involved in which diseases grows rapidly, which makes it difficult to keep up with new publications and genetics datasets. The DISEASES database aims to provide a comprehensive overview by systematically integrating and assigning confidence scores to evidence for disease–gene associations from curated databases, genome-wide association studies (GWAS), and automatic text mining of the biomedical literature. Here, we present a major update to this resource, which greatly increases the number of associations from all these sources. This is especially true for the text-mined associations, which have increased by at least 9-fold at all confidence cutoffs. We

## Measure Annotation Agreement 

We use Cohen's Kappa as the evalution matrics. 

Cohen's Kappa coefficient is a statistic that measures inter-rater agreement for categorical items. It is generally considered a more robust measure than simple percent agreement calculation since it takes into account the agreement occurring by chance.

In [19]:
import json
from sklearn.metrics import cohen_kappa_score

# Function to read a JSONL file and return a list of dictionaries
def read_jsonl_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        return [json.loads(line) for line in f]

# Read the annotated files
merged_annotations = read_jsonl_file("annotations/merged_annotations.jsonl")
papers_annotations = read_jsonl_file("annotations/papers.jsonl")

# Ensure they have the same number of items
if len(merged_annotations) != len(papers_annotations):
    raise ValueError("The number of papers in the merged annotations and papers annotations files do not match.")

# Extract the labels from both sets of annotations
merged_labels = [item['label'] for item in merged_annotations]
papers_labels = [item['label'] for item in papers_annotations]

# Calculate Cohen's Kappa score for agreement
kappa_score = cohen_kappa_score(merged_labels, papers_labels)

print(f"Cohen's Kappa Score: {kappa_score}")

Cohen's Kappa Score: 0.8755832037325039


Given a Cohen's Kappa score of 0.87, it can be interpreted as follows:

High Reliability: The annotations are highly consistent and reliable.
Acceptance in Most Fields: This level of agreement is generally accepted in most fields, including highly critical ones like medical research.

## Convert jsonl file to csv file 

To see the detail difference in our final result, we convert the jsonl file into csv for easier observation.

In [21]:
import json
import csv

def read_jsonl_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        return [json.loads(line) for line in f]

def flatten(d, parent_key='', sep='.'):
    """
    Flatten a nested dictionary
    """
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

def jsonl_to_csv(jsonl_file, csv_file):
    data = read_jsonl_file(jsonl_file)

    # Flatten each JSON object and create a set of all columns
    flattened_data = [flatten(record) for record in data]
    columns = set()
    for record in flattened_data:
        columns.update(record.keys())
    
    # Write to CSV
    with open(csv_file, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=sorted(columns))
        writer.writeheader()
        for record in flattened_data:
            writer.writerow(record)

# Usage
jsonl_file_path = 'annotations/merged_annotations.jsonl'
csv_file_path = 'merged_annotations.csv'
jsonl_to_csv(jsonl_file_path, csv_file_path)

## Some difference

Detail information in: https://docs.google.com/spreadsheets/d/1J5xpQYRGW6fJnT6FG0xC2iRBycRBsFe4idsYKVkht_M/edit?usp=sharing

1. There are some paper we classify differently. 
2. 3 of them got mark as N/A by other reviewer but I mark them as No BioNLP for model to make prediction. 

For the some of those paper, here are my analysis: 

1. Inter-sentence Relation Extraction with Document-level Graph Convolutional Neural Network: I mark as BioNLP but disaggre by other. 

Reason: 
Relevance to Biomedical Domain: The paper focuses on chemical disease relations and chemical reactions datasets, which are important components of biomedical research.

Use of NLP Techniques: It use NLP methods for relation extraction, leveraging techniques like GCNN and MIL, specifically tailored to biomedical text.

2. Nested Named Entity Recognition with Partially-Observed TreeCRFs (yes from me, no from other)

Reason: 
Relevance to Biomedical Domain: Include the GENIA dataset which is biomedical, but primarily focuses on general techniques and includes datasets that are not exclusively biomedical.

Use of NLP Techniques: The paper uses advanced NLP methods for NER and nested structures, even though these methods primarily applied to biomedical texts.