## **Notebook Summary**

### **Purpose**
The notebook is designed to preprocess and subset the SIGIR2016/TrialGPT clinical trial dataset, focusing on generating a smaller, well-structured development/test set. It ensures that selected samples have non-blank inclusion and exclusion criteria, which is critical for downstream clinical trial eligibility modeling and evaluation.

---

### **Key Steps**

#### **1. Import Management**
- Defines a utility (`clean_and_sort_imports`) to deduplicate and sort Python import statements for code hygiene.

#### **2. Data Loading**
- Loads three main files from the processed dataset:
  - `test.tsv`: Tabular relevance judgments (query-id, corpus-id, score).
  - `queries.jsonl`: Patient queries (likely with demographic/clinical info).
  - `corpus.jsonl`: Clinical trial records, each with metadata (inclusion/exclusion criteria, drugs, diseases, etc.).

#### **3. Data Inspection**
- Displays DataFrame info, basic statistics, and value counts for exploratory analysis.
- Inspects the structure/content of queries and corpus data.

#### **4. Subset Construction**
- **Small Dev Set Creation:**
  - Selects a small number of unique query-ids (e.g., 4 or 6).
  - Filters `test.tsv` to these queries.
  - For each (query-id, score) pair, keeps the first occurrence.
  - Separates positive (score 1/2) and negative (score 0) samples.
  - For negatives, retains only those with non-blank inclusion and exclusion criteria in the trial metadata.
  - Combines positives and filtered negatives into a new subset.

- **Full Test Set Option:**
  - Alternatively, allows using the entire test set without subsetting.

- **Corresponding Query/Trial Subsets:**
  - Filters `queries.jsonl` and `corpus.jsonl` to only those referenced in the new subset.

#### **5. Output**
- Writes the filtered `test.tsv`, `queries.jsonl`, and `corpus.jsonl` to a new directory (`processed_cut`), preserving the original format.

#### **6. Data Quality Checks**
- Merges positive samples (score 1/2) with trial metadata.
- Implements functions to check for missing/blank inclusion or exclusion criteria.
- Reports counts of missing fields, supporting data quality assessment for downstream modeling.

---

### **Technical/Architectural Notes**
- **Data Integrity:** Ensures that all selected samples have the necessary metadata for eligibility modeling.
- **Efficiency:** Uses pandas for in-memory filtering and merging; processes JSONL line-by-line for memory efficiency.
- **Reproducibility:** Outputs are written in standard formats, facilitating further analysis or model development.
- **Modularity:** The code is structured to allow easy adjustment of subset size and filtering criteria.

---

### **Intended Use**
- The resulting subset is suitable for rapid prototyping, debugging, or focused evaluation of clinical trial eligibility models, especially where complete inclusion/exclusion criteria are required.

---


In [1]:
import os
import glob
import json
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, classification_report, confusion_matrix

# Set pandas display options 
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 300)

In [2]:
def clean_and_sort_imports(import_list):
    # Split each import statement into lines
    imports = import_list.split('\n')
    
    # Remove empty lines, strip whitespace, and remove duplicates
    unique_imports = list(dict.fromkeys(line.strip() for line in imports if line.strip()))
    
    # Sort imports by length, then alphabetically for imports of the same length
    sorted_imports = sorted(unique_imports, key=lambda x: (len(x), x))
    
    return sorted_imports

# Example usage:
imports = """
import os
import glob
import json
import random
import numpy as np
import os
import glob
import json
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, classification_report, confusion_matrix
"""

cleaned_sorted_imports = clean_and_sort_imports(imports)

# Print the cleaned and sorted imports
for imp in cleaned_sorted_imports:
    print(imp)

import os
import glob
import json
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, classification_report, confusion_matrix


In [3]:
ls ../data/sigir2016/processed

corpus.jsonl  queries.jsonl  test.tsv


In [4]:
# Load the TSV file
tsv_file = '../data/sigir2016/processed/test.tsv'
# df_tsv = pd.read_csv(tsv_file, sep='\t', names=['query-id', 'corpus-id', 'score'])
df_tsv = pd.read_csv(tsv_file, sep='\t', header=0)

In [5]:
# Display info about the DataFrame
print("\nDataFrame Info:")
print(df_tsv.info())

# Display basic statistics of the DataFrame
print("\nDataFrame Description:")
print(df_tsv.describe())

# Display value counts for each column
for column in df_tsv.columns:
    print(f"\nValue counts for {column}:")
    print(df_tsv[column].value_counts().head())

# # Optional: Save to CSV for easy viewing in spreadsheet software
# df_tsv.to_csv('test_tsv_data.csv', index=False)
# print("\nDataFrame saved to 'test_tsv_data.csv'")# Display the first few rows of the DataFrame

# Display the first few rows of the DataFrame
print("First few rows of the DataFrame:")
# df_tsv.head()
df_tsv


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3870 entries, 0 to 3869
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   query-id   3870 non-null   object
 1   corpus-id  3870 non-null   object
 2   score      3870 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 90.8+ KB
None

DataFrame Description:
             score
count  3870.000000
mean      0.394574
std       0.675703
min       0.000000
25%       0.000000
50%       0.000000
75%       1.000000
max       2.000000

Value counts for query-id:
query-id
sigir-20147     153
sigir-20141     143
sigir-201518    118
sigir-20153     118
sigir-20146     108
Name: count, dtype: int64

Value counts for corpus-id:
corpus-id
NCT02134652    5
NCT00455468    5
NCT02269761    4
NCT01512277    4
NCT00264901    4
Name: count, dtype: int64

Value counts for score:
score
0    2764
1     685
2     421
Name: count, dtype: int64
First few rows of the DataFrame

Unnamed: 0,query-id,corpus-id,score
0,sigir-20141,NCT00000408,0
1,sigir-20141,NCT00000492,1
2,sigir-20141,NCT00000501,0
3,sigir-20141,NCT00001853,0
4,sigir-20141,NCT00004727,0
...,...,...,...
3865,sigir-20159,NCT02459171,0
3866,sigir-20159,NCT02459327,0
3867,sigir-20159,NCT02498964,0
3868,sigir-20159,NCT02612896,0


In [6]:
# Read the JSONL file from the dataset
# This file is part of the TrialGPT dataset and contains query information
# The 'queries.jsonl' file likely includes patient information
jsonl_file = '../data/sigir2016/processed/queries.jsonl'
queries = []

# Open the file and read it line by line
with open(jsonl_file, 'r') as file:
    for line in file:
        # Parse each line as a JSON object and append it to the queries list
        # This approach is memory-efficient for large files as it processes one line at a time
        queries.append(json.loads(line))

# Convert the list of dictionaries (parsed JSON objects) to a pandas DataFrame
# This transformation allows for easier data manipulation and analysis using pandas functions
df_queries = pd.DataFrame(queries)

# Display the entire DataFrame
# This allows us to inspect the structure and content of the queries data from the dataset
df_queries

Unnamed: 0,_id,text
0,sigir-20141,"A 58-year-old African-American woman presents to the ER with episodic pressing/burning anterior chest pain that began two days earlier for the first time in her life. The pain started while she was walking, radiates to the back, and is accompanied by nausea, diaphoresis and mild dyspnea, but is ..."
1,sigir-20142,"An 8-year-old male presents in March to the ER with fever up to 39 C, dyspnea and cough for 2 days. He has just returned from a 5 day vacation in Colorado. Parents report that prior to the onset of fever and cough, he had loose stools. He denies upper respiratory tract symptoms. On examination h..."
2,sigir-20143,"A 58-year-old nonsmoker white female with mild exertional dyspnea and occasional cough is found to have a left lung mass on chest x-ray. She is otherwise asymptomatic. A neurologic examination is unremarkable, but a CT scan of the head shows a solitary mass in the right frontal lobe."
3,sigir-20144,"A 2-year-old boy is brought to the emergency department by his parents for 5 days of high fever and irritability. The physical exam reveals conjunctivitis, strawberry tongue, inflammation of the hands and feet, desquamation of the skin of the fingers and toes, and cervical lymphadenopathy with t..."
4,sigir-20145,A 56-year-old female on 20th day post-left mastectomy presents to the emergency department complaining of shortness of breath and malaise. The patient says that she has remained in bed for the last two weeks. The physical examination reveals tenderness on the left upper thoracic wall and right c...
...,...,...
55,sigir-201526,"A 28 yo female G1P0A0 is admitted to the Ob/Gyn service for non-ruptured ectopic pregnancy. Past medical history is remarkable for obesity, a non-complicated appendectomy at age 8, infertility treatment for the past 3 years, and pelvic laparoscopy during which minor right Fallopian tube adhesion..."
56,sigir-201527,"A 15 yo girl accompanied by her mother is referred for evaluation by the school. The girl has more than expected absences in the last three month, appears to be constantly tired and sleepy in class. Her mother assures the girl is well fed, and getting the proper sleep at night but admits the gir..."
57,sigir-201528,"A previously healthy 8-year-old boy presents with a complaint of right lower extremity pain and fever. He reports limping for the past two days. The parents report no previous trauma, but do remember a tick bite during a summer visit to Maryland several months ago. They do not remember observing..."
58,sigir-201529,"A 4-year-old girl presents with persistent fever for the past week. The parents report a spike at 104°F. The parents brought the child to the emergency room when they noticed erythematous rash on the girl's trunk. Physical examination reveals strawberry red tongue, red and cracked lips, and swol..."


In [7]:
# Read the corpus.jsonl file
# corpus.jsonl contains detailed information about clinical trials
# Each line in this file is a JSON object representing a single trial
# The structure includes fields like '_id' (trial ID), 'title', 'text', and 'metadata'
corpus_file = '../data/sigir2016/processed/corpus.jsonl'

# Use pandas to read the JSONL file
# The 'lines=True' parameter tells pandas to read the file as JSON Lines format
# where each line is a separate JSON object
df_corpus = pd.read_json(corpus_file, lines=True)

# Display the entire DataFrame
df_corpus

# Note on corpus.jsonl structure:
# Each entry in corpus.jsonl looks like this:
# {
# "_id": nct_id,
# "title": title,
# "metadata": {
#     "phase": phase,
#     "drugs": str(drugs_list),
#     "drugs_list": drugs_list,
#     "diseases": str(diseases_list),
#     "diseases_list": diseases_list,
#     "enrollment": enrollment_value,
#     "inclusion_criteria": inclusion_text,
#     "exclusion_criteria": exclusion_text,
#     "brief_summary": brief_summary_text,
#     "detailed_description": detailed_description_text
# }
# }

Unnamed: 0,_id,title,metadata
0,NCT00000102,Congenital Adrenal Hyperplasia: Calcium Channels as Therapeutic Targets,"{'phase': 'Phase 1/Phase 2', 'drugs': '['Nifedipine']', 'drugs_list': ['Nifedipine'], 'diseases_list': ['Congenital Adrenal Hyperplasia'], 'enrollment': '0', 'inclusion_criteria': '- diagnosed with Congenital Adrenal Hyperplasia (CAH) - normal ECG during baseline evaluation', 'exclusion_criteria..."
1,NCT00000104,Does Lead Burden Alter Neuropsychological Development?,"{'phase': 'N/A', 'drugs': '['ERP measures of attention and memory']', 'drugs_list': ['ERP measures of attention and memory'], 'diseases_list': ['Lead Poisoning'], 'enrollment': '0', 'inclusion_criteria': '- Pregnant mothers of the Phillips neighborhood in Minneapolis, Minnesota. Subject recruitm..."
2,NCT00000105,Vaccination With Tetanus and KLH to Assess Immune Responses.,"{'phase': 'N/A', 'drugs': '['Intracel KLH Vaccine', 'Biosyn KLH', 'Montanide ISA51', 'Tetanus toxoid']', 'drugs_list': ['Intracel KLH Vaccine', 'Biosyn KLH', 'Montanide ISA51', 'Tetanus toxoid'], 'diseases_list': ['Cancer'], 'enrollment': '112', 'inclusion_criteria': '- Patients must have a diag..."
3,NCT00000106,41.8 Degree Centigrade Whole Body Hyperthermia for the Treatment of Rheumatoid Diseases,"{'phase': 'N/A', 'drugs': '['Whole body hyperthermia unit']', 'drugs_list': ['Whole body hyperthermia unit'], 'diseases_list': ['Rheumatic Diseases'], 'enrollment': '0', 'inclusion_criteria': '- Patients are required to meet the criteria of the American College of Rheumatology (ACR)for rheumatoi..."
4,NCT00000107,Body Water Content in Cyanotic Congenital Heart Disease,"{'phase': 'N/A', 'drugs': '[]', 'drugs_list': [], 'diseases_list': ['Heart Defects, Congenital'], 'enrollment': '0', 'inclusion_criteria': '- Resting blood pressure below 140/90', 'exclusion_criteria': '', 'brief_summary': 'Adults with cyanotic congenital heart disease have elevated levels of pl..."
...,...,...,...
204850,NCT02634177,Genecept Assay™ vs. Treatment-as-Usual to Evaluate Efficacy of Assay-Guided Treatment in Adults With Major Depressive Disorder,"{'phase': 'N/A', 'drugs': '['Assay-guided treatment (AGT)', 'Treatment-as-usual (TAU)']', 'drugs_list': ['Assay-guided treatment (AGT)', 'Treatment-as-usual (TAU)'], 'diseases_list': ['Major Depressive Disorder'], 'enrollment': '300', 'inclusion_criteria': '1. Age 18-75 years 2. Ability to under..."
204851,NCT02634190,Clinical Evaluation of the APTIMA® HPV Assay and Comparison With the HR HC2® Test Using LBC ThinPrep® Specimens,"{'phase': 'N/A', 'drugs': '['Thinprep® LBC', 'APTIMA® HPV Assay', 'HR HC2® HPV DNA', 'Colposcopy']', 'drugs_list': ['Thinprep® LBC', 'APTIMA® HPV Assay', 'HR HC2® HPV DNA', 'Colposcopy'], 'diseases_list': ['Human Papilloma Virus Infection'], 'enrollment': '10000', 'inclusion_criteria': '', 'excl..."
204852,NCT02634203,Riociguat Versus Balloon Pulmonary Angioplasty in Non-operable Chronic thromboEmbolic Pulmonary Hypertension,"{'phase': 'N/A', 'drugs': '['Balloon Pulmonary Angioplasty (BPA)', 'Riociguat']', 'drugs_list': ['Balloon Pulmonary Angioplasty (BPA)', 'Riociguat'], 'diseases_list': ['Chronic Thromboembolic Pulmonary Hypertension'], 'enrollment': '124', 'inclusion_criteria': '- 18 to 80 years of age at Visit 1..."
204853,NCT02634216,Effects of Capros in Patients With Type-1 Diabetes,"{'phase': 'N/A', 'drugs': '['Capros']', 'drugs_list': ['Capros'], 'diseases_list': ['Type I Diabetes'], 'enrollment': '20', 'inclusion_criteria': '- Subjects must be 10 - 40 yrs. of age - Type 1 Diabetes using Continuous Glucose Monitoring (CGM) for at least the last 3 months - Less than 10% var..."


# make cut down dev set 

## or

In [8]:
# Suppose df_tsv and df_corpus are already loaded

# Step 1: Optionally, limit to the first 6 query-ids (or use all)
query_ids = df_tsv['query-id'].unique()[:6]  # .take([0, 1, 2, 3, 4, 5]) for first 6

# Step 2: Filter df_tsv to only those query-ids
filtered = df_tsv[df_tsv['query-id'].isin(query_ids)]

# Step 3: Get all score 1s and 2s
ones_twos = filtered[filtered['score'].isin([1, 2])]

# Step 4: Find zeros
zeros = filtered[filtered['score'] == 0]

# Step 5: Merge `metadata` from df_corpus (matching _id with corpus-id)
zeros_with_meta = zeros.merge(
    df_corpus[['_id', 'metadata']],
    left_on='corpus-id', right_on='_id', how='left'
)

# Step 6: Helper for non-blank fields
def not_blank_metadata(meta, key):
    try:
        val = meta.get(key, None) if isinstance(meta, dict) else None
        return pd.notnull(val) and str(val).strip() != ''
    except Exception:
        return False

# Step 7: Filter zeros where both inclusion_criteria AND exclusion_criteria are non-blank in metadata
zeros_kept = (
    zeros_with_meta[
        zeros_with_meta['metadata'].apply(lambda m: not_blank_metadata(m, 'inclusion_criteria'))
        & zeros_with_meta['metadata'].apply(lambda m: not_blank_metadata(m, 'exclusion_criteria'))
    ]
    .sort_values('corpus-id')
    .drop_duplicates(subset=['query-id'], keep='first')
    .drop(columns=['_id', 'metadata'])  # Cleanup if desired
)

# Step 8: Combine all and reset index
df_tsv_subset = pd.concat([ones_twos, zeros_kept], ignore_index=True).reset_index(drop=True)

## or whole test

In [9]:
df_tsv_subset

Unnamed: 0,query-id,corpus-id,score
0,sigir-20141,NCT00000492,1
1,sigir-20141,NCT00005127,2
2,sigir-20141,NCT00005485,2
3,sigir-20141,NCT00005757,2
4,sigir-20141,NCT00129233,1
...,...,...,...
104,sigir-201414,NCT00006196,0
105,sigir-201412,NCT00011102,0
106,sigir-201410,NCT00043277,0
107,sigir-201413,NCT00067691,0


In [10]:
# Get the unique query-ids from the subset
query_ids_subset = df_tsv_subset['query-id'].unique()

# Filter df_queries to only those _id's
df_queries_subset = df_queries[df_queries['_id'].isin(query_ids_subset)].reset_index(drop=True)

In [11]:
df_queries_subset

Unnamed: 0,_id,text
0,sigir-20141,"A 58-year-old African-American woman presents to the ER with episodic pressing/burning anterior chest pain that began two days earlier for the first time in her life. The pain started while she was walking, radiates to the back, and is accompanied by nausea, diaphoresis and mild dyspnea, but is ..."
1,sigir-201410,"A physician is called to see a 67-year-old woman who underwent cardiac catheterization via the right femoral artery earlier in the morning. She is now complaining of a cool right foot. Upon examination she has a pulsatile mass in her right groin with loss of distal pulses, and auscultation revea..."
2,sigir-201411,"A 40-year-old woman with no past medical history presents to the ER with excruciating pain in her right arm that had started 1 hour prior to her admission. She denies trauma. On examination she is pale and in moderate discomfort, as well as tachypneic and tachycardic. Her body temperature is nor..."
3,sigir-201412,"A 25-year-old woman presents to the clinic complaining of prolonged fatigue. She denies difficulty sleeping and sleeps an average of 8 hours a night. She also notes hair loss, a change in her voice and weight gain during the previous 6 months. She complains of cold intolerance. On examination sh..."
4,sigir-201413,"A 30-year-old generally healthy woman presents with shortness of breath that had started 2 hours before admission. She has had no health problems in the past besides 2 natural abortions. She had given birth to a healthy child 3 weeks before. On examination, she is apprehensive, tachypneic and ta..."
5,sigir-201414,"An 85-year-old man is brought to the ER because of gradual decrease in his level of consciousness. In the last 3 days he stopped walking and eating by himself. He has had no fever, cough, rash or diarrhea. His daughter recalls that he had been involved in a car accident 3 weeks prior to his admi..."


In [12]:
# Get the unique corpus-ids from the subset
corpus_ids_subset = df_tsv_subset['corpus-id'].unique()

# Filter df_corpus to only those _id's
df_corpus_subset = df_corpus[df_corpus['_id'].isin(corpus_ids_subset)].reset_index(drop=True)

In [13]:
df_corpus_subset

Unnamed: 0,_id,title,metadata
0,NCT00000408,Low Back Pain Patient Education Evaluation,"{'phase': 'N/A', 'drugs': '['Patient education evaluation']', 'drugs_list': ['Patient education evaluation'], 'diseases_list': ['Low Back Pain'], 'enrollment': '580', 'inclusion_criteria': '- Must live in the United States - Must understand and write English - Must have access to a computer with..."
1,NCT00000492,Beta-Blocker Heart Attack Trial (BHAT),"{'phase': 'Phase 3', 'drugs': '['propranolol']', 'drugs_list': ['propranolol'], 'diseases_list': ['Arrhythmia', 'Cardiovascular Diseases', 'Coronary Disease', 'Death, Sudden, Cardiac', 'Heart Diseases', 'Myocardial Infarction', 'Myocardial Ischemia', 'Ventricular Fibrillation'], 'enrollment': '0..."
2,NCT00005127,Muscatine Heart Study,"{'phase': 'N/A', 'drugs': '[]', 'drugs_list': [], 'diseases_list': ['Cardiovascular Diseases', 'Coronary Disease', 'Hypertension', 'Heart Diseases'], 'enrollment': '0', 'inclusion_criteria': '', 'exclusion_criteria': '', 'brief_summary': 'To conduct longitudinal and cross-sectional studies of ri..."
3,NCT00005485,Environmental and Genetic Factors That Influence Cardiovascular Disease in African Americans,"{'phase': 'N/A', 'drugs': '[]', 'drugs_list': [], 'diseases_list': ['Cardiovascular Diseases', 'Heart Diseases', 'Atherosclerosis', 'Coronary Disease', 'Hypertension', 'Cerebrovascular Disorders'], 'enrollment': '5307', 'inclusion_criteria': '- African American', 'exclusion_criteria': '', 'brief..."
4,NCT00005757,Racial Variation in ACE--Genetic and Physiologic Bases,"{'phase': 'N/A', 'drugs': '[]', 'drugs_list': [], 'diseases_list': ['Cardiovascular Diseases', 'Heart Diseases', 'Hypertension'], 'enrollment': '0', 'inclusion_criteria': '', 'exclusion_criteria': '', 'brief_summary': 'To determine whether differences in the activity of the renin-angiotensin and..."
...,...,...,...
104,NCT02531542,Evaluation of Rapid Emergency Echography for Acute Dyspnoea,"{'phase': 'N/A', 'drugs': '['Echocardiography according to the READ method.']', 'drugs_list': ['Echocardiography according to the READ method.'], 'diseases_list': ['Acute Dyspnea'], 'enrollment': '500', 'inclusion_criteria': 'Admission to the Emergency Department Age >= 75 years AND criteria of ..."
105,NCT02532699,Anti-hypertensive Effect of Mycelia of Antrodia Cinnamomea,"{'phase': 'N/A', 'drugs': '['AC mycelia', 'Placebo']', 'drugs_list': ['AC mycelia', 'Placebo'], 'diseases_list': ['Hypertension'], 'enrollment': '41', 'inclusion_criteria': '- Eligible subjects were untreated hypertensive men or women aged between 20 and 80 years old with SBP between 130 and 179..."
106,NCT02608255,A Prospective Pilot Study to Evaluate a New Marker of Ischemia in Chest Pain Triage,"{'phase': 'N/A', 'drugs': '['blood samples']', 'drugs_list': ['blood samples'], 'diseases_list': ['Acute Coronary Syndrome'], 'enrollment': '100', 'inclusion_criteria': '- Subject with compatible symptoms with an acute coronary syndrome, for at least 15 minutes and not older than 3 hours (eg dis..."
107,NCT02616991,Computed Tomography CT Venography During Postpartum Venous Thromboembolism,"{'phase': 'N/A', 'drugs': '['computed tomography venography']', 'drugs_list': ['computed tomography venography'], 'diseases_list': ['Pulmonary Thromboembolism'], 'enrollment': '130', 'inclusion_criteria': '- Women with clinically suspected pulmonary embolism (PE) during the first 6 weeks postpar..."


In [14]:
# Define output directory and file paths
out_dir = '../data/sigir2016/processed_cut_debug'
os.makedirs(out_dir, exist_ok=True)

# Save df_tsv_subset as TSV
df_tsv_subset.to_csv(os.path.join(out_dir, 'test.tsv'), sep='\t', index=False)

# Save df_queries_subset as JSONL
df_queries_subset.to_json(os.path.join(out_dir, 'queries.jsonl'), orient='records', lines=True, force_ascii=False)

# Save df_corpus_subset as JSONL
df_corpus_subset.to_json(os.path.join(out_dir, 'corpus.jsonl'), orient='records', lines=True, force_ascii=False)

In [15]:
# Assume: df_tsv and df_corpus loaded
# 1. Get all 1s and 2s
ones_twos = df_tsv_subset[df_tsv_subset['score'].isin([1, 2])]

# 2. Merge in metadata from df_corpus by corpus-id / _id
ones_twos_meta = ones_twos.merge(
    df_corpus[['_id', 'metadata']],
    left_on='corpus-id', right_on='_id', how='left'
)

# 3. Functions to check for blank or missing
def blank_or_missing(meta, field):
    try:
        val = meta.get(field, None) if isinstance(meta, dict) else None
        return not (pd.notnull(val) and str(val).strip() != '')
    except Exception:
        return True

# 4. Count how many have missing each field
ones_twos_meta['missing_inclusion'] = ones_twos_meta['metadata'].apply(lambda m: blank_or_missing(m, 'inclusion_criteria'))
ones_twos_meta['missing_exclusion'] = ones_twos_meta['metadata'].apply(lambda m: blank_or_missing(m, 'exclusion_criteria'))

num_missing_inclusion = ones_twos_meta['missing_inclusion'].sum()
num_missing_exclusion = ones_twos_meta['missing_exclusion'].sum()

print(f"Number of score 1 or 2 rows with MISSING inclusion_criteria: {num_missing_inclusion}")
print(f"Number of score 1 or 2 rows with MISSING exclusion_criteria: {num_missing_exclusion}")

# Optionally, see which rows are missing either:
num_missing_either = ((ones_twos_meta['missing_inclusion']) | (ones_twos_meta['missing_exclusion'])).sum()
print(f"Number of score 1 or 2 rows with EITHER field missing: {num_missing_either}")

Number of score 1 or 2 rows with MISSING inclusion_criteria: 4
Number of score 1 or 2 rows with MISSING exclusion_criteria: 11
Number of score 1 or 2 rows with EITHER field missing: 11
