**What has been done**

Unfortunately, I didn't get as far as I wanted to because the sensitivity labeling turned out to be quite complicated.

Downloaded the datasets and added relevant links. Make sure to put them in the data folder in the way they are referenced by the path strings in this notebook.

The data is getting loaded and merged. 

Then, I obtained all terms from the C12 and C13 categories from an .xml by exploiting the tree structure (see get_keywords_from_xml() function). Then, for each row in the df, I check if one of the C12 or C13 terms appears. If so, the row gets labeled as sensitive (1), otherwise not sensitive (0).

**What needs to be done**

In general we have to verify two things in the whole project:
1) Intrinsic Evaluation (next): Here, we need to train a Logistic Regression model and also use DistilBERT to try to predict sensitivity in the same way the authors did.
2) Extrinsic Evaluation (later on): Here we have to combine also the relevance label with the sensitivity label. I do not quite understand how this works. I am fairly certain we have to use some of the files from the original data source (see link in README). I think the queries are needed.

# Imports

In [13]:
# Python libraries
import pandas as pd

In [14]:
# Add src to the path so we can import our own functions from .py files
import sys
import os
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.insert(0, project_root)

# Custom functions
from src.data_processing.load import load_parquet_as_df
from src.data_processing.xml import get_keywords_from_xml
from src.data_processing.sensitivity_labeling import is_sensitive

# Load data

In [32]:
test_path = "../data/OHSUMED/test-00000-of-00001.parquet"
train_path = "../data/OHSUMED/train-00000-of-00001.parquet"

df_test = load_parquet_as_df(test_path)
df_train = load_parquet_as_df(train_path)

df = pd.concat([df_test, df_train], axis=0)

In [33]:
print(len(df))
df.columns

348564


Index(['seq_id', 'medline_ui', 'mesh_terms', 'title', 'publication_type',
       'abstract', 'author', 'source'],
      dtype='object')

In [39]:
# Set max column width to display long strings fully
pd.set_option('display.max_colwidth', None)

# Now, accessing the entry will show the full string
df['mesh_terms'][92]

92    Abortion, Habitual/*CO/PC; Adult; Case Report; Chromosome Abnormalities/*CO/GE; Female; Fibrinogen/*BL/TU; Human; Karyotyping; Pedigree; Pregnancy; Support, Non-U.S. Gov't; Support, U.S. Gov't, P.H.S.; Translocation (Genetics).
92                                             Blood Pressure; Catheters, Indwelling/*ST; Hemodialysis/*ST; Human; Kidney Failure, Acute/*PP/TH; Kidney Failure, Chronic/*PP/TH; Polyurethanes; Quality Control; Support, Non-U.S. Gov't.
Name: mesh_terms, dtype: object

# Sensitivity labeling

In [34]:
MESH_XML_FILE = "../data/nlm/mesh/medit/ascii_xml/output/desc2022.xml"

c12_terms, c13_terms = get_keywords_from_xml(MESH_XML_FILE)

In [35]:
df['sensitive_label'] = df['mesh_terms'].apply(
    lambda x: is_sensitive(x, c12_terms, c13_terms)
)

In [16]:
percentage_sensitive = 100 * df['sensitive_label'].mean()
print(f"{percentage_sensitive:.2f}% of the rows are sensitive")

7.72% of the rows are sensitive


In [43]:
save_csv = False

# Only save relevant columns to make file smaller
relevant_columns = ['title', 'abstract', 'sensitive_label']
df = df[relevant_columns]

if save_csv:
    save_path = "../data/OHSUMED/full_ohsumed_sensitivity_labeled.csv"
    
    
    df.to_csv(save_path, index=False)

# Misc

Just some code for getting screenshots

In [None]:
# A list which contains indices of rows which have a 1 as label
indices = df.index[df['sensitive_label'] == 1].tolist()

indices[0:5]

In [47]:
rows = df.iloc[23:25]  

In [48]:
rows

Unnamed: 0,text,sensitive_label
23,"Onset and recovery of atracurium and suxamethonium-induced neuromuscular blockade with simultaneous train-of-four and single twitch stimulation. Single twitch and train-of-four stimulation were applied at 0.08 Hz to each ulnar nerve and the force of contraction of the adductor pollicis was recorded during onset of and recovery from neuromuscular blockade by suxamethonium 1 mg kg-1 or atracurium 0.4 mg kg-1. Times to 90% first twitch blockade of train-of-four were (mean +/- SEM) 0.82 +/- 0.08 and 1.98 +/- 0.18 min for suxamethonium and atracurium, respectively, compared with times to 90% single twitch blockade of 1.00 +/- 0.07 and 3.35 +/- 0.37 min, respectively (P less than 0.05 in both cases). Apparent onset time also depended on how long train-of-four stimulation had been applied before injection of atracurium. The mode of stimulation had little effect on time to 10% recovery. The results are consistent with stimulation-induced augmentation in muscle blood flow, which increased delivery of the drug to the neuromuscular junction.",0
24,"Atracurium, vecuronium and pancuronium in end-stage renal failure. Dose-response properties and interactions with azathioprine. Dose-response relations for atracurium, vecuronium and pancuronium were determined in patients in end-stage renal failure for the initial neuromuscular blockade (using three cumulative doses) and for the maintenance of stable 90% response (during continuous infusion). All measurements were during renal transplant surgery, and the interaction of azathioprine on neuromuscular blockade was estimated. Mean ED95 doses were (microgram kg-1): atracurium 375.6, vecuronium 67.2, pancuronium 86.6; the initial blockade required significantly larger doses than in normal patients (37%, 20% and 45%, respectively, using ED50 values). Mean infusion rates for 90% sustained blockade in renal failure were (microgram kg-1 h-1): atracurium 409.4, vecuronium 78.3, pancuronium 14.2. The atracurium dose was not influenced by renal function, whereas vecuronium and pancuronium requirements were significantly reduced by 23.2% and 61.5%, respectively, compared with normal patients (previous study). Azathioprine was injected at the rate of 1 mg kg-1 min-1 for 3 min at stable 90% neuromuscular blockade with constant-rate infusion of the neuromuscular blocking drug. This produced a relatively small and transient antagonism of blockade--probably of negligible clinical significance.",1


In [46]:
# Combine title and abstract
df['text'] = df['title'] + " " + df['abstract']

df = df[['text', 'sensitive_label']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['title'] + " " + df['abstract']
