Mike Honey: Copy of [Daniel's Notebook](https://www.kaggle.com/danielwolffram/cord-19-match-clinical-trials), split the data prep notebook into several steps to avoid running out of memory.

Here we combine the information about clinical trials gathered in [this great dataset](https://www.kaggle.com/panahi/covid-19-international-clinical-trials) with the CORD-19 dataset by matching trial ids found in the title, abstract or the full text of the papers.

In [None]:
import numpy as np
import pandas as pd

In [None]:
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 200)

# Load Data

We use the dataset that I've created in [another notebook](https://www.kaggle.com/mikehoney/cord-19-create-dataframe-step-4-language). 

In [None]:
df = pd.read_csv('/kaggle/input/cord-19-create-dataframe-step-4-language/cord19_df.csv')

In [None]:
ct_df = pd.read_csv('/kaggle/input/covid-19-international-clinical-trials/data/ClinicalTrials.gov_COVID_19.csv')

ct_df.shape, ct_df.columns

In [None]:
ict_df = pd.read_csv('/kaggle/input/covid-19-international-clinical-trials/data/ICTRP_COVID_19.csv')

ict_df.shape, ict_df.columns

In [None]:
ict_df.head(3)

# Combine both dataframes

Let's pick several columns that seem interesting and appear in both files...

In [None]:
ict = ict_df[['TrialID', 'web address', 'Study type', 'Study design', 'Intervention', 'Primary outcome']]
ct = ct_df[['NCT Number', 'URL', 'Study Type', 'Study Designs', 'Interventions', 'Outcome Measures']]

... and rename them...

In [None]:
ict.columns = ['id', 'url', 'study_type', 'study_design', 'intervention', 'outcome']
ct.columns = ['id', 'url', 'study_type', 'study_design', 'intervention', 'outcome']

... so we can easily combine them:

In [None]:
all_trials = ict.append(ct, ignore_index=True)

In [None]:
all_trials.head()

# Drop Duplicates

In [None]:
# all_trials[all_trials.id.duplicated(keep=False)].sort_values('id').head()

Some trial ids appear in both data sets, we drop the duplicates.

In [None]:
all_trials.drop_duplicates(subset='id', keep='last', inplace=True)

In [None]:
all_trials.shape

# Search Trial ID in Papers

We need to come up with some regular expression to search for trial ids from different sources.

In [None]:
all_trials.id.str[:6].value_counts()

In [None]:
all_trials.id

In [None]:
all_trials.id[all_trials.id.str.startswith('EUCTR')] # use this to see the patterns

These should work:

In [None]:
reg_nct = 'NCT[0-9]{8}'
reg_chi = 'ChiCTR[0-9]{10}'
reg_eu = 'EUCTR[0-9]{4}-[0-9]{6}-[0-9]{2}-[A-Z]{2}'
reg_ir = 'IRCT[0-9]+N[0-9]{1,2}'
reg_isrctn = 'ISRCTN[0-9]{8}'
reg_jprn = 'JPRN-[0-9a-zA-Z]+'
reg_tctr = 'TCTR[0-9]{11}'
reg_actrn = 'ACTRN[0-9]{14}'
reg_drks = 'DRKS[0-9]{8}'

registries = [reg_nct, reg_chi, reg_eu, reg_ir, reg_isrctn, reg_jprn, reg_tctr, reg_actrn, reg_drks]

reg = ('|').join(registries)
reg = r'({})'.format(reg)

reg

Sanity checks

In [None]:
pd.Series(['The trial has been registered in Chinese Clinical Trial Registry (ChiCTR2000029981).']).str.extract(reg)

Great, we don't lose any of the trial ids:

In [None]:
len(all_trials), len(all_trials.id.str.extract(reg))

Now we extract all trial ids that we can find in the title, abstract or text body.

In [None]:
trials = (df.title.fillna('') + ' ' + df.abstract.fillna('') + ' ' + df.body_text.fillna('')).str.extract(reg)

In [None]:
df['trial_id'] = trials

Total number of trial ids we found:

In [None]:
trials.notnull().sum()

In our papers that are marked as covid-19-papers we found 46 trial ids. (We will see later that not all of them are found in the registered trials)

In [None]:
# df[df.trial_id.notnull() & df.is_covid19].shape
# Mike Honey edit - keep non-COVID-19 papers
df[df.trial_id.notnull()].shape

In [None]:
# final = pd.merge(left=df, right=all_trials, left_on='trial_id', right_on='id', how='left', suffixes=(None, '_trial'))

We now want to match the trial ids we found in the papers with the registered trials.

In [None]:
final = pd.merge(left=df[['paper_id', 'trial_id']], right=all_trials, left_on='trial_id', right_on='id', how='inner').drop(columns=['id'])

In [None]:
final.tail()

In [None]:
final.shape

In [None]:
final.trial_id.nunique()

Well, we found 37 registered (Covid-19) trials that appear in CORD-19, some of them appear multiple times. Keep in mind though that not all papers in CORD-19 specifically deal with Covid-19, there are about 2000 of them. (We don't consider other registered trials that might appear in non-covid19 papers)

In [None]:
df.is_covid19.sum()

Let's export our dataframe with paper_id + trial info.

In [None]:
final.to_csv('trial_info.csv', index=False)