# Pubmed ID comparison
We compare the union and intersection of PMIDs cited by the Molecular Oncology Almanac (MOAlmanac), CIVIC, and OncoKB. Not every source in MOAlmanac is associated with a PMID, as we directly cite FDA approvals and guidelines for some references.

To not directly host the databases used, the raw files of OncoKB and CIViC are not hosted in the Github repository but I do have a dataframe committed, `pmids.txt`. `pmid-comparison.png` is then produced with `pmid-comparison.R`.

In [59]:
import glob
import pandas as pd
import pyupset
import requests

%matplotlib inline

## Get CIViC citations
Get via API

In [38]:
handle_name = 'pmid.civic.2020-08-04.txt'
handle = glob.glob(handle_name)

if handle:
    civic_records = pd.read_csv(handle[0], sep='\t')
else:
    request = 'https://civicdb.org/api/sources?count=5000'
    r = requests.get(request)
    civic_records = r.json()['records']
    records = []
    for record in civic_records:
        series = pd.Series(record).loc[['name', 'citation', 'citation_id', 'source_type', 'source_url']]
        records.append(series.to_frame())
    dataframe = pd.concat(records, axis=1, ignore_index=True).T
    civic_records = dataframe[dataframe['source_type'].eq('PubMed')]
    # civic_recods.to_csv(handle_name, sep='\t', index=False)
civic_pmids = civic_records['citation_id'].drop_duplicates().sort_values().tolist()
print(len(civic_pmids))

2677


## MOAlmanac citations

In [44]:
handles = glob.glob('/Users/brendan/Github/moalmanac-admin/database/almanac/*')
almanac = []

columns = ['pmid']

for handle in handles:
    dtype = handle.split('/')[-1].split('.tsv')[0]
    tmp = pd.read_csv(handle, sep='\t')
    tmp['dtype'] = dtype
    almanac.append(tmp.loc[:, columns])
almanac = pd.concat(almanac, ignore_index=True)
almanac_pmids = almanac['pmid'].dropna().astype(int).drop_duplicates().sort_values().tolist()
print(len(almanac_pmids))

146


## OncoKB citations
PMIDs for OncoKB relationships by listing the [300 biomarker-drug associations](https://www.oncokb.org/actionableGenes#levels=1,2,3,4,R1,R2) (65 genes, 52 tumor types, 6 levels of evidence; levels 1, 2, 3, 4, R1, and R2), downloading associations, and annotated for PMIDs through manual review.

In [39]:
oncokb_records = pd.read_csv('oncokb_biomarker_drug_associations.annotated.tsv', sep='\t')
oncokb_pmids = []
for list_ in oncokb_records['PMIDs'].fillna('').str.replace(' ', '').str.split(',').tolist():
    oncokb_pmids.extend(list_)
oncokb_pmids = pd.Series(oncokb_pmids)
oncokb_pmids = oncokb_pmids[~oncokb_pmids.eq('')].sort_values().drop_duplicates().tolist()
print(len(oncokb_pmids))

440


## Combine

In [110]:
idx_civic = pd.Index(pd.Series(civic_pmids).astype(str).replace(' ', '').astype(int).tolist())
idx_moalmanac = pd.Index(pd.Series(almanac_pmids).astype(str).replace(' ', '').astype(int).tolist())
idx_oncokb = pd.Index(pd.Series(oncokb_pmids).astype(str).replace(' ', '').astype(int).tolist())
idx_all = idx_civic.union(idx_moalmanac).union(idx_oncokb)

df = pd.DataFrame(0, columns=['CIViC', 'MOAlmanac', 'OncoKB'], index=idx_all)
df.loc[idx_civic, 'CIViC'] = 1
df.loc[idx_moalmanac, 'MOAlmanac'] = 1
df.loc[idx_oncokb, 'OncoKB'] = 1

for column in df.columns:
    print(column, df[column].eq(1).value_counts()[True])

df.to_csv('pmid-comparison.txt', sep='\t', index_label='pmid')

CIViC 2677
MOAlmanac 146
OncoKB 440
