# Downloads Publication Information for PANGO Lineages from the CORD-19 Data Set
**[Work in progress]**

This notebook text-mines [PANGO lineage](https://cov-lineages.org/) mentions in the titles and abstracts of publications and preprints from the CORD-19 data set. Note, the text-mined results may contain false positive!

Data sources: [PANGO Lineage Designations](https://github.com/cov-lineages/pango-designation), 
[CORD-19](https://allenai.org/data/cord-19)

References:

Rambaut A, et al., A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology(2020) Nature Microbiology [doi:10.1038/s41564-020-0770-5](https://doi.org/10.1038/s41564-020-0770-5).

Lucy Lu Wang, et al., CORD-19: The COVID-19 Open Research Dataset (2020) [arXiv:2004.10706v4](https://arxiv.org/abs/2004.10706).

Author: Peter Rose (pwrose@ucsd.edu)

In [105]:
import os
import pandas as pd
import io
import dateutil
import re
from pathlib import Path
import nltk
import json, requests
from urllib.request import urlopen
from xml.etree.ElementTree import parse
import urllib
import time

In [94]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = "/Users/lyt/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-a1516f46-b63a-46dd-b67a-1fb59d6c5d05/import"#Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/lyt/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-a1516f46-b63a-46dd-b67a-1fb59d6c5d05/import


## Get PANGO lineages

In [4]:
pango = pd.read_csv(NEO4J_IMPORT + "/00b-PANGOLineage.csv", dtype=str)

In [5]:
pango.sample(5)

Unnamed: 0,lineage,description,alias,predecessor,l0,l1,l2,l3,levels
631,BA.1.17.2,"Alias of B.1.1.529.1.17.2, lineage from pango-...",B.1.1.529.1.17.2,B.1.1.529.1.17,BA.1.17.2,BA.1.17,BA.1,BA,4
250,B.1.1.153,Northern European Lineage,,,B.1.1.153,B.1.1,B.1,B,4
807,AA.6,"Alias of B.1.177.15.6, Welsh Lineage",B.1.177.15.6,B.1.177.15,AA.6,AA,,,2
1258,B.1.564.1,Canada lineage,,,B.1.564.1,B.1.564,B.1,B,4
1202,B.1.505,Israel and england (was B.1.3.4),,,B.1.505,B.1,B,,3


In [6]:
lineages = pango['lineage'].unique()

In [7]:
len(lineages)

1668

In [8]:
# get max number of dots in lineage

In [9]:
import numpy as np
f = lambda x: x.count('.')
f = np.vectorize(f)

In [10]:
max(f(lineages))

3

## Get CORD-19 Metadata

In [11]:
CACHE = Path(NEO4J_IMPORT +'/cache/cord19/2022-03-31/metadata.csv')

In [12]:
metadata = pd.read_csv(CACHE, dtype='str')

In [13]:
metadata.fillna('', inplace=True)
#convert datetime column to just date
metadata['year'] = metadata['publish_time'].apply(lambda d: d[:4] if len(d) > 4 else '')
metadata['date'] = metadata['publish_time'].apply(lambda d: dateutil.parser.parse(d) if len(d) > 0 else '')

  after removing the cwd from sys.path.


In [14]:
print("Total number of papers", metadata.shape[0])

Total number of papers 992921


## Extract a list of PANGO lineages

Remove special characters to simply parsing for lineages in parenthesis, comma-separated lists, etc.

In [15]:
metadata['title'] = metadata['title'].replace('[()/,]', ' ', regex=True)
metadata['abstract'] = metadata['abstract'].replace('[()/,]', ' ', regex=True)

Match PANGO patterns and check agains list of known lineages.

In [45]:
pattern1 = re.compile(' [A-Z]{1,2}[.]\d+ ')
pattern2 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+ ')
pattern3 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+[.]+\d+ ')

In [46]:
def get_lineages(row):
    text = ' ' + row.title + ' ' + row.abstract + ' '
    lin = pattern1.findall(text) + pattern2.findall(text) + pattern3.findall(text)
    u_lin = set()
    
    
    
    for l in lin:
        l = l.strip()
        # check if lineage is valid (e.g., not a withdrawn lineage or false positive)
        if l in lineages:
            u_lin.add(l)
            
    return ";".join(u_lin)

### sample subset

In [23]:
data = metadata.sample(30000)

In [24]:
data['lineages'] = data.apply(get_lineages, axis=1)

In [25]:
# keep those has lineage in title & abstract
ln = data[data['lineages'].str.len() > 0].copy()

In [30]:
ln.iloc[3].lineages

'B.1.1.7'

In [31]:
ln.iloc[3].abstract

'We report three cases of SARS-CoV-2 lineage B.1.1.7 infection in Malayan tigers at the Virginia Zoo. All three animals exhibited respiratory signs. These findings show the mutations in the B.1.1.7 lineage did not affect the susceptibility of tigers to SARS-CoV-2.'

### Run on whole dataset

In [32]:
metadata['lineages'] = metadata.apply(get_lineages, axis=1)

Keep only papers that map to PANGO lineages

In [33]:
hits = metadata[metadata['lineages'].str.len() > 0].copy()

### Assign CURIEs from [Identifiers.org](https://identifiers.org)

In [34]:
hits['doi'] = hits['doi'].apply(lambda x: 'doi:' + x if len(x) > 0 else '')
hits['pubmed_id'] = hits['pubmed_id'].apply(lambda x: 'pubmed:' + x if len(x) > 0 else '')
hits['pmcid'] = hits['pmcid'].apply(lambda x: 'pmc:' + x if len(x) > 0 else '')
hits['arxiv_id'] = hits['arxiv_id'].apply(lambda x: 'arxiv:' + x if len(x) > 0 else '')

In [35]:
#hits.sort_values(by=['publish_time'], ascending=False, inplace=True)

In [36]:
print("Number of matches", hits.shape[0])

Number of matches 4419


In [37]:
def create_id(row):
    """Creates a unique id using the most commonly available id in priority order"""
    if row.doi != '':
        return row.doi
    elif row.pubmed_id != '':
        return row.pubmed_id
    elif row.pmcid != '':
        return row.pmcid
    elif row.arxiv_id != '':
        return row.arxiv_id
    elif row.url != '':
        return row.url
    else:
        # TODO deal with WHO papers here?
        return ''

In [38]:
hits['id'] = hits.apply(create_id, axis=1)

WHO documents seem to be copies of articles that are already present in the dataset and will be ignored for now.

In [40]:
hits.query('id != ""', inplace=True)

In [41]:
print("Total number of matches", hits.shape[0])

Total number of matches 3200


In [44]:
hits.to_csv(NEO4J_IMPORT + "01h-CORDLineages.csv", index=False)

## Fulltext Regrex
1. How to save body paragraph texts? Save as a dataframe with id & content to Neo4j?

In [62]:
# get articles ids for specific lineage

def get_ids(lineage):
    url = requests.get(f'https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=(%22{lineage}%22%20AND%20(%22SARS-CoV-2%22%20OR%20%22COVID-19%22)%20AND%20(%22lineage%22%20OR%20%22lineages%22%20OR%20%22strain%22%20OR%20%22strains%22%20OR%20%22variants%22%20OR%20%22variants%22))%20AND%20(FIRST_PDATE:%5b2020-01-01%20)%20AND%20HAS_FT:y%20AND%20%20sort_date:y&resultType=idlist&pageSize=1000&format=json&cursorMark=*')
    text = url.text
    results = json.loads(text)['resultList']['result']
    ids = list(map(lambda x: x['fullTextIdList']['fullTextId'][0], results))
    return ids

In [63]:
# download articles in XML and return body paragraph
def download_article(article_id):
    url = f'https://www.ebi.ac.uk/europepmc/webservices/rest/{article_id}/fullTextXML'
    xmldoc = parse(urlopen(url))
    
    # get full text
    root = xmldoc.getroot()
    text = root.findall('.//p')

    # put body paragraphs together
    ptext = ""
    for p in text:
        ptext += ''.join([x for x in p.itertext()]) + '.\n' + '\n'
    return ptext

In [64]:
# get lineage for full texts
def get_full_lineage(ptext):
    # tokenize texts into sentences
    p_sentence = nltk.tokenize.sent_tokenize(ptext)
    
    # record lineages
    record = []
    for s in p_sentence:
        s1 = re.subn('[()/,]', ' ', s)[0] # remove special chars
        lin = pattern1.findall(s1) + pattern2.findall(s1) + pattern3.findall(s1)

        if lin: # if find lineages record sentence
            record.append([lin, s])
    return record

#### test on B.1.1.7


In [115]:
lineage = 'B.1.1.7'
ids = get_ids(lineage)

full_regrex = []
for i in ids:
    try: 
        body_text = download_article(i)
        record = get_full_lineage(body_text)
        # attach article id to lineage record
        [x.append(i) for x in record]
        full_regrex.append(pd.DataFrame(record))
    except urllib.error.HTTPError as exc:
        print('Something went wrong.')
        time.sleep(10) # wait 10 seconds and then make http request again
        continue

fulltext_lineage = pd.concat(full_regrex)

Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.


In [121]:
fulltext_lineage.to_csv('B_1_1_7.csv')

#### test on P.1

In [122]:
lineage = 'P.1'
ids = get_ids(lineage)

full_regrex = []
for i in ids:
    try: 
        body_text = download_article(i)
        record = get_full_lineage(body_text)
        # attach article id to lineage record
        [x.append(i) for x in record]
        full_regrex.append(pd.DataFrame(record))
    except urllib.error.HTTPError as exc:
        print('Something went wrong.')
        time.sleep(10) # wait 10 seconds and then make http request again
        continue
        

Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.
Something went wrong.


In [123]:
fulltext_lineage_p1 = pd.concat(full_regrex)

In [125]:
fulltext_lineage_p1.to_csv('P_1.csv')