# Downloads Publication Information for PANGO Lineages from the CORD-19 Data Set
**[Work in progress]**

This notebook text-mines [PANGO lineage](https://cov-lineages.org/) mentions in the titles and abstracts of publications and preprints from the CORD-19 data set. Note, the text-mined results may contain false positive!

Data sources: [PANGO Lineage Designations](https://github.com/cov-lineages/pango-designation), 
[CORD-19](https://allenai.org/data/cord-19)

References:

Rambaut A, et al., A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology(2020) Nature Microbiology [doi:10.1038/s41564-020-0770-5](https://doi.org/10.1038/s41564-020-0770-5).

Lucy Lu Wang, et al., CORD-19: The COVID-19 Open Research Dataset (2020) [arXiv:2004.10706v4](https://arxiv.org/abs/2004.10706).

Author: Peter Rose (pwrose@ucsd.edu)

In [171]:
import os
import pandas as pd
import io
import dateutil
import re
from pathlib import Path
import nltk
import json, requests
from urllib.request import urlopen
from xml.etree.ElementTree import parse
import urllib
import time
import numpy as np

In [186]:
pattern1 = re.compile(' [A-Z]{1,2}[.]\d+ ', re.IGNORECASE)
pattern2 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+ ', re.IGNORECASE)
pattern3 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+[.]+\d+ ', re.IGNORECASE)

# add WHO lineage
who_lineage = [' Alpha ', ' Beta ', ' Gamma ', ' Epsilon ',' Zeta ', ' Eta ', ' Theta  ',\
               ' Iota ', ' Kappa ', ' Lambda ', ' Mu ']
pattern4 = re.compile("|".join(who_lineage), re.IGNORECASE)

In [284]:
gg = pd.read_csv('lineages')

In [285]:
lineages = gg.iloc[:,0].to_list()

In [188]:
def get_lineages(row):
    text = ' ' + row.title + ' ' + row.abstract + ' '
    lin = pattern1.findall(text) + pattern2.findall(text) + pattern3.findall(text)
    u_lin = set()
    
    
    for l in lin:
        l = l.strip()
        # check if lineage is valid (e.g., not a withdrawn lineage or false positive)
        if l in lineages:
            u_lin.add(l)
            
    return ";".join(u_lin)

In [239]:
# download articles in XML and return body paragraph
def download_article(article_id):
    url = f'https://www.ebi.ac.uk/europepmc/webservices/rest/{article_id}/fullTextXML'
    xmldoc = parse(urlopen(url))
    
    # get full text
    root = xmldoc.getroot()
    text = root.findall('.//p')

    # put body paragraphs together
    ptext = ""
    for p in text:
        ptext += ''.join([x for x in p.itertext()]) + '.\n' + '\n'
    return ptext

In [89]:
# get lineage for full texts
def get_full_lineage(ptext):
    # tokenize texts into sentences
    p_sentence = nltk.tokenize.sent_tokenize(ptext)
    
    # record lineages
    pair = []
    for s in p_sentence:
        s1 = re.subn('[()/,]', ' ', s)[0] # remove special chars
        lin = set(pattern1.findall(s1) + pattern2.findall(s1) + pattern3.findall(s1) + pattern4.findall(s1))

        if lin: 
            for l in lin:
                # valid lineage and not recorded
                l = l.strip()
                l = l.capitalize()
                if (l in lineages): 
                    pair.append([l, s])
                else: continue
    return pair

In [237]:
def pub_mentions_lin(article_id, real_id):
    body_text = download_article(article_id) # get body text
    record = get_full_lineage(body_text) # extract lineages in text
    [x.append(real_id) for x in record] # attach article id to lineage record
    df = pd.DataFrame(record)
    if record:
        df.columns = ['lineage', 'string', 'ID']
        df = df[['ID','lineage','string']]
    return df

In [253]:
def run_pipeline(N, pub):
    results = []
    for i in range(N):
        article = pub.iloc[i]
        article_id = article.pmcId.split(":")[1]
        real_id = article.id
        print(f'start article {i}')
        if i%100 == 0:
            print(f'{i}/{N}')
            
        try:
            results.append(pub_mentions_lin(article_id, real_id))
        except urllib.error.HTTPError as exc:
            time.sleep(5) # wait 5 seconds and then make http request again
            continue
    return pd.concat(results)

In [264]:
pub = pd.read_csv("Publication_1120.csv")
N = pub.size


In [267]:
N

17360

In [244]:
ans = run_pipeline(N, pub)
ans.columns = [['from','to','evidence']]
ans.to_csv('Publication-MENTIONS-Lineage.csv',index=False)

0/17360
100/17360
200/17360


URLError: <urlopen error TLS/SSL connection has been closed (EOF) (_ssl.c:1076)>

In [256]:
"""results = []
for i in range(144, N):
    article = pub.iloc[i]
    article_id = article.pmcId.split(":")[1]
    real_id = article.id
    print(f'start article {i}')
    if i%100 == 0:
        print(f'{i}/{N}')

    try:
        results.append(pub_mentions_lin(article_id, real_id))
    except urllib.error.HTTPError as exc:
        time.sleep(5) # wait 5 seconds and then make http request again
        continue"""

start article 144
start article 145
start article 146
start article 147
start article 148
start article 149
start article 150
start article 151
start article 152
start article 153
start article 154
start article 155
start article 156
start article 157
start article 158
start article 159
start article 160
start article 161
start article 162
start article 163
start article 164
start article 165
start article 166
start article 167
start article 168
start article 169
start article 170
start article 171
start article 172
start article 173
start article 174
start article 175
start article 176
start article 177
start article 178
start article 179
start article 180
start article 181
start article 182
start article 183
start article 184
start article 185
start article 186
start article 187
start article 188
start article 189
start article 190
start article 191
start article 192
start article 193
start article 194
start article 195
start article 196
start article 197
start article 198
start arti

URLError: <urlopen error [Errno 60] Operation timed out>

In [262]:
"""results_144_to_321[:3]"""

[                          ID    lineage  \
 0  doi:10.4269/ajtmh.21-0542    B.1.1.7   
 1  doi:10.4269/ajtmh.21-0542    B.1.617   
 2  doi:10.4269/ajtmh.21-0542    B.1.351   
 3  doi:10.4269/ajtmh.21-0542    B.1.243   
 4  doi:10.4269/ajtmh.21-0542    B.1.234   
 5  doi:10.4269/ajtmh.21-0542      B.1.2   
 6  doi:10.4269/ajtmh.21-0542  B.1.1.519   
 7  doi:10.4269/ajtmh.21-0542        B.1   
 8  doi:10.4269/ajtmh.21-0542    B.1.596   
 
                                               string  
 0  Among the sequences were three novel viral var...  
 1  Mutations of Q493 resulting in an amino acid c...  
 2  Mutations of Q493 resulting in an amino acid c...  
 3  Of these, 56 (60%) sequences were of B.1.1.7/2...  
 4  Of these, 56 (60%) sequences were of B.1.1.7/2...  
 5  Of these, 56 (60%) sequences were of B.1.1.7/2...  
 6  Of these, 56 (60%) sequences were of B.1.1.7/2...  
 7  Of these, 56 (60%) sequences were of B.1.1.7/2...  
 8  Of these, 56 (60%) sequences were of B.1.1.7/2... 

In [257]:
#results_144_to_321 = results

In [251]:
# partial results 

#results_0_to_144 = results

## Fulltext Regrex
This part is removed when generating knowledge graph data