<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at 'In [12]'.</span>

# Downloads Publication Information for PANGO Lineages from the CORD-19 Data Set
**[Work in progress]**

This notebook text-mines [PANGO lineage](https://cov-lineages.org/) mentions in the titles and abstracts of publications and preprints from the CORD-19 data set. Note, the text-mined results may contain false positive!

Data sources: [PANGO Lineage Designations](https://github.com/cov-lineages/pango-designation), 
[CORD-19](https://allenai.org/data/cord-19)

References:

Rambaut A, et al., A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology(2020) Nature Microbiology [doi:10.1038/s41564-020-0770-5](https://doi.org/10.1038/s41564-020-0770-5).

Lucy Lu Wang, et al., CORD-19: The COVID-19 Open Research Dataset (2020) [arXiv:2004.10706v4](https://arxiv.org/abs/2004.10706).

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import io
import dateutil
import re
from pathlib import Path
import nltk
import json, requests
from urllib.request import urlopen
from xml.etree.ElementTree import parse
import urllib
import time
import numpy as np

In [2]:
pattern1 = re.compile(' [A-Z]{1,2}[.]\d+ ', re.IGNORECASE)
pattern2 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+ ', re.IGNORECASE)
pattern3 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+[.]+\d+ ', re.IGNORECASE)

# add WHO lineage
who_lineage = [' Alpha ', ' Beta ', ' Gamma ', ' Epsilon ',' Zeta ', ' Eta ', ' Theta  ',\
               ' Iota ', ' Kappa ', ' Lambda ', ' Mu ']
pattern4 = re.compile("|".join(who_lineage), re.IGNORECASE)

In [3]:
gg = pd.read_csv('lineages')

In [4]:
lineages = gg.iloc[:,0].to_list()

In [5]:
def get_lineages(row):
    text = ' ' + row.title + ' ' + row.abstract + ' '
    lin = pattern1.findall(text) + pattern2.findall(text) + pattern3.findall(text)
    u_lin = set()
    
    
    for l in lin:
        l = l.strip()
        # check if lineage is valid (e.g., not a withdrawn lineage or false positive)
        if l in lineages:
            u_lin.add(l)
            
    return ";".join(u_lin)

In [6]:
# download articles in XML and return body paragraph
def download_article(article_id):
    url = f'https://www.ebi.ac.uk/europepmc/webservices/rest/{article_id}/fullTextXML'
    xmldoc = parse(urlopen(url))
    
    # get full text
    root = xmldoc.getroot()
    text = root.findall('.//p')

    # put body paragraphs together
    ptext = ""
    for p in text:
        ptext += ''.join([x for x in p.itertext()]) + '.\n' + '\n'
    return ptext

In [7]:
# get lineage for full texts
def get_full_lineage(ptext):
    # tokenize texts into sentences
    p_sentence = nltk.tokenize.sent_tokenize(ptext)
    
    # record lineages
    pair = []
    for s in p_sentence:
        s1 = re.subn('[()/,]', ' ', s)[0] # remove special chars
        lin = set(pattern1.findall(s1) + pattern2.findall(s1) + pattern3.findall(s1) + pattern4.findall(s1))

        if lin: 
            for l in lin:
                # valid lineage and not recorded
                l = l.strip()
                l = l.capitalize()
                if (l in lineages): 
                    pair.append([l, s])
                else: continue
    return pair

In [8]:
def pub_mentions_lin(article_id, real_id):
    body_text = download_article(article_id) # get body text
    record = get_full_lineage(body_text) # extract lineages in text
    [x.append(real_id) for x in record] # attach article id to lineage record
    df = pd.DataFrame(record)
    if record:
        df.columns = ['lineage', 'string', 'ID']
        df = df[['ID','lineage','string']]
    return df

In [9]:
def run_pipeline(N, pub):
    results = []
    for i in range(N):
        article = pub.iloc[i]
        article_id = article.pmcId.split(":")[1]
        real_id = article.id
        print(f'start article {i}')
        if i%100 == 0:
            print(f'{i}/{N}')
            
        try:
            results.append(pub_mentions_lin(article_id, real_id))
        except urllib.error.HTTPError as exc:
            time.sleep(5) # wait 5 seconds and then make http request again
            continue
    return pd.concat(results)

In [10]:
pub = pd.read_csv("Publication_1120.csv")
N = pub.size


In [11]:
N

17360

In [12]:
ans = run_pipeline(N, pub)
ans.columns = [['from','to','evidence']]
ans.to_csv('Publication-MENTIONS-Lineage.csv',index=False)

start article 0
0/17360
start article 1
start article 2
start article 3
start article 4
start article 5
start article 6
start article 7
start article 8
start article 9
start article 10
start article 11
start article 12
start article 13
start article 14
start article 15
start article 16
start article 17
start article 18
start article 19
start article 20
start article 21
start article 22
start article 23
start article 24
start article 25
start article 26
start article 27
start article 28
start article 29
start article 30
start article 31
start article 32
start article 33
start article 34
start article 35
start article 36
start article 37
start article 38
start article 39
start article 40
start article 41
start article 42
start article 43
start article 44
start article 45
start article 46
start article 47
start article 48
start article 49
start article 50
start article 51
start article 52
start article 53
start article 54
start article 55
start article 56
start article 57
start article 58

start article 459
start article 460
start article 461
start article 462
start article 463
start article 464
start article 465
start article 466
start article 467
start article 468
start article 469
start article 470
start article 471
start article 472
start article 473
start article 474
start article 475
start article 476
start article 477
start article 478
start article 479
start article 480
start article 481
start article 482
start article 483
start article 484
start article 485
start article 486
start article 487
start article 488
start article 489
start article 490
start article 491
start article 492
start article 493
start article 494
start article 495
start article 496
start article 497
start article 498
start article 499
start article 500
500/17360
start article 501
start article 502
start article 503
start article 504
start article 505
start article 506
start article 507
start article 508
start article 509
start article 510
start article 511
start article 512
start article 513


start article 912
start article 913
start article 914
start article 915
start article 916
start article 917
start article 918
start article 919
start article 920
start article 921
start article 922
start article 923
start article 924
start article 925
start article 926
start article 927
start article 928
start article 929
start article 930
start article 931
start article 932
start article 933
start article 934
start article 935
start article 936
start article 937
start article 938
start article 939
start article 940
start article 941
start article 942
start article 943
start article 944
start article 945
start article 946
start article 947
start article 948
start article 949
start article 950
start article 951
start article 952
start article 953
start article 954
start article 955
start article 956
start article 957
start article 958
start article 959
start article 960
start article 961
start article 962
start article 963
start article 964
start article 965
start article 966
start arti

start article 1346
start article 1347
start article 1348
start article 1349
start article 1350
start article 1351
start article 1352
start article 1353
start article 1354
start article 1355
start article 1356
start article 1357
start article 1358
start article 1359
start article 1360
start article 1361
start article 1362
start article 1363
start article 1364
start article 1365
start article 1366
start article 1367
start article 1368
start article 1369
start article 1370
start article 1371
start article 1372
start article 1373
start article 1374
start article 1375
start article 1376
start article 1377
start article 1378
start article 1379
start article 1380
start article 1381
start article 1382
start article 1383
start article 1384
start article 1385
start article 1386
start article 1387
start article 1388
start article 1389
start article 1390
start article 1391
start article 1392
start article 1393
start article 1394
start article 1395
start article 1396
start article 1397
start articl

IndexError: single positional indexer is out-of-bounds

In [None]:
"""results = []
for i in range(144, N):
    article = pub.iloc[i]
    article_id = article.pmcId.split(":")[1]
    real_id = article.id
    print(f'start article {i}')
    if i%100 == 0:
        print(f'{i}/{N}')

    try:
        results.append(pub_mentions_lin(article_id, real_id))
    except urllib.error.HTTPError as exc:
        time.sleep(5) # wait 5 seconds and then make http request again
        continue"""

In [None]:
"""results_144_to_321[:3]"""

In [None]:
#results_144_to_321 = results

In [None]:
# partial results 

#results_0_to_144 = results

## Fulltext Regrex
This part is removed when generating knowledge graph data