# Downloads Publication Information for PANGO Lineages from the CORD-19 Data Set
**[Work in progress]**

This notebook text-mines [PANGO lineage](https://cov-lineages.org/) mentions in the titles and abstracts of publications and preprints from the CORD-19 data set. Note, the text-mined results may contain false positive!

Data sources: [PANGO Lineage Designations](https://github.com/cov-lineages/pango-designation), 
[CORD-19](https://allenai.org/data/cord-19)

References:

Rambaut A, et al., A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology(2020) Nature Microbiology [doi:10.1038/s41564-020-0770-5](https://doi.org/10.1038/s41564-020-0770-5).

Lucy Lu Wang, et al., CORD-19: The COVID-19 Open Research Dataset (2020) [arXiv:2004.10706v4](https://arxiv.org/abs/2004.10706).

Author: Peter Rose (pwrose@ucsd.edu)

In [2]:
import os
import pandas as pd
import io
import dateutil
import re
from pathlib import Path
import nltk
import json, requests
from urllib.request import urlopen
from xml.etree.ElementTree import parse
import urllib
import time
import numpy as np

In [3]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [4]:
NEO4J_IMPORT = "/Users/lyt/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-a1516f46-b63a-46dd-b67a-1fb59d6c5d05/import"#Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/lyt/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-a1516f46-b63a-46dd-b67a-1fb59d6c5d05/import


## Get PANGO lineages

In [5]:
pango = pd.read_csv(NEO4J_IMPORT + "/00b-PANGOLineage.csv", dtype=str)

In [6]:
pango.sample(5)

Unnamed: 0,lineage,description,alias,predecessor,l0,l1,l2,l3,levels
928,B.1.177.62,"Germany, Switzerland, Netherlands",,,B.1.177.62,B.1.177,B.1,B,4
47,C.7,"Alias of B.1.1.1.7, Denmark",B.1.1.1.7,B.1.1.1,C.7,C,,,2
1877,B.1.107,Withdrawn: Reassigned in the current tree. Dan...,,,B.1.107,B.1,B,,3
557,B.1.1.458,Swedish lineage,,,B.1.1.458,B.1.1,B.1,B,4
252,B.1.1.155,English,,,B.1.1.155,B.1.1,B.1,B,4


In [7]:
lineages = pango['lineage'].unique()

In [8]:
pattern1 = re.compile(' [A-Z]{1,2}[.]\d+ ', re.IGNORECASE)
pattern2 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+ ', re.IGNORECASE)
pattern3 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+[.]+\d+ ', re.IGNORECASE)

# add WHO lineage
who_lineage = [' Alpha ', ' Beta ', ' Gamma ', ' Epsilon ',' Zeta ', ' Eta ', ' Theta  ',\
               ' Iota ', ' Kappa ', ' Lambda ', ' Mu ']
pattern4 = re.compile("|".join(who_lineage), re.IGNORECASE)

In [9]:
# add who to lineages
lineages = np.append(lineages, who_lineage)

In [10]:
# remove A B
lineages = np.delete(lineages, np.where(lineages == 'A'))
lineages = np.delete(lineages, np.where(lineages == 'B'))

## Get CORD-19 Metadata

In [12]:
CACHE = Path(NEO4J_IMPORT +'/cache/cord19/2022-03-31/metadata.csv')

In [13]:
metadata = pd.read_csv(CACHE, dtype='str')

In [14]:
metadata.fillna('', inplace=True)
#convert datetime column to just date
metadata['year'] = metadata['publish_time'].apply(lambda d: d[:4] if len(d) > 4 else '')
metadata['date'] = metadata['publish_time'].apply(lambda d: dateutil.parser.parse(d) if len(d) > 0 else '')

  after removing the cwd from sys.path.


In [15]:
print("Total number of papers", metadata.shape[0])

Total number of papers 992921


## Extract a list of PANGO lineages

Remove special characters to simply parsing for lineages in parenthesis, comma-separated lists, etc.

In [16]:
metadata['title'] = metadata['title'].replace('[()/,]', ' ', regex=True)
metadata['abstract'] = metadata['abstract'].replace('[()/,]', ' ', regex=True)

Match PANGO patterns and check agains list of known lineages.

In [17]:
pattern1 = re.compile(' [A-Z]{1,2}[.]\d+ ', re.IGNORECASE)
pattern2 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+ ', re.IGNORECASE)
pattern3 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+[.]+\d+ ', re.IGNORECASE)

# add WHO lineage
who_lineage = [' Alpha ', ' Beta ', ' Gamma ', ' Epsilon ',' Zeta ', ' Eta ', ' Theta  ',\
               ' Iota ', ' Kappa ', ' Lambda ', ' Mu ']
pattern4 = re.compile("|".join(who_lineage), re.IGNORECASE)

In [18]:
# add who to lineages
lineages = np.append(lineages, who_lineage)

In [19]:
def get_lineages(row):
    text = ' ' + row.title + ' ' + row.abstract + ' '
    lin = pattern1.findall(text) + pattern2.findall(text) + pattern3.findall(text)
    u_lin = set()
    
    
    for l in lin:
        l = l.strip()
        # check if lineage is valid (e.g., not a withdrawn lineage or false positive)
        if l in lineages:
            u_lin.add(l)
            
    return ";".join(u_lin)

### Run on whole dataset

In [32]:
metadata['lineages'] = metadata.apply(get_lineages, axis=1)

Keep only papers that map to PANGO lineages

In [33]:
hits = metadata[metadata['lineages'].str.len() > 0].copy()

### Assign CURIEs from [Identifiers.org](https://identifiers.org)

In [34]:
hits['doi'] = hits['doi'].apply(lambda x: 'doi:' + x if len(x) > 0 else '')
hits['pubmed_id'] = hits['pubmed_id'].apply(lambda x: 'pubmed:' + x if len(x) > 0 else '')
hits['pmcid'] = hits['pmcid'].apply(lambda x: 'pmc:' + x if len(x) > 0 else '')
hits['arxiv_id'] = hits['arxiv_id'].apply(lambda x: 'arxiv:' + x if len(x) > 0 else '')

In [35]:
#hits.sort_values(by=['publish_time'], ascending=False, inplace=True)

In [36]:
print("Number of matches", hits.shape[0])

Number of matches 4419


In [37]:
def create_id(row):
    """Creates a unique id using the most commonly available id in priority order"""
    if row.doi != '':
        return row.doi
    elif row.pubmed_id != '':
        return row.pubmed_id
    elif row.pmcid != '':
        return row.pmcid
    elif row.arxiv_id != '':
        return row.arxiv_id
    elif row.url != '':
        return row.url
    else:
        # TODO deal with WHO papers here?
        return ''

In [38]:
hits['id'] = hits.apply(create_id, axis=1)

WHO documents seem to be copies of articles that are already present in the dataset and will be ignored for now.

In [40]:
hits.query('id != ""', inplace=True)

In [41]:
print("Total number of matches", hits.shape[0])

Total number of matches 3200


In [44]:
hits.to_csv(NEO4J_IMPORT + "01h-CORDLineages.csv", index=False)

## Fulltext Regrex


In [278]:
# get articles ids for specific lineage
def get_ids(lineage):
    url = requests.get(f'https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=(%22{lineage}%22%20AND%20(%22SARS-CoV-2%22%20OR%20%22COVID-19%22)%20AND%20(%22lineage%22%20OR%20%22lineages%22%20OR%20%22strain%22%20OR%20%22strains%22%20OR%20%22variants%22%20OR%20%22variants%22))%20AND%20(FIRST_PDATE:%5b2020-01-01%20)%20AND%20HAS_FT:y%20AND%20%20sort_date:y&resultType=idlist&pageSize=1000&format=json&cursorMark=*')
    text = url.text
    print(text[:5])
    results = json.loads(text)['resultList']['result']
    ids = list(map(lambda x: x['fullTextIdList']['fullTextId'][0], results))
    return ids

In [21]:
# download articles in XML and return body paragraph
def download_article(article_id):
    url = f'https://www.ebi.ac.uk/europepmc/webservices/rest/{article_id}/fullTextXML'
    xmldoc = parse(urlopen(url))
    
    # get full text
    root = xmldoc.getroot()
    text = root.findall('.//p')

    # put body paragraphs together
    ptext = ""
    for p in text:
        ptext += ''.join([x for x in p.itertext()]) + '.\n' + '\n'
    return ptext

In [22]:
# get lineage for full texts
def get_full_lineage(ptext):
    # tokenize texts into sentences
    p_sentence = nltk.tokenize.sent_tokenize(ptext)
    
    # record lineages
    linset = set()
    pair = []
    for s in p_sentence:
        s1 = re.subn('[()/,]', ' ', s)[0] # remove special chars
        lin = set(pattern1.findall(s1) + pattern2.findall(s1) + pattern3.findall(s1) + pattern4.findall(s1))

        if lin: 
            for l in lin:
                # valid lineage and not recorded
                l = l.strip()
                l = l.capitalize()
                if (l in lineages) and (l not in linset): 
                    linset.add(l)
                    pair.append([l, s])
                else: continue

    
    """
    ptext = re.subn('[()/,]', ' ', ptext)[0] # remove special chars
    lin = pattern1.findall(ptext) + pattern2.findall(ptext) + pattern3.findall(ptext)
    lin_set = set(lin)
    
    record = []
    if lin_set:
        for l in lin_set:
            
            sen = re.search(r"\.?([^\.]*{}[^\.]*)".format(l), ptext).group()
            record.append([l, sen])
    """
    return pair

In [55]:
# wrap up function take lineage ids as input and output dataframe
def extract_full(ids):
    full_regrex = []
    
    if not ids:  
        return None
    for i in ids:
        try:
            body_text = download_article(i) # get body text
            record = get_full_lineage(body_text) # extract lineages in text
            [x.append(i) for x in record] # attach article id to lineage record
            full_regrex.append(pd.DataFrame(record))
        except urllib.error.HTTPError as exc:
            time.sleep(10) # wait 10 seconds and then make http request again
            continue
    df_fulltext = pd.concat(full_regrex)
    df_fulltext.columns = ['lineage', 'string', 'ID']
    return df_fulltext


    

#### test on B.1.1.7


In [231]:
lineage = 'B.1.1.7'
ids = get_ids(lineage)

full_regrex = []
for i in ids:
    try: 
        path = Path(f'{lineage}/{i}.txt')
        
        # if file not exist, get body text and save to file
        if not path.is_file():
            body_text = download_article(i)
            path.parent.mkdir(parents=True, exist_ok=True)
            with path.open("w", encoding ="utf-8") as f:
                f.write(body_text)
                f.close()
        else: # otherwise retrieve text
            body_text = path.read_text()
        
        
        record = get_full_lineage(body_text) # get lineages
        [x.append(i) for x in record] # attach article id to lineage record
        full_regrex.append(pd.DataFrame(record))
    except urllib.error.HTTPError as exc:
        time.sleep(10) # wait 10 seconds and then make http request again
        continue

fulltext_lineage = pd.concat(full_regrex)

In [233]:
fulltext_lineage.to_csv('B_1_1_7.csv',index=False, header = ['lineage', 'string contains lineage', 'ID'])

#### test on P.1

In [234]:
lineage = 'P.1'
ids = get_ids(lineage)

full_regrex = []
for i in ids:
    try: 
        path = Path(f'{lineage}/{i}.txt')
        
        # if file not exist, get body text and save to file
        if not path.is_file():
            body_text = download_article(i)
            path.parent.mkdir(parents=True, exist_ok=True)
            with path.open("w", encoding ="utf-8") as f:
                f.write(body_text)
                f.close()
        else: # otherwise retrieve text
            body_text = path.read_text()
        
        
        record = get_full_lineage(body_text) # get lineages
        [x.append(i) for x in record] # attach article id to lineage record
        full_regrex.append(pd.DataFrame(record))
    except urllib.error.HTTPError as exc:
        time.sleep(10) # wait 10 seconds and then make http request again
        continue

fulltext_lineage = pd.concat(full_regrex)

In [253]:
fulltext_lineage.to_csv('P_1.csv',index=False, header = ['lineage', 'string contains lineage', 'ID'])

### manual check possible false postives
#### B.1.1.7

In [276]:
b117 = pd.read_csv('B_1_1_7.csv')

In [277]:
b117 = b117.drop(['Unnamed: 0'],axis = 1)
b117.columns = ['lineage', 'string contains lineage', 'ID']

In [346]:
# 1. same sentence with many lineages are counted as positive, we remove those 
b117_sub = b117[~ b117.duplicated('string contains lineage',keep=False)]

In [349]:
# 2. extract articles with only one lineage, which are possibly FP
IDs = b117_sub.groupby('ID').lineage.count().loc[lambda p : p == 1].index

In [350]:
b117_sub = b117_sub.set_index('ID').loc[IDs]

In [369]:
# manual check
#print(b117_sub['string contains lineage'].str.cat(sep = '\n '))

#### P.1

In [370]:
p1 = pd.read_csv("P_1.csv")
p1_sub = p1[~ p1.duplicated('string contains lineage',keep=False)]
IDs_1 = p1_sub.groupby('ID').lineage.count().loc[lambda p : p == 1].index

In [371]:
p1_sub = p1_sub.set_index('ID').loc[IDs_1]

In [379]:
p1_sub[p1_sub['string contains lineage'].str.contains('The nucleic acid')]

Unnamed: 0_level_0,lineage,string contains lineage
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
PPR282435,R.9.4.1,The nucleic acid was converted to cDNA and amp...


## Start Generalization

In [165]:
import time

def iter_lineage(l_):
    
    #for l_ in lineage: This is redundant in mapping
     
    start = time.time()
    print(f'start {l_}')
    #get ids
    ids = get_ids(l_)
    try:
        # if lineage processed before, run on new ids
        path = Path(f'{l_}_df.csv')
        if path.is_file():
            df = pd.read_csv(path)
            id_ran = set(df.ID)
            id_to_run = set(ids) - id_ran
            df_new = extract_full(id_to_run) # merge with original df
            ddf = pd.concat([df, df_new], axis = 0)

        else: # otherwise run on all ids
            id_to_run = ids
            ddf = extract_full(id_to_run)

        ddf.to_csv(f'{l_}_df.csv',index=False, \
                                    header = ['lineage', 'string', 'ID'])
        end = time.time()
        print(f'done with {l_}, time duration --- seconds --- {end - start} \n')
    except:
        id_to_run = ids
        ddf = extract_full(id_to_run)
        ddf.to_csv(f'{l_}_df.csv',index=False, \
                                    header = ['lineage', 'string', 'ID'])
        end = time.time()
        print(f'done with {l_}, time duration --- seconds --- {end - start} \n')
        


In [731]:
i1 = get_ids('A.1')
i2 = get_ids('A.2')

In [718]:
d1 = extract_full(i1[:10])
d2 = extract_full(i2[:10])

In [732]:
i1[:5],i2[:5]

(['PMC8725908', 'PMC8725896', 'PMC9181312', 'PMC9174147', 'PMC9162986'],
 ['PMC8725908', 'PMC8725896', 'PMC9174147', 'PMC9132978', 'PMC9132891'])

In [720]:
d1.shape

(28, 3)

In [721]:
d2.shape

(36, 3)

In [127]:
l_10 = lineages[:10]
l_10


array(['A.1', 'A.2', 'A.2.2', 'A.2.3', 'A.2.4', 'A.2.5', 'A.2.5.1',
       'A.2.5.2', 'A.2.5.3', 'A.3'], dtype=object)

### parallel running

In [387]:
from dask.distributed import Client, progress

In [390]:
client = Client(n_workers=4, threads_per_worker=1, memory_limit="4 GiB")
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 62301 instead
  f"Port {expected} is already in use.\n"


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:62301/status,

0,1
Dashboard: http://127.0.0.1:62301/status,Workers: 4
Total threads: 4,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:62302,Workers: 4
Dashboard: http://127.0.0.1:62301/status,Total threads: 4
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:62323,Total threads: 1
Dashboard: http://127.0.0.1:62324/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:62309,
Local directory: /Users/lyt/Desktop/COVID-19/covid-19-community/notebooks/dataprep/dask-worker-space/worker-caitlq4j,Local directory: /Users/lyt/Desktop/COVID-19/covid-19-community/notebooks/dataprep/dask-worker-space/worker-caitlq4j

0,1
Comm: tcp://127.0.0.1:62314,Total threads: 1
Dashboard: http://127.0.0.1:62315/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:62305,
Local directory: /Users/lyt/Desktop/COVID-19/covid-19-community/notebooks/dataprep/dask-worker-space/worker-j7kue5a8,Local directory: /Users/lyt/Desktop/COVID-19/covid-19-community/notebooks/dataprep/dask-worker-space/worker-j7kue5a8

0,1
Comm: tcp://127.0.0.1:62317,Total threads: 1
Dashboard: http://127.0.0.1:62319/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:62307,
Local directory: /Users/lyt/Desktop/COVID-19/covid-19-community/notebooks/dataprep/dask-worker-space/worker-nt4fff35,Local directory: /Users/lyt/Desktop/COVID-19/covid-19-community/notebooks/dataprep/dask-worker-space/worker-nt4fff35

0,1
Comm: tcp://127.0.0.1:62318,Total threads: 1
Dashboard: http://127.0.0.1:62320/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:62308,
Local directory: /Users/lyt/Desktop/COVID-19/covid-19-community/notebooks/dataprep/dask-worker-space/worker-wgytwgcp,Local directory: /Users/lyt/Desktop/COVID-19/covid-19-community/notebooks/dataprep/dask-worker-space/worker-wgytwgcp


In [243]:
import dask.bag as db

b = db.from_sequence(l_10)

In [181]:
start = time.time()
result = b.map(lambda l: get_ids(l))
result.compute()
end = time.time()
print('Total time:', end-start)

Total time: 4.617193937301636


In [191]:
a = get_ids('A.1')

In [192]:
len(a)

358

In [195]:
len(get_ids('A.2.2'))

15

In [177]:
start = time.time()
for i in l_10[:1]:
    iter_lineage(i)
end = time.time()

start A.1
done with A.1, time duration --- seconds --- 1427.6210389137268 



In [178]:
print('Total time:', end-start)

Total time: 1427.6240351200104


In [313]:
def parellel_iter(l_): 
    
    #for l_ in lineage: This is redundant in mapping
     
    start = time.time()
    print(f'start {l_}')
    #get ids
    ids = get_ids(l_)

    try:
        # if lineage processed before, run on new ids
        path = Path(f'{l_}_df.csv')
        if path.is_file():
            df = pd.read_csv(path)
            id_ran = set(df.ID)
            id_to_run = set(ids) - id_ran
            if id_to_run: 
                bag_id = db.from_sequence(id_to_run)
                dds = bag_id.map(lambda i: extract_full_parallel2(i))
                df_new = dds.take(1)[0]
                ddf = pd.concat([df, df_new], axis = 0)

        else: # otherwise run on all ids
            id_to_run = ids
            bag_id = db.from_sequence(id_to_run)
            dds = bag_id.map(lambda i: extract_full_parallel2(i)) 
            ddf = dds.take(1)[0]

        ddf.to_dataframe().to_csv(f'{l_}_df.csv',index=False, \
                                    header = ['lineage', 'string', 'ID'])
        end = time.time()
        print(f'done with {l_}, time duration --- seconds --- {end - start} \n')
    except:
        id_to_run = ids
        bag_id = db.from_sequence(id_to_run)
        dds = bag_id.map(lambda i: extract_full_parallel2(i))
        ddf = dds.take(1)[0]
        ddf.to_csv(f'{l_}_df.csv',index=False, \
                                    header = ['lineage', 'string', 'ID'])
        end = time.time()
        print(f'done with {l_}, time duration --- seconds --- {end - start} \n')
        


In [1]:
# no looping
def extract_full_parallel(ids):
    #ids = dic['ids']
    record = []
    
    """if not ids:  
        return None
    for i in ids:"""
    
    try:
        body_text = download_article(ids) # get body text
        record = get_full_lineage(body_text) # extract lineages in text
        [x.append(i) for x in record] # attach article id to lineage record
        #full_regrex.append(pd.DataFrame(record))
    except urllib.error.HTTPError as exc:
        time.sleep(10) # wait 10 seconds and then make http request again
        #continue
        
    df_fulltext = pd.DataFrame(record)
    df_fulltext.columns = ['lineage', 'string', 'ID']
    return df_fulltext

In [2]:
# looping over ids
def extract_full_parallel2(ids):
    #ids = dic['ids']
    full_regrex = []
    """
    if not ids:  
        return None
    """
    for i in ids:
        try:
            body_text = download_article(i) # get body text
            record = get_full_lineage(body_text) # extract lineages in text
            [x.append(i) for x in record] # attach article id to lineage record
            full_regrex.append(pd.DataFrame(record))
        except urllib.error.HTTPError as exc:
            time.sleep(10) # wait 10 seconds and then make http request again
            continue
    df_fulltext = pd.concat(full_regrex)
    df_fulltext.columns = ['lineage', 'string', 'ID']
    return df_fulltext

In [None]:
### BAG

In [284]:
def query(lineage):
    # get articles ids for specific lineage:
    url = requests.get(f'https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=(%22{lineage}%22%20AND%20(%22SARS-CoV-2%22%20OR%20%22COVID-19%22)%20AND%20(%22lineage%22%20OR%20%22lineages%22%20OR%20%22strain%22%20OR%20%22strains%22%20OR%20%22variants%22%20OR%20%22variants%22))%20AND%20(FIRST_PDATE:%5b2020-01-01%20)%20AND%20HAS_FT:y%20AND%20%20sort_date:y&resultType=idlist&pageSize=1000&format=json&cursorMark=*')
    text = url.text
    print(text[:10])
    results = json.loads(text)['resultList']['result']
    ids = list(map(lambda x: x['fullTextIdList']['fullTextId'][0], results))
    return {'lineage': lineage, 'ids': ids}

In [287]:
lineages[7:8]

array(['A.2.5.2'], dtype=object)

In [289]:
## try if work on lineage A.1 & A2.5.2

In [451]:
def parallel_iter(l_): 
    
    #for l_ in lineage: This is redundant in mapping
     
    start = time.time()
    print(f'start {l_}')
    #get ids
    b = db.from_sequence([l_])
    result = b.map(lambda record: get_ids(record))
    

    try:
        # if lineage processed before, run on new ids
        path = Path(f'{l_}_df.csv')
        if path.is_file():
            df = pd.read_csv(path)
            ids = result.take(1)[0]
            id_ran = set(df.ID)
            if id_ran != set(ids): 
                dds = result.filter(lambda x: x not in id_ran).map(lambda i: extract_full_parallel2(i))
                df_new = dds.take(1)[0]
                ddf = pd.concat([df, df_new], axis = 0)
            else: return # if no updates do nothing

        else: # otherwise run on all ids
            dds = result.map(lambda i: extract_full_parallel2(i)) 
            ddf = dds.take(1)[0]

        ddf.to_csv(f'{l_}_df.csv',index=False, \
                                    header = ['lineage', 'string', 'ID'])
        end = time.time()
        print(f'done with {l_}, time duration --- seconds --- {end - start} \n')
    except:
        dds = result.map(lambda i: extract_full_parallel2(i))
        ddf = dds.take(1)[0]
        ddf.to_csv(f'{l_}_df.csv',index=False, \
                                    header = ['lineage', 'string', 'ID'])
        end = time.time()
        print(f'done with {l_}, time duration --- seconds --- {end - start} \n')
        


In [460]:
# break down iter code for debugging
l_ = 'A.2.5.2'
start = time.time()
b = db.from_sequence([l_])
path = Path(f'{l_}_df.csv')
if path.is_file():
    df = pd.read_csv(path)
    ids = result.take(1)[0]
    id_ran = set(df.ID)
    if id_ran != set(ids): 
        dds = result.filter(lambda x: x not in id_ran).map(lambda i: extract_full_parallel2(i))
        df_new = dds.take(1)[0]
        ddf = pd.concat([df, df_new], axis = 0)
end = time.time()
print('Total time:', end-start)

Total time: 2.6870317459106445


In [481]:
start = time.time()
#b = db.from_sequence(lineages[8:9])
parallel_iter('A.2.5.3')
end = time.time()

start A.2.5.3


In [462]:
end - start

3.3915951251983643

In [463]:
start = time.time()
for i in l_10[6:9]:
    iter_lineage(i)
end = time.time()

start A.2.5.1
{"ver
done with A.2.5.1, time duration --- seconds --- 1.402590036392212 

start A.2.5.2
{"ver
done with A.2.5.2, time duration --- seconds --- 1.1155860424041748 

start A.2.5.3
{"ver
done with A.2.5.3, time duration --- seconds --- 0.6666321754455566 



In [464]:
end-start

3.186002254486084

In [378]:
b = db.from_sequence(lineages[7:8])
print(lineages[7:8])
start = time.time()
result = b.map(lambda record: get_ids(record))
print('get id')
print(result.take(1))
dds = result.map(lambda x: extract_full_parallel2(x))
print(dds.compute())
end = time.time()
print('Total time:', end-start)

['A.2.5.2']
get id
(['PMC9145602', 'PMC9088647', 'PMC8525575'],)
[       lineage                                             string          ID
0          P.1  The most common genotype was 417variant/484K/5...  PMC9145602
1          P.2  No mutations (K417 only) were found in 64/198 ...  PMC9145602
2      B.1.1.7  Seven samples (3.5%) tested positive for 452R ...  PMC9145602
3    B.1.617.2  Patient consent was waived under approval of I...  PMC9145602
4         C.37  Patient consent was waived under approval of I...  PMC9145602
5    B.1.617.1  Patient consent was waived under approval of I...  PMC9145602
6      B.1.351  The most common genotype identified was 417var...  PMC9145602
7        P.1.2  Samples bearing genotype 417variant/484K/501Y ...  PMC9145602
8      B.1.499  The genotype of only K417 was found in samples...  PMC9145602
9    B.1.1.277  The genotype of only K417 was found in samples...  PMC9145602
10    B.1.1.33  The genotype of only K417 was found in samples...  PMC914560

In [336]:
b = db.from_sequence([lineages[7:8]])
print(lineages[7:8])
start = time.time()
result = b.map(lambda record: query(record))
print('get id')
ddf = result.map(lambda x: extract_full_parallel(x))
ddf.compute()
end = time.time()
print(end - start)

['A.2.5.2']
get id
10.318028926849365


In [298]:
# get dataframe
ddf.take(1)[0]

Unnamed: 0,lineage,string,ID
0,P.1,The most common genotype was 417variant/484K/5...,PMC9145602
1,P.2,No mutations (K417 only) were found in 64/198 ...,PMC9145602
2,B.1.1.7,Seven samples (3.5%) tested positive for 452R ...,PMC9145602
3,B.1.617.2,Patient consent was waived under approval of I...,PMC9145602
4,C.37,Patient consent was waived under approval of I...,PMC9145602
5,B.1.617.1,Patient consent was waived under approval of I...,PMC9145602
6,B.1.351,The most common genotype identified was 417var...,PMC9145602
7,P.1.2,Samples bearing genotype 417variant/484K/501Y ...,PMC9145602
8,B.1.499,The genotype of only K417 was found in samples...,PMC9145602
9,B.1.1.277,The genotype of only K417 was found in samples...,PMC9145602


In [21]:
# download articles in XML and return body paragraph
def process(article_id):
    url = f'https://www.ebi.ac.uk/europepmc/webservices/rest/{article_id}/fullTextXML'
    xmldoc = parse(urlopen(url))
    
    # get full text
    root = xmldoc.getroot()
    text = root.findall('.//p')

    # put body paragraphs together
    ptext = ""
    for p in text:
        ptext += ''.join([x for x in p.itertext()]) + '.\n' + '\n'
    
    # tokenize texts into sentences
    p_sentence = nltk.tokenize.sent_tokenize(ptext)
    
    # record lineages
    linset = set()
    pair = []
    for s in p_sentence:
        s1 = re.subn('[()/,]', ' ', s)[0] # remove special chars
        lin = set(pattern1.findall(s1) + pattern2.findall(s1) + pattern3.findall(s1) + pattern4.findall(s1))

        if lin: 
            for l in lin:
                # valid lineage and not recorded
                l = l.strip()
                l = l.capitalize()
                if (l in lineages) and (l not in linset): 
                    linset.add(l)
                    pair.append([l, s])
                else: continue
    
    

In [22]:
# get lineage for full texts
def get_full_lineage(ptext):
    # tokenize texts into sentences
    p_sentence = nltk.tokenize.sent_tokenize(ptext)
    
    # record lineages
    linset = set()
    pair = []
    for s in p_sentence:
        s1 = re.subn('[()/,]', ' ', s)[0] # remove special chars
        lin = set(pattern1.findall(s1) + pattern2.findall(s1) + pattern3.findall(s1) + pattern4.findall(s1))

        if lin: 
            for l in lin:
                # valid lineage and not recorded
                l = l.strip()
                l = l.capitalize()
                if (l in lineages) and (l not in linset): 
                    linset.add(l)
                    pair.append([l, s])
                else: continue
    return pair
    
    """
    ptext = re.subn('[()/,]', ' ', ptext)[0] # remove special chars
    lin = pattern1.findall(ptext) + pattern2.findall(ptext) + pattern3.findall(ptext)
    lin_set = set(lin)
    
    record = []
    if lin_set:
        for l in lin_set:
            
            sen = re.search(r"\.?([^\.]*{}[^\.]*)".format(l), ptext).group()
            record.append([l, sen])
    """
    

In [55]:
# wrap up function take lineage ids as input and output dataframe
def extract_full(ids):
    full_regrex = []
    
    if not ids:  
        return None
    for i in ids:
        try:
            body_text = download_article(i) # get body text
            record = get_full_lineage(body_text) # extract lineages in text
            [x.append(i) for x in record] # attach article id to lineage record
            full_regrex.append(pd.DataFrame(record))
        except urllib.error.HTTPError as exc:
            time.sleep(10) # wait 10 seconds and then make http request again
            continue
    df_fulltext = pd.concat(full_regrex)
    df_fulltext.columns = ['lineage', 'string', 'ID']
    return df_fulltext


    

In [220]:
result.map(lambda x: 

dask.bag<lambda, npartitions=4>