# Downloads PANGO Lineages
**[Work in progress]**

This notebook downloads the current PANGO lineages and build a tree structure of the lineages.

Data sources: [PANGO Lineage Designations](https://github.com/cov-lineages/pango-designation)

Reference:
Rambaut A, et al., A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology(2020) Nature Microbiology [doi:10.1038/s41564-020-0770-5](https://doi.org/10.1038/s41564-020-0770-5).

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import numpy as np
import pandas as pd
import io
import dateutil
import re
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = "/Users/lyt/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-a1516f46-b63a-46dd-b67a-1fb59d6c5d05/import"#Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/lyt/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-a1516f46-b63a-46dd-b67a-1fb59d6c5d05/import


## Get PANGO lineages

In [4]:
pango_url = 'https://raw.githubusercontent.com/cov-lineages/pango-designation/master/lineage_notes.txt'

In [5]:
pango = pd.read_csv(pango_url, sep='\t', skiprows=1, dtype=str, names=['lineage', 'description'])

Remove spaces in lineage column

In [6]:
pango['lineage'] = pango['lineage'].str.strip()

In [7]:
pango['lineage'] = pango['lineage'].str.strip('*') # include obsolete

Remove withdrawn lineages (start with a "*")

In [10]:
#pango = pango[~pango['lineage'].str.startswith('*')]

### Extract alias from description

In [8]:
pattern = re.compile('Alias of ([\S]*?),', re.IGNORECASE)

In [9]:
def get_alias(row):
    match = pattern.findall(str(row.description))
    if len(match) > 0:
        return match[0]
    else:
        return ''

In [10]:
pango['alias'] = pango.apply(get_alias, axis=1)

In [11]:
pango['predecessor'] = pango['lineage'].str.rsplit('.', 1, expand=True)[0]

In [12]:
pango.sample(5)

Unnamed: 0,lineage,description,alias,predecessor
1519,B.1.487,USA - TX,,B.1
593,B.1.1.528,"South Africa lineage, from pango-designation i...",,B.1.1
2047,A.8,Withdrawn: Indian lineage merged with A.9,,A
1039,B.1.83,Belgian lineage,,B.1
2185,B.1.107,Withdrawn: Reassigned in the current tree. Dan...,,B.1


### Modify pango df for knowledge graph

In [17]:
pango['id'] = pango.lineage # add id column

# add synonyms
pango['synonyms'] = ''
pango.loc[pango.id == 'B.1.1.7', 'synonyms'] = "Alpha"
pango.loc[pango.id == 'B.1.351', 'synonyms'] = "Beta"
pango.loc[pango.id == 'P.1', 'synonyms'] = "Gamma"
pango.loc[pango.id == 'B.1.617.2', 'synonyms'] = "Delta"
pango.loc[pango.id == 'B.1.1.529', 'synonyms'] = "Omicron"
pango.loc[(pango.id == 'BA.1') | (pango.id == 'BA.2')|\
          (pango.id == 'BA.3') | (pango.id == 'BA.4')|\
         (pango.id == 'BA.5'), 'synonyms'] = "Omicron subvariant"

Unnamed: 0,lineage,description,alias,predecessor,id,synonyms
0,A,One of the two original haplotypes of the pand...,,A,A,
1,A.1,USA lineage,,A,A.1,
2,A.2,Mostly Spanish lineage now includes South and ...,,A,A.2,
3,A.2.2,Australian lineage,,A.2,A.2.2,
4,A.2.3,Scottish lineage,,A.2,A.2.3,


In [21]:
pango['name'] = pango.id

In [24]:
df = pango[['id','name','synonyms','description']]
df.to_csv('Lineage.csv',index=False)

In [14]:
lin_is_lin = pango[['id','predecessor']]
lin_is_lin.columns = ['from','to']

In [15]:
lin_is_lin.sample(4)

Unnamed: 0,from,to
722,BA.2.25,BA.2
1562,B.1.532,B.1
1630,B.1.595.4,B.1.595
2044,XBB,XBB


In [16]:
lin_is_lin.to_csv('Lineage-IS_A-Lineage.csv',index=False)

### Split into sublineages

In [58]:
def split_lineage(row):
    lineage = row['name']
    lineages =  np.empty(4, dtype=object)

    for i in range(lineages.size):
        lineages[i] = lineage
        lineage = lineage.rpartition('.')[0]

    return lineages

In [59]:
pango[['l0', 'l1', 'l2', 'l3']] = pango.apply(split_lineage, axis=1, result_type='expand')
pango['levels'] = pango['name'].str.count('\.') + 1

In [60]:
pango.sample(5)

Unnamed: 0,id,name,synonyms,description,alias,predecessor,l0,l1,l2,l3,levels
695,BG.3,BG.3,,"Alias of B.1.1.529.2.12.1.3, mainly found in P...",B.1.1.529.2.12.1.3,B.1.1.529.2.12.1,BG.3,BG,,,2
713,BA.2.25,BA.2.25,,"Alias of B.1.1.529.2.25, Sweden lineage",B.1.1.529.2.25,B.1.1.529.2,BA.2.25,BA.2,BA,,3
1452,B.1.567,B.1.567,,"USA lineage, formally part of B.1.2",,,B.1.567,B.1,B,,3
1443,B.1.559,B.1.559,,Scotland,,,B.1.559,B.1,B,,3
196,B.1.1.82,B.1.1.82,,Wales lineage,,,B.1.1.82,B.1.1,B.1,B,4


In [33]:
pango.to_csv(NEO4J_IMPORT +  "/00b-PANGOLineage.csv", index=False)