# Downloads PANGO Lineages
**[Work in progress]**

This notebook downloads the current PANGO lineages and build a tree structure of the lineages.

Data sources: [PANGO Lineage Designations](https://github.com/cov-lineages/pango-designation)

Reference:
Rambaut A, et al., A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology(2020) Nature Microbiology [doi:10.1038/s41564-020-0770-5](https://doi.org/10.1038/s41564-020-0770-5).

Author: Peter Rose (pwrose@ucsd.edu)

In [4]:
import os
import numpy as np
import pandas as pd
import io
import dateutil
import re
from pathlib import Path

In [5]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [6]:
NEO4J_IMPORT = "/Users/lyt/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-a1516f46-b63a-46dd-b67a-1fb59d6c5d05/import"#Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/lyt/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-a1516f46-b63a-46dd-b67a-1fb59d6c5d05/import


## Get PANGO lineages

In [7]:
pango_url = 'https://raw.githubusercontent.com/cov-lineages/pango-designation/master/lineage_notes.txt'

In [8]:
pango = pd.read_csv(pango_url, sep='\t', skiprows=1, dtype=str, names=['lineage', 'description'])

Remove spaces in lineage column

In [9]:
pango['lineage'] = pango['lineage'].str.strip()

In [24]:
pango['lineage'] = pango['lineage'].str.strip('*') # include obsolete

Remove withdrawn lineages (start with a "*")

In [10]:
#pango = pango[~pango['lineage'].str.startswith('*')]

Extract alias from description

In [25]:
pattern = re.compile('Alias of ([\S]*?),', re.IGNORECASE)

In [26]:
def get_alias(row):
    match = pattern.findall(str(row.description))
    if len(match) > 0:
        return match[0]
    else:
        return ''

In [27]:
pango['alias'] = pango.apply(get_alias, axis=1)

In [28]:
pango['predecessor'] = pango['alias'].str.rsplit('.', 1, expand=True)[0]

In [29]:
pango.tail(5)

Unnamed: 0,lineage,description,alias,predecessor,l0,l1,l2,l3,levels
2013,E.1,Withdrawn: Reassigned B.1.416.1. French lineag...,,,*E.1,*E,,,2
2014,F.1,"Withdrawn: Alias of B.1.36.17.1, New Zealand l...",B.1.36.17.1,B.1.36.17,*F.1,*F,,,2
2015,H.1,"Withdrawn: Alias of B.1.1.67.1, South African ...",B.1.1.67.1,B.1.1.67,*H.1,*H,,,2
2016,I.1,"Withdrawn: Alias of B.1.1.217.1, Latvian linea...",B.1.1.217.1,B.1.1.217,*I.1,*I,,,2
2017,J.1,"Withdrawn: Alias of B.1.1.250.1, Australian li...",B.1.1.250.1,B.1.1.250,*J.1,*J,,,2


### Split into sublineages

In [30]:
def split_lineage(row):
    lineage = row['lineage']
    lineages =  np.empty(4, dtype=object)

    for i in range(lineages.size):
        lineages[i] = lineage
        lineage = lineage.rpartition('.')[0]

    return lineages

In [31]:
pango[['l0', 'l1', 'l2', 'l3']] = pango.apply(split_lineage, axis=1, result_type='expand')
pango['levels'] = pango['lineage'].str.count('\.') + 1

In [32]:
pango.sample(5)

Unnamed: 0,lineage,description,alias,predecessor,l0,l1,l2,l3,levels
1971,B.1.419,Withdrawn: Iceland lineage,,,B.1.419,B.1,B,,3
1860,B.1.36.6,Withdrawn: Norway lineage,,,B.1.36.6,B.1.36,B.1,B,4
1345,B.1.588.1,Romanian lineage,,,B.1.588.1,B.1.588,B.1,B,4
1420,AY.7,"Alias of B.1.617.2.7, UK lineage, from pango-d...",B.1.617.2.7,B.1.617.2,AY.7,AY,,,2
1838,B.1.5.26,Withdrawn: Reassigned B.1.232,,,B.1.5.26,B.1.5,B.1,B,4


In [33]:
pango.to_csv(NEO4J_IMPORT +  "/00b-PANGOLineage.csv", index=False)