# Downloads PANGO Lineages
**[Work in progress]**

This notebook downloads the current PANGO lineages and build a tree structure of the lineages.

Data sources: [PANGO Lineage Designations](https://github.com/cov-lineages/pango-designation)

Reference:
Rambaut A, et al., A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology(2020) Nature Microbiology [doi:10.1038/s41564-020-0770-5](https://doi.org/10.1038/s41564-020-0770-5).

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import numpy as np
import pandas as pd
import io
import dateutil
import re
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [18]:
NEO4J_IMPORT = "/Users/lyt/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-a1516f46-b63a-46dd-b67a-1fb59d6c5d05/import"#Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/lyt/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-a1516f46-b63a-46dd-b67a-1fb59d6c5d05/import


## Get PANGO lineages

In [21]:
pango_url = 'https://raw.githubusercontent.com/cov-lineages/pango-designation/master/lineage_notes.txt'

In [22]:
pango = pd.read_csv(pango_url, sep='\t', skiprows=1, dtype=str, names=['lineage', 'description'])

Remove spaces in lineage column

In [2]:
pango['lineage'] = pango['lineage'].str.strip()

NameError: name 'pango' is not defined

Remove withdrawn lineages (start with a "*")

In [3]:
#pango = pango[~pango['lineage'].str.startswith('*')]

Extract alias from description

In [37]:
pattern = re.compile('Alias of ([\S]*?),', re.IGNORECASE)

In [38]:
def get_alias(row):
    match = pattern.findall(str(row.description))
    if len(match) > 0:
        return match[0]
    else:
        return ''

In [39]:
pango['alias'] = pango.apply(get_alias, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [40]:
pango['predecessor'] = pango['alias'].str.rsplit('.', 1, expand=True)[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [41]:
pango.tail(5)

Unnamed: 0,lineage,description,alias,predecessor
1663,XN,"Recombinant lineage of BA.1 and BA.2, UK linea...",,
1664,XP,"Recombinant lineage of BA.1.1 and BA.2, UK lin...",,
1665,XQ,"Recombinant lineage of BA.1.1 and BA.2, UK lin...",,
1666,XR,"Recombinant lineage of BA.1.1 and BA.2, UK lin...",,
1667,XS,"Recombinant lineage of Delta and BA.1.1, USA l...",,


### Split into sublineages

In [14]:
def split_lineage(row):
    lineage = row['lineage']
    lineages =  np.empty(4, dtype=object)

    for i in range(lineages.size):
        lineages[i] = lineage
        lineage = lineage.rpartition('.')[0]

    return lineages

In [42]:
pango[['l0', 'l1', 'l2', 'l3']] = pango.apply(split_lineage, axis=1, result_type='expand')
pango['levels'] = pango['lineage'].str.count('\.') + 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [43]:
pango.sample(5)

Unnamed: 0,lineage,description,alias,predecessor,l0,l1,l2,l3,levels
271,B.1.1.177,Texas,,,B.1.1.177,B.1.1,B.1,B,4
204,B.1.1.91,Sweden,,,B.1.1.91,B.1.1,B.1,B,4
602,BA.1.1.7,"Alias of B.1.1.529.1.1.7, lineage in India and...",B.1.1.529.1.1.7,B.1.1.529.1.1,BA.1.1.7,BA.1.1,BA.1,BA,4
306,AM.4,"Alias of B.1.1.216.4, Canadian lineage",B.1.1.216.4,B.1.1.216,AM.4,AM,,,2
1355,AY.5.7,"Alias of B.1.617.2.5.7, mainly found in Israel...",B.1.617.2.5.7,B.1.617.2.5,AY.5.7,AY.5,AY,,3


In [44]:
pango.to_csv(NEO4J_IMPORT +  "/00b-PANGOLineage.csv", index=False)