# Dowload SARS-CoV-2 Variant Metadata from CNCB
**[Work in progress]**

This notebook downloads and standardizes SARS-CoV-2 variant data from CNCB.

Data source: [China National Center for Bioinformation, 2019 Novel Coronavirus Resource (2019nCoVR)](https://bigd.big.ac.cn/ncov/release_genome)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import dateutil
import pandas as pd

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

### Download variant metadata

In [3]:
metadata_url = "https://bigd.big.ac.cn/ncov/genome/export/meta"

In [4]:
df = pd.read_csv(metadata_url, sep='\t', dtype='str')

In [5]:
df.fillna('', inplace=True)

In [6]:
print("Total number of strains:", df.shape[0])

Total number of strains: 2764634


In [7]:
df.head(5)

Unnamed: 0,Virus Strain Name,Accession ID,Data Source,Related ID,Lineage,Nuc.Completeness,Sequence Length,Sequence Quality,Quality Assessment,Host,Sample Collection Date,Location,Originating Lab,Submission Date,Submitting Lab,Create Time,Last Update Time
0,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01,NMDC,EPI_ISL_402132,B.1.36.10,Complete,29848,High,0/0/-/-/-,Homo sapiens,2019-12-30,China / Hubei,Hubei Provincial Center for Disease Control an...,2020-01-19,Hubei Provincial Center for Disease Control an...,2020-01-20 12:04:48,2021-03-16 10:46:59
1,hCoV-19/Thailand/74/2020,EPI_ISL_403963,GISAID,,B,Complete,29859,High,0/0/-/-/-,Homo sapiens,2020-01-13,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 12:04:48,2021-03-16 10:46:59
2,hCoV-19/Thailand/61/2020,EPI_ISL_403962,GISAID,,B,Complete,29848,High,0/0/-/-/-,Homo sapiens,2020-01-08,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 12:04:48,2021-03-16 10:46:59
3,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01,NMDC,EPI_ISL_402120,B,Complete,29896,High,0/0/-/-/-,Homo sapiens,2020-01-01,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-11,National Institute for Viral Disease Control a...,2020-01-20 12:04:48,2021-03-16 10:46:59
4,BetaCoV/Wuhan/IVDC-HB-01/2019,NMDC60013084-01,NMDC,EPI_ISL_402119,B,Complete,29891,High,0/0/-/-/-,Homo sapiens,2019-12-30,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-10,National Institute for Viral Disease Control a...,2020-01-20 12:04:48,2021-03-16 10:46:59


#### Rename and concatenate fields

In [8]:
df['Related ID'] = df['Related ID'].str.replace('-', '')

In [9]:
df.rename(columns={'Data Source': 'source'}, inplace=True)
df.rename(columns={'Sequence Length': 'sequenceLength'}, inplace=True)
df.rename(columns={'Sequence Quality': 'sequenceQuality'}, inplace=True)
df.rename(columns={'Quality Assessment': 'qualityAssessment'}, inplace=True)
df.rename(columns={'Originating Lab': 'originatingLab'}, inplace=True)
df.rename(columns={'Virus Strain Name': 'name'}, inplace=True)
df.rename(columns={'Sample Collection Date':'collectionDate'},inplace=True)
df.rename(columns={'Location':'location'}, inplace=True)
df.rename(columns={'Lineage': 'lineage'}, inplace=True)

In [10]:
# strip white space
df['location'] = df['location'].str.strip()

In [11]:
df['lineage'] = df['lineage'].str.replace('-', '')
df['lineage'] = df['lineage'].str.replace('None', '')

In [12]:
print(df['lineage'].unique())

['B.1.36.10' 'B' '' ... 'P.6' 'B.1.1.449' 'C.26']


In [13]:
df['collectionDate'] = df['collectionDate'].apply(lambda d: dateutil.parser.parse(d) if len(d) > 0 else '')

In [14]:
df.fillna('', inplace=True)

In [15]:
df.head()

Unnamed: 0,name,Accession ID,source,Related ID,lineage,Nuc.Completeness,sequenceLength,sequenceQuality,qualityAssessment,Host,collectionDate,location,originatingLab,Submission Date,Submitting Lab,Create Time,Last Update Time
0,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01,NMDC,EPI_ISL_402132,B.1.36.10,Complete,29848,High,0/0/-/-/-,Homo sapiens,2019-12-30,China / Hubei,Hubei Provincial Center for Disease Control an...,2020-01-19,Hubei Provincial Center for Disease Control an...,2020-01-20 12:04:48,2021-03-16 10:46:59
1,hCoV-19/Thailand/74/2020,EPI_ISL_403963,GISAID,,B,Complete,29859,High,0/0/-/-/-,Homo sapiens,2020-01-13,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 12:04:48,2021-03-16 10:46:59
2,hCoV-19/Thailand/61/2020,EPI_ISL_403962,GISAID,,B,Complete,29848,High,0/0/-/-/-,Homo sapiens,2020-01-08,Thailand/ Nonthaburi Province,"Department of Medical Sciences, Ministry of Pu...",2020-01-17,"Department of Medical Sciences, Ministry of Pu...",2020-01-20 12:04:48,2021-03-16 10:46:59
3,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01,NMDC,EPI_ISL_402120,B,Complete,29896,High,0/0/-/-/-,Homo sapiens,2020-01-01,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-11,National Institute for Viral Disease Control a...,2020-01-20 12:04:48,2021-03-16 10:46:59
4,BetaCoV/Wuhan/IVDC-HB-01/2019,NMDC60013084-01,NMDC,EPI_ISL_402119,B,Complete,29891,High,0/0/-/-/-,Homo sapiens,2019-12-30,China / Hubei / Wuhan,National Institute for Viral Disease Control a...,2020-01-10,National Institute for Viral Disease Control a...,2020-01-20 12:04:48,2021-03-16 10:46:59


In [16]:
df.to_parquet('../data/variants.parquet', compression='brotli', index=False)