<a href="https://colab.research.google.com/github/semenko/liquid-cell-atlas/blob/main/Final_SURF_Data_Processing_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Necessary Packages

We will be needing some bio-related dataset packages such as pyBigWig, pybedtools, and deeptools.

In [2]:
!pip install pyBigWig pybedtools gunzip bedparse deeptools pyGenomeTracks
!apt install bedtools

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyBigWig
  Downloading pyBigWig-0.3.18.tar.gz (64 kB)
[K     |████████████████████████████████| 64 kB 2.1 MB/s 
[?25hCollecting pybedtools
  Downloading pybedtools-0.9.0.tar.gz (12.5 MB)
[K     |████████████████████████████████| 12.5 MB 9.2 MB/s 
[?25hCollecting gunzip
  Downloading gunzip-0.1.10-py2.py3-none-any.whl (3.0 kB)
Collecting bedparse
  Downloading bedparse-0.2.3-py3-none-any.whl (21 kB)
Collecting deeptools
  Downloading deepTools-3.5.1-py3-none-any.whl (233 kB)
[K     |████████████████████████████████| 233 kB 49.8 MB/s 
[?25hCollecting pyGenomeTracks
  Downloading pyGenomeTracks-3.7-py2.py3-none-any.whl (112 kB)
[K     |████████████████████████████████| 112 kB 47.1 MB/s 
Collecting pysam
  Downloading pysam-0.20.0-cp37-cp37m-manylinux_2_24_x86_64.whl (15.4 MB)
[K     |████████████████████████████████| 15.4 MB 32.0 MB/s 
[?25hCollecting microapp>=0.2.3
  Do

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  bedtools
0 upgraded, 1 newly installed, 0 to remove and 4 not upgraded.
Need to get 577 kB of archives.
After this operation, 2,040 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 bedtools amd64 2.26.0+dfsg-5 [577 kB]
Fetched 577 kB in 0s (1,401 kB/s)
Selecting previously unselected package bedtools.
(Reading database ... 123942 files and directories currently installed.)
Preparing to unpack .../bedtools_2.26.0+dfsg-5_amd64.deb ...
Unpacking bedtools (2.26.0+dfsg-5) ...
Setting up bedtools (2.26.0+dfsg-5) ...


In [3]:
import pandas as pd
import io
import itertools
import numpy as np
from tqdm import tqdm
import csv
import os as os
import urllib
import pickle
import json
import pyBigWig
import pybedtools
import sys
import re

# Data Processing

## TSV of Links to Data

The TSV contains the links to all of the datasets on the site, as well as their corresponding cell types, file types, and more.

We filter the file to get rid of individuals with diseases, and only keep the datasets with the bigWig file format. We also only keep bisulfite sequencing data.

In [None]:
# Get blueprint dataset
! wget 'http://dcc.blueprint-epigenome.eu/data/blueprint_files.tsv' -N
data_tsv = pd.read_csv('blueprint_files.tsv', sep='\t')

# Only keep bisulfite sequencing data from non-diseased individuals, formatted as a bigWig file.
noDisease_bw_data = data_tsv[(data_tsv['Disease'] == 'None') & 
                             (data_tsv['Format'] == 'bigWig') & 
                             (data_tsv['Experiment'] == 'Bisulfite-Seq')]

# Gene Locations

Download the human genomic database and convert it into a bed file.

In [None]:
# Get dataset of human genes
! wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gtf.gz
! gunzip gencode.v40.annotation.gtf.gz

# Do gtf to bed conversion
! bedparse gtf2bed <gencode.v40.annotation.gtf> output.bed --extraFields gene_id,gene_name
output_bed = pybedtools.BedTool("output.bed")

os.remove("gencode.v40.annotation.gtf")

Convert the bed file into a pandas dataframe, and remove any unneccessary colummns! The only ones we keep are those relating to the chromosome number, the start and end base pair, the strandedness, the gene_id, and gene_name.

In [None]:
# Convert be file to pandas dataframe and delete unnecessary columns
gene_loc = pd.read_csv("coding_exon.bed", sep = '\t', names = ["chrom", "start", "end", "name", "e1", "strand"], index_col = False)
gene_loc = gene_loc[["chrom", "start", "end", "strand", "name"]]
gene_loc = gene_loc.drop(gene_loc[gene_loc["chrom"] == "chrM"].index) # Don't need chromosome M data
gene_loc.reset_index(drop=True, inplace=True)

gene_locs = {}

# List of all unique gene names in the dataset
gene_names = list(set(gene_loc["name"]))
gene_names = list(set([x[:15] for x in gene_names]))

# Map each gene name to its (chromosome, starting base pair, ending base pair)
for i in tqdm(range(len(gene_names))):
    name = gene_names[i]
    gene = gene_loc[gene_loc["name"].str.contains(name)]
    try:
        chr = re.search('chr([0-9]{1,2}|X|Y)', str(gene["chrom"])).group(0)
        start_loc = gene.iloc[0]["start"]
        end_loc = gene.iloc[0]["end"]

        gene_locs[name] = (chr, start_loc, end_loc)
    except:
        continue

In [None]:
for name in gene_names:
    if name not in gene_locs.keys():
        gene_names.remove(name)

## Making the Dataset

For each cell example in the Blueprint methylation database, we want the average methylation over all genes. The final dataset will be a table, with cell type as the rows, and gene names as the columns.

To do this, we iterate over each example in the Blueprint methylation database. Each example contains a cell type, as well as its average methylation at millions of base pairs. First, the cell type is set as the label. We then iterate over all gene locations, and average over all methylation records from the starting to the ending points of each gene. 

Thus, each cell type label has hundreds of thousands of average methylation measurements, one for each gene in the human body.

In [None]:
CHROMOSOMES = ["chr" + str(i) for i in range(1, 23)] + ["chrX"]
track = 0

columns = ["Cell Type"]
for name in gene_names:
    columns.append(name)

# Make dataset with gene names as the columns.
dataset = pd.DataFrame(columns = columns)

while track < len(noDisease_bw_data):
    print(str(track/len(noDisease_bw_data)) + "% Progress (" + str(track) + "/" + str(len(noDisease_bw_data)) + " Complete)")
    cell_type = noDisease_bw_data.iloc[track]["Cell type"]
    
    # Retrieve the file containing the methylation data for the cell_type
    # Don't need coverage file for now - may integrate into more advanced algorithms
    call_url = noDisease_bw_data.iloc[track]["URL"]
    ! wget "$call_url" -N -q

    call_file = call_url.split("/")[-1]

    data = [cell_type]
    with pyBigWig.open(call_file) as call_object:
        for name in gene_names:
            try:
                chrom, start, end = gene_locs[name]
                # Get methylation data for particular gene
                gene_cpgs = call_object.intervals(chrom, start, end)
                # If some data exists, average over it and add it into the data for cell_type
                if gene_cpgs is not None:
                    gene_cpgs = [tup[2] for tup in gene_cpgs]
                    data.append(sum(gene_cpgs)/len(gene_cpgs))
                # If no data exists for the gene, put a -1 into the data for cell_type
                else:
                    data.append(-1)
            except KeyError:
                continue

    dataset.loc[len(dataset.index)] = data

    os.remove(call_file)

    track += 2

## Dataset Modifications

Models need at least two examples for training - one for the training set, and another for the validation/testing. If a particular cell type only has one example in the dataset, we must remove it.

In [None]:
for cell_type in dataset["Cell Type"].unique():
    if len(dataset[dataset["Cell Type"] == cell_type]) == 1:
        dataset.drop(dataset[dataset['Cell Type'] == cell_type].index, inplace = True)

For the purposes of this experiment, there is too little data for the number of unique cell types in our dataset, so we are grouping similar cell types under the same label, and deleting some we deemed too niche.

Additionally, we only kept the 500 columns (ie genes) with the most variance in average methylation over all cell examples

In [None]:
dataset = dataset.drop(columns=dataset.columns[(dataset == -1).any()])
breh = np.array(dataset.drop("Cell Type", axis = 1))
# Get the variances with respect to the genes
var = np.var(breh, axis = 0)

# Only keep the genes with the 500 highest variances of methylation.
top_500 = list(np.argsort(var)[-500:])
top_500.append(0)
top_500_values = [var[i] for i in top_500]

filtered_dataset = dataset.iloc[:, top_500]

# These cells will be deleted.
delete = ["hematopoietic multipotent progenitor cell", "CD14-positive, CD16-negative classical monocyte",
          "erythroblast", "CD34-negative, CD41-positive, CD42-positive megakaryocyte cell", "endothelial cell of umbilical vein (proliferating)",
          "endothelial cell of umbilical vein (resting)", "CD3-negative, CD4-positive, CD8-positive, double positive thymocyte",
          "CD3-positive, CD4-positive, CD8-positive, double positive thymocyte", "osteoclast", "regulatory T cell", 
          "mature eosinophil", "adult endothelial progenitor cell", "mesenchymal stem cell of the bone marrow",
          "cytotoxic CD56-dim natural killer cell"]

copy_dataset = filtered_dataset.copy(deep = True)

for index, row in copy_dataset.iterrows():
    if row["Cell Type"] in delete:
        filtered_dataset.drop(index, inplace = True)
        
# These cells will be labeled as a neutrophil.
neutrophil = ["band form neutrophil", "neutrophilic metamyelocyte", "neutrophilic myelocyte",
              "segmented neutrophil of bone marrow", "mature neutrophil"]

# These cells will be labeled as a b_cell.
b_cell = ["CD38-negative naive B cell", "germinal center B cell", "class switched memory B cell", "memory B cell",
          "naive B cell"]

# These cells will be labeled as a CD8 cell.
cd8 = ["CD8-positive, alpha-beta T cell", "CD8-positive, alpha-beta thymocyte", "effector memory CD8-positive, alpha-beta T cell",
       "central memory CD8-positive, alpha-beta T cell", "effector memory CD8-positive, alpha-beta T cell, terminally differentiated"]

# These cells will be labeled as a CD4 cell.
cd4 = ["CD4-positive, alpha-beta thymocyte", "CD4-positive, alpha-beta T cell", "effector memory CD4-positive, alpha-beta T cell",
       "central memory CD4-positive, alpha-beta T cell"]

# These cells will be labeled as a dendritic cell.
dendritic = ["immature conventional dendritic cell", "mature conventional dendritic cell", "conventional dendritic cell"]

# These cells will be labeled as a macrophage.
macrophage = ["inflammatory macrophage", "alternatively activated macrophage", "macrophage"]

copy_dataset = filtered_dataset.copy(deep = True)

for index, row in copy_dataset.iterrows():
    if row["Cell Type"] in neutrophil:
        filtered_dataset.at[index, "Cell Type"] = "neutrophil"
    if row["Cell Type"] in b_cell:
        filtered_dataset.at[index, "Cell Type"] = "B cell"
    if row["Cell Type"] in cd8:
        filtered_dataset.at[index, "Cell Type"] = "CD8 Cell"
    if row["Cell Type"] in cd4:
        filtered_dataset.at[index, "Cell Type"] = "CD4 Cell"
    if row["Cell Type"] in dendritic:
        filtered_dataset.at[index, "Cell Type"] = "Dendritic Cell"
    if row["Cell Type"] in macrophage:
        filtered_dataset.at[index, "Cell Type"] = "Macrophage"

In [None]:
dataset.to_csv("full_dataset.csv", index = False)

In [None]:
filtered_dataset.to_csv("filtered_dataset.csv", index = False)