<a href="https://colab.research.google.com/github/semenko/liquid-cell-atlas/blob/main/data_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Necessary Libraries

In [4]:
!pip install pyBigWig pybedtools

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pybedtools
  Downloading pybedtools-0.9.0.tar.gz (12.5 MB)
[K     |████████████████████████████████| 12.5 MB 4.4 MB/s 
Collecting pysam
  Downloading pysam-0.19.1-cp37-cp37m-manylinux_2_24_x86_64.whl (15.1 MB)
[K     |████████████████████████████████| 15.1 MB 26.3 MB/s 
[?25hBuilding wheels for collected packages: pybedtools
  Building wheel for pybedtools (setup.py) ... [?25l[?25hdone
  Created wheel for pybedtools: filename=pybedtools-0.9.0-cp37-cp37m-linux_x86_64.whl size=13616831 sha256=3b43d1ba3281191d29043b298c9a153a2770015e551120ad9f8b6cdb0043a6f1
  Stored in directory: /root/.cache/pip/wheels/7a/44/0d/3a7449885adaf8ebb157da8c3c834a712f48b3b3b84ba51dda
Successfully built pybedtools
Installing collected packages: pysam, pybedtools
Successfully installed pybedtools-0.9.0 pysam-0.19.1


In [18]:
import pandas as pd
from google.colab import files
import io
import json
import itertools

import pyBigWig
import pybedtools

# Get TSV of all data from Blueprint and Filter it

In [8]:
# Download the TSV file from http://dcc.blueprint-epigenome.eu/#/files, and upload it here
file = files.upload()
data_tsv = pd.read_csv(io.BytesIO(file['blueprint_files.tsv']), sep='\t')

Saving blueprint_files.tsv to blueprint_files.tsv


Filtering the tsv file to get rid of individuals with diseases, and only keepign the bigWig file format

In [9]:
noDisease_bw_data = data_tsv[(data_tsv['Disease'] == 'None') & (data_tsv['Format'] == 'bigWig')]

Types of cells present in the dataset. For now, I'll choose an memory-B-cell sample, and a plasma cell

In [10]:
noDisease_bw_data['Cell type'].unique()

array(['band form neutrophil', 'neutrophilic metamyelocyte',
       'neutrophilic myelocyte', 'segmented neutrophil of bone marrow',
       'hematopoietic multipotent progenitor cell', 'precursor B cell',
       'precursor lymphocyte of B lineage', 'plasma cell',
       'megakaryocyte-erythroid progenitor cell', 'mature neutrophil',
       'CD14-positive, CD16-negative classical monocyte',
       'CD4-positive, alpha-beta T cell', 'common lymphoid progenitor',
       'granulocyte monocyte progenitor cell', 'hematopoietic stem cell',
       'CD8-positive, alpha-beta T cell', 'CD38-negative naive B cell',
       'cytotoxic CD56-dim natural killer cell', 'erythroblast',
       'CD34-negative, CD41-positive, CD42-positive megakaryocyte cell',
       'common myeloid progenitor', 'inflammatory macrophage',
       'macrophage', 'endothelial cell of umbilical vein (proliferating)',
       'endothelial cell of umbilical vein (resting)',
       'alternatively activated macrophage',
       'conve

Get example of memory B cell and plasma data, and extract its url.

In [11]:
memBcell_data = noDisease_bw_data[noDisease_bw_data['Cell type'] == 'memory B cell'].iloc[0]
plasmacell_data = noDisease_bw_data[noDisease_bw_data['Cell type'] == 'plasma cell'].iloc[0]

membcell_url = memBcell_data['URL']
plasmacell_url = plasmacell_data['URL']

Downloading the data at the two URLs!

In [12]:
!wget '$membcell_url'

--2022-06-27 17:00:26--  http://ftp.ebi.ac.uk/pub/databases/blueprint/data/homo_sapiens/GRCh38/venous_blood/C001NB/memory_B_cell/RNA-Seq/MPIMG/C001NBB3.minusStrand.star_grape2_crg.GRCh38.20150815.bw
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 204776598 (195M) [application/octet-stream]
Saving to: ‘C001NBB3.minusStrand.star_grape2_crg.GRCh38.20150815.bw’


2022-06-27 17:02:51 (1.35 MB/s) - ‘C001NBB3.minusStrand.star_grape2_crg.GRCh38.20150815.bw’ saved [204776598/204776598]



In [13]:
!wget '$plasmacell_url'

--2022-06-27 17:02:51--  http://ftp.ebi.ac.uk/pub/databases/blueprint/data/homo_sapiens/GRCh38/bone_marrow/MO7071/plasma_cell/Bisulfite-Seq/CNAG/G202.CPG_methylation_calls.bs_call.GRCh38.20160531.bw
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 154166573 (147M) [application/octet-stream]
Saving to: ‘G202.CPG_methylation_calls.bs_call.GRCh38.20160531.bw’


2022-06-27 17:04:40 (1.35 MB/s) - ‘G202.CPG_methylation_calls.bs_call.GRCh38.20160531.bw’ saved [154166573/154166573]



In [23]:
# Annotations & bs_cov / bs_call names
CELL_TYPE_TO_FILE_ID = {
    "Plasma_cell": ["G202"],
}

# Reverse mapping of file id -> cell type
# e.g.  'S01BHIA1': 'Monocyte'
FILE_ID_TO_CELL_TYPE = {sample:cell_type for cell_type, sample_list in CELL_TYPE_TO_FILE_ID.items() for sample in sample_list}

# If the blueprint dict changes, we need to replace our cache files
# This is a tiny checksum of the dictionary state, which we incorporate into
# our cache filenames below.
CELL_TYPE_DICT_SIG = str(hex(abs(hash(json.dumps(CELL_TYPE_TO_FILE_ID, sort_keys=True))))[2:10])
print(f"Dictionary signature for cache files: {CELL_TYPE_DICT_SIG}\n")


BLUEPRINT_FILEKEYS = list(itertools.chain.from_iterable(CELL_TYPE_TO_FILE_ID.values()))

# Validity testing
# assert all(len(vals) > 1 for vals in CELL_TYPE_TO_FILE_ID.values()), "We need more than one example per cell type."
assert len(BLUEPRINT_FILEKEYS) == len(set(BLUEPRINT_FILEKEYS)), "One filename is duplicated in the cell types"

print(f"Number of Blueprint cell types: {len(CELL_TYPE_TO_FILE_ID.keys())}")
print(f"Number of Blueprint raw files: {len(BLUEPRINT_FILEKEYS)}")

Dictionary signature for cache files: 1b493f11

Number of Blueprint cell types: 1
Number of Blueprint raw files: 1


In [None]:
CHROMOSOMES = ["chr" + str(i) for i in range(1, 23)] + ["chrX"]

In [None]:
FILE_ID_TO_CPG_CALLS = { }

for file_key in BLUEPRINT_FILEKEYS: 
    FILE_ID_TO_CPG_CALLS[file_key] = {}  
    with pyBigWig.open("G202.CPG_methylation_calls.bs_call.GRCh38.20160531.bw") as bw_object:
            for chrom in CHROMOSOMES:
                # This is more nuanced than the bs_cov data, since we only want to look at the 
                # CpGs that were covered across all samples. (The intervals now in BS_COV_POSITIONS).

                # Grabbing the entire chr interval is super slow
                FILE_ID_TO_CPG_CALLS[file_key]["chr1"] = [i for pos, _, i in bw_object.intervals("chr1") if pos in shared_pos_for_this_chromosome]
                