<a href="https://colab.research.google.com/github/semenko/liquid-cell-atlas/blob/main/training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Packages

In [1]:
! pip install pyBigWig pybedtools bedparse
!apt install bedtools

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyBigWig
  Downloading pyBigWig-0.3.18.tar.gz (64 kB)
[K     |████████████████████████████████| 64 kB 1.0 MB/s 
[?25hCollecting pybedtools
  Downloading pybedtools-0.9.0.tar.gz (12.5 MB)
[K     |████████████████████████████████| 12.5 MB 4.9 MB/s 
[?25hCollecting bedparse
  Downloading bedparse-0.2.3-py3-none-any.whl (21 kB)
Collecting pysam
  Downloading pysam-0.19.1-cp37-cp37m-manylinux_2_24_x86_64.whl (15.1 MB)
[K     |████████████████████████████████| 15.1 MB 200 kB/s 
[?25hCollecting argparse
  Downloading argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Building wheels for collected packages: pyBigWig, pybedtools
  Building wheel for pyBigWig (setup.py) ... [?25l[?25hdone
  Created wheel for pyBigWig: filename=pyBigWig-0.3.18-cp37-cp37m-linux_x86_64.whl size=197020 sha256=df1bf71c92fddfd9018eb9f9e72bfa8e57905071565bafbef24a1e3f86634e20
  Stored in directory: /root/.cach

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  bedtools
0 upgraded, 1 newly installed, 0 to remove and 19 not upgraded.
Need to get 577 kB of archives.
After this operation, 2,040 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 bedtools amd64 2.26.0+dfsg-5 [577 kB]
Fetched 577 kB in 1s (715 kB/s)
Selecting previously unselected package bedtools.
(Reading database ... 155680 files and directories currently installed.)
Preparing to unpack .../bedtools_2.26.0+dfsg-5_amd64.deb ...
Unpacking bedtools (2.26.0+dfsg-5) ...
Setting up bedtools (2.26.0+dfsg-5) ...


In [2]:
import pandas as pd
from google.colab import files
import io
import itertools
import numpy as np
from tqdm.notebook import tqdm
import csv
import os
import urllib
import pickle
import json
import pyBigWig
import pybedtools

# Data Processing

The TSV contains the links to all of the datasets on the site, as well as their corresponding cell types, file types, and more.

We filter the file to get rid of individuals with diseases, and only keep the datasets with the bigWig file format. We also only keep bisulfite sequencing data.

In [3]:
# Download the TSV file from http://dcc.blueprint-epigenome.eu/#/files, and upload it here
! wget 'http://dcc.blueprint-epigenome.eu/data/blueprint_files.tsv'
data_tsv = pd.read_csv('blueprint_files.tsv', sep='\t')

noDisease_bw_data = data_tsv[(data_tsv['Disease'] == 'None') & 
                             (data_tsv['Format'] == 'bigWig') & 
                             (data_tsv['Experiment'] == 'Bisulfite-Seq')]

--2022-08-05 11:01:41--  http://dcc.blueprint-epigenome.eu/data/blueprint_files.tsv
Resolving dcc.blueprint-epigenome.eu (dcc.blueprint-epigenome.eu)... 193.62.192.83, 193.62.193.83
Connecting to dcc.blueprint-epigenome.eu (dcc.blueprint-epigenome.eu)|193.62.192.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4341342 (4.1M) [text/tab-separated-values]
Saving to: ‘blueprint_files.tsv’


2022-08-05 11:01:45 (1.15 MB/s) - ‘blueprint_files.tsv’ saved [4341342/4341342]



## Gene Locations

Download the genomic database and convert it into a bed file.

In [4]:
! wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gtf.gz
! gunzip gencode.v40.annotation.gtf.gz
! bedparse gtf2bed <gencode.v40.annotation.gtf> output.bed --extraFields gene_id,gene_name
output_bed = pybedtools.BedTool("output.bed")

--2022-08-05 11:01:46--  https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gtf.gz
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 48043727 (46M) [application/octet-stream]
Saving to: ‘gencode.v40.annotation.gtf.gz’


2022-08-05 11:02:00 (3.39 MB/s) - ‘gencode.v40.annotation.gtf.gz’ saved [48043727/48043727]



Convert the bed file into a pandas dataframe, and remove any unneccessary colummns! The only ones we keep are those relating to the chromosome number, the start and end base pair, the strandedness, the gene_id, and gene_name.

In [5]:
gene_loc = pd.read_csv("output.bed", sep = '\t', names = ["chrom", "start", "end", "name", "e1", "strand", "e2", "e3", "e4", "e5", "e6", "e7", "gene_id", "gene_name"])
gene_loc = gene_loc[["chrom", "start", "end", "strand", "gene_id", "gene_name", "name"]]

In [6]:
gene_locs = {}

gene_names = list(set(gene_loc["gene_name"]))
for name in tqdm(gene_names):
    gene = gene_loc[gene_loc["gene_name"] == name]
    chr = gene["chrom"].values[0]
    start_loc = min(gene["start"].values) - 1000
    end_loc = max(gene["end"].values) + 1000

    gene_locs[name] = (chr, start_loc, end_loc)

  0%|          | 0/60308 [00:00<?, ?it/s]

In [None]:
CHROMOSOMES = ["chr" + str(i) for i in range(1, 23)] + ["chrX"]
track = 0

columns = ["Cell Type"]
for name in gene_names:
    columns.append(name + " (+/- 1kb)")

dataset = pd.DataFrame(columns = columns)

while track < len(noDisease_bw_data):
    print(str(track/len(noDisease_bw_data)) + "% Progress")
    cell_type = noDisease_bw_data.iloc[track]["Cell type"]

    call_url = noDisease_bw_data.iloc[track]["URL"]
    cov_url = noDisease_bw_data.iloc[track + 1]["URL"]
    ! wget "$call_url" -q
    ! wget "$cov_url" -q

    call_file = call_url.split("/")[-1]
    cov_file = cov_url.split("/")[-1]

    try:
        data = [cell_type]
        for name in gene_names:
            chr, start, end = gene_locs[name]
            gene_cpgs = []
            with pyBigWig.open(cov_file) as cov_object:
                with pyBigWig.open(call_file) as call_object:
                    for pos in range(start, end+1):
                        if call_object.intervals(chr, pos,pos+1) is not None:
                            if cov_object.intervals(chr, pos, pos+1)[0][2] > 10:
                                gene_cpgs.append(call_object.intervals(chr, pos, pos+1)[0][2])
                    data.append(sum(gene_cpgs)/len(gene_cpgs))
    
        dataset.loc[len(dataset.index)] = data
    except:
        print("Unable to open file")

    track += 2

0.0% Progress
Unable to open file
0.006578947368421052% Progress
Unable to open file
0.013157894736842105% Progress


In [1]:
dataset.to_csv("dataset.csv")
files.download("dataset.csv")

NameError: ignored

In [None]:
!git clone https://github.com/hussius/tabnet_fork.git

In [None]:
os.chdir('tabnet_fork')

In [None]:
!pip install -r requirements.txt

In [None]:
! python opt_tabnet.py \
       --csv-path PATH_TO_CSV \
       --target-name "cell_type" \
       --categorical-features methylation