# Example BIU Usage

BIU is a toolkit I made to gather various datasets and tools that I regularly use.
This way, I never need to worry about managing the data in files on my computer, I can simply use this package which has wrapper functions for common queries I perform on the datasets.

Currently, it allows me to 

 * Download a number of datasets on the fly - when they are needed (and sub-components of these datasets)
 * Dynamically load datasets when they are needed (They do not consume memory until a query is performed upon them
 * Query these datasets
 * Handle various data formats:
   * FASTA
   * GFF3
   * GAF (GO annotation file)
   * SQLite databases
   * VCF (using the pyvcf package)
 * Map IDs between various indexing methods

In [1]:
import biu as biu
import matplotlib.pylab as plt
import seaborn as sns
import numpy as np
import pandas as pd

### Change some default settings
We can change the default location for all data storage, and turn off debug messages

In [2]:
where = '/exports/molepi/tgehrmann/data/'
biu.config.settings.setWhere(where)
print("We set the default data directory to be: '%s'" % biu.config.settings.getWhere())

biu.config.settings.setDebugState(False)
print("Turned OFF debug messages")
biu.config.settings.setDebugState(True)
print("Turned ON debug messages")

We set the default data directory to be: '/exports/molepi/tgehrmann/data/'
Turned OFF debug messages
Turned ON debug messages


## List the available datasets

In [3]:
biu.db.list()

Available databases:
 * BBMRI
 * CADD
 * ClinVar
 * Cosmic
 * GO
 * GTeX
 * Genomes
 * Gnomad
 * HAGR
 * KEGG
 * LLS
 * MiRmine
 * Reactome
 * UniProt


## Open a genome object and load the GFF file

Load the genome, and get the GFF file and parse it

In [4]:
biu.db.listGenomes()
hg = biu.db.Genome("Ensembl_GRCh37")
ce = biu.db.Genome("WBcel235")
print(ce)

Available versions:
 * GRCh37
 * Ensembl_GRCh37
 * Ensembl_GRCh38_91
 * RefSeq_GRCh37
 * RefSeq_GRCh38
 * WBcel235
Genome object
 Where: /exports/molepi/tgehrmann/data
 Genome : WBcel235
 Objects:
  * [ ] gff
  * [ ] cds
  * [ ] aa
  * [ ] genome[all]
 Files:
  * [X] gff : /exports/molepi/tgehrmann/data/genome_WBcel235/genome.gff3
  * [X] cds : /exports/molepi/tgehrmann/data/genome_WBcel235/cds.fa
  * [X] aa : /exports/molepi/tgehrmann/data/genome_WBcel235/aa.fa
  * [X] chr_all : /exports/molepi/tgehrmann/data/genome_WBcel235/chrall.fa.gz



In [9]:
print(ce.gff.getChildren('rna2').entries)
print(ce.cds["NM_058260.4"] == ce.gff.seq("rna1", ce.genome["all"]))

[GFF3Entry(seqid:NC_003279.8, source:RefSeq, feature:exon, start:11641, end:11689, score:., strand:+, phase:., attr:ID=id7;Dbxref=GeneID:171591,Genbank:NM_058259.4,WormBase:WBGene00022276;gbkey=mRNA;gene=nlp-40;partial=true;product=Peptide P4;start_range=.,11641;transcript_id=NM_058259.4), GFF3Entry(seqid:NC_003279.8, source:RefSeq, feature:exon, start:14951, end:15160, score:., strand:+, phase:., attr:ID=id8;Dbxref=GeneID:171591,Genbank:NM_058259.4,WormBase:WBGene00022276;gbkey=mRNA;gene=nlp-40;partial=true;product=Peptide P4;transcript_id=NM_058259.4), GFF3Entry(seqid:NC_003279.8, source:RefSeq, feature:exon, start:16473, end:16585, score:., strand:+, phase:., attr:ID=id9;Dbxref=GeneID:171591,Genbank:NM_058259.4,WormBase:WBGene00022276;end_range=16585,.;gbkey=mRNA;gene=nlp-40;partial=true;product=Peptide P4;transcript_id=NM_058259.4), GFF3Entry(seqid:NC_003279.8, source:RefSeq, feature:CDS, start:11641, end:11689, score:., strand:+, phase:0, attr:ID=cds1;Dbxref=EnsemblGenomes-Gn:WBGe

D: GFF input source is list of GFF3Entries.


In [6]:
print(ce.cds["NM_058260.4"] == ce.gff.seq("rna1", ce.genome["all"]))

print(type(ce.cds["NM_058260.4"]) == type(ce.gff.seq("rna1", ce.genome["all"])))
print(ce.gff.seq("rna1", ce.genome["all"]).seq.lower() == ce.cds["NM_058260.4"].seq.lower())

#for i, (o,t) in enumerate(zip(ce.cds["NM_058260.4"].seq, ce.gff.seq("rna1", ce.genome["all"]).seq)):
#    if o.lower() == t.lower():
#        print(i, o, t)

D: Initializing the GFF3ResourceManager object NOW
D: GFF input source is file.
D: GFF input source is list of GFF3Entries.
D: Initializing the FastaResourceManager object NOW
D: Fasta input source is file


Why doesn't this work?
True
True
True


D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.


In [7]:
nTranscriptsPerGene = []
nExonsPerTranscript = []
for gene in hg.gff.topLevel['gene']:
    transcripts = [ cid for (i, cid) in hg.gff.index[gene][1] ]
    nTranscriptsPerGene.append(len(transcripts))
    for trans in transcripts:
        nExonsPerTranscript.append(len(hg.gff.index[trans][1]))
    #efor
#efor

fig, axes = plt.subplots(figsize=(12,4), ncols=2, nrows=1)
axes = axes.flatten()
axes[0].hist(nTranscriptsPerGene, bins=20)
axes[0].set_xlabel("Number of transcripts")
axes[0].set_ylabel("Number of genes")
axes[1].hist(nExonsPerTranscript, bins=20, log=True)
axes[1].set_xlabel("Number of exons")
axes[1].set_ylabel("Number of genes")
plt.show()

D: Initializing the GFF3ResourceManager object NOW
D: GFF input source is file.


KeyboardInterrupt: 

## Access the Uniprot database

In [None]:
uniprot = biu.db.UniProt("human")
print(uniprot)

In [None]:
for result in uniprot.getProteinDomains('P42345'):
    print(result)

## Access the ClinVar database

In [None]:
cv = biu.db.ClinVar("GRCh37")
print(cv)

In [None]:
alts = { n : 0 for n in 'ACGT'}
for record in cv.queryVCF(1, 949422, 1049422):
    for alt in record.ALT:
        alt = alt.sequence
        if alt in alts:
            alts[alt] += 1

cImpact = {}
for record in cv.querySummary(1, 949422, 1049422):
    if record.clinicalsignificance not in cImpact:
        cImpact[record.clinicalsignificance] = 0
    cImpact[record.clinicalsignificance] += 1

fig, axes = plt.subplots(figsize=(12,4), ncols=2, nrows=1)
axes = axes.flatten()

nbars = axes[0].bar([1,2,3,4], alts.values(), tick_label=list(alts.keys()))

nbars = axes[1].bar([ x + 1 for x in range(len(cImpact.keys())) ], cImpact.values(), tick_label=list(cImpact.keys()))
plt.xticks(rotation=90)
plt.show()

## Access the CADD database
If you have pre-existing files elsewhere, you can tell the system where they are exactly with the "localCopy" argument, and it will make a symbolic link to our local copy.

In [None]:
cadd = biu.db.CADD(localCopy = {"tsv" : "/exports/molepi/tgehrmann/GAVIN-reimp/CADD/cadd.tsv.bgz", 
                                "tsv_tbi" : "/exports/molepi/tgehrmann/GAVIN-reimp/CADD/cadd.tsv.bgz.tbi"})
print(cadd)

In [None]:
plt.hist([ float(p) for p in cadd.query(1, 0, 1000000).values() ], bins=200)
plt.xlabel("CADD score")
plt.show()

## Access the GnomAD database

In [None]:
gnomad = biu.db.Gnomad(localCopy = { "vcf" : "/exports/molepi/tgehrmann/GAVIN-reimp/gnomAD/gnomad.vcf.bgz",
                                     "vcf_tbi" : "/exports/molepi/tgehrmann/GAVIN-reimp/gnomAD/gnomad.vcf.bgz.tbi"})
print(gnomad)

### How to query from VCF files.
There are different options to filter queries:
 * filters : Filter Variants based on a list of filters
 * gtFilters : Filter genotype calls based on a list of filters
 * types : Filter variants based on variant types
 * subTypes : Filter variants based on more specific variant types
 * sampleFilters : Filter sample calls based on a list of sample names. IMPORTANT: THIS WILL SELECT THOSE IN THE LIST, NOT FILTER THEM OUT!!
 
These options can be used in any VCF query structure (e.g. clinVar, GnomAD, COSMIC, LLS, BBMRI)

In [None]:
filters = [ "AMBIGUOUS","VQLOW","NVLOC","CALLRATE","MULTI","RECMULTI"]
types = [ "snp" ]

print("Without snp filter:", len(list(gnomad.query(1, 324719, 324720, filters=filters))))
print("With snp filter:", len(list(gnomad.query(1, 324719, 324720, filters=filters, types=types))))
print("With RF filter:", len(list(gnomad.query(1, 324719, 324720, filters=filters + ['RF'], types=types))))

In [None]:
lls = biu.db.LLS(localCopy={"phen" : "/home/tgehrmann/repos/VAR/phen218.txt"})

# Define the filters we want to use for the Variant calls:
varFilters = [ "AMBIGUOUS","VQLOW","NVLOC","CALLRATE","MULTI","RECMULTI"]

# We are only interested in SNPs
varTypes = [ "snp" ]

# Within the LLS, we are only interested in 218 people, 4 of them we want to exclude (REMINDER: ask Erik why...)
varSampleFilters = lls.phenotypes["cgID"].apply(lambda x: x + '_240_37-ASM').values

V = lls.query(1, 26883510, 26883511, filters=varFilters, types=varTypes, sampleFilters=varSampleFilters)
S = lls.query(1, 26883510, 26883511, filters=varFilters, types=varTypes, sampleFilters=varSampleFilters, extract="summary")

biu.formats.VCF.summary(V)
S


In [None]:
V = lls.query(1, 26883510, 26883511, extract="raw")

In [None]:
alts = { n : 0 for n in 'ACGT'}
for record in gnomad.queryVCF(1, 12590, 13000):
    for alt in record.ALT:
        alt = alt.sequence
        if alt in alts:
            alts[alt] += 1

x = []; y = []
for record in gnomad.queryCov(1, 12590, 13000, namedtuple=True):
    x.append(int(record.pos))
    y.append(float(record.mean))

gene = "gene:ENSG00000146648"
gEntry = hg.gff.getID(gene)
genex = []; geney=[]
for record in gnomad.queryCov(1, gEntry.start, gEntry.end, namedtuple=True):    
    genex.append(int(record.pos))
    geney.append(float(record.mean))

fig, axes = plt.subplots(figsize=(12,4), ncols=3, nrows=1)
axes = axes.flatten()

nbars = axes[0].bar([1,2,3,4], alts.values(), tick_label=list(alts.keys()))
axes[1].plot(x,y)
axes[2].plot(genex,geney)

plt.show()

## GTeX access
Because GTeX is behind this ugly google login, you have to provide the files yourself.
This can be done by specifying the exact location of the data.
While the other datasets place the files deeper than the 'where' that you specify, GTeX will look EXACTLY for the following files:
 * `'where'/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz`
 * `'where'/GTEx_Analysis_2016-01-15_v7_RSEMv1.2.22_transcript_tpm.txt.gz`
 * `'where'/GTEx_v7_Annotations_SampleAttributesDS.txt`
 * `'where'/GTEx_v7_Annotations_SubjectPhenotypesDS.txt`


In [None]:
where = '/exports/molepi/tgehrmann/data/'
import biu as biu
biu.config.settings.setWhere(where)
gtex = biu.db.GTeX(version="v7",
                   where="/exports/molepi/tgehrmann/data/gtex")
print(gtex)

In [None]:
print(gtex.getPersonIDSamples(gtex.getPersonIDs()[0]))
%time gtex.getGeneExpr(gtex.getPe rsonIDSamples(gtex.getPersonIDs()[0]))

In [None]:
pTissues = {}
for i, row in gtex.sAttr.iterrows():
    if row["SMAFRZE"] != "RNASEQ":
        continue
    #fi
    personID = row["SAMPID"].split('-')[1]
    sampleType = row["SMTSD"]
    if personID not in pTissues:
        pTissues[personID] = []
    pTissues[personID].append(sampleType)

indivTissues = sorted(list(set(gtex.sAttr["SMTSD"])))
pairwiseTissueCounts = {}
for personID in pTissues:
    tissues = list(set(pTissues[personID]))
    for i, samplei in enumerate(tissues[:-1]):
        for j, samplej in enumerate(tissues[i+1:]):
            key = (samplei, samplej)
            if key not in pairwiseTissueCounts:
                pairwiseTissueCounts[key] = 0
            pairwiseTissueCounts[(samplei, samplej)] += 1

indivTissuesMap = { t: i for (i,t) in enumerate(list(indivTissues)) }

C = np.zeros([len(indivTissues), len(indivTissues)])
for (t1,t2) in pairwiseTissueCounts:
    C[indivTissuesMap[t1], indivTissuesMap[t2]] = int(pairwiseTissueCounts.get((t1,t2),0) +
                                                  pairwiseTissueCounts.get((t2,t1),0))
    C[indivTissuesMap[t2], indivTissuesMap[t1]] = int(pairwiseTissueCounts.get((t1,t2),0) +
                                                  pairwiseTissueCounts.get((t2,t1),0))

In [None]:
fig, ax = plt.subplots(figsize=(40,40))
sns.heatmap(C, ax = ax, xticklabels=indivTissues, yticklabels=indivTissues, annot=True, fmt='.0f')
plt.show()

In [None]:
interestTissues = [ 'Adipose - Subcutaneous', 'Adipose - Visceral (Omentum)', 'Muscle - Skeletal', "Whole Blood" ]
interestTissuesIndex = [ indivTissuesMap[t] for t in interestTissues ]

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(C[interestTissuesIndex,:][:,interestTissuesIndex], ax = ax, xticklabels=interestTissues, yticklabels=interestTissues, annot=True, fmt='.1f')
plt.show()

## Reactome access
Doesn't really work yet

In [None]:
#reactome = biu.db.Reactome(where=where + '/reactome')
#print(reactome)
#for r in reactome.getPathway("R-HSA-1236973"):
#    print(r)

## Cosmic access

In [None]:
cosmic = biu.db.Cosmic("t.gehrmann@lumc.nl", "Cosmic_password1")
print(cosmic)

In [None]:
for r in cosmic.vcfCoding.query(1, 1, 69270):
    print(r)

In [None]:
for r in cosmic.vcfNonCoding.query(1, 1, 69270):
    print(r)

## Gene mapping

### Do gene mapping with pickled maps
Faster operations, but slow initialization + more memory usage

In [None]:
import biu
hm = biu.maps.Human(where="/exports/molepi/tgehrmann/data/")
print(hm)

def exampleMapping(GMO):
    # GMO : Gene Mapping Object
    symbol = "MTOR"
    geneid = GMO.getSymbolGeneID(symbol)[0]
    print("%s -> %s" % (symbol, geneid))
    symbol = GMO.getGeneIDSymbol(geneid)[0]
    print("%s -> %s" % (geneid, symbol))
    ensembl = GMO.getSymbolEnsembl(symbol)[0]
    print("%s -> %s" % (symbol, ensembl))
    symbol = GMO.getEnsemblSymbol(ensembl)[0]
    print("%s -> %s" % (ensembl, symbol))
#edef

def exampleMappingSilent(GMO):
    # GMO : Gene Mapping Object
    symbol = "MTOR"
    geneid = GMO.getSymbolGeneID(symbol)[0]
    symbol = GMO.getGeneIDSymbol(geneid)[0]
    ensembl = GMO.getSymbolEnsembl(symbol)[0]
    symbol = GMO.getEnsemblSymbol(ensembl)[0]
#edef

In [None]:
exampleMapping(hm)

### Mapping with SQLite instead of pickled Maps
Fast initialization, but slower operations.
Because of the high speed initialization, we can perform queries on a larger number of structures, including the gene2refseq index, and the uniprotmap, which is prohibitively large for the map.

In [None]:
where = '/exports/molepi/tgehrmann/data/'
import biu as biu
biu.config.settings.setWhere(where)
print(biu.config.settings.getWhere())
hms = biu.maps.HumanS()
print(hms)

In [None]:
exampleMapping(hms)

### Compare Map vs SQLite speeds

In [None]:
print("Map Lookup")
%timeit exampleMappingSilent(hm)
print("SQLite lookup")
%timeit exampleMappingSilent(hms)

## HAGR access
Unlike others, HAGR is downloaded already entirely when the class is initiated (because they are in ZIP files, and I don't have a nice solution for this yet)

In [None]:
import biu as biu
hagr = biu.db.HAGR(where = '/exports/molepi/tgehrmann/data/')
print(hagr)

In [None]:
hagr.human_genes

## Access GO annotations

In [None]:
where = '/exports/molepi/tgehrmann/data/'
import biu as biu
biu.config.settings.setWhere(where)
print(biu.config.settings.getWhere())
go = biu.db.GO()
print(go)

In [None]:
print("Number of genes annotated with GO:0002250: %d" % len(go.getAnnotated("GO:0002250")))

print("Number of annotations for P78540: %d" % len(go.getAnnots("P78540")))

## Access KEGG annotations

In [None]:
where = '/exports/molepi/tgehrmann/data/'
import biu as biu
biu.config.settings.setWhere(where)
print(biu.config.settings.getWhere())
kegg = biu.db.KEGG()
hms = biu.maps.HumanS()

In [None]:
print(kegg)

In [None]:
print("Number of pathways MTOR is in: %d" % len(kegg.getGenePathways(hms.getSymbolGeneID("MTOR")[0])))

print("Number of genes in path:hsa05230: %d" % len(kegg.getPathwayGenes("path:hsa05230")))

In [None]:
print(kegg.getPathwayInfo("hsa05230"))

In [None]:
fig, axes = plt.subplots(figsize=(12,4), ncols=2, nrows=1)
axes = axes.flatten()

# How many genes are there per kegg pathway?
genesPerPathway = [ len(kegg.getPathwayGenes(p)) for p in kegg.getPathways() ]
pathwaysPerGene = [ len(kegg.getGenePathways(g)) for g in kegg.getGenes() ]

axes[0].hist(genesPerPathway, bins=50)
axes[0].set_xlabel("Number of genes per pathway")
axes[1].hist(pathwaysPerGene, bins=50)
axes[1].set_xlabel("Number of pathways per gene")
plt.show()

## Access LLS data

In [None]:
lls = biu.db.LLS()

In [None]:
for r in lls.queryRegions([ ("1", 100, 100000), ("1", 100, 100000)]):
    print(r)

## Access BBMRI data

In [None]:
bbmri = biu.db.BBMRI()

## Store some variables persistently in a SQLite database

In [None]:
pDict = biu.formats.SQLDict("test")
print(pDict)

In [None]:
pDict["test"] = { 5: "hello", "aha" : [ 1, 4, "345"]}
print("test -> ", pDict["test"])
print("yest -> ", pDict["yest"])

for x in pDict:
    print(x)

### Access miRmine database

In [None]:
import biu
where = '/exports/molepi/tgehrmann/data/'
biu.config.settings.setWhere(where)

In [None]:
mir = biu.db.MiRmine()

In [None]:
print(mir)

In [None]:
mir.getExpr(["DRX003170", "DRX003171", "DRX017209"])

In [None]:
set(mir._info["Tissue"][mir._info["Tissue"].apply(lambda x: not(pd.isnull(x)))].values)

## Use pipelines

### Use the VEP pipeline

In [None]:
import biu as biu
where = '/exports/molepi/tgehrmann/data/'
biu.config.settings.setWhere(where)

lls = biu.db.LLS()
varList = list([ r for r in lls.query(1, 10483, 10495)])
vep = biu.pipelines.VEP(varList)
vep.getAnnotations()

### Use the LiftOver Pipeline

In [None]:
import biu as biu
where = '/exports/molepi/tgehrmann/data/'
biu.config.settings.setWhere(where)

lls = biu.db.LLS()
varList = [ (r.CHROM, r.POS-1, r.POS) for r in lls.query(5, 42423775, 42426000, filt=['VQLOW'])]
varLift = biu.pipelines.LiftOver(varList)
varLift.getLiftOver()

In [None]:
varLift.getLiftOver().values

In [None]:
print(lls)

In [None]:
import biu as biu
import matplotlib.pylab as plt
import seaborn as sns
import numpy as np
import pandas as pd

gnomad = biu.db.Gnomad(localCopy = { "vcf" : "/exports/molepi/tgehrmann/GAVIN-reimp/gnomAD/gnomad.vcf.bgz",
                                     "vcf_tbi" : "/exports/molepi/tgehrmann/GAVIN-reimp/gnomAD/gnomad.vcf.bgz.tbi"})

filters = [ "AMBIGUOUS","VQLOW","NVLOC","CALLRATE","MULTI","RECMULTI"]
types = [ "snp" ]

print(gnomad.vcf)

gnomad.query('Y', 2655002, 2655003)
gnomad.query('Y', 2655024, 2655025)
v = gnomad.query('Y', 2709618, 2709619)
#gnomad.summary(v, [1])

print(gnomad.vcf)

gnomad.query('1', 2655002, 2709619, extract="summary")


In [None]:
print(gnomad.vcf)

In [None]:
print(list(r.query(20, 14369, 14371))[0])

In [None]:
import vcf
r = vcf.Reader(open('docs/example_files/example.vcf','r'))
for rec in r:
    print(rec)

In [None]:
r.samples

In [None]:
import intervaltree as it
t = it.IntervalTree()

In [None]:
i = it.Interval(0, 1, 'test')

In [None]:
t[0:1] = 'test'

In [None]:
t[0]

In [None]:
class Parent(object):
    def __init__(self, a):
        self.__a = a
    
    def printa(self):
        print(self.__a)

class Child(Parent):
    def __init__(self, a):
        Parent.__init__(self, a)

        
c = Child(6)
c.printa()

# Methylation

In [119]:
M = pd.read_csv("/home/tgehrmann/methylated_bases.tsv", delimiter="\t")
M["ratio"] = M["numCs"] / M["coverage"]
M["#contig"] = M["#contig"].apply(lambda x: '_'.join(x.split('_')[:2]))
M = M[(M["ratio"] >= 0.4) & (M["ratio"] <= 0.6)]

G = biu.formats.GFF3("/home/tgehrmann/AgabiA15p1.genes.gff3")

differentiable = pd.read_csv("/home/tgehrmann/DESEQ_input.tsv", delimiter="\t")
p1_diff = differentiable["genegroup"].apply(lambda x: x.split(',')[0] + '|gene').values

D: GFF input source is file.


In [120]:
import intervaltree

def makeIndex(gff, genes):
    T = { seqid: intervaltree.IntervalTree() for seqid in gff.seqids }
    n = len(genes)
    for i, gene in enumerate(genes):
        e = G.getIDEntry(gene)
        if e is None:
            continue
        print("\r%s - %d/%d" % (e.seqid, i+1, n), end="")
        #Promotors only
        T[e.seqid][e.start-1000:e.start] = gene + '_upstream'
        T[e.seqid][e.end:e.end+1000] = gene + '_downstream'
        #Full gene & up/downstream
        #T[e.seqid][e.start-1000:e.end+1000] = gene + '_full'
    #efor
    return T
#edef
idx = makeIndex(G, p1_diff)
        
        

scaffold_3 - 6631/66311

In [121]:
methGenes = {}
for i, row in M[(M["ratio"] >= 0.4) & (M["ratio"] <= 0.6)].iterrows():
    pos = row.start
    seqid = row["#contig"]
    res = [ r[-1] for r in idx[seqid][pos] ]
    for r in res:
        if r not in methGenes:
            methGenes[r] = []
        #fi
        methGenes[r].append((seqid, pos))
    #efor

In [122]:
methGenes.keys()

dict_keys(['AgabiA15p1|601|gene_downstream', 'AgabiA15p1|617|gene_upstream', 'AgabiA15p1|616|gene_downstream', 'AgabiA15p1|816|gene_downstream', 'AgabiA15p1|1189|gene_downstream', 'AgabiA15p1|1268|gene_downstream', 'AgabiA15p1|1269|gene_downstream', 'AgabiA15p1|1288|gene_downstream', 'AgabiA15p1|1290|gene_upstream', 'AgabiA15p1|1289|gene_downstream', 'AgabiA15p1|1291|gene_upstream', 'AgabiA15p1|1290|gene_downstream', 'AgabiA15p1|1303|gene_downstream', 'AgabiA15p1|3570|gene_upstream', 'AgabiA15p1|3570|gene_downstream', 'AgabiA15p1|3631|gene_upstream', 'AgabiA15p1|3632|gene_upstream', 'AgabiA15p1|3632|gene_downstream', 'AgabiA15p1|3631|gene_downstream', 'AgabiA15p1|3703|gene_downstream', 'AgabiA15p1|3704|gene_upstream', 'AgabiA15p1|3708|gene_downstream', 'AgabiA15p1|4078|gene_downstream', 'AgabiA15p1|4081|gene_downstream', 'AgabiA15p1|4095|gene_upstream', 'AgabiA15p1|4402|gene_downstream', 'AgabiA15p1|4403|gene_upstream', 'AgabiA15p1|4408|gene_upstream', 'AgabiA15p1|4407|gene_downstream'

In [123]:
print(len(set([ mg.split('_')[0] for mg in methGenes.keys()])))
diffmeth = set([ mg.split('_')[0] for mg in methGenes.keys()])
diffmeth = set([ '|'.join(dmg.split('|')[:-1]) for dmg in diffmeth if dmg != "" ])

p1_diff = set([ '|'.join(dmg.split('|')[:-1]) for dmg in p1_diff ])

334


In [124]:
COMPOST_DIFFREG = pd.read_csv("/home/tgehrmann/diff_regulated_compost.tsv", delimiter="\t")
COMPOST_DIFFREG["id"] = COMPOST_DIFFREG["id"].apply(lambda x: x.split(',')[0])
COMPOST_DIFFREG = set(COMPOST_DIFFREG[COMPOST_DIFFREG.padj < 0.05].id.unique())

JORDI_DIFFREG = pd.read_csv("/home/tgehrmann/diff_regulated.tsv", delimiter="\t")
JORDI_DIFFREG["id"] = JORDI_DIFFREG["id"].apply(lambda x: x.split(',')[0])
JORDI_DIFFREG = set(JORDI_DIFFREG[JORDI_DIFFREG.padj < 0.05].id.unique())

DIFFREG = JORDI_DIFFREG | COMPOST_DIFFREG

In [125]:

from scipy.stats import chi2_contingency
sigdiffreg_all      = DIFFREG
sigdiffreg_compost  = COMPOST_DIFFREG
sigdiffreg_jordi    = JORDI_DIFFREG
sigdiffreg_overlap  = sigdiffreg_compost & sigdiffreg_jordi
sifdiffreg_com_uniq = sigdiffreg_compost - sigdiffreg_overlap
sigdiffreg_jor_uniq = sigdiffreg_jordi - sigdiffreg_overlap

def enrich_methylation(genes, sigdiffreg, MGENES):
   
    totalgenes   = set(genes)
    totaldiff    = set(sigdiffreg);
    totaldiff_no = totalgenes - totaldiff
    totalmeth    = set(MGENES);
    totalmeth_no = totalgenes - totalmeth

    diff_meth     = totaldiff & totalmeth
    diff_nometh   = totaldiff & totalmeth_no
    nodiff_meth   = totaldiff_no & totalmeth
    nodiff_nometh = totaldiff_no & totalmeth_no
    
    abcd = [len(diff_meth), len(diff_nometh)], [len(nodiff_meth), len(nodiff_nometh)]

    #print(abcd)
    res = chi2_contingency(abcd)[1]
    #print(res)
    
    return res, abcd
#edef

diffsets = [ ("all", sigdiffreg_all), ("compost", sigdiffreg_compost), 
            ("jordi", sigdiffreg_jordi), ("overlap", sigdiffreg_overlap), 
            ("compost_unique", sifdiffreg_com_uniq), ("jordi_unique", sigdiffreg_jor_uniq)]

for (title, diffset) in diffsets:
    print(title, enrich_methylation(p1_diff, diffset, diffmeth))
#[ [x[0][1]] + x[1] + x[2] for x in enrichments_mgenes]

all (0.0028797005331946484, ([34, 377], [300, 5920]))
compost (0.08688076471811172, ([8, 74], [326, 6223]))
jordi (0.0006154488818513568, ([33, 335], [301, 5962]))
overlap (0.0008666163397255266, ([7, 32], [327, 6265]))
compost_unique (0.6413442183475865, ([1, 42], [333, 6255]))
jordi_unique (0.020960381867411244, ([26, 303], [308, 5994]))


In [126]:
len(p1_diff)

6631

In [127]:
p1_diff

{'AgabiA15p1|5531',
 'AgabiA15p1|2205',
 'AgabiA15p1|9593',
 'AgabiA15p1|2631',
 'AgabiA15p1|9563',
 'AgabiA15p1|510',
 'AgabiA15p1|3985',
 'AgabiA15p1|5912',
 'AgabiA15p1|4480',
 'AgabiA15p1|7144',
 'AgabiA15p1|9273',
 'AgabiA15p1|4027',
 'AgabiA15p1|5613',
 'AgabiA15p1|9531',
 'AgabiA15p1|10053',
 'AgabiA15p1|2695',
 'AgabiA15p1|9333',
 'AgabiA15p1|3153',
 'AgabiA15p1|694',
 'AgabiA15p1|8421',
 'AgabiA15p1|6581',
 'AgabiA15p1|4769',
 'AgabiA15p1|10533',
 'AgabiA15p1|6363',
 'AgabiA15p1|3995',
 'AgabiA15p1|10634',
 'AgabiA15p1|10240',
 'AgabiA15p1|8522',
 'AgabiA15p1|10381',
 'AgabiA15p1|9400',
 'AgabiA15p1|4351',
 'AgabiA15p1|3270',
 'AgabiA15p1|4356',
 'AgabiA15p1|1059',
 'AgabiA15p1|8349',
 'AgabiA15p1|9945',
 'AgabiA15p1|5866',
 'AgabiA15p1|3749',
 'AgabiA15p1|9968',
 'AgabiA15p1|4138',
 'AgabiA15p1|723',
 'AgabiA15p1|6241',
 'AgabiA15p1|630',
 'AgabiA15p1|9124',
 'AgabiA15p1|131',
 'AgabiA15p1|2097',
 'AgabiA15p1|10854',
 'AgabiA15p1|4117',
 'AgabiA15p1|4251',
 'AgabiA15p1|1790',

# Filestamping

In [78]:
class RDM(object):
    def __init__(self, path, mode='w', file=False, nouser=False, suffix=True, **meta):
        import os
        import pathlib
        
        self.path = path.split('/')[:-1] if len(path.split('/')) > 1 else '.'
        self.filename = path.split('/')[-1]
        self.prefix = '.'.join(self.filename.split('.')[:-1]) if suffix else self.filename
        self.suffix = self.filename.split('.')[-1] if suffix else None
        self.jsonfile = '%s/rdm_filedescriptions.json' % self.path
        pathlib.Path("/tmp/path/to/desired/directory").mkdir(parents=True, exist_ok=True)
        
        meta['source'] = meta.get('source', os.getcwd())
        meta['user']   = os.getlogin()
        
        self._session = {
            'mode' : mode,
            'file' : file,
            'meta' : meta,
            'nouser' : nouser
        }
        
    #edef
    
    def _glob(self, **kwargs):
        import glob
        globstring = '{path}/{prefix}.{stamp}{dot}{suffix}'.format(
            path=self.path,
            prefix=self.prefix,
            stamp=self._rdmstamp(**kwargs)['stamp'],
            dot='.' if self.suffix is not None else '',
            suffix=self.suffix if self.suffix is not None else '')
        return sorted(glob.glob(globstring))
    #edef
    
    def _rdmstamp(self, nouser=False, nodate=False):
        import datetime
        import os
        
        date = datetime.datetime.now().isoformat()
        user = os.getlogin()

        return {
            'user' : user,
            'date' : date,
            'stamp' : 'rdm_{user}_{date}'.format(user=user if not nouser else '*', date=date if not nodate else '*').replace('.',':')
        }
    #edef
    
    def _rdmstamp_from_filename(self, filename):
        stamp = filename[::-1][:filename[::-1].find('_mdr.')][::-1].split('.')[0].split('_')
        return {
            'user' : stamp[0],
            'date' : stamp[1],
            'stamp' : 'rdm_{user}_{date}'.format(user=stamp[0], date=stamp[1]).replace('.',':')
        }
    #edef    
    
    def __enter__(self):
        import json
        import os
        
        if self._session['mode'] not in ['r','w','rw']:
            raise NotImplementedError
        #fi
        
        if self._session['mode'] in [ 'r', 'rw' ]:
            g = self._glob(nouser=self._session['nouser'], nodate=True)
            #if len(g) == 0:
            #    raise FileNotFoundError
            #else
            stampnow = self._rdmstamp()
            self._session['stamp'] = self._rdmstamp_from_filename(g[-1]) if len(g) > 0 else stampnow
            self._session['meta']['modified'] = stampnow['date'] if 'w' in self._session['mode'] else self._session['stamp']['date']
        #fi
        
        if self._session['mode'] == 'w':
            self._session['stamp'] = self._rdmstamp()
            self._session['meta']['modified'] = self._session['stamp']['date']
        #fi
        
        self._session['filename'] = '{path}/{prefix}.{stamp}{dot}{suffix}'.format(
            path=self.path,
            prefix=self.prefix,
            stamp=self._session['stamp']['stamp'],
            dot='.' if self.suffix is not None else '',
            suffix=self.suffix if self.suffix is not None else '')
        
        if self._session['file']:
            fd = open(session['filename'], mode)
            self._session['fd'] = fd
            return fd
        #fi
        
        return self._session['filename']
    #edef
    
    def _loadjson(self):
        if os.path.exists(self.jsonfile):
            with open(self.jsonfile, 'r') as ifd:
                return json.loads(ifd.read())
            #ewith
        else:
            return {}
        #fi
    #edef
    
    def _writejson(self, d):
        with open(self.jsonfile, 'w') as ofd:
            ofd.write(json.dumps(d, indent='    '))
        #ewith
    #edef

    def __exit__(self, *args):
        if self._session['file']:
            self._session['fd'].close()
        #fi
        
        js = self._loadjson()
        js[self._session['filename']] = self._session['meta']
        self._writejson(js)
    #edef
#eclass

D = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a','b','c'])

with RDM('my_new_file.txt','rw', source='repos/BIU/example.ipynb') as ofd:
    D.to_csv(ofd, index=False)
    
with RDM('my_new_file.txt', 'r') as ifd:
    X = pd.read_csv(ifd)

./my_new_file.rdm_thies_2023-11-20T13:42:43:370113.txt
./my_new_file.rdm_thies_2023-11-20T13:42:43:370113.txt


In [74]:
X

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [38]:
? FileNotFoundError

[0;31mInit signature:[0m  [0mFileNotFoundError[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      File not found.
[0;31mType:[0m           type
[0;31mSubclasses:[0m     JoblibFileNotFoundError, ExecutableNotFoundError


In [33]:
d = datetime.datetime.now().isoformat()
d.isoformat()

'2023-11-17T13:57:19.000285'

In [21]:
? datetime.datetime

[0;31mInit signature:[0m  [0mdatetime[0m[0;34m.[0m[0mdatetime[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]]]]])

The year, month and day arguments are required. tzinfo may be None, or an
instance of a tzinfo subclass. The remaining arguments may be ints.
[0;31mFile:[0m           /mnt/b/thies/miniconda/envs/biu/lib/python3.8/datetime.py
[0;31mType:[0m           type
[0;31mSubclasses:[0m     ABCTimestamp, _NaT
