# Project 03: Gibbs Sampling (Random Algorithm)

In [1]:
import random
import numpy as np
import bamnostic as bs
import seqlogo
from pprint import pprint

#import function for building sequence motif & idenfitying seqs matching to motif
from data_readers import *
import seq_ops
import motif_ops
import project03
import numba

from joblib import Parallel, delayed

from utils import utils
from sequence_database import sequence_box
from collections import Counter, defaultdict

from scipy.special import softmax 
import time

print("Great, everything is up to date")



Great, everything is up to date


---
## Implement Gibbs Sampler


Gibbs sampling is a MCMC approach to identify enrichments. Here we will implement a method to identify motifs from a set of regions. 

Important considerations:
- We will need to score each sequence with a PWM using the `score_kmer()` or `score_sequence()` functions
    - You will need to investigate into the help documenation and libraries to identify how best to use these functions. 
- These sites are often not strand-specific and so both scores on the negative as well as positive strand should be considered
- To select a random sequence, use `random.randint()` or `numpy.random.randint()`
- To select a new position $m$ (as defined below) use `random.choices()` or `numpy.random.choice()`  

Assumptions: 
- We know $k$ as the length of expected motif
- Each sequence contains the motif



```
GibbsMotifFinder(DNA, k-length)
    random pick of k-length sequences from each line of DNA as Motifs
    for j ← 1 to 10000 or Motifs stops changing
        i ← Random(N) where N is number of DNA entries
        PWM ← PWM constructed from all Motifs except for Motifi
        Motifi ← select position m from PWM-scored k-mers in DNAi in probabilistic fashion from score distribution
    return PFM
```

Probability of chosing position $m = \frac{A_{m}}{\sum_{l}A_{l}}$ for positions $l$ in DNAi


**Note:** I have also added a function to `motif_ops.py` that will calculate the information content of your motifs. This is useful to observe the progression of your Gibbs sampler as well as a measure of convergence. You can use this function as `IC = pfm_ic(pfm)`. You should expect a slow increase of IC until it plateaus such as in the plot below from your lecture slides:

<center><img src='figures/Gibbs_Sampling.png'/ width=600px></center>

---
# Setting up the data

This BAM file contains a subsampled set of aligned ChIP‑seq reads from a p53 immunoprecipitation experiment in human K562 cells treated with the anthracycline drug daunorubicin. The original SRA experiment SRX5865974 (run SRR9090854) reports 31.2 million Illumina NextSeq 500 ChIP‑seq read pairs from K562 wild‑type cells exposed to daunorubicin, using a p53 antibody to pull down p53‑bound chromatin fragments before library preparation and sequencing. [pmc.ncbi.nlm.nih](https://pmc.ncbi.nlm.nih.gov/articles/PMC4366240/)

### What the data represent

- Biological system: human K562 leukemia cells (Homo sapiens) treated with daunorubicin, a DNA‑damaging chemotherapeutic known to stabilize and activate p53. [pmc.ncbi.nlm.nih](https://pmc.ncbi.nlm.nih.gov/articles/PMC6561911/)
- Assay: ChIP‑seq using an antibody against p53, so reads should be enriched around genomic regions where p53 is bound after drug treatment. [pmc.ncbi.nlm.nih](https://pmc.ncbi.nlm.nih.gov/articles/PMC4526040/)
- Sequencing: Illumina NextSeq 500, with the raw run SRR9090854 corresponding to experiment SRX5865974. [github](https://github.com/ncbi/sra-tools/issues/213)
- BAM file: `SRR9090854.subsampled_5pct.bam` is a coordinate‑sorted alignment file containing ~5% of the original mapped reads, typically created by random down‑sampling the full BAM to reduce file size and speed up exploratory analyses while preserving the overall distribution of p53 binding events. [ecseq](https://www.ecseq.com/support/ngs-snippets/how-to-extract-a-list-of-specific-read-IDs-from-a-BAM-file)

### How `bamnostic` is used and what your code does

The `bamnostic` package provides a pure‑Python interface to BAM files that mirrors the `pysam` API, including an `AlignmentFile` class whose iterator yields `AlignedSegment` objects representing individual aligned reads. In your code: [bamnostic.readthedocs](https://bamnostic.readthedocs.io/en/latest/bamnostic.html)

```python
bam_path = "data/SRR9090854.subsampled_5pct.bam"

seqs = [read.seq for read in bs.AlignmentFile(bam_path)]
```

- `bs.AlignmentFile(bam_path)` opens the BAM file as an `AlignmentFile` object in binary read mode (default `'rb'`), reading the BAM header (reference contigs, read groups, etc.) and preparing a streaming interface to all aligned records. [github](https://github.com/betteridiot/bamnostic/blob/master/docs/source/quickstart.rst)
- Iterating over `AlignmentFile` (`for read in bs.AlignmentFile(...)`) returns each aligned read as a `bamnostic.AlignedSegment` object, which exposes properties analogous to SAM fields such as query name, flags, reference name, position, mapping quality, CIGAR string, and the original sequencing **sequence**. [bamnostic.readthedocs](https://bamnostic.readthedocs.io/en/latest/bamnostic.html)
- The `read.seq` attribute is the query (read) sequence string stored in the BAM, corresponding to the full read sequence (including any unaligned bases), as opposed to `query_alignment_sequence`, which would only contain the aligned portion. [bamnostic.readthedocs](https://bamnostic.readthedocs.io/en/latest/bamnostic.html)
- The list comprehension `[read.seq for read in ...]` consumes the entire BAM stream and collects the nucleotide sequences from every subsampled ChIP‑seq read into a Python list `seqs`, which can then be used for downstream tasks such as motif discovery, k‑mer analysis, or quality checks on read content. [ucdavis-bioinformatics-training.github](https://ucdavis-bioinformatics-training.github.io/2022-Feb-Introduction-To-Python-For-Bioinformatics/python/python5)

In summary, your dataset is a 5% random sample of p53‑ChIP‑seq alignments from daunorubicin‑treated K562 cells, and the `bamnostic` code opens the subsampled BAM and extracts the raw read sequences from each aligned fragment into memory as a list.

---
# Project Start

# Functions

# Our code
Check notes in each cell to see if it's what you're trying to do

In [2]:
#create the seqs array, just need to do 1 time and it will take a while. For Vic it took ~3-5 minutes.
bam_path = "data/SRR9090854.subsampled_5pct.bam"

seqs = [read.seq for read in bs.AlignmentFile(bam_path)]


In [3]:
#Check that seqs has initialized (and view some data)
pprint(seqs[0:10])

['CCTAACCCTAACCCTAACCCTAACCCTATCCAGATCG',
 'ACCCTAACCCTAACCCAAACCCTAACCCTAACAGATC',
 'CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC',
 'CTAACCCTAACCCTAACCCTAACCCTAACCCTAAC',
 'CTATCCCTAACCCTAACCCTAACCCTAACCCTAACC',
 'CGATATCCTAACCCTAACCCTAACCCTAACCCTAACC',
 'ATCTACCCTAACCCTAACCCTAACCCTAACCCTAAC',
 'TACCCCTAACCCTAACCCTAACCCTAACCCTAACCCT',
 'CCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCC',
 'CCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCG']


In [13]:
#runs a lot of times to check convergence - you will get a "consensus" sequence by running this. 
#if you want the pfm for seqlogo you do not need to do this

threads = 16

results = Parallel(n_jobs=threads)([delayed(project03.GibbsMotifFinder)(k=6, speed="fast", rtol=1e-5, max_iter=6000) for _ in range(threads)])

top_k = consensus(results)
print(top_k)

[('AAAAAA', 118190), ('TTTTTT', 117764), ('ATTTTT', 64784), ('AAAAAT', 64463), ('TATTTT', 53983), ('AAAATA', 53425), ('TTTTTA', 50967), ('TTTCTT', 50642), ('TAAAAA', 50264), ('AGAAAA', 49118), ('TTTTCT', 48549), ('TTCTTT', 48204), ('AGGCTG', 47928), ('CAGCCT', 47603), ('AAGAAA', 47555), ('TTTTTG', 46688), ('AAAGAA', 46036), ('CAAAAA', 45589), ('CCTCCC', 45276), ('TTATTT', 45180)]


In [4]:
# generate your pwm to get seqlogo - this will take a few seconds
pwm = project03.GibbsMotifFinder(speed="fast", rtol=1e-5, atol=1e-10)

processed_data.npz exists, reading from path
data ready :)
array([[850847, 850816, 852667, 852757, 853186, 849427],
       [694227, 692144, 687858, 686268, 683276, 685789],
       [685325, 684541, 685707, 686649, 691857, 693290],
       [853098, 855996, 857265, 857823, 855178, 854991]], dtype=int32)
beginning fast iteration
Iteration 397
array([[850888, 850829, 852670, 852763, 853175, 849433],
       [694217, 692135, 687848, 686262, 683270, 685775],
       [685297, 684550, 685681, 686633, 691850, 693284],
       [853095, 855983, 857298, 857839, 855202, 855005]], dtype=int32)

After 398 iterations, final motif list: ['TCCAGA', 'CCTAAC', 'AACCCT', 'AACCCT', 'CTAACC', 'CTAACC', 'CTAACC', 'CCCTAA', 'CTAACC', 'TAACCC', 'TGACCT', 'TCCGCC', 'AAGCCT', 'GGCCGC', 'GGGCAA', 'TAGGGC', 'CAGGGC', 'TAGACT', 'GGCACT', 'TGTATA']


In [5]:
# generate your pwm to get seqlogo - this will take a few seconds
pwm = project03.GibbsMotifFinder(speed="pythonic", rtol=1e-5, atol=1e-10)

processed_data.npz exists, reading from path
data ready :)

array([[851287, 851667, 854659, 852958, 852773, 847891],
       [692817, 692029, 686559, 686237, 683763, 686388],
       [686676, 684793, 685759, 687361, 692575, 695636],
       [852717, 855008, 856520, 856941, 854386, 853582]], dtype=int32)

After 0 iterations, final motif list: ['CCTAAC', 'AACAGA', 'CCCTAA', 'AACCCT', 'ATCCCT', 'TAACCC', 'CCTAAC', 'ACCCCT', 'CGCGGT', 'CAGCCG', 'CGCCCG', 'CAGAGG', 'AGAGAC', 'TCAGAA', 'TGCAGG', 'CTGCAG', 'GAGTGG', 'TGCAGG', 'GGGCAC', 'CTGCAG']


---
# Driver Program
Don't change any of the code here. If you have completed the project by following the coding by contract, the following code should work.

In [6]:
#Marcus' code:
#If everything is correct, we should not need to edit this
#update: I edited it (pointing to project03 python file for functions)

# Run the gibbs sampler:
promoter_pfm = project03.GibbsMotifFinder(seqs,10 )

# Plot the final pfm that is generated: 
seqlogo.seqlogo(seqlogo.CompletePm(pfm = promoter_pfm.T))

processed_data.npz exists, reading from path
data ready :)

array([[849643, 849576, 852657, 853449, 853070, 853525, 854504, 851736,
        852715, 848506],
       [696939, 693818, 692106, 689535, 688185, 685567, 682907, 681496,
        683066, 685170],
       [685265, 684120, 682315, 682892, 684672, 687758, 688210, 692177,
        694247, 696185],
       [851650, 855983, 856419, 857621, 857570, 856647, 857876, 858088,
        853469, 853636]], dtype=int32)

After 0 iterations, final motif list: ['CTAACCCTAA', 'AACCCTAACC', 'CCCTAACCCT', 'CCTAACCCTA', 'TAACCCTAAC', 'TAACCCTAAC', 'CTAACCCTAA', 'ACCCTAACCC', 'TAACCCTAAC', 'ACCCTAACCC', 'AACTGTGCTC', 'AGAGGACAAC', 'AGCCTACGGG', 'CAGAAAAGCC', 'TGCTGGCGAC', 'GCGACTAGGG', 'TGGTGGCCAG', 'TTGCTTAGAC', 'CTGGCGCCGG', 'CTGCAGGGCC']


AttributeError: 'list' object has no attribute 'T'