# Project 03: Gibbs Sampling (Random Algorithm)

In [4]:
import random as random
import numpy as np
import bamnostic as bs
import seqlogo as sl

#import function for building sequence motif & idenfitying seqs matching to motif
from data_readers import get_fasta, get_gff
from seq_ops import get_seq
from motif_ops import build_pfm, build_pwm, pfm_ic, score_kmer



---
## Implement Gibbs Sampler


Gibbs sampling is a MCMC approach to identify enrichments. Here we will implement a method to identify motifs from a set of regions. 

Important considerations:
- We will need to score each sequence with a PWM using the `score_kmer()` or `score_sequence()` functions
    - You will need to investigate into the help documenation and libraries to identify how best to use these functions. 
- These sites are often not strand-specific and so both scores on the negative as well as positive strand should be considered
- To select a random sequence, use `random.randint()` or `numpy.random.randint()`
- To select a new position $m$ (as defined below) use `random.choices()` or `numpy.random.choice()`  

Assumptions: 
- We know $k$ as the length of expected motif
- Each sequence contains the motif



```
GibbsMotifFinder(DNA, k-length)
    random pick of k-length sequences from each line of DNA as Motifs
    for j ← 1 to 10000 or Motifs stops changing
        i ← Random(N) where N is number of DNA entries
        PWM ← PWM constructed from all Motifs except for Motifi
        Motifi ← select position m from PWM-scored k-mers in DNAi in probabilistic fashion from score distribution
    return PFM
```

Probability of chosing position $m = \frac{A_{m}}{\sum_{l}A_{l}}$ for positions $l$ in DNAi


**Note:** I have also added a function to `motif_ops.py` that will calculate the information content of your motifs. This is useful to observe the progression of your Gibbs sampler as well as a measure of convergence. You can use this function as `IC = pfm_ic(pfm)`. You should expect a slow increase of IC until it plateaus such as in the plot below from your lecture slides:

<center><img src='figures/Gibbs_Sampling.png'/ width=600px></center>

---
# Setting up the data

This BAM file contains a subsampled set of aligned ChIP‑seq reads from a p53 immunoprecipitation experiment in human K562 cells treated with the anthracycline drug daunorubicin. The original SRA experiment SRX5865974 (run SRR9090854) reports 31.2 million Illumina NextSeq 500 ChIP‑seq read pairs from K562 wild‑type cells exposed to daunorubicin, using a p53 antibody to pull down p53‑bound chromatin fragments before library preparation and sequencing. [pmc.ncbi.nlm.nih](https://pmc.ncbi.nlm.nih.gov/articles/PMC4366240/)

### What the data represent

- Biological system: human K562 leukemia cells (Homo sapiens) treated with daunorubicin, a DNA‑damaging chemotherapeutic known to stabilize and activate p53. [pmc.ncbi.nlm.nih](https://pmc.ncbi.nlm.nih.gov/articles/PMC6561911/)
- Assay: ChIP‑seq using an antibody against p53, so reads should be enriched around genomic regions where p53 is bound after drug treatment. [pmc.ncbi.nlm.nih](https://pmc.ncbi.nlm.nih.gov/articles/PMC4526040/)
- Sequencing: Illumina NextSeq 500, with the raw run SRR9090854 corresponding to experiment SRX5865974. [github](https://github.com/ncbi/sra-tools/issues/213)
- BAM file: `SRR9090854.subsampled_5pct.bam` is a coordinate‑sorted alignment file containing ~5% of the original mapped reads, typically created by random down‑sampling the full BAM to reduce file size and speed up exploratory analyses while preserving the overall distribution of p53 binding events. [ecseq](https://www.ecseq.com/support/ngs-snippets/how-to-extract-a-list-of-specific-read-IDs-from-a-BAM-file)

### How `bamnostic` is used and what your code does

The `bamnostic` package provides a pure‑Python interface to BAM files that mirrors the `pysam` API, including an `AlignmentFile` class whose iterator yields `AlignedSegment` objects representing individual aligned reads. In your code: [bamnostic.readthedocs](https://bamnostic.readthedocs.io/en/latest/bamnostic.html)

```python
bam_path = "data/SRR9090854.subsampled_5pct.bam"

seqs = [read.seq for read in bs.AlignmentFile(bam_path)]
```

- `bs.AlignmentFile(bam_path)` opens the BAM file as an `AlignmentFile` object in binary read mode (default `'rb'`), reading the BAM header (reference contigs, read groups, etc.) and preparing a streaming interface to all aligned records. [github](https://github.com/betteridiot/bamnostic/blob/master/docs/source/quickstart.rst)
- Iterating over `AlignmentFile` (`for read in bs.AlignmentFile(...)`) returns each aligned read as a `bamnostic.AlignedSegment` object, which exposes properties analogous to SAM fields such as query name, flags, reference name, position, mapping quality, CIGAR string, and the original sequencing **sequence**. [bamnostic.readthedocs](https://bamnostic.readthedocs.io/en/latest/bamnostic.html)
- The `read.seq` attribute is the query (read) sequence string stored in the BAM, corresponding to the full read sequence (including any unaligned bases), as opposed to `query_alignment_sequence`, which would only contain the aligned portion. [bamnostic.readthedocs](https://bamnostic.readthedocs.io/en/latest/bamnostic.html)
- The list comprehension `[read.seq for read in ...]` consumes the entire BAM stream and collects the nucleotide sequences from every subsampled ChIP‑seq read into a Python list `seqs`, which can then be used for downstream tasks such as motif discovery, k‑mer analysis, or quality checks on read content. [ucdavis-bioinformatics-training.github](https://ucdavis-bioinformatics-training.github.io/2022-Feb-Introduction-To-Python-For-Bioinformatics/python/python5)

In summary, your dataset is a 5% random sample of p53‑ChIP‑seq alignments from daunorubicin‑treated K562 cells, and the `bamnostic` code opens the subsampled BAM and extracts the raw read sequences from each aligned fragment into memory as a list.

In [None]:
bam_path = "data/SRR9090854.subsampled_5pct.bam"

seqs = [read.seq for read in bs.AlignmentFile(bam_path)] 

---
# Project Start

In [2]:
def initialize_random_motif(seqs, k): 
    motif_sequence_list = []
    for seq in seqs:
        motif_start_index = np.random.randint(low = 0, high = len(seq) - k)
        motif_sequence_list.append(seq[motif_start_index:motif_start_index +k])
        #motif_sequence_list.append(motif_start_index)
    return motif_sequence_list

In [3]:
def update_motifSeqs_list(motif_list, new_motif, seq_ind):
    
    motif_list[seq_ind] = new_motif
    new_pfm = motif_list

    
    return motif_list, new_pfm



In [None]:
def select_new_motif(motif_list, seqs, k):
    '''
    Function iterates 1 time. Omits sequence, recalculates pwm, calculates kmer score, choses new motif stochastically
    '''
    omitted_seq_ind = np.random.randint(low = 0, high = len(seqs))
    temp_seq_list = seqs[:omitted_seq_ind] + seqs[omitted_seq_ind+1:]



    return  omitted_seq_ind
select_new_motif(initialize_random_motif(seqs, 10), seqs, 10)

NameError: name 'initialize_random_motif' is not defined

In [None]:
def GibbsMotifFinder (seqs, k, seed=None):

    '''
Function to find a pfm from a list of strings using a Gibbs sampler
    
    Args: 
        seqs (str list): a list of sequences, not necessarily in same lengths
        k (int): the length of motif to find
        seed (int, default=None): seed for np.random

    Returns:
        pfm (numpy array): dimensions are 4xlength
    '''
    #Use rng to make random samples/selections/numbers
    #Example: randint = rng.integer(1, 10)
    random.seed(seed) 
    rng = np.random.default_rng(seed)
    
    #Select starting random motifs in each sequence
    initial_motifs = initialize_random_motif(seqs, k)
    #Create initial PFM
    initial_pfm = build_pfm(initial_motifs)
    #Create PWM from initial motif assignments
    initial_pwm = build_pwm(initial_pfm)
    #initialize information_content
    current_ic = pfm_ic(initial_pfm)
    i = 1
    while i != 1000 or inf_content_delta > 0.10:
        new_motif, seq_ind = select_new_motif(motif_list, seqs, k)
        motif_list, new_pfm = update_motifSeqs_list(motif_list, new_motif, seq_ind)
        new_ic = pfm_ic(new_pfm)
        inf_content_delta = pfm_ic(new_pfm) - pfm_ic(current_ic )
        current_ic = new_ic
        i+1
        
    if i == 1000 and inf_content_delta > 0.1:
        return 'failed to converge'
    
    elif i < 1000 and inf_content_delta < 0.1:
        return new_pfm

---
# Driver Program
Don't change any of the code here. If you have completed the project by following the coding by contract, the following code should work.

In [None]:
# Run the gibbs sampler:
promoter_pfm = GibbsMotifFinder(seqs,10 )

# Plot the final pfm that is generated: 
sl.seqlogo(sl.CompletePm(pfm = promoter_pfm.T))