### Inputs:
- `bamfile`: Path to the BAM file containing alignment data.
- `assembly_`: Path to the reference genome in FASTA format.

### Outputs:
- `assembly`: Dictionary containing chromosome names as keys and their sequences as values.
  - Example:
    ```python
    {
        'chr1': 'ATCGATCG...',
        'chr2': 'GCTAGCTA...'
    }
    ```
- `assembly_sequence_length`: Dictionary containing chromosome names as keys and their sequence lengths as values.
  - Example:
    ```python
    {
        'chr1': 248956422,
        'chr2': 242193529
    }
    ```
- Prints the time taken to load the reference genome and create the dictionaries.

### Description:
This code imports necessary libraries, sets up the environment, and loads alignment data from a BAM file. It reads a reference genome from a FASTA file, creating dictionaries to store chromosome sequences and their lengths. Additionally, it measures and prints the time taken to perform these operations.


In [30]:
import pysam
import numpy as np
from Bio import SeqIO
import time 
import matplotlib.patches as patches
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import itertools
import pandas as pd 
from tabulate import tabulate
import csv
import random

import os
# Change the working directory
os.chdir('/private/home/yxu267/anaconda3/envs/dimelo/lib/python3.10/site-packages')

# Verify the change
print(os.getcwd())



np.set_printoptions(threshold=np.inf)
min_quality_score = 8

#Load the bam file 
bamfile = pysam.AlignmentFile(
    "/private/groups/migalab/dan/06_11_24_R1041_UL_DiMeLo_CENPAyoung_1/20240611_1126_1H_PAW33460_814408d8/pod5/06_11_24_R1041_UL_DiMeLo_CENPAyoung_1_5mA_6mC_winnowmap_MD_mA_mC.bam",
    "rb") 


assembly_ = open("/private/groups/migalab/dan/reference/hg002v1.0.1.fasta", "r")

start_time = time.time()

#Load the reference genome and make it into a dictionary 
fasta_sequences = SeqIO.parse(assembly_, "fasta")
assembly={}
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    assembly[name] = sequence

#Make a dictionary for all the chromosomes and their corresponding sequence length 
assembly_sequence_length = {}    
for chromosome in assembly:
    assembly_sequence_length[chromosome] = len(assembly[chromosome])
    
end_time = time.time()
elapsed_time = end_time - start_time
print (elapsed_time, "seconds")
assembly_.close()



[W::hts_idx_load3] The index file is older than the data file: /private/groups/migalab/dan/06_11_24_R1041_UL_DiMeLo_CENPAyoung_1/20240611_1126_1H_PAW33460_814408d8/pod5/06_11_24_R1041_UL_DiMeLo_CENPAyoung_1_5mA_6mC_winnowmap_MD_mA_mC.bam.csi


62.05266785621643 seconds


### Inputs:
- `input_file`: Path to the BED file containing chromosome and active array data.

### Outputs:
- `active_dict`: Dictionary where chromosome names are keys and the values are lists of active array blocks.
  - Example:
    ```python
    {
        'chr1': [[1000, 2000], [3000, 4000]],
        'chr2': [[500, 1500]]
    }
    ```

### Description:
The purpose of this code is to create an active array dictionary where each chromosome name serves as a key. The corresponding value is a list of active array blocks represented by their start and end positions. The code reads the input BED file, processes each line to extract chromosome information and active array positions, and populates the `active_dict` dictionary accordingly. It handles changes in chromosome number and parental status to reset or continue the numbering of active blocks as needed.


In [10]:
'''The purpose of the code here is to make an active array where the chromosome name is
the dictionary key. The active array blocks are in lists inside the key.
'''
import os
input_file = '/private/groups/migalab/dan/data_analysis/alpha_bed/hg002v1.0.fasta.manualAlpha.cenSat_H1L_merged.bed'
active_dict = {}
with open(input_file, 'r') as infile:  
    num = 0 
    previous_chr_num = ""
    previous_parental_status = ""
    for i in infile:
        chr_num = i.split ('_')[0]
        parental_status = i.split ('_')[1][0:8]
        if chr_num == previous_chr_num and parental_status !=previous_parental_status:
            num = 0 
        elif chr_num == previous_chr_num and parental_status ==previous_parental_status:
            pass 
        else: 
            num = 0 
        active = i.split ()
        if active[0] not in active_dict:
            
            active_dict[active[0]] = [[int(active[1]) ,int(active[2])]]
        else:
            active_dict[active[0]].append([int(active[1]) ,int(active[2])])
        previous_chr_num = chr_num 
        previous_parental_status = parental_status


### Inputs:
- `active_dict`: Dictionary where chromosome names are keys and the values are lists of active array blocks.
  - Example:
    ```python
    {
        'chr1': [[1000, 2000], [3000, 4000]],
        'chr2': [[500, 1500]]
    }
    ```

### Outputs:
- `active_array_range`: Dictionary where chromosome names are keys and the values are lists of active array sub-ranges, each sub-range having a length of 1000.
  - Example:
    ```python
    {
        'chr1': [[1000, 2000], [2000, 3000], [3000, 4000]],
        'chr2': [[500, 1500]]
    }
    ```

### Description:
This code creates a new dictionary, `active_array_range`, which contains chromosome names as keys and lists of active array sub-ranges as values. For each chromosome in the input dictionary `active_dict`, the code iterates over the active array blocks and subdivides each block into smaller ranges of 1000 base pairs. These sub-ranges are then added to the `active_array_range` dictionary. The resulting dictionary provides a more granular representation of the active array blocks.


In [14]:
active_array_range={}
for chromosome in active_dict:
    active_array_range[chromosome] = []
    for fragment in range (active_dict[chromosome][0][0], active_dict[chromosome][0][1],1000):

        active_array_range[chromosome].append ([fragment,fragment+1000])
        

### Inputs:
- `input_file`: Path to the BED file containing CDR regions data.

### Outputs:
- `CDR_dict`: Dictionary where chromosome names are keys and the values are lists of CDR regions represented by their start and end positions.
  - Example:
    ```python
    {
        'chr1': [[1000, 2000], [3000, 4000]],
        'chr2': [[500, 1500]]
    }
    ```

### Description:
This code formulates the CDR (Conserved Domain Region) regions and lists them for each chromosome. It reads the input BED file and processes each line to extract chromosome numbers and CDR start and end positions. For each chromosome, it adds the CDR regions to the `CDR_dict` dictionary. If a chromosome is not already in the dictionary, it initializes a new list for that chromosome. If the chromosome is already present, it appends the new CDR region to the existing list.


In [31]:
''' here in the code, I am formulating the CDR regions and listing the CDRs in each and every chromosome'''
input_file = '/private/groups/migalab/dan/data_analysis/young_old_analysis/HG002_DiMeLo_CENPA_youngpassage.hmmCDR_only_CDR_Dan_certified.bed'
CDR_dict = {}
with open(input_file, 'r') as infile:  
    for i in infile:
        chr_num = i.split('\t')[0]
        CDR_start = int (i.split('\t')[1]) 
        CDR_end = int(i.split('\t')[2].split('\n')[0]) 
        if chr_num not in CDR_dict:
            CDR_dict[chr_num] = [[CDR_start ,CDR_end]]
        elif chr_num in CDR_dict:  
            CDR_dict[chr_num].append ([CDR_start ,CDR_end])





### Inputs:
- `CDR_dict`: Dictionary where chromosome names are keys and the values are lists of CDR regions represented by their start and end positions.
  - Example:
    ```python
    {
        'chr1': [[1000, 2000], [3000, 4000]],
        'chr2': [[500, 1500]]
    }
    ```

### Outputs:
- `CDR_adjacent`: Dictionary where chromosome names are keys and the values are lists of CDR adjacent regions, with each adjacent region being 1000 base pairs long.
  - Example:
    ```python
    {
        'chr1': [[-1, 999], [2001, 3001], [2999, 3999], [4001, 5001]],
        'chr2': [[-501, 499], [1501, 2501]]
    }
    ```

### Description:
This code creates a dictionary, `CDR_adjacent`, to store the regions adjacent to the CDR (Conserved Domain Region) regions for each chromosome. For each chromosome in the `CDR_dict` dictionary, it iterates over the CDR regions and calculates the left and right adjacent regions, each 1000 base pairs long. These adjacent regions are then added to the `CDR_adjacent` dictionary under the corresponding chromosome key. The resulting dictionary provides information on the genomic regions immediately adjacent to each CDR region.


In [16]:
'''Based on the CDR regions, I am obtaining CDR adjacent regions in the same format as the CDR data set above'''
CDR_adjacent = {}
for chromosome in CDR_dict: 
    CDR_adjacent[chromosome] =[]
    for CDR in CDR_dict[chromosome]: 
        #print (CDR) 
        CDR_adjacent_left_space = [int(CDR[0]) - 1001, int(CDR[0]) - 1]
        CDR_adjacent_right_space = [int(CDR[1]) + 1, int(CDR[1]) + 1001]
        
        CDR_adjacent[chromosome].append (CDR_adjacent_left_space)
        CDR_adjacent[chromosome].append (CDR_adjacent_right_space)
        
    

### Inputs:
- `CDR_dict`: Dictionary where chromosome names are keys and the values are lists of CDR regions represented by their start and end positions.
  - Example:
    ```python
    {
        'chr1': [[1000, 2000], [3000, 4000]],
        'chr2': [[500, 1500]]
    }
    ```
- `active_dict`: Dictionary where chromosome names are keys and the values are lists of active array blocks represented by their start and end positions.
  - Example:
    ```python
    {
        'chr1': [[0, 100000]],
        'chr2': [[0, 100000]]
    }
    ```

### Outputs:
- `none_CDR_active`: Dictionary where chromosome names are keys and the values are lists of non-CDR active regions represented by their start and end positions.
  - Example:
    ```python
    {
        'chr1': [[0, 999], [2001, 2999], [4001, 99999]],
        'chr2': [[0, 499], [1501, 99999]]
    }
    ```

### Description:
This code generates a dictionary, `none_CDR_active`, to store non-CDR active regions for each chromosome. It uses the `CDR_dict` and `active_dict` dictionaries to determine the regions that are active but not part of the CDRs. For each chromosome, it calculates the regions from the start of the active block to the start of the first CDR, the regions between consecutive CDRs, and the regions from the end of the last CDR to the end of the active block. These non-CDR regions are added to the `none_CDR_active` dictionary under the corresponding chromosome key. The resulting dictionary provides information on the active genomic regions that are not part of any CDR.


In [17]:
none_CDR_active = {}
for chromosome in CDR_dict: 
    none_CDR_active[chromosome] = [] 
    active_start = active_dict[chromosome][0][0]
    active_end = active_dict[chromosome][0][1]
    CDR_start_position = int (CDR_dict[chromosome][0][0]) - 1 
    CDR_end_position = int (CDR_dict[chromosome][-1][1]) + 1
    none_CDR_active[chromosome].append ([active_start, CDR_start_position])
    for coordinate in range(len(CDR_dict[chromosome]) - 1):
        # Generate in-between coordinates
        
        start_of_next = int (CDR_dict[chromosome][coordinate + 1][0]) - 1 
        end_of_current = int (CDR_dict[chromosome][coordinate][1]) + 1 
        # Ensure there's no overlap and the next start is greater than the current end
        if int(start_of_next) > int(end_of_current):
            none_CDR_active[chromosome].append([end_of_current, start_of_next])
    
    none_CDR_active[chromosome].append ([CDR_end_position, active_end])


### Inputs:
- `segment_num`: Number of random segments to pick.
- `chr_name`: Chromosome name for which the random regions are to be generated.
- `H1L_active_dict`: Dictionary where chromosome names are keys and the values are lists of active array blocks represented by their start and end positions.
  - Example:
    ```python
    {
        'chr1': [[0, 100000]],
        'chr2': [[0, 100000]]
    }
    ```
- `assembly_sequence_length`: Dictionary containing chromosome names as keys and their sequence lengths as values.
  - Example:
    ```python
    {
        'chr1': 248956422,
        'chr2': 242193529
    }
    ```

### Outputs:
- `chromosome_arm_random_region_dict`: Dictionary where chromosome names are keys and the values are lists of randomly picked chromosome arm regions represented by their start and end positions.
  - Example:
    ```python
    {
        'chr1': [[0, 99999], [200000, 299999], [400000, 499999]],
        'chr2': [[0, 99999], [200000, 299999], [400000, 499999]]
    }
    ```

### Description:
This code defines a function `chromosome_arm_random_region` to generate random regions in the chromosome arms for a specified chromosome. It first calculates the active array regions for the chromosome and determines the excluded portions. Then, it randomly selects a defined number of segments from the non-active regions, ensuring no duplicate random numbers are chosen. The selected regions are expanded to chromosome positions and added to a list. The function returns a list of randomly picked regions for the chromosome.

The code then creates a dictionary, `chromosome_arm_random_region_dict`, by applying the `chromosome_arm_random_region` function to each chromosome in `CDR_dict`. The resulting dictionary stores the randomly picked chromosome arm regions for each chromosome.


In [18]:
def chromosome_arm_random_region (segment_num, chr_name, H1L_active_dict, assembly_sequence_length): 
    # create variables to contain regions chosen 
    chromosome_arm_regions = []
    excluded_portion = []
    H1L_active_length = 0 
    
    for active_region in H1L_active_dict[chr_name]: 
        #calculate the length of the active array region for each chromosome 
        H1L_active_length = active_region[1] - active_region[0]
        
        #defining where the chromosome arm regions are
        
        #The left side of the active array 
        chromosome_arm_regions.append([0,active_region[0] - 1])
        
        #The right side of the active array 
        chromosome_arm_regions.append([active_region[1] + 1,assembly_sequence_length[chr_name]])
        
        #calculate the percentage portion of where the active array is in and add them to the exclusion bin 
        start_portion = active_region[0] / H1L_active_length
        end_portion = active_region[1] / H1L_active_length
        excluded_portion.append (int(start_portion))
        excluded_portion.append (int(end_portion))
        
    #calculate the total portions of the active array 
    total_segment_amount = int(assembly_sequence_length[chr_name] / H1L_active_length)
    
    
    #pick defined amount of random numbers between 0 and the pre defined amount of random numbers 
    random_numbers = []
    for num in range(segment_num):
        while True: 
            #if the same random number gets picked twice, repeat 
            current_random_number = random.randint(0, total_segment_amount)
            if current_random_number not in (excluded_portion and random_numbers):
                break
            
            
        random_numbers.append(current_random_number)
    
    #Expand the chromosome portion number to chromosome position number by multiplying 
    random_picked_regions = [num * H1L_active_length for num in random_numbers]
    uncoded_region_list = []
    
    #Make a dictionary that contains randomly picked region for each chromosome 
    for item in random_picked_regions: 
        arms_region_start = item
        arms_region_end = item + H1L_active_length
        uncoded_region_list.append([arms_region_start, arms_region_end])
        
    return uncoded_region_list


chromosome_arm_random_region_dict = {}
for chromosome in CDR_dict: 
    chromosome_arm_random_region_dict[chromosome] = chromosome_arm_random_region (3,
                                                                             chromosome, 
                                                                             active_dict, 
                                                                             assembly_sequence_length)


### Inputs:
- `cdr_adjacent`: Dictionary containing chromosome names as keys and lists of regions as values. The regions are represented by their start and end positions.
  - Example:
    ```python
    {
        'chr1': [[-1, 999], [2001, 3001]],
        'chr2': [[-501, 499], [1501, 2501]]
    }
    ```

### Outputs:
- `filtered_cdr_adjacent`: Dictionary with the same structure as the input but with invalid regions removed. Regions where the end position is not greater than the start position are excluded.
  - Example:
    ```python
    {
        'chr1': [[2001, 3001]],
        'chr2': [[1501, 2501]]
    }
    ```

### Description:
The function `check_lists` filters out invalid regions from a dictionary of chromosome regions. For each chromosome, it iterates through the list of regions and checks if the end position of each region is greater than its start position. If a region is invalid, it is excluded, and a message is printed indicating the removal of the invalid region. The function returns a new dictionary with only valid regions.

The `check_lists` function can be used to validate and filter regions in various dictionaries, such as `CDR_dict`, `CDR_adjacent`, `none_CDR_active`, and `chromosome_arm_random_region_dict`, ensuring that only valid regions are retained for further analysis.


In [25]:
def check_lists(cdr_adjacent):
    filtered_cdr_adjacent = {}
    
    for chromosome, regions in cdr_adjacent.items():
        valid_regions = []
        for region in regions:
            if region[1] > region[0]:
                valid_regions.append(region)
            else:
                print(f"Removed invalid region in {chromosome}: {region[1]} is not larger than {region[0]}")
        if valid_regions:
            filtered_cdr_adjacent[chromosome] = valid_regions

    return filtered_cdr_adjacent
CDR_dict= check_lists(CDR_dict)
CDR_adjacent = check_lists(CDR_adjacent)
none_CDR_active = check_lists(none_CDR_active)
chromosome_arm_random_region_dict = check_lists(chromosome_arm_random_region_dict)



### Inputs:
- `mod_no_dash`: Numpy array of the mod without any insertions and deletions.
- `alignment_dash`: Alignment sequence with dashes (insertions).
- `target_start_no_dash`: The starting position of the subset without dashes.
- `target_end_no_dash`: The ending position of the subset without dashes.

### Outputs:
- `mod_subset`: Numpy array representing the subset of the mod without any insertions and deletions.

### Description:
The function `mod_subset_producing_step` isolates the desired regions (subset) from a numpy array of mod sequences (`mod_no_dash`) without insertions and deletions. It takes an alignment sequence with dashes (`alignment_dash`) and target start and end positions without dashes. The function creates a mask to identify non-dash positions, generates cumulative counts for non-dash positions, and maps these counts to the positions in the original alignment sequence.

The function calculates the start and end positions of the target region within the dashed alignment sequence. It then extracts the subset of the alignment sequence, removes dashes, and calculates the corresponding start and end positions within the mod numpy array. The resulting subset of the mod numpy array (`mod_subset`) is returned, representing the desired region without any insertions or deletions.


In [21]:
'''
The idea of this function is to isolate the the desired regions (here in the function, it is called the subset) in the mod 
numpy array without dashes(insertions)'''

def mod_subset_producing_step (mod_no_dash,alignment_dash,target_start_no_dash,target_end_no_dash):
    #mod_no_dash = is the numpy array of the mod without any insertions and deletions
    # alignment_dash = is the alignment sequence with the dashes in it 
    # target_start = it's the subset starting position WITHOUT the dashes!!! 


    # Create a mask to identify non-dash positions
    mask = [char != '-' for char in alignment_dash]

    # Generate cumulative counts only for True values in the mask
    cumulative_counts = list(itertools.accumulate(mask))


    
    # Create the final indexes list
    indexes = [count - 1 if is_non_dash else '-' for count, is_non_dash in zip(cumulative_counts, mask)]



    target_start_dash = indexes.index (target_start_no_dash)

        
    try:
        target_end_dash = indexes.index (target_end_no_dash)
    except ValueError: 
        target_end_dash = indexes[-1]




    #obtain dashed alignment 
    alignment_dash_sequence_pre_subset = alignment_dash[0:target_start_dash]
    alignment_dash_sequence_subset = alignment_dash[target_start_dash:target_end_dash]

    #create no dash alignment 
    alignment_no_dash_sequence_pre_subset = alignment_dash_sequence_pre_subset.replace("-","")
    alignment_no_dash_sequence_subset = alignment_dash_sequence_subset.replace("-","")

    subset_no_dash_start = len(alignment_no_dash_sequence_pre_subset)
    subset_no_dash_end = subset_no_dash_start + len(alignment_no_dash_sequence_subset)

    #make mod_no_dash alignment
    mod_subset = mod_no_dash[subset_no_dash_start:subset_no_dash_end]

    return mod_subset





### Inputs:
- `chromosome_coordinates`: Dictionary containing chromosome names as keys and lists of regions represented by their start and end positions.
  - Example:
    ```python
    {
        'chr1': [[1000, 2000], [3000, 4000]],
        'chr2': [[500, 1500]]
    }
    ```
- `name`: Name identifier for the output file.
- `bamfile`: BAM file object containing read data.
- `mod_tag`: Modification tag to calculate the density for (e.g., 'A' for adenine).

### Outputs:
- `data_table`: List of lists containing chromosome name, region, average region density, and normalized region base coverage.
  - Example:
    ```python
    [
        ['chr1', [1000, 2000], 0.123456, 0.987654],
        ['chr2', [500, 1500], 0.654321, 0.876543]
    ]
    ```
- CSV file: Saved to the specified directory with region density scores.

### Description:
The `region_read_mA_density_calculator` function calculates the density of modifications (e.g., methylation) in specified regions of chromosomes. It takes a dictionary of chromosome coordinates, a name identifier, a BAM file, and a modification tag. For each region in each chromosome, the function fetches reads from the BAM file, aligns them, and calculates the modification density. It handles different scenarios where regions and reads overlap in various ways, ensuring that only valid reads and regions are processed. The function then compiles the data into a table and saves it as a CSV file. It also prints the name identifier and the formatted table.


In [1]:

def region_read_mA_density_calculator (chromosome_coordinates,name,bamfile,mod_tag): 
    data_table = [] 

    #get each chromosome
    for chr_name in chromosome_coordinates:
        for region in chromosome_coordinates[chr_name]:
            region_density = []
            region_base = 0 
            
            region_start_index = int(region[0])
            region_end_index = int(region[1])

            
            for read in bamfile.fetch(chr_name,region_start_index,region_end_index):

                #make an if statement to check a specific read front, middle, end regions 
                #setting read start, end, density, length variables 
                    
                #Get the starting and ending positions of the reads 
                read_start_position = read.reference_start
                read_end_position = read.reference_end
                read_density = 0 
        
                #Get sequence information which shows deletions and insertions 
                sequence = read.get_aligned_pairs(matches_only=False, with_seq = True)


                #make a numpy of the sequence length which eliminates the deletion
                
                read_sequence_insertion_included = ''
                genomic_alignment_sequence_deletion_mistach_included = ''
                
                for item in sequence:
                    if item[0] is None:
                        read_sequence_insertion_included+='-'
                    elif item[1] is None:
                        genomic_alignment_sequence_deletion_mistach_included += '-'
                    else: 
                        read_sequence_insertion_included+=item[2]
                        genomic_alignment_sequence_deletion_mistach_included +=item[2]

                
                read_sequence_insertion_included = read_sequence_insertion_included.upper()
                genomic_alignment_sequence_deletion_mistach_included = genomic_alignment_sequence_deletion_mistach_included.upper()


                genomic_alignment_sequence_deletion_mistach_included_mask = np.array(
                    [char != '-' for char in genomic_alignment_sequence_deletion_mistach_included])

                #take sequence length excluding insertions 
                insertions = read_sequence_insertion_included.count ("-")
                no_insertion_no_deletion_sequence_length = len(read_sequence_insertion_included) 
                
                
                
                #removing reads shorter than 50000 
                #if no_insertion_no_deletion_sequence_length < 50000:
                #    continue 

                #make a mod np array with the length of the read length
                mod=read.modified_bases_forward
                
                #make a mod score with its original length 
                mod_score = np.zeros(len(genomic_alignment_sequence_deletion_mistach_included),)

                
                #make transfer mA positions to mod np array corresponded to their sequence positions 
                try:
                    if mod_tag == 'A':
                        for indices, values in mod[('A', 0, 'a')]:
                            mod_score[indices] = values
                        mod_score = mod_score[genomic_alignment_sequence_deletion_mistach_included_mask]
                        


                    elif mod_tag == 'CG':
                        for indices, values in mod[('C', 0, 'm')]:
                            mod_score[indices] = values
                        mod_score = mod_score[genomic_alignment_sequence_deletion_mistach_included_mask]
                    
                    if read.is_reverse:
                            mod_score = mod_score[::-1]


                # No mod would return KeyError 
                except KeyError:
                    continue

                
                    


                # if the regions are longer than the reads 
                if (region_end_index - region_start_index) > (read_end_position - read_start_position):
                    # scenario 4: if the reads are inside the region
                    if (region_end_index >= read_end_position) and (region_start_index <= read_start_position): 
                        mod_start = 0
                        mod_end = len(read_sequence_insertion_included)
                    
                    # scenario 5: if the reads cover the later part of the region
                    elif (region_end_index < read_end_position) and (region_start_index > read_start_position): 
                        mod_start = 0
                        mod_end = no_insertion_no_deletion_sequence_length - read_end_position - region_end_index

                    # scenario 6: if the reads cover the starting part of the region 
                    elif (region_end_index > read_end_position) and (region_start_index > read_start_position): 
                        mod_start = region_start_index - read_start_position 
                        mod_end = no_insertion_no_deletion_sequence_length

                        
                
                # if the reads are longer than the region selected 
                elif (region_end_index - region_start_index) <= (read_end_position - read_start_position):
                    # scenario 1: when the defined region is inside the read
                    if (read_start_position <= region_start_index) and (read_end_position >= region_end_index):
                        mod_start = region_start_index - read_start_position 
                        mod_end = region_end_index - read_start_position

                    # scenario 3: when the defined region covers a bit of the end of the read
                    elif (read_end_position < region_end_index) and (read_end_position > region_start_index):
                        mod_start = region_start_index - read_start_position

                        mod_end = no_insertion_no_deletion_sequence_length

                    # scenario 2: when the defined region covers a bit of the beginning of the read
                    elif (read_start_position > region_start_index) and (read_start_position < region_end_index):
                        mod_start = 0
                        mod_end = region_end_index - read_start_position 


                #use the defined starting and ending positons in the region to subset mod numpy
                if (region_start_index - read_start_position) > (no_insertion_no_deletion_sequence_length - insertions):
                    continue
                try:
                    trimmed_mod_score = mod_subset_producing_step (mod_score,read_sequence_insertion_included,mod_start,mod_end)
                except ValueError:
                    continue
                
            
                region_base += (mod_end - mod_start)
                #removing all the zeros 
                mod_no_zeros = trimmed_mod_score[trimmed_mod_score != 0]
                m_mod_tag = len (mod_no_zeros)
                

                #Getting the total amount of As in the subsetted region of the sequence 
                total_mod_tag = read_sequence_insertion_included[mod_start:mod_end].count(mod_tag)
                
                
                #calculate read density
                try:
                    read_density = m_mod_tag / total_mod_tag
                    
                except ZeroDivisionError:
                    pass
                region_density.append (read_density)

                
                    
                
            #calculate averaged region density average 
            try:
                region_density_average = sum(region_density)/len(region_density)
            
            except ZeroDivisionError:
                region_density_average = 0
            data_table.append ([chr_name,region,region_density_average,region_base/(region_end_index - region_start_index)])
            #print (chr_name, region_end_index - region_start_index, region_base )

        
    table = tabulate(data_table, headers="firstrow", tablefmt="fancy_grid", floatfmt=".18f")
    
    filename = f"/private/groups/migalab/dan/data_analysis/young_old_analysis/{name}_region_density_scores_{mod_tag}.csv"
    with open(filename, "w", newline="") as csvfile:
        writer = csv.writer(csvfile,delimiter="\t")
        writer.writerows(data_table)

    print (name)
    print (table)
    

#region_read_mA_density_calculator(test_dict,'test_dict',bamfile)

#for scenario in scenarios: 
    #region_read_mA_density_calculator (scenario,str(scenario))
    

In [None]:

import unittest
from unittest.mock import MagicMock, patch
class TestRegionReadMADensityCalculator(unittest.TestCase):

    @patch('pysam.AlignmentFile')
    def test_region_read_mA_density_calculator(self, mock_bamfile):
        # Mock chromosome names
        mock_bamfile.references = ['chr1']
        
        # Define mock reads and return values
        mock_reads = []
        for i in range(100):
            mock_read = MagicMock()
            mock_read.reference_start = 100
            mock_read.reference_end = 500
            # Create modified bases with mA every other A position

            mock_read.modified_bases_forward= {('A', 0, 'a'): [(0, 254), (4, 254), (8, 254),(16, 254), (24, 254),(32, 254), (36, 254), (40, 254),(48, 254), (52, 254)]
                                               +[(104, 254), (108, 254), (112, 254),(116, 254), (152, 254)]
                                               +[(j, 250) for j in range(200, 300, 4)]
                                               +[(396, 254)]
                                               }
            
            # Mock the sequence as a long string of 'A's
            mock_read.get_aligned_pairs.return_value = [(j, j, 'A') for j in range(0, 400)]
            
            mock_read.is_reverse = False
            
            mock_reads.append(mock_read)

        # Mock the fetch method
        mock_bamfile.fetch.side_effect = lambda chr_name, start, end: iter(mock_reads) if chr_name == 'chr1' else iter([])

        # Define input
        chromosome_coordinates = {'chr1': [[100, 200]]}
        name = 'test_output'
        mod_tag = 'A'
        
        # Call the function
        region_read_mA_density_calculator(chromosome_coordinates, name, mock_bamfile,mod_tag)
        
        # Check if the file was created and contains the expected data
        with open('test_output_region_density_scores.csv', 'r') as file:
            content = file.read()
        


if __name__ == '__main__':
    unittest.TextTestRunner().run(unittest.TestLoader().loadTestsFromTestCase(TestRegionReadMADensityCalculator))


In [None]:

import unittest
from unittest.mock import MagicMock, patch
class TestRegionReadMADensityCalculator(unittest.TestCase):

    @patch('pysam.AlignmentFile')
    def test_region_read_mA_density_calculator(self, mock_bamfile):
        # Mock chromosome names
        mock_bamfile.references = ['chr1']
        
        # Define mock reads and return values
        mock_reads = []
        for i in range(100):
            mock_read = MagicMock()
            mock_read.reference_start = 100
            mock_read.reference_end = 600
            # Create modified bases with mA every other A position

            mock_read.modified_bases_forward= {('A', 0, 'a'): 
                                               
                                               [(j, 250) for j in range(0, 125)]
                                               +[(j, 250) for j in range(126, 227)]
                                               +[(j, 250) for j in range(251, 326)]
                                               +[(j, 250) for j in range(376, 411)]
                                               }
            
            # Mock the sequence as a long string of 'A's

            # Initialize the list to store the tuples
            mock_sequence = []

            # Initialize the second position counter
            second_pos = 0

            # Generate the tuples
            for i in range(500):
                if i % 5 == 0:
                    mock_sequence.append((i, None, None))
                else:
                    mock_sequence.append((i, second_pos, 'A'))
                    second_pos += 1

                # Reset the second position counter if it exceeds 400
                if second_pos > 400:
                    second_pos = 0
            mock_read.get_aligned_pairs.return_value = mock_sequence
            
            mock_read.is_reverse = False
            
            mock_reads.append(mock_read)

        # Mock the fetch method
        mock_bamfile.fetch.side_effect = lambda chr_name, start, end: iter(mock_reads) if chr_name == 'chr1' else iter([])

        # Define input
        chromosome_coordinates = {'chr1': [[400, 500]]}
        name = 'test_output'
        
        # Call the function
        region_read_mA_density_calculator(chromosome_coordinates, name, mock_bamfile,'A')
        
        # Check if the file was created and contains the expected data
        with open('test_output_region_density_scores.csv', 'r') as file:
            content = file.read()
        

if __name__ == '__main__':
    unittest.TextTestRunner().run(unittest.TestLoader().loadTestsFromTestCase(TestRegionReadMADensityCalculator))


In [None]:

import unittest
from unittest.mock import MagicMock, patch
class TestRegionReadMADensityCalculator(unittest.TestCase):

    @patch('pysam.AlignmentFile')
    def test_region_read_mA_density_calculator(self, mock_bamfile):
        # Mock chromosome names
        mock_bamfile.references = ['chr1']
        
        # Define mock reads and return values
        mock_reads = []
        for i in range(100):
            mock_read = MagicMock()
            mock_read.reference_start = 100
            mock_read.reference_end = 500
            # Create modified bases with mA every other A position

            mock_read.modified_bases_forward= {('A', 0, 'a'): 
                                               
                                               [(i, 250) if (i + 1) % 4 != 0 else (i, 0) for i in range(0,133)]
                                               +[(i, 250) if (i + 1) % 4 != 0 else (i, 0) for i in range(134,202)]

                                               
                                             
                                               }
            
            # Mock the sequence as a long string of 'A's

            # Initialize the list to store the tuples
            # Initialize the list to store the tuples
            mock_sequence = []

            # Initialize the counters for the first and second positions
            first_pos = 0
            second_pos = 100

            # Generate the tuples
            for i in range(400):
                if (i + 1) % 4 == 0:
                    mock_sequence.append((None, second_pos, 'A'))
                else:
                    mock_sequence.append((first_pos, second_pos, 'A'))
                    first_pos += 1
                
                second_pos += 1



            mock_read.get_aligned_pairs.return_value = mock_sequence
            
            mock_read.is_reverse = False
            
            mock_reads.append(mock_read)

        # Mock the fetch method
        mock_bamfile.fetch.side_effect = lambda chr_name, start, end: iter(mock_reads) if chr_name == 'chr1' else iter([])

        # Define input
        chromosome_coordinates = {'chr1': [[233, 366]]}
        name = 'test_output'
        
        # Call the function
        region_read_mA_density_calculator(chromosome_coordinates, name, mock_bamfile)
        
        # Check if the file was created and contains the expected data
        with open('test_output_region_density_scores.csv', 'r') as file:
            content = file.read()
        

if __name__ == '__main__':
    unittest.TextTestRunner().run(unittest.TestLoader().loadTestsFromTestCase(TestRegionReadMADensityCalculator))





In [27]:
CDR_dict= check_lists(CDR_dict)
CDR_adjacent = check_lists(CDR_adjacent)
none_CDR_active = check_lists(none_CDR_active)
chromosome_arm_random_region_dict = check_lists(chromosome_arm_random_region_dict)

### Inputs:
- `bar_plot_dataset`: List of dictionaries containing chromosome regions to analyze.
  - Example:
    ```python
    [CDR_dict, none_CDR_active, CDR_adjacent]
    ```
- `bar_plot_dataset_names`: List of names corresponding to each dictionary in `bar_plot_dataset`.
  - Example:
    ```python
    ['CENPA_young_CDR_dict', 'CENPA_young_none_CDR_active_dict', 'CDR_young_adjacent']
    ```
- `bamfile`: BAM file object containing read data.

### Outputs:
- Calls the `region_read_mA_density_calculator` function for each dictionary in `bar_plot_dataset` and saves the results to CSV files.

### Description:
This code snippet uses parallel processing to calculate the modification density for different sets of chromosome regions using the `region_read_mA_density_calculator` function. It iterates over the list of dictionaries (`bar_plot_dataset`), each containing different sets of chromosome regions, and their corresponding names (`bar_plot_dataset_names`). For each dictionary, it calls `region_read_mA_density_calculator`, passing the dictionary, its name, the BAM file, and the modification tag ('A'). The results are saved to CSV files named according to the dataset names.


In [32]:
from joblib import Parallel, delayed

bar_plot_dataset = [CDR_dict, none_CDR_active, CDR_adjacent]
bar_plot_dataset_names = [ 'CENPA_young_CDR_dict', 'CENPA_young_none_CDR_active_dict','CDR_young_adjacent']
def get_variable_name(var, locals_dict):
    for name, value in locals_dict.items():
        if value is var:
            return name
    return None




for dictionary in range (0, len (bar_plot_dataset)): 

    region_read_mA_density_calculator (bar_plot_dataset[dictionary],bar_plot_dataset_names[dictionary],bamfile,'A')
    

CENPA_young_CDR_dict
╒══════════════════╤════════════════════════╤═════════════════════════╤═══════════════════════╕
│ chr10_MATERNAL   │ [42097255, 42102611]   │   0.0069820373839237625 │    47.218073188946974 │
╞══════════════════╪════════════════════════╪═════════════════════════╪═══════════════════════╡
│ chr10_MATERNAL   │ [42139864, 42150884]   │    0.017002641198485494 │ 52.622958257713250418 │
├──────────────────┼────────────────────────┼─────────────────────────┼───────────────────────┤
│ chr10_MATERNAL   │ [42214295, 42277197]   │    0.022570038330719324 │ 48.115274554068236057 │
├──────────────────┼────────────────────────┼─────────────────────────┼───────────────────────┤
│ chr10_MATERNAL   │ [42281736, 42284735]   │    0.013577724292006266 │ 55.083361120373460551 │
├──────────────────┼────────────────────────┼─────────────────────────┼───────────────────────┤
│ chr10_PATERNAL   │ [41440315, 41449687]   │    0.014497516356293321 │ 44.305591122492529621 │
├──────────────────

In [7]:
region_read_mA_density_calculator (CDR_dict,'halo_old_passaged_',bamfile,'A')
#region_read_mA_density_calculator (CDR_dict,'halo_young_passaged_',bamfile,'CG')

halo_old_passaged_
╒══════════════════╤════════════════════════╤═════════════════════════╤═══════════════════════╕
│ chr10_MATERNAL   │ [42092482, 42095482]   │   0.0073657216745463515 │                 42.14 │
╞══════════════════╪════════════════════════╪═════════════════════════╪═══════════════════════╡
│ chr10_PATERNAL   │ [41439071, 41442071]   │    0.015194368586081421 │ 44.005666666666670039 │
├──────────────────┼────────────────────────┼─────────────────────────┼───────────────────────┤
│ chr11_MATERNAL   │ [54050943, 54053943]   │    0.010462415629489748 │ 36.886666666666663161 │
├──────────────────┼────────────────────────┼─────────────────────────┼───────────────────────┤
│ chr11_PATERNAL   │ [52510375, 52513375]   │    0.004933663976291847 │ 34.397333333333335759 │
├──────────────────┼────────────────────────┼─────────────────────────┼───────────────────────┤
│ chr12_MATERNAL   │ [35770477, 35773477]   │    0.008693122013280614 │ 55.251333333333334963 │
├──────────────────┼─