### Inputs:
- The current working directory is changed to `/private/home/yxu267/anaconda3/envs/dimelo/lib/python3.10/site-packages`.

### Outputs:
- Prints the current working directory to verify the change.
- Three CSV files are loaded into pandas DataFrames:
  1. `CENPA_chrom_arm_df`: Contains data for chromosome arm random regions.
     - File path: `CENPA_chromosome_arm_random_region_dict_region_density_scores.csv`
  2. `CENPA_CDR_df`: Contains data for CENPA old CDR regions.
     - File path: `/private/groups/migalab/dan/data_analysis/young_old_analysis/CENPA_old_CDR_dict_region_density_scores_A.csv`
  3. `CENPA_non_active_df`: Contains data for CENPA old non-CDR active regions.
     - File path: `/private/groups/migalab/dan/data_analysis/young_old_analysis/CENPA_old_none_CDR_active_dict_region_density_scores_A.csv`

### DataFrame Columns:
- `CENPA_chrom_arm_df`:
  - `Chromosome`: Chromosome name
  - `Start/End`: Start and end positions
  - `Density`: Modification density
  - `Coverage`: Region coverage
- `CENPA_CDR_df`:
  - `Chromosome`: Chromosome name
  - `Start/End`: Start and end positions
  - `Density`: Modification density
  - `Coverage`: Region coverage
- `CENPA_non_active_df`:
  - `Chromosome`: Chromosome name
  - `Start/End`: Start and end positions
  - `Density`: Modification density
  - `Coverage`: Region coverage

### Description:
This script changes the working directory, verifies the change, and imports necessary libraries. It then defines column names and loads three CSV files into pandas DataFrames. The data represents chromosome arm random regions, CENPA old CDR regions, and CENPA old non-CDR active regions. The script filters out rows corresponding to the X and Y chromosomes, leaving only autosomal chromosomes (chr1 to chr22). The DataFrames are now prepared for further analysis and visualization.


In [None]:
import os
# Change the working directory
os.chdir('/private/home/yxu267/anaconda3/envs/dimelo/lib/python3.10/site-packages')

# Verify the change
print(os.getcwd())

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
import ast



column_names = ['Chromosome', 'Start/End', 'Density', 'Coverage']
# Load the CSV file into a DataFrame
CENPA_chrom_arm_df = pd.read_csv('CENPA_chromosome_arm_random_region_dict_region_density_scores.csv',
                              header=None,
                              names=column_names,
                              sep='\t')

CENPA_CDR_df = pd.read_csv('/private/groups/migalab/dan/data_analysis/young_old_analysis/CENPA_old_CDR_dict_region_density_scores_A.csv',
                           header=None,
                           names=column_names,
                              sep='\t')

CENPA_non_active_df = pd.read_csv('/private/groups/migalab/dan/data_analysis/young_old_analysis/CENPA_old_none_CDR_active_dict_region_density_scores_A.csv',
                                  header=None,
                                  names=column_names,
                              sep='\t')

#Thes lines remove the X and Y chromosomes 

CENPA_chrom_arm_df = CENPA_chrom_arm_df[~CENPA_chrom_arm_df['Chromosome'].str.contains('chrX_MATERNAL')]
CENPA_chrom_arm_df = CENPA_chrom_arm_df[~CENPA_chrom_arm_df['Chromosome'].str.contains('chrY_PATERNAL')]

CENPA_non_active_df = CENPA_non_active_df[~CENPA_non_active_df['Chromosome'].str.contains('chrX_MATERNAL')]
CENPA_non_active_df = CENPA_non_active_df[~CENPA_non_active_df['Chromosome'].str.contains('chrY_PATERNAL')]

CENPA_CDR_df = CENPA_CDR_df[~CENPA_CDR_df['Chromosome'].str.contains('chrX_MATERNAL')]
CENPA_CDR_df = CENPA_CDR_df[~CENPA_CDR_df['Chromosome'].str.contains('chrY_PATERNAL')]

chromosomes = [f'chr{i}' for i in range(1, 23)]


### Inputs:
- Reference genome file in FASTA format:
  - File path: `/private/groups/migalab/dan/reference/hg002v1.0.1.fasta`

### Outputs:
- `assembly`: Dictionary containing chromosome names as keys and their sequences as values.
  - Example:
    ```python
    {
        'chr1': 'ATCGATCG...',
        'chr2': 'GCTAGCTA...'
    }
    ```
- `assembly_sequence_length`: Dictionary containing chromosome names as keys and their sequence lengths as values.
  - Example:
    ```python
    {
        'chr1': 248956422,
        'chr2': 242193529
    }
    ```

### Description:
This script parses a reference genome in FASTA format and loads it into a dictionary. It opens the reference genome file, reads each sequence using `SeqIO.parse`, and stores the sequences in the `assembly` dictionary with chromosome names as keys. It also creates a dictionary `assembly_sequence_length` to store the length of each chromosome sequence. Finally, it closes the reference genome file. The resulting dictionaries are prepared for further analysis, with `assembly` containing the full sequences and `assembly_sequence_length` containing the corresponding lengths of each chromosome.


In [None]:
from Bio import SeqIO
#parse CDR based on A count 


assembly_ = open("/private/groups/migalab/dan/reference/hg002v1.0.1.fasta", "r")

#Load the reference genome and make it into a dictionary 
fasta_sequences = SeqIO.parse(assembly_, "fasta")
assembly={}
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    assembly[name] = sequence

#Make a dictionary for all the chromosomes and their corresponding sequence length 
assembly_sequence_length = {}    
for chromosome in assembly:
    assembly_sequence_length[chromosome] = len(assembly[chromosome])
    
assembly_.close()



### Inputs:
- `region_df`: DataFrame containing region data with columns 'Chromosome', 'Start/End', and 'Density'.
  - Example:
    ```python
    {
        'Chromosome': ['chr1_PATERNAL', 'chr1_MATERNAL'],
        'Start/End': ['[1000, 2000]', '[3000, 4000]'],
        'Density': [0.5, 0.7]
    }
    ```
- `assembly`: Dictionary containing chromosome sequences with chromosome names as keys and their sequences as values.
  - Example:
    ```python
    {
        'chr1_PATERNAL': 'ATCGATCG...',
        'chr1_MATERNAL': 'GCTAGCTA...'
    }
    ```

### Outputs:
- `region_weighted_df`: DataFrame containing chromosome names, the percentage difference in mA density between paternal and maternal chromosomes, and the average mA density.
  - Columns: `Chromosome`, `Difference`, `Average`
  - Example:
    ```python
    {
        'Chromosome': ['chr1'],
        'Difference': [10.0],
        'Average': [0.6]
    }
    ```
- `region_combined_df`: DataFrame containing chromosome names, start and end positions, and comprehensive mA density for paternal and maternal chromosomes.
  - Columns: `Chromosome`, `Start/End`, `Density`
  - Example:
    ```python
    {
        'Chromosome': ['chr1_PATERNAL'],
        'Start/End': [[1000, 2000]],
        'Density': [0.5]
    }
    ```

### Description:
The `comprehensive_mA_density_calculation` function calculates the comprehensive mA density for given regions in paternal and maternal chromosomes. It takes a DataFrame containing region data and a dictionary with chromosome sequences. For each chromosome (from chr1 to chr22), it calculates the total mA density by counting the number of 'A' bases in the specified regions and multiplying it by the density values from the DataFrame. The function computes the comprehensive density for both paternal and maternal chromosomes and averages them. It also calculates the percentage difference in density between the two and stores the results in two DataFrames: `region_weighted_df` (containing the differences and averages) and `region_combined_df` (containing the comprehensive densities). The function returns these two DataFrames.

The function is applied to three datasets: `CENPA_CDR_df`, `CENPA_chrom_arm_df`, and `CENPA_non_active_df`, producing corresponding weighted and combined DataFrames for each dataset.


In [None]:
   
def comprehensive_mA_density_calculation(region_df, assembly):   
    region_weighted_df = pd.DataFrame(columns=['Chromosome', 'Difference', 'Average'])
    region_combined_df = pd.DataFrame(columns=['Chromosome', 'Start/End', 'Density'])

    # Filter rows for chromosome 1 to 22
    chromosomes = [f'chr{i}' for i in range(1, 23)]

    for chrom in chromosomes:

        maternal_col = f'{chrom}_MATERNAL'
        paternal_col = f'{chrom}_PATERNAL'

        #grab chromosome hap coordinates and put them in seperate lists 
        MAT_indices = region_df.index[region_df['Chromosome'] == maternal_col].tolist()
        PAT_indices = region_df.index[region_df['Chromosome'] == paternal_col].tolist()

        #make the start and end positions of each and every chromosome into and individual list 
        PAT_CDR_coordinates_from_density = region_df.loc[PAT_indices]['Start/End'].tolist()
        MAT_CDR_coordinates_from_density = region_df.loc[MAT_indices]['Start/End'].tolist()
        
        PAT_CDR_coordinates_total_A = 0 
        PAT_CDR_coordinates_total_mA = 0 

        # calculate the total amount of mA from the total amount of As in the reference genome and density 
        # calculated in the previous script in both Paternal and Maternal chromosomes 
        for coordinate in range (len(PAT_CDR_coordinates_from_density)): 
            PAT_CDR_coordinates_start_end = ast.literal_eval(PAT_CDR_coordinates_from_density[coordinate])
            PAT_CDR_coordinates_start = int (PAT_CDR_coordinates_start_end[0])
            PAT_CDR_coordinates_end = int (PAT_CDR_coordinates_start_end[1])
            PAT_CDR_A_count = assembly[paternal_col][PAT_CDR_coordinates_start:PAT_CDR_coordinates_end].count('A')
            PAT_CDR_coordinates_total_A += PAT_CDR_A_count
            PAT_CDR_coordinates_total_mA += PAT_CDR_A_count * region_df['Density'][PAT_indices[coordinate]]


        PAT_CDR_coordinates_comprehensive_density = PAT_CDR_coordinates_total_mA / PAT_CDR_coordinates_total_A
        new_row = pd.DataFrame({
                                    'Chromosome': [paternal_col],
                                    'Start/End': [[PAT_CDR_coordinates_start, PAT_CDR_coordinates_end]],
                                    'Density': [PAT_CDR_coordinates_comprehensive_density]
                                })

        region_combined_df = pd.concat([region_combined_df, new_row], ignore_index=True)


        MAT_CDR_coordinates_total_A = 0 
        MAT_CDR_coordinates_total_mA = 0 

        for coordinate in range (len(MAT_CDR_coordinates_from_density)): 
            MAT_CDR_coordinates_start_end = ast.literal_eval(MAT_CDR_coordinates_from_density[coordinate])
            MAT_CDR_coordinates_start = int (MAT_CDR_coordinates_start_end[0])
            MAT_CDR_coordinates_end = int (MAT_CDR_coordinates_start_end[1])
            MAT_CDR_A_count = assembly[maternal_col][MAT_CDR_coordinates_start:MAT_CDR_coordinates_end].count('A')
            MAT_CDR_coordinates_total_A += MAT_CDR_A_count
            MAT_CDR_coordinates_total_mA += MAT_CDR_A_count * region_df['Density'][MAT_indices[coordinate]]
   
        MAT_CDR_coordinates_comprehensive_density = MAT_CDR_coordinates_total_mA / MAT_CDR_coordinates_total_A
        new_row = pd.DataFrame({
                                    'Chromosome': [maternal_col],
                                    'Start/End': [[MAT_CDR_coordinates_start, MAT_CDR_coordinates_end]],
                                    'Density': [MAT_CDR_coordinates_comprehensive_density]
                                })
        region_combined_df = pd.concat([region_combined_df, new_row], ignore_index=True)


        chromosome_average = (PAT_CDR_coordinates_comprehensive_density + MAT_CDR_coordinates_comprehensive_density) / 2


        diff = abs (((PAT_CDR_coordinates_comprehensive_density - MAT_CDR_coordinates_comprehensive_density)/ MAT_CDR_coordinates_comprehensive_density)*100)
        regional_new_row = pd.DataFrame({
                                    'Chromosome': [chrom],
                                    'Difference': [diff],
                                    'Average': [chromosome_average]
                                })
        
        region_weighted_df = pd.concat([region_weighted_df, regional_new_row], ignore_index=True)
    return region_weighted_df, region_combined_df

CENPA_CDR_A_weighted_df, CENPA_CDR_A_combined_df = comprehensive_mA_density_calculation(CENPA_CDR_df, assembly)
CENPA_chrom_arm_weighted_df, CENPA_chrom_arm_A_combined_df = comprehensive_mA_density_calculation(CENPA_chrom_arm_df, assembly)
CENPA_non_active_weighted_df, CENPA_non_active_A_combined_df = comprehensive_mA_density_calculation(CENPA_non_active_df, assembly)



### Inputs:
- `df`: DataFrame containing chromosome density data with columns 'Chromosome', 'Start/End', and 'Density'.
- `type_label`: String label indicating the type of region (e.g., 'non_CDR_active', 'CDR', 'chrom_arm').
- `type_pos`: Position value for the type label on the x-axis of the plot.

### Outputs:
- `strip_plot_df`: DataFrame formatted for strip plotting with columns 'Chromosome', 'Density', 'Type', and 'Type_Pos'.
  - Example:
    ```python
    {
        'Chromosome': ['chr1_PATERNAL', 'chr1_MATERNAL'],
        'Density': [0.5, 0.7],
        'Type': ['PAT_non_CDR_active', 'MAT_non_CDR_active'],
        'Type_Pos': [0.5, 0.7]
    }
    ```
- `combined_df`: Concatenated DataFrame from the processed DataFrames for different region types.

### Description:
The script defines a function `process_df` that processes a DataFrame of chromosome density data and prepares it for strip plotting. The function calculates the average density for paternal and maternal chromosomes and assigns appropriate labels and positions for plotting. It creates a new DataFrame with these values and concatenates it with the existing DataFrame.

The script then applies `process_df` to three datasets: `CENPA_non_active_A_combined_df`, `CENPA_CDR_A_combined_df`, and `CENPA_chrom_arm_A_combined_df`, each with different region types ('non_CDR_active', 'CDR', and 'chrom_arm'). The processed DataFrames are concatenated into a single DataFrame `combined_df`.

For plotting, the script defines a position mapping for each type of region and calculates the median density values. It uses seaborn to create a scatter plot, with different colors and styles for each type. Lines are drawn between paternal and maternal points for each chromosome and region type. Median density values are represented by bars on the plot. The plot is customized with appropriate labels, a title, and a legend, and displayed using `plt.show()`.


In [None]:

# Function to process the data and create a DataFrame for strip plot
def process_df(df, type_label,type_pos):
    strip_plot_df = pd.DataFrame(columns=['Chromosome', 'Density', 'Type','Type_Pos'])
    new_rows = []
    for chrom in chromosomes:
        maternal_col = f'{chrom}_MATERNAL'
        paternal_col = f'{chrom}_PATERNAL'

        MAT_indices = df.index[df['Chromosome'] == maternal_col].tolist()
        PAT_indices = df.index[df['Chromosome'] == paternal_col].tolist()

        PAT_values = df.loc[PAT_indices]['Density'].tolist()
        MAT_values = df.loc[MAT_indices]['Density'].tolist()

        PAT_density_avg = np.mean(PAT_values) if PAT_values else 0
        MAT_density_avg = np.mean(MAT_values) if MAT_values else 0

        # Append new row data for paternal and maternal columns to the list
        new_rows.append({'Chromosome': paternal_col,
                         'Density': PAT_density_avg,
                         'Type': f'PAT_{type_label}',
                         'Type_Pos': type_pos})
        
        new_rows.append({'Chromosome': maternal_col,
                         'Density': MAT_density_avg,
                         'Type': f'MAT_{type_label}',
                         'Type_Pos': type_pos + 0.2})

    # Create a DataFrame from the list of new rows and concatenate with the existing DataFrame
    new_rows_df = pd.DataFrame(new_rows)
    strip_plot_df = pd.concat([strip_plot_df, new_rows_df], ignore_index=True)

    return strip_plot_df

# Assuming 'chromosomes' and the DataFrames are already defined

strip_plot_non_active_df = process_df(CENPA_non_active_A_combined_df, 'non_CDR_active',0.5)
strip_plot_CDR_df = process_df(CENPA_CDR_A_combined_df, 'CDR', 1.6)
strip_plot_chrom_arm_df = process_df(CENPA_chrom_arm_A_combined_df, 'chrom_arm', 1.0)

# Concatenate the processed DataFrames
combined_df = pd.concat([strip_plot_non_active_df, 
                         strip_plot_chrom_arm_df,
                         strip_plot_CDR_df], ignore_index=True)

# Update the x-coordinates mapping for plotting
type_pos_map = {'PAT_non_CDR_active': 0.5,
                'MAT_non_CDR_active': 0.7,
               'PAT_chrom_arm':1.0,
               'MAT_chrom_arm':1.2,
                'PAT_CDR': 1.6,
                'MAT_CDR': 1.8}
#combined_df['Type_Pos'] = combined_df['Type'].map(type_pos_map)
print (combined_df)

medians_dict = {type_label: combined_df[combined_df['Type'] == type_label]['Density'].median() for type_label in type_pos_map.keys()}

# Plotting
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Type_Pos', y='Density', hue='Type', style='Type', data=combined_df, s=100, palette=['blue', 'red', 'green', 'orange','black','pink'])

# Drawing lines between the points
for chrom in chromosomes:
    for type_label in ['non_CDR_active', 'CDR','chrom_arm']:
        pat_row = combined_df[(combined_df['Chromosome'] == f'{chrom}_PATERNAL') & (combined_df['Type'] == f'PAT_{type_label}')]
        mat_row = combined_df[(combined_df['Chromosome'] == f'{chrom}_MATERNAL') & (combined_df['Type'] == f'MAT_{type_label}')]
        if not pat_row.empty and not mat_row.empty:
            plt.plot([type_pos_map[f'PAT_{type_label}'], type_pos_map[f'MAT_{type_label}']],
                     [pat_row['Density'].values[0], mat_row['Density'].values[0]],
                     color='gray', linestyle='--')

for type_label, median in medians_dict.items():
    plt.bar(type_pos_map[type_label], median, width=0.1, color='blue', alpha=0.5)

            
# Customizing the plot
plt.xticks(list(type_pos_map.values()), list(type_pos_map.keys()))
plt.title('CENPA density Values between homologs in adaptive sequencing in old passaged cells ')
plt.xlabel('Type')
plt.ylabel('Density')
plt.legend(title='Type', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()


### Inputs:
- `CENPA_CDR_A_combined_df`: DataFrame containing columns 'Chromosome' and 'Density'.
  - Example:
    ```python
    {
        'Chromosome': ['chr1_PATERNAL', 'chr1_MATERNAL'],
        'Density': [0.5, 0.7]
    }
    ```

### Outputs:
- Bar plot displaying the density values for CENPA CDR between paternal and maternal chromosomes.

### Description:
This script processes a DataFrame (`CENPA_CDR_A_combined_df`) to add new columns indicating whether each entry is 'Paternal' or 'Maternal' and to create a simplified chromosome label without the paternal/maternal designation. The script then sets up a color palette to differentiate between paternal and maternal data points.

Using seaborn, the script creates a bar plot where the x-axis represents the chromosomes, and the y-axis represents the density values. The data is grouped by the new 'Type' column ('Paternal' or 'Maternal') and colored accordingly. The plot is customized with a title, and axis labels, and displayed using `plt.show()`.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'CENPA_CDR_A_combined_df' is already defined and contains 'Chromosome', 'Density', and we already have 'Chromosome_Label'
# Add a new column to indicate if the data is 'Paternal' or 'Maternal'
CENPA_CDR_A_combined_df['Chromosome_Label'] = CENPA_CDR_A_combined_df['Chromosome'].str.replace('_PATERNAL', '').str.replace('_MATERNAL', '')

CENPA_CDR_A_combined_df['Type'] = CENPA_CDR_A_combined_df['Chromosome'].apply(lambda x: 'Paternal' if 'PATERNAL' in x else 'Maternal')

# Set the color palette
palette = {'Paternal': 'blue', 'Maternal': 'red'}

# Plotting
plt.figure(figsize=(10, 6))
sns.barplot(x='Chromosome_Label', y='Density', hue='Type', data=CENPA_CDR_A_combined_df, palette=palette, errorbar=None)

plt.title('CENPA CDR density values between homologs in adaptive sequencing in old passaged cells ')
plt.xlabel('Chromosome')
plt.ylabel('Density')

plt.show()
