### Figure 1E; Figure 5C; Supplementary Table 2

Analysis for the increasing pyrimidine dinucleotide sequence set. Given an increasing number of pyrimidine dinucleotides in 3 sequence contexts, does the signal for CPD, 6-4PP, and UV-DDB increase?


### Overview


**Supplementary Table 2A-C:**

Supplementary Table 2A-C contains incresing pyrimidine dinucleotide count measurements for CPD, 6-4PP and UV-DDB respectively in UV and non-UV conditions. The output for the files here are in **csv** format to be manually organized and placed in a **xlsx** table. 

**Supplementary Table 2D:**

Supplementary Table 2D is for the jonkheere trend statistics. This notebook generates csv files for use in the Fig1_Fig5_FigS8_TableS2_TableS8_Jonckheere_Statistics.rmd script. 

**Figure 1E:**

The output for F1E is 2 svg files containing a row of line plots for CPD and 6-4PP respectively.

**Figure 5C:**

The output for F5C is 3 svg files containing a line plot for sequences A, B, and C. 

### File Input and Output

This notebook covers the analysis for the increasing pyrimidine dinucleotide library, which is used in Figures 1E and 5C and Table S2. It takes as **input** the following files:


| Input File | Associated Figure | Associated Table |
| --- | --- | --- |
| CPD_WC_ID32_alldata.txt | NA | Table S2 |
| CPD_UV_ID33_alldata.txt | Figure 1E | Table S2 |
| 64PP_WC_ID34_alldata.txt | NA | Table S2 |
| 64PP_UV_ID35_alldata.txt | Figure 1E | Table S2 |
| UVDDB_WC_ID28_alldata.txt | NA | Table S2 |
| UVDDB_WC_ID29_alldata.txt | NA | Table S2 |
| UVDDB_UV_ID30_alldata.txt | Figure 5C | Table S2 |
| UVDDB_UV_ID31_alldata.txt | Figure 5C | Table S2 |


And generates the following **output**:

#### (1) Figure Output:

| Output File | Associated Figure | Desciption |
| --- | --- | --- |
| CPD_F1E.svg | Figure 1E | Top row plots in F1e showing CPD results |
| 64PP_F1E.svg | Figure 1E | Bottom row plots in F1e showing 6-4PP results |
| Fig5C_Sequence_A.svg | Figure 5C | Line plots for Sequence A in Figure 5C |
| Fig5C_Sequence_B.svg | Figure 5C | Line plots for Sequence B in Figure 5C |
| Fig5C_Sequence_C.svg | Figure 5C | Line plots for Sequence C in Figure 5C |


#### (2) Table Output:

- Table_S2A.csv
- Table_S2B.csv
- Table_S2C.csv
- Tables for use in Fig1_Fig5_FigS7_TableS2_TableS7_Jonheere_Statistics.rmd


### 3rd Party Packages

1. Bokeh - Creating plots
2. Numpy - Array usage
3. Pandas - Dataframe usage
4. Scipy - Linear regression

### UV Bind Analysis Core Imports

- uac.ols: OLS analysis trained on sequences that cannot form pyrimidine dimers
- uac.plot_range_from_x_y: Creates a tuple to draw a plot range based on 2 lists
- uac.scale_uv_on_non_uv: Scales UV values on Non-UV based on sequences that cannot form pyrimidine dinucleotides

** Additional details can be found in the uvbind_analysis_core.py script.

### Abbreviations:

- df: DataFrame
- pydi: Pyrimidine dinucleotide


### Imports and meta data

In [1]:
import itertools
import os

from bokeh.layouts import gridplot
from bokeh.palettes import Category10
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.models import HoverTool, BooleanFilter, CDSView, ColumnDataSource, GroupFilter
from bokeh.io import export_svg, export_png
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.multitest import multipletests


import uvbind_analysis_core as uac

#Files and Folders
ALLDATA_FOLDER = "../../Data/AllData_Files"
FILES = (("CPD_UV", f"{ALLDATA_FOLDER}/CPD_UV_ID33_alldata.txt"),
         ("SixFour_UV", f"{ALLDATA_FOLDER}/64PP_UV_ID35_alldata.txt"),
         ("UVDDB_r1", f"{ALLDATA_FOLDER}/UVDDB_UV_ID30_alldata.txt"),
         ("UVDDB_r2", f"{ALLDATA_FOLDER}/UVDDB_UV_ID31_alldata.txt"),
         ("UVDDB_NUVr1", f"{ALLDATA_FOLDER}/UVDDB_WC_ID28_alldata.txt"),
         ("UVDDB_NUVr2", f"{ALLDATA_FOLDER}/UVDDB_WC_ID29_alldata.txt"),
         ("CPD_NUV", f"{ALLDATA_FOLDER}/CPD_WC_ID32_alldata.txt"),
         ("SixFour_NUV", f"{ALLDATA_FOLDER}/64PP_WC_ID34_alldata.txt"))
OUTPUT_FOLDER_F1 = "../Figure_1"
OUTPUT_FOLDER_F5 = "../Figure_5"
OUTPUT_FOLDER_T2 = "../Table_S2"
# Figure 1E Parameters
CPD_RANGE = (np.log(1200), np.log(25000))
CPD_Y_TICKER = (np.log(2500), np.log(5000), np.log(15000))
SIXFOUR_Y_TICKER = (np.log(5000), np.log(15000), np.log(50000))
SIXFOUR_RANGE = (np.log(3000), np.log(80000))
F1E_CIRCLE_SIZE = 10
F1E_COLOR_PALETTE = ("black", "#d67c35", "#128fcb")
# Figure 5C Parameters
F5C_PALETTE = ["#1b9e77",'#d95f02','#7570b3','#e7298a']
F5C_Y_TICKERS = ([np.log(20000), np.log(25000), np.log(35000)],
                 [np.log(25000), np.log(30000), np.log(40000)],
                 [np.log(20000), np.log(30000), np.log(50000)])
F5C_CIRCLE_SIZE=25

In [2]:
# Ensure output folders exist
for i in (OUTPUT_FOLDER_F1, OUTPUT_FOLDER_F5, OUTPUT_FOLDER_T2):
    os.makedirs(i, exist_ok=True)

### Table S2A-C

Given the alldata files: 

1. Read the files as done in uac.process_alldata_file()
2. Filter for the increasing pyrimidine dinucleotide sequence group
    - These sequences uniquely contain a P6 in the name
3. Count the features for pyrimidine dinucleotides (TT, TC, CT, CC)
4. Filter for sequences with zero or one pyrimidine dinucleotide feature
    - Ex/ A sequence with 2+ TTs would pass, but not one with a TT and TC
5. Save this to a list of pre-median aggregation dataframes to later create the supplementary table (**TS2b Output**)
6. Aggregate the sequence classes by median and transform into natural log space
7. Classify rows by which feature they contain
    - If no feature, classify as all features, creating a new row for each one
8. Save this as a tabular dataset (**TS2c Output**)
9. Relable and place the rows of sequences without pyrimidine dinucleotides such that they are copied and placed into each group (TT, TC, CT, CC) as their 0 count instead of being in a seperate group of N. 
10. Return the dataframe. 

This is done for CPD, 6-4PP, and UV-DDB measurements. 

#### (1) Functions:

In [3]:
def count_pydi(string):
    """Returns a Pydi Tuple of counts.
    
    Given a string, counts the number of pyrimidine 
    dinucleotides and returns a namedtuple with each
    field representing a dinucleotide in both orientations.
    """
    # List to update with counts
    counts = [0, 0, 0, 0]
    # Dictionary of index positions in counts to update
    pydi_dict = {"TT":0,
                 "AA":0,
                 "TC":1,
                 "GA":1,
                 "CT":2,
                 "AG":2,
                 "CC":3,
                 "GG":3}
    # For each dinucleotide position, update counts
    for position in range(len(string) - 1):
        dinucleotide = string[position:position+2]
        if dinucleotide in pydi_dict:
            counts[pydi_dict[dinucleotide]] += 1
    # Return the counts list as a Pydi_Tuple
    return uac.Pydi_Tuple._make(counts)

def validate_label(label, sequence):
    label_pydi_counts = count_pydi(label)
    sequence_pydi_counts = count_pydi(sequence)
    return label_pydi_counts == sequence_pydi_counts
        

def validate_pydi_description(labels, sequences):
    for label, sequence in zip(labels, sequences):
        if validate_label(label, sequence) is False:
            raise ValueError(f"""The label sequence pair:\n{label}\n{sequence} does not match.""")
            
def pydi_group(string):
    total_count = count_pydi(string)
    if total_count.count(0) < 3:
        raise ValueError("Pyrimidine dinucleotide group is ambiguous.")
    for pydi in total_count._fields:
        if pydi in string:
            return pydi
    return "N"

def increasing_pydi_pipeline(file: str, name: str) -> pd.DataFrame:
    """Pipeline for increasing pyrimidine dinucleotide groups."""
    # Read the input file
    df = uac.process_alldata_file(file, False, False)
    # Query for the increasing pyrimidine dinucleotide set
    df = df[df["Name"].str.contains('P6')].reset_index(drop=True)
    # Add a column indicating the sequence group
    df["Sequence_Set"] = df["Name"].apply(lambda x: x.split('_')[0])
    # Check that the sequence name correctly describes the sequence
    validate_pydi_description(df["Name"], df["Sequence"])
    # Filter for only sequences with 1 or 0 types of pyrimidine dinucleotides
    df = df[df["Sequence"].apply(lambda x: count_pydi(x).count(0) >= 3)]
    df = df.reset_index(drop=True)
    # Add a group and count column for the sequence
    df["Group"] = df["Name"].apply(lambda x: pydi_group(x))
    df["Count"] = df["Name"].apply(lambda x: sum(count_pydi(x)))
    df["Sequence_Replicate"] = df["Name"].apply(lambda x: int(x.split('_')[-1][1:]))
    # Rearrange the column order
    df = df[["Sequence_Set","Group", "Count","Sequence_Replicate", "Sequence", "Signal"]]
    df["Experiment"] = name
    return df

def subtable_from_df(experiment, dataframe):
    pivot_index = ["Sequence_Set", "Group", "Count", "Sequence_Replicate", "Sequence"]
    subtable = dataframe[dataframe["Experiment"].str.contains(experiment)].reset_index(drop=True)
    subtable = subtable.pivot(index=pivot_index, values="Signal", columns="Experiment")
    subtable = subtable.reset_index()
    subtable = subtable.sort_values(by=["Sequence_Set", "Group", "Count", "Sequence", "Sequence_Replicate"])
    return subtable

#### (2) Analysis:

In [4]:
# Create a dataframe from all files in FILES using the output from increasing_pydi_pipeline
dfs = []
for name, file in FILES:
    dfs.append(increasing_pydi_pipeline(file, name))
result = pd.concat(dfs)
result = result.reset_index(drop=True)
# Convert the sequence set from the labels in the AMADID design to the final labels
name_dict = {"MITF":"A", "p53":"B", "TBP":"C"}
result["Sequence_Set"] = result["Sequence_Set"].apply(lambda x: name_dict[x])
# Sort the dataframe
result = result.sort_values(by=["Experiment", "Sequence_Set", "Group", "Count", "Sequence"])
result = result.reset_index(drop=True)
# Rename N to No_PyDi to avoid confusion with an "N" base
result["Group"] = result["Group"].apply(lambda x: "No_PyDi" if x == "N" else x)
# Subset sequences to the variable region (first 25bp)
result["Sequence"] = result["Sequence"].apply(lambda x: x[:25])
# Show full table
result

Unnamed: 0,Sequence_Set,Group,Count,Sequence_Replicate,Sequence,Signal,Experiment
0,A,CC,1,1,GTATGCCACGCACGTGCGTACATAC,24,CPD_NUV
1,A,CC,1,10,GTATGCCACGCACGTGCGTACATAC,30,CPD_NUV
2,A,CC,1,5,GTATGCCACGCACGTGCGTACATAC,25,CPD_NUV
3,A,CC,1,4,GTATGCCACGCACGTGCGTACATAC,26,CPD_NUV
4,A,CC,1,8,GTATGCCACGCACGTGCGTACATAC,31,CPD_NUV
...,...,...,...,...,...,...,...
6918,C,TT,3,1,GTACGTTACGTATTATATATTGTAC,29736,UVDDB_r2
6919,C,TT,3,2,GTACGTTACGTATTATATATTGTAC,27600,UVDDB_r2
6920,C,TT,3,5,GTACGTTACGTATTATATATTGTAC,24281,UVDDB_r2
6921,C,TT,3,8,GTACGTTACGTATTATATATTGTAC,24890,UVDDB_r2


In [5]:
# Create individual subtables for use in the xlsx file
for label, dataset in (("A", "CPD"), ("B", "SixFour"), ("C", "UVDDB")):
    table = subtable_from_df(dataset, result)
    table.to_csv(f"{OUTPUT_FOLDER_T2}/Table_S2{label}.csv", index=None)

### Supplementary Table 2D

Create tables for each group to be run though an R script for a Jonckheere trend test (F5_FS8_TS5_Statistics.rmd). 

In [6]:
# Add the no pyrimidine dinucleotide sequences into each dinucleotide label as a count of 0
n_group_sequences = result[result["Group"] == "No_PyDi"].reset_index(drop=True)
for i in ("TT", "TC", "CT", "CC"):
    dataframe_addition = n_group_sequences.copy()
    dataframe_addition["Group"] = i
    result = pd.concat([result, dataframe_addition])
result = result[result["Group"] != 'No_PyDi']
result = result.reset_index(drop=True)

# Create csv files to perform the Jonckheere test in R 
for i in [x[0] for x in FILES]:
    table = subtable_from_df(i, result)
    for sequence_set in ("A", "B", "C"):
        for group in ("TT", "TC", "CT", "CC"):
            out = table[(table["Group"] == group) & (table["Sequence_Set"] == sequence_set)].reset_index(drop=True)
            out.to_csv(f"{OUTPUT_FOLDER_T2}/Table_S2_{i}_{sequence_set}_{group}_For_Statistics.csv", index=None)

### Figure 1E

Plot CPD, 6-4PP data by pyrimidine dinucleotide feature

Creates 2 svg files, each a row of line plots with circles drawn at each point. Each plot corresponds to a pyrimidine dinucleotide (TT. TC. CT, or CC) and the lines in the plot correspond to measurements from 3 different sequences. Each x axis group is the count of the given pyrimidine dinucleotide. 

#### (1) Functions

In [7]:
def plot_by_dinucleotide(df: pd.DataFrame,
                         output: str,
                         y_range: tuple,
                         circle_size: int,
                         ticker: list,
                         palette: tuple):
    """Generates a set of line plots """
    pydi = ("TT", "TC", "CT", "CC")
    seq_sets = ("A", "B", "C")
    tickers = [ticker, [], [], []]
    plots = []
    # For each set of dinucleotide and y ticker to plot
    for dinuc, y_ticks in zip(pydi, tickers):
        # Create figure object
        p = figure(plot_width=150, plot_height=200, y_range=y_range)
        # For each sequence and color set
        for seq_set, color in zip(seq_sets, palette):
            # Filter the data for the pyrimidine dinucleotide and sequence
            pdf = df[(df["Sequence_Set"] == seq_set) & (df["Group"] == dinuc)].reset_index(drop=True)
            # Draw the circles and lines for that dataframe
            p.circle(pdf["Count"],
                     pdf["Median_Signal_ln"],
                     color=color,
                     size=circle_size)
            p.line(pdf["Count"],
                   pdf["Median_Signal_ln"],
                   color=color,
                   line_width=2)
        # Set an empty x axsis ticker
        p.xaxis.ticker = []
        # Set y axis ticker
        p.yaxis.ticker = y_ticks
        # Remove grid lines, label text, and toolbar
        p.xgrid.grid_line_color = None
        p.ygrid.grid_line_color = None
        p.xaxis.major_label_text_font_size = '0pt'
        p.yaxis.major_label_text_font_size = '0pt'
        p.toolbar_location = None
        # Settings for the border
        p.outline_line_width = 1
        p.outline_line_color = 'black'
        # Set backend to svg
        p.output_backend = 'svg'
        # Add to list of plots
        plots.append(p)
    # Create a grid of plots from the plots list with 1 row
    grid = gridplot([plots])
    # Export the grid of plots
    export_svg(grid, filename=output)


#### (2) Analysis

In [8]:
# Calculate medians and transform into natural log space
result_medians = result.groupby(by=["Sequence_Set",
                                    "Group",
                                    "Count",
                                    "Experiment"]).aggregate(np.median)
result_medians = result_medians.reset_index()
result_medians = result_medians.rename(columns={"Signal":"Median_Signal"})
result_medians["Median_Signal_ln"] = result_medians["Median_Signal"].apply(lambda x: np.log(x))
result_medians = result_medians.sort_values(by=["Experiment", "Sequence_Set", "Group", "Count"]).reset_index(drop=True)

cpd_results = result_medians[result_medians["Experiment"] == "CPD_UV"].reset_index(drop=True)
sixfour_results = result_medians[result_medians["Experiment"] == "SixFour_UV"].reset_index(drop=True)

plot_by_dinucleotide(cpd_results,
                     f"{OUTPUT_FOLDER_F1}/CPD_F1E.svg",
                     CPD_RANGE,
                     F1E_CIRCLE_SIZE,
                     CPD_Y_TICKER,
                     F1E_COLOR_PALETTE)
plot_by_dinucleotide(sixfour_results,
                     f"{OUTPUT_FOLDER_F1}/64PP_F1E.svg",
                     SIXFOUR_RANGE,
                     F1E_CIRCLE_SIZE,
                     SIXFOUR_Y_TICKER,
                     F1E_COLOR_PALETTE)

### Figure 5B - UV-DDB replicates

In [9]:
F5B_RANGE = (np.log(10000), np.log(60000))
F5B_TICKS = [np.log(15000), np.log(30000), np.log(60000)]
F5B_CIRCLE_SIZE = 10


uvddb_r1 = uac.process_alldata_file(f"{ALLDATA_FOLDER}/UVDDB_UV_ID30_alldata.txt")
uvddb_r2 = uac.process_alldata_file(f"{ALLDATA_FOLDER}/UVDDB_UV_ID31_alldata.txt")
uvddb = pd.merge(uvddb_r1, uvddb_r2, on=["Name", "Sequence", "Has_PyDi"], suffixes=("_r1", "_r2"))

# Calculate R2 
regression = stats.linregress(uvddb["Signal_r1"], uvddb["Signal_r2"])
rsquared = regression.rvalue ** 2
# Save
with open(f"{OUTPUT_FOLDER_F5}/Correlation_F5B.txt", 'w') as file:
    file.write(f"Correlation = {rsquared}\n")
print(rsquared)

# Plot scatterplot
source = ColumnDataSource(uvddb)
p = figure(plot_width=800, plot_height=800,
           x_range=F5B_RANGE, y_range=F5B_RANGE)
p.circle("Signal_r1",
         "Signal_r2",
         source=source,
         color = "black",
         size=F5B_CIRCLE_SIZE)
p.xaxis.ticker = F5B_TICKS
p.yaxis.ticker = F5B_TICKS
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.xaxis.major_label_text_font_size = '0pt'
p.yaxis.major_label_text_font_size = '0pt'
p.toolbar_location = None
export_png(p, filename=f"{OUTPUT_FOLDER_F5}/Fig5B_UVDDB_scatter.png")

0.9032640394558213


'/home/zmielko/Documents/UV_Project/GitHub_Directory/Analysis/Figure_5/Fig5B_UVDDB_scatter.png'

### Figure 5C - Plot UV-DDB by sequence 

Creates 1 svg file which contains a row of line plots. Each plot corresponds to a sequence context. Within the plots, each line corresponds to a pyrimidine dinucleotide. Each x axis category corresponds to a count of that pyrimidine dinucleotide. 

In [10]:
uvddb_results = result_medians[result_medians["Experiment"] == "UVDDB_r1"].reset_index(drop=True)

for sequence_set, y_ticker in zip(("A", "B", "C"), F5C_Y_TICKERS):
    # Plot data
    p = figure(plot_width=600, plot_height=800)
    for color, yy in zip(F5C_PALETTE, ("TT", "TC", "CT", "CC")) :
        plot_df = uvddb_results[(uvddb_results["Group"] == yy) &
                                (uvddb_results["Sequence_Set"] == sequence_set)].reset_index(drop=True)
        p.circle(plot_df["Count"],
                 plot_df["Median_Signal_ln"],
                 color=color,
                 size=25)
        p.line(plot_df["Count"],
               plot_df["Median_Signal_ln"],
               color=color,
               line_width=5)
    p.xgrid.visible = False
    p.yaxis.ticker = y_ticker
    p.xaxis.major_tick_line_color = None  # turn off x-axis major ticks
    p.xaxis.minor_tick_line_color = None  # turn off x-axis minor ticks
    p.xaxis.major_label_text_font_size = '0pt'
    p.yaxis.major_label_text_font_size = '0pt'
    p.output_backend = 'svg'
    export_svg(p, filename=f"{OUTPUT_FOLDER_F5}/Fig5C_Sequence_{sequence_set}.svg")