### Increasing Pyrimidine Dinucleotides

Analysis for the increasing pyrimidine dinucleotide sequence set. Given an increasing number of pyrimidine dinucleotides in 3 sequence contexts, does the signal for CPD, 6-4PP, and UV-DDB increase?

1. Figure 1e (F1e) - CPD and 6-4PP signal with increasing number of pyrimidine dinucleotides
2. Figure 5c (F5c) - UV-DDB signal with increasing number of pyrimidine dinucleotides
3. Table S2 (TS2) - Table with the data used to generate F1e and F5c

### Overview

This notebook covers the analysis for the increasing pyrimidine dinucleotide library, which is used in Figures 1e and 5c and Table S2. It takes as **input** the following files:


| Input File | Associated Figure | Associated Table |
| --- | --- | --- |
| CPD_UV_ID33_alldata.txt | Figure 1e | Table S2 |
| 64PP_UV_ID35_alldata.txt | Figure 1e | Table S2 |
| UVDDB_UV_ID30_alldata.txt | Figure 5c | Table S2 |
| UVDDB_UV_ID31_alldata.txt | Figure 5c | Table S2 |


And generates the following **output**:

#### (1) Figure Output:

| Output File | Associated Figure | Desciption |
| --- | --- | --- |
| CPD_F1e.svg | Figure 1e | Top row plots in F1e showing CPD results |
| 64PP_F1e.svg | Figure 1e | Bottom row plots in F1e showing 6-4PP results |



#### (2) Table Output:

- Table_S2b.csv: Table with data used to generate F1e and F5c

### 3rd Party Packages

1. Bokeh - Creating plots
2. Numpy - Array usage
3. Pandas - Dataframe usage
4. Scipy - Linear regression

### UV Bind Analysis Core Imports

- uac.ols: OLS analysis trained on sequences that cannot form pyrimidine dimers
- uac.plot_range_from_x_y: Creates a tuple to draw a plot range based on 2 lists
- uac.scale_uv_on_non_uv: Scales UV values on Non-UV based on sequences that cannot form pyrimidine dinucleotides

** Additional details can be found in the uvbind_analysis_core.py script.

### Abbreviations:

- df: DataFrame
- pydi: Pyrimidine dinucleotide


### Imports and meta data

In [1]:
import itertools

from bokeh.layouts import gridplot
from bokeh.palettes import Category10
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.models import HoverTool, BooleanFilter, CDSView, ColumnDataSource, GroupFilter
from bokeh.io import export_svg, export_png
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.multitest import multipletests


import uvbind_analysis_core as uac

#Files and Folders
ALLDATA_FOLDER = "../../Data/AllData_Files"
FILES = (("CPD", f"{ALLDATA_FOLDER}/CPD_UV_ID33_alldata.txt"),
         ("64PP", f"{ALLDATA_FOLDER}/64PP_UV_ID35_alldata.txt"),
         ("UVDDB_r1", f"{ALLDATA_FOLDER}/UVDDB_UV_ID30_alldata.txt"),
         ("UVDDB_r2", f"{ALLDATA_FOLDER}/UVDDB_UV_ID31_alldata.txt"))
OUTPUT_F1 = "../Grant"
OUTPUT_T2 = "../Grant"
# Figure 1E Parameters
CPD_RANGE = (np.log(1200), np.log(25000))
CPD_Y_TICKER = (np.log(2500), np.log(5000), np.log(15000))
SIXFOUR_Y_TICKER = (np.log(5000), np.log(15000), np.log(50000))
SIXFOUR_RANGE = (np.log(3000), np.log(80000))
F1E_CIRCLE_SIZE = 10
F1E_COLOR_PALETTE = ("black", "#d67c35", "#128fcb")
# Figure 5C Parameters
TFS = ("MITF", "TBP", "p53")
MITF_TICKS = [np.log(20000), np.log(25000), np.log(35000)]
TBP_TICKS = [np.log(25000), np.log(30000), np.log(40000)]
P53_TICKS = [np.log(20000), np.log(30000), np.log(50000)]
F5C_CIRCLE_SIZE=25
TICK_TUP = (MITF_TICKS, TBP_TICKS, P53_TICKS)

### TS2: Process and organize data

Given the alldata files: 

1. Read the files as done in uac.process_alldata_file()
2. Filter for the increasing pyrimidine dinucleotide sequence group
    - These sequences uniquely contain a P6 in the name
3. Count the features for pyrimidine dinucleotides (TT, TC, CT, CC)
4. Filter for sequences with zero or one pyrimidine dinucleotide feature
    - Ex/ A sequence with 2+ TTs would pass, but not one with a TT and TC
5. Save this to a list of pre-median aggregation dataframes to later create the supplementary table (**TS2b Output**)
6. Aggregate the sequence classes by median and transform into natural log space
7. Classify rows by which feature they contain
    - If no feature, classify as all features, creating a new row for each one
8. Save this as a tabular dataset (**TS2c Output**)
9. Relable and place the rows of sequences without pyrimidine dinucleotides such that they are copied and placed into each group (TT, TC, CT, CC) as their 0 count instead of being in a seperate group of N. 
10. Return the dataframe. 

This is done for CPD, 6-4PP, and UV-DDB measurements. 

#### (1) Functions:

In [64]:
def count_pydi(string):
    """Returns a Pydi Tuple of counts.
    
    Given a string, counts the number of pyrimidine 
    dinucleotides and returns a namedtuple with each
    field representing a dinucleotide in both orientations.
    """
    # List to update with counts
    counts = [0, 0, 0, 0]
    # Dictionary of index positions in counts to update
    pydi_dict = {"TT":0,
                 "AA":0,
                 "TC":1,
                 "GA":1,
                 "CT":2,
                 "AG":2,
                 "CC":3,
                 "GG":3}
    # For each dinucleotide position, update counts
    for position in range(len(string) - 1):
        dinucleotide = string[position:position+2]
        if dinucleotide in pydi_dict:
            counts[pydi_dict[dinucleotide]] += 1
    # Return the counts list as a Pydi_Tuple
    return uac.Pydi_Tuple._make(counts)

def validate_label(label, sequence):
    label_pydi_counts = count_pydi(label)
    sequence_pydi_counts = count_pydi(sequence)
    return label_pydi_counts == sequence_pydi_counts
        

def validate_pydi_description(labels, sequences):
    for label, sequence in zip(labels, sequences):
        if validate_label(label, sequence) is False:
            raise ValueError(f"""The label sequence pair:\n{label}\n{sequence} does not match.""")
            
def pydi_group(string):
    total_count = count_pydi(string)
    if total_count.count(0) < 3:
        raise ValueError("Pyrimidine dinucleotide group is ambiguous.")
    for pydi in total_count._fields:
        if uac.reverse_complement(pydi) in string or pydi in string:
            return pydi
    return "N"

def increasing_pydi_pipeline(file: str, name: str) -> pd.DataFrame:
    """Pipeline for increasing pyrimidine dinucleotide groups."""
    # Read the input file
    df = uac.process_alldata_file(file, False, False)
    # Query for the increasing pyrimidine dinucleotide set
    df = df[df["Name"].str.contains('P7')].reset_index(drop=True)
    # Add a column indicating the sequence group
    df["Sequence_Set"] = df["Name"].apply(lambda x: x.split('_')[0])
    # Check that the sequence name correctly describes the sequence
    validate_pydi_description(df["Name"], df["Sequence"])
    # Filter for only sequences with 1 or 0 types of pyrimidine dinucleotides
    df = df[df["Sequence"].apply(lambda x: count_pydi(x).count(0) >= 3)]
    df = df.reset_index(drop=True)
    # Add a group and count column for the sequence
    df["Group"] = df["Name"].apply(lambda x: pydi_group(x))
    df["Count"] = df["Name"].apply(lambda x: sum(count_pydi(x)))
    # Rearrange the column order
    df = df[["Sequence_Set","Group", "Count","Sequence", "Signal"]]
    df["Experiment"] = name
    return df

In [63]:
t = pd.read_csv(FILES[1][1], sep='\t')
t = t[~t["Sequence"].isna()]
t[t["Name"].str.contains("P6")]

Unnamed: 0,Column,Row,Name,ID,Sequence,Cy3,Cy3Flags,Alexa488,Alexa488Flags,Cy3Exp,Obs/Exp,Alexa488Norm,Alexa488Median,Alexa488Adjusted
23,24,1,p53_P6_TC&P12_CC&P18_TT_r07,Ctrl_UV_DC1_37273,ACGCATCATGTGCCACACATTGTGCGCACACATACACATACACACA...,1,,31659,,1,1,31659.0,17401.0,28442.339348
29,30,1,TBP_P6_CT&P12_CC&P18_TC_r01,Ctrl_UV_DC1_00043,GTACGCTACGTACCATATATCGTACGCACACATACACATACACACA...,1,,31353,,1,1,31353.0,17192.0,28509.856270
32,33,1,p53_P6_TC&P12_CC&P18_N_r06,Ctrl_UV_DC1_31247,ACGCATCATGTGCCACACATGTGCGCGCACACATACACATACACAC...,1,,23268,,1,1,23268.0,17602.0,20665.188274
57,58,1,MITF_P6_N&P12_CT&P18_TC_r04,Ctrl_UV_DC1_18668,GTATGTACGCACTGTGCGTCACATACGCACACATACACATACACAC...,1,,14999,,1,1,14999.0,16865.0,13903.312600
139,140,1,p53_P6_N&P12_TT&P18_CT_r02,Ctrl_UV_DC1_06681,ACGCACATGTGTTACACACTGTGCGCGCACACATACACATACACAC...,1,,11480,,1,1,11480.0,10418.0,17226.611634
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62895,112,328,TBP_P6_CC&P12_CT&P18_CC_r05,Ctrl_UV_DC1_24593,GTACGCCACGTACTATATACCGTACGCACACATACACATACACACA...,1,,11684,,1,1,11684.0,14667.0,12453.533238
62922,139,328,TBP_P6_TC&P12_N&P18_N_r08,Ctrl_UV_DC1_43004,GTACGTCACGTATATATACGTACGCGCACACATACACATACACACA...,1,,10739,,1,1,10739.0,16353.0,10266.176665
62929,146,328,TBP_P6_CC&P12_CT&P18_TT_r04,Ctrl_UV_DC1_18392,GTACGCCACGTACTATATATTGTACGCACACATACACATACACACA...,1,,15134,,1,1,15134.0,15592.5,15173.309091
62937,154,328,p53_P6_CT&P12_CT&P18_TT_r08,Ctrl_UV_DC1_43395,ACGCACTATGTGCTACACATTGTGCGCACACATACACATACACACA...,1,,14194,,1,1,14194.0,14692.0,15103.103866


#### (2) Analysis:

Input: FILES

Output: Table_S2b.csv, Table_S2c.csv

In [65]:
# Read process data for all files in pipeline, then save to a file
dfs = []
for name, file in FILES:
    dfs.append(increasing_pydi_pipeline(file, name))
result = pd.concat(dfs)
result = result.reset_index(drop=True)
name_dict = {"MITF":"A", "P53":"B", "TBP":"C"}
result["Sequence_Set"] = result["Sequence_Set"].apply(lambda x: name_dict[x])
result = result.sort_values(by=["Experiment", "Sequence_Set", "Group", "Count"])
result = result.reset_index(drop=True)
result.to_csv(f"{OUTPUT_T2}/Table_S2b.csv", index=False)
# Calculate medians and transform into natural log space
result_medians = result.groupby(by=["Sequence_Set",
                                    "Group",
                                    "Count",
                                    "Experiment"]).aggregate(np.median)
result_medians = result_medians.reset_index()
result_medians = result_medians.rename(columns={"Signal":"Median_Signal"})
result_medians["Median_Signal_ln"] = result_medians["Median_Signal"].apply(lambda x: np.log(x))
result_medians = result_medians.sort_values(by=["Experiment", "Sequence_Set", "Group", "Count"]).reset_index(drop=True)
result_medians.to_csv(f"{OUTPUT_T2}/Table_S2c.csv", index=False)
result_medians

Unnamed: 0,Sequence_Set,Group,Count,Experiment,Median_Signal,Median_Signal_ln
0,A,CC,1,64PP,5544.5,8.620562
1,A,CC,2,64PP,7003.5,8.854165
2,A,CC,3,64PP,9003.0,9.105313
3,A,CT,1,64PP,4556.5,8.424310
4,A,CT,2,64PP,5393.5,8.592950
...,...,...,...,...,...,...
151,C,TC,2,UVDDB_r2,37325.5,10.527432
152,C,TC,3,UVDDB_r2,45991.5,10.736212
153,C,TT,1,UVDDB_r2,24636.0,10.111964
154,C,TT,2,UVDDB_r2,25700.5,10.154266


In [None]:
dfs[0]

In [66]:
n_group_sequences = result[result["Group"] == "N"].reset_index(drop=True)
for i in ("TT", "TC", "CT", "CC"):
    dataframe_addition = n_group_sequences.copy()
    dataframe_addition["Group"] = i
    result = pd.concat([result, dataframe_addition])
result = result[result["Group"] != 'N']
result = result.reset_index(drop=True)
result

Unnamed: 0,Sequence_Set,Group,Count,Sequence,Signal,Experiment
0,A,CC,1,GTATGTACGCACGTGCGTGGCATACGCACACATACACATACACACA...,4373,64PP
1,A,CC,1,GTATGTGGCGCACGTGCGTACATACGCACACATACACATACACACA...,3635,64PP
2,A,CC,1,GTATGTACGCACGGTGCGTACATACGCACACATACACATACACACA...,8659,64PP
3,A,CC,1,GTATGTACGCACGGTGCGTACATACGCACACATACACATACACACA...,8449,64PP
4,A,CC,1,GTATGTACGCACGTGCGTGGCATACGCACACATACACATACACACA...,3660,64PP
...,...,...,...,...,...,...
3821,C,CC,0,GTACGTACGTATATATACGTACGTACGCACACATACACATACACAC...,26096,UVDDB_r2
3822,C,CC,0,GTACGTACGTATATATACGTACGTACGCACACATACACATACACAC...,24832,UVDDB_r2
3823,C,CC,0,GTACGTACGTATATATACGTACGTACGCACACATACACATACACAC...,25483,UVDDB_r2
3824,C,CC,0,GTACGTACGTATATATACGTACGTACGCACACATACACATACACAC...,24102,UVDDB_r2


In [17]:
statistics_results = []
for experiment in list(set(result["Experiment"])):
    for sequence_set in list(set(result["Sequence_Set"])):
        for group in list(set(result["Group"])):
            # Get data subset
            comparison_df = result[(result["Experiment"] == experiment) &
                                   (result["Sequence_Set"] == sequence_set) &
                                   (result["Group"] == group)].reset_index(drop=True)
            group_results = []
            for count_a, count_b in itertools.combinations((0, 1, 2, 3), 2):
                count_a_values = comparison_df[comparison_df["Count"] == count_a]["Signal"]
                count_b_values = comparison_df[comparison_df["Count"] == count_b]["Signal"]
                p_value = stats.ranksums(count_a_values, count_b_values).pvalue
                group_results.append((experiment, sequence_set, group, count_a, count_b, p_value))
            group_df = pd.DataFrame(group_results)
            group_df = group_df.rename(columns={0:"Experiment", 1:"Sequence_Set", 2:"Group", 3:"Count_A", 4:"Count_B", 5:"P_Value"})
            group_df["FDR"] = multipletests(group_df["P_Value"], method="fdr_bh")[1]
            statistics_results.append(group_df)
statistics_df = pd.concat(statistics_results).reset_index(drop=True)
statistics_df = statistics_df.sort_values(by=["Experiment", "Sequence_Set", "Group", "Count_A", "Count_B"]).reset_index(drop=True)
statistics_df.to_csv(f"{OUTPUT_T2}/Table_S2d.csv")

### Fig 1; Plot CPD, 6-4PP data by pyrimidine dinucleotide feature

Creates 2 svg files, each a row of line plots with circles drawn at each point. Each plot corresponds to a pyrimidine dinucleotide (TT. TC. CT, or CC) and the lines in the plot correspond to measurements from 3 different sequences. Each x axis group is the count of the given pyrimidine dinucleotide. 

In [18]:
statistics_df

Unnamed: 0,Experiment,Sequence_Set,Group,Count_A,Count_B,P_Value,FDR
0,64PP,A,CC,0,1,0.031146,0.037376
1,64PP,A,CC,0,2,0.000302,0.000907
2,64PP,A,CC,0,3,0.000212,0.000907
3,64PP,A,CC,1,2,0.041507,0.041507
4,64PP,A,CC,1,3,0.000526,0.001052
...,...,...,...,...,...,...,...
283,UVDDB_r2,C,TT,0,2,0.416732,0.652319
284,UVDDB_r2,C,TT,0,3,0.705457,0.846548
285,UVDDB_r2,C,TT,1,2,0.859184,0.859184
286,UVDDB_r2,C,TT,1,3,0.416732,0.652319


In [67]:
result_med = result.groupby(by=["Sequence_Set", "Group", "Count", "Experiment"]).aggregate(np.median)
result_med = result_med.reset_index()
result_med

Unnamed: 0,Sequence_Set,Group,Count,Experiment,Signal
0,A,CC,0,64PP,4200.0
1,A,CC,0,CPD,2209.5
2,A,CC,0,UVDDB_r1,18645.5
3,A,CC,0,UVDDB_r2,18141.5
4,A,CC,1,64PP,5544.5
...,...,...,...,...,...
187,C,TT,2,UVDDB_r2,25700.5
188,C,TT,3,64PP,22609.0
189,C,TT,3,CPD,8510.0
190,C,TT,3,UVDDB_r1,30982.0


In [43]:
cpd_results = result_med[result_med["Experiment"] == "CPD"].reset_index(drop=True)
cpd_results_ln = cpd_results.copy()
cpd_results_ln["Signal"] = cpd_results_ln["Signal"].apply(lambda x: np.log(x))
cpd_results_ln

Unnamed: 0,Sequence_Set,Group,Count,Experiment,Signal
0,A,CC,0,CPD,7.700522
1,A,CC,1,CPD,7.611842
2,A,CC,2,CPD,7.451532
3,A,CC,3,CPD,7.327452
4,A,CT,0,CPD,7.700522
5,A,CT,1,CPD,7.755125
6,A,CT,2,CPD,7.886081
7,A,CT,3,CPD,8.04719
8,A,TC,0,CPD,7.700522
9,A,TC,1,CPD,7.860764


In [44]:
cpd_results

Unnamed: 0,Sequence_Set,Group,Count,Experiment,Signal
0,A,CC,0,CPD,2209.5
1,A,CC,1,CPD,2022.0
2,A,CC,2,CPD,1722.5
3,A,CC,3,CPD,1521.5
4,A,CT,0,CPD,2209.5
5,A,CT,1,CPD,2333.5
6,A,CT,2,CPD,2660.0
7,A,CT,3,CPD,3125.0
8,A,TC,0,CPD,2209.5
9,A,TC,1,CPD,2593.5


In [38]:
sixfour_results = result_med[result_med["Experiment"] == "64PP"].reset_index(drop=True)
sixfour_results_ln = sixfour_results.copy()
sixfour_results_ln["Signal"] = sixfour_results_ln["Signal"].apply(lambda x: np.log(x))
sixfour_results_ln

Unnamed: 0,Sequence_Set,Group,Count,Experiment,Signal
0,A,CC,0,64PP,8.34284
1,A,CC,1,64PP,8.620562
2,A,CC,2,64PP,8.854165
3,A,CC,3,64PP,9.105313
4,A,CT,0,64PP,8.34284
5,A,CT,1,64PP,8.42431
6,A,CT,2,64PP,8.59295
7,A,CT,3,64PP,8.670515
8,A,TC,0,64PP,8.34284
9,A,TC,1,64PP,9.168789


In [68]:
# Redo to accompdate for result_medians ####
def plot_by_dinucleotide(df: pd.DataFrame,
                         output: str,
                         y_range: tuple,
                         circle_size: int,
                         ticker: list,
                         palette: tuple):
    """Generates a set of line plots """
    figures = []
    colors = palette
    pydi = ("TT", "TC", "CT", "CC")
    seqs = ("MITF", "P53", "TBP")
    tickers = [ticker, [], [], []]
    plots = []
    # For each set of dinucleotide and y ticker to plot
    for dinuc, y_ticks in zip(pydi, tickers):
        # Create figure object
        p = figure(plot_width=150, plot_height=200, y_range=y_range)
        # For each sequence and color set
        for seq, color in zip(("A", "B", "C"), colors):
            # Filter the data for the pyrimidine dinucleotide and sequence
            pdf = df[(df["Sequence_Set"] == seq) & (df["Group"] == dinuc)].reset_index(drop=True)
            # Draw the circles and lines for that dataframe
            p.circle(pdf["Count"],
                     pdf["Signal"],
                     color=color,
                     size=circle_size)
            p.line(pdf["Count"],
                   pdf["Signal"],
                   color=color,
                   line_width=2)
        # Set an empty x axsis ticker
        p.xaxis.ticker = []
        # Set y axis ticker
        p.yaxis.ticker = y_ticks
        # Remove grid lines, label text, and toolbar
        p.xgrid.grid_line_color = None
        p.ygrid.grid_line_color = None
        p.xaxis.major_label_text_font_size = '0pt'
        p.yaxis.major_label_text_font_size = '0pt'
        p.toolbar_location = None
        # Settings for the border
        p.outline_line_width = 1
        p.outline_line_color = 'black'
        # Set backend to svg
        p.output_backend = 'svg'
        # Add to list of plots
        plots.append(p)
    # Create a grid of plots from the plots list with 1 row
    grid = gridplot([plots])
    # Export the grid of plots
    #export_svg(grid, filename=output)
    return grid

CPD_RANGE = (np.log(1400), np.log(10000))
CPD_Y_TICKER = (np.log(2000), np.log(4000), np.log(8000))
# Plot CPD and 6-4PP data
a = plot_by_dinucleotide(cpd_results_ln,
                     f"{OUTPUT_F1}/CPD.svg",
                     CPD_RANGE,
                     F1E_CIRCLE_SIZE,
                     CPD_Y_TICKER,
                     F1E_COLOR_PALETTE)
show(a)

In [69]:
SIXFOUR_Y_TICKER = (np.log(5000), np.log(15000), np.log(50000))
SIXFOUR_RANGE = (np.log(3000), np.log(80000))
b = plot_by_dinucleotide(sixfour_results_ln,
                     f"{OUTPUT_F1}/64PP.svg",
                     SIXFOUR_RANGE,
                     F1E_CIRCLE_SIZE,
                     SIXFOUR_Y_TICKER,
                     F1E_COLOR_PALETTE)
show(b)

### Fig 5; Plot UV-DDB by sequence 

Creates 1 svg file which contains a row of line plots. Each plot corresponds to a sequence context. Within the plots, each line corresponds to a pyrimidine dinucleotide. Each x axis category corresponds to a count of that pyrimidine dinucleotide. 

In [None]:
# Plot data
dataframe = uvddb_results
source = ColumnDataSource(dataframe)
figures = []
colors = ["#1b9e77",
    '#d95f02',
    '#7570b3',
    '#e7298a']
pydi = ("TT", "TC", "CT", "CC")
p = figure(plot_width=600, plot_height=800)
for color, py in zip(colors, pydi) :
    df = query_group(dataframe, py)
    p.circle(df[py],
              df["Signal"],
              color = color,
              size=CIRCLE_SIZE)
    p.line(df[py],
              df["Signal"],
              color = color,
          line_width=5)
p.xgrid.visible = False
#p.ygrid.visible = False
p.yaxis.ticker = TICK
p.xaxis.major_tick_line_color = None  # turn off x-axis major ticks
p.xaxis.minor_tick_line_color = None  # turn off x-axis minor ticks
p.xaxis.major_label_text_font_size = '0pt'
p.yaxis.major_label_text_font_size = '0pt'
show(p)



In [None]:
# Plot data
CIRCLE_SIZE=25
df = plot_df
source = ColumnDataSource(df)
figures = []
colors = ['black', 'cyan', 'gold']
pydi = ("TT", "TC", "CT", "CC")
seqs = ("MITF", "TBP", "p53")
for dinuc in pydi:
    p = figure(plot_width=600, plot_height=800, y_range=CPD_RANGE)
    for seq, color in zip(seqs, colors):
        pdf = df[(df["Sequence"] == seq) & (df["PyDi"] == dinuc)].reset_index(drop=True)
        p.circle(pdf["Count"],
                 pdf["Signal"],
                 color=color,
                 size=CIRCLE_SIZE)
        p.line(pdf["Count"],
                 pdf["Signal"],
                 color=color,
                 line_width=5)
    p.xaxis.ticker = []
    p.yaxis.ticker = [np.log(2000), np.log(10000), np.log(20000)]
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    p.xaxis.major_label_text_font_size = '0pt'
    p.yaxis.major_label_text_font_size = '0pt'
    p.toolbar_location = None
    show(p)

### UVDDB Replicate Plot - Figure 5b

In [None]:
uvddb_r1 = uac.process_alldata_file(UVDDB_FILES[0])
uvddb_r2 = uac.process_alldata_file(UVDDB_FILES[1])
uvddb = pd.merge(uvddb_r1, uvddb_r2, on=["Name", "Sequence", "Has_PyDi"], suffixes=("_r1", "_r2"))

# Calculate R2 
regression = stats.linregress(uvddb["Signal_r1"], uvddb["Signal_r2"])
rsquared = regression.rvalue ** 2

# Plot scatterplot
source = ColumnDataSource(uvddb)
p = figure(plot_width=800, plot_height=800,
           x_range=F5B_RANGE, y_range=F5B_RANGE)
p.circle("Signal_r1",
         "Signal_r2",
         source=source,
         color = "black",
         size=F5B_CIRCLE_SIZE)
p.xaxis.ticker = F5B_TICKS
p.yaxis.ticker = F5B_TICKS
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.xaxis.major_label_text_font_size = '0pt'
p.yaxis.major_label_text_font_size = '0pt'
p.toolbar_location = None
#p.output_backend = "svg"
show(p)
#export_png(p, filename="../Scratch/UVDDB_PLOTS/UVDDB_scatter.png")

### UVDDB increasing pyrimidine dinucleotide plot - Figure 5c

In [None]:
TFS = ("MITF", "TBP", "p53")
MITF_TICKS = [np.log(20000), np.log(25000), np.log(35000)]
TBP_TICKS = [np.log(25000), np.log(30000), np.log(40000)]
P53_TICKS = [np.log(20000), np.log(30000), np.log(50000)]
TICK_TUP = (MITF_TICKS, TBP_TICKS, P53_TICKS)

for TF, TICK in zip(TFS, TICK_TUP):
    df = process_file(FILE, False, False)
    mitf_probes = df[(df["Name"].str.contains('P6')) &
                   (df["Name"].str.startswith(TF))]
    mitf_probes = mitf_probes.reset_index(drop=True)
    features = ("TT", "TC", "CT", "CC")
    for i in features:
        result = []
        for seq in mitf_probes["Sequence"]:
            result.append(dimer_count(seq, i))
        mitf_probes[i] = result
    mitf_probes
    only_one = []
    for row in mitf_probes.itertuples():
        if [row.TT, row.TC, row.CT, row.CC].count(0) >= 3:
            only_one.append(1)
        else:
            only_one.append(0)
    mitf_probes["Select"] = only_one
    mitf_probes = mitf_probes[mitf_probes["Select"] == 1]
    mitf_probes = mitf_probes.reset_index(drop=True)
    mitf_probes = mitf_probes[["TT", "TC", "CT", "CC", "Alexa488"]]
    mitf_probes = mitf_probes.groupby(by=["TT", "TC", "CT", "CC"]).aggregate(np.median)
    mitf_probes = mitf_probes.reset_index()
    mitf_probes["Alexa488"] = mitf_probes["Alexa488"].apply(lambda x: np.log(x))
    mitf_probes
    # Plot data
    CIRCLE_SIZE=25
    dataframe = mitf_probes
    source = ColumnDataSource(dataframe)
    figures = []
    colors = ["#1b9e77",
        '#d95f02',
        '#7570b3',
        '#e7298a']
    pydi = ("TT", "TC", "CT", "CC")
    p = figure(plot_width=600, plot_height=800)
    for color, py in zip(colors, pydi) :
        df = query_group(dataframe, py)
        p.circle(df[py],
                  df["Alexa488"],
                  color = color,
                  size=CIRCLE_SIZE)
        p.line(df[py],
                  df["Alexa488"],
                  color = color,
              line_width=5)
    p.xgrid.visible = False
    #p.ygrid.visible = False
    p.yaxis.ticker = TICK
    p.xaxis.major_tick_line_color = None  # turn off x-axis major ticks
    p.xaxis.minor_tick_line_color = None  # turn off x-axis minor ticks
    p.xaxis.major_label_text_font_size = '0pt'
    p.yaxis.major_label_text_font_size = '0pt'
    show(p)