### Consensus Site Flanks and Variations

Analysis of variations in binding sites and binding site flanks across non-UV and UV conditions.

1. Figure 3
2. Figure S4
3. Table S4

### Overview 

**Figure 3A-C: MITF Flanks**

UV-Bind data for MITF binding to the E-box CACGTG and all possible 2-nt 5' and 3' flanks was measured. Sequences were grouped into those with a **T**CACGTG**A** and **C**CACGTG**G** flanking context. The variation in fold change decreases was shown for all sequences within those groups and their relation relative to a prediction interval trained on sequences in the library that cannot form pyrimidine dimers due to lacking pyrimidine dinucleotides. 



For a given transcription factor and consensus site (MITF, F3; CREB1, EGR1, MITF F5S) binding was measured for all sequences generated by taking a sliding window os 2-bp and creating all possible variations over a given region. The sequences are compared to a prediction interval trained on all sequences without pyrimidine dinucleotides in the library. 

### 3rd Party Packages

1. Bokeh - Creating plots
2. Numpy - Array usage
3. Pandas - Dataframe usage

### UV Bind Analysis Core Imports

- uac.ols: OLS analysis trained on sequences that cannot form pyrimidine dimers
- uac.plot_range_from_x_y: Creates a tuple to draw a plot range based on 2 lists
- uac.scale_uv_on_non_uv: Scales UV values on Non-UV based on sequences that cannot form pyrimidine dinucleotides

** Additional details can be found in the uvbind_analysis_core.py script.

### Abbreviations:

- df: DataFrame
- dict: Dictionary
- nuv: Non-UV
- nsig: Non-Significant
- sig: Significant
- supp: Supplemental
- var: variant

In [1]:
# Imports
from __future__ import annotations
from collections import namedtuple
import math
import os

from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.models import BooleanFilter, CDSView, GroupFilter, Label
from bokeh.models import Grid, HBar, Span, IndexFilter, Text
from bokeh.io import export_svg
import numpy as np
import pandas as pd
import scipy.stats as stats
import uvbind_analysis_core as uac

# Global variables
DATA_FOLDER = "../../Data/AllData_Files"
META_DATA = "Meta_Data/Meta_Data_Binding_Site_Variations.csv"
META_DATA_FIG_S5 = "Meta_Data/Meta_Data_Binding_Site_Variations_FigS5_Plots.csv"
FORMATTED_VAR_NAMES = "Meta_Data/Array_86684_Formatted_Variation_Names_v2.csv"
OUTPUT_MAIN = "../Figure_3"
OUTPUT_SUPP = "../Figure_S4"
OUTPUT_TABLE = "../Table_S4"
OUTPUT_TEXT = "../Text_Data"
# Figure 3a-b
F3A_PLOT_RANGE = (np.log(500), np.log(75000))
F3A_TA_COLOR = "#1f78b4"
F3A_CG_COLOR = "#33a02c"
F3A_ND_COLOR = "lightgrey"
F3A_TICKER = (np.log(500), np.log(5000), np.log(50000))
F3A_CIRCLE_SIZE = 20
# Figure 3c
F3C_PLOT_RANGE = (np.log(300), np.log(75000))
F3C_CIRCLE_SIZE = 20
F3C_TICKER = [np.log(500), np.log(5000), np.log(50000)]
FC3_COLOR_PALETTE = ('#f4987a', "grey")
# Figure 3e
F3E_TICKER = (np.log(500), np.log(5000), np.log(50000))
F3E_PALETTE = ('#8db0fe', '#f4987a', "grey")
# Figure 5S
PALETTE_FS5 = ('#8db0fe', '#f4987a',  "gray")
# Supplementary Table S4
table_s_vars_meta = (("CREB1", 0, "CGGCTGACGTCAGCCAC"),
                     ("CREB1", 1, "CATATGACGTCATATGT"),
                     ("MITF", 8, "TACGCACATGCGTAC"),
                     ("MITF", 9, "TACGCACGTGCGTAC"),
                     ("MITF", 10, "TACGCATGTGCGTAC"),
                     ("EGR1", 3, "TGCACGCCCACGCGTATG"),
                     ("EGR1", 6, "ATACGCGTGGGCGTGCAT"))

In [2]:
# Ensure output folders exist
for i in (OUTPUT_MAIN, OUTPUT_SUPP, OUTPUT_TABLE, OUTPUT_TEXT):
    os.makedirs(i, exist_ok=True)

### Process files - CREB1, EGR1, MITF

Data from CREB1, EGR1, and MITF is used throughput the analysis and processed in the same manner. Each pair of alldata files are read and sequences with 2+ replicates and without flags are kept. The UV values are scaled on Non-UV values using a linear regression for all sequences without pyrimidine dinucleotides. An OLS regression is trained on sequences without pyrimidine dinucleotides a 99% prediction interval is generated. 

The results are added to a dictionary where the key is the protein name and value is a dataframe containing values from both conditions and results from the OLS.

**Input:** Meta data, alldata.txt files specified in the meta data.

**Output:** Dictionary with keys as the protein name and values as dataframes with the following columns:

*Created from uac.scale_uv_on_non_uv*

1. Name: Label given to the sequence from the design, minus the r# suffix where # is the replicate number as it represents the median of those sequences.
2. Sequence: The 60bp sequence in the UV-Bind array
3. Signal_Non_UV: The natural log transformed value of the median non-UV signal.
4. Has_PyDi: Boolean (True/False) indicating if the sequence contains a pyrimidine dinucleotide in either orientation.
5. Signal_UV: The natural log transformed value of the median non-UV signal after scaling based on the Signal_Non_UV values.
6. Signal_UV_PreScale: The natural log transformed value of the median non-UV signal before scaling.

*Created from uac.ols*

7. Prediction_Upper: The top predicted Signal_UV value for the 99% prediction interval based on an OLS regression (uac.ols)
8. Prediction_Lower: The top predicted Signal_UV value for the 99% prediction interval based on an OLS regression (uac.ols)
9. Predicted: The predicted Signal_UV value from the OLS. This value is equal to the Signal_Non_UV since the slope is transformed to be 1 during the scaling process.


#### (1) Read and display meta data

In [3]:
meta_data = pd.read_csv(META_DATA)
meta_data

Unnamed: 0,Protein,Non_UV_File,UV_File
0,CREB1,CREB1_WC_ID38_alldata.txt,CREB1_UV_ID39_alldata.txt
1,EGR1,EGR1_WC_ID40_alldata.txt,EGR1_UV_ID41_alldata.txt
2,MITF,MITF_WC_ID36_alldata.txt,MITF_UV_ID37_alldata.txt


#### (2) Read files and concatinate into a single dataframe

In [4]:
dfs = []
for row in meta_data.itertuples():
    # Read data, scale UV on Non_UV values with no pydi
    df = uac.scale_uv_on_non_uv(f"{DATA_FOLDER}/{row.Non_UV_File}",
                                f"{DATA_FOLDER}/{row.UV_File}")
    # Run OLS analysis
    df = uac.ols(df,
                 column_x="Signal_Non_UV",
                 column_y="Signal_UV",
                 alpha=0.01,
                 column_damage="Has_PyDi")
    df["Protein"] = row.Protein
    # Add to df list
    dfs.append(df)
data_df = pd.concat(dfs).reset_index(drop=True)
data_df

Unnamed: 0,Name,Sequence,Signal_Non_UV,Has_PyDi,Signal_UV,Signal_UV_PreScale,Prediction_Upper,Prediction_Lower,Predicted,Protein
0,ClstD_CREB1_H0_TD25_S0,TTGGAAACCTTTGGAGGGAATTTCCCGCACACATACACATACACAC...,6.864848,True,8.414366,9.079890,7.284846,6.444850,6.864848,CREB1
1,ClstD_CREB1_H0_TD25_S0_O2,GGAAATTCCCTCCAAAGGTTTCCAACGCACACATACACATACACAC...,7.175490,True,8.861874,9.368455,7.593654,6.757326,7.175490,CREB1
2,ClstD_CREB1_H0_TD25_S1,TGGAAACCTTTGGAGGGAATTTCCGCGCACACATACACATACACAC...,6.980076,True,8.715496,9.274066,7.399360,6.560792,6.980076,CREB1
3,ClstD_CREB1_H0_TD25_S1_O2,CGGAAATTCCCTCCAAAGGTTTCCACGCACACATACACATACACAC...,7.193686,True,8.897863,9.391661,7.611751,6.775620,7.193686,CREB1
4,ClstD_CREB1_H0_TD25_S2,TGGAAACCTTTGGAGGGAATTTCCCCGCACACATACACATACACAC...,6.709304,True,8.450171,9.102978,7.130329,6.288280,6.709304,CREB1
...,...,...,...,...,...,...,...,...,...,...
36151,kD_Zif268_F3_GTG_O2,CGGATATATACGCCCACACTATATACGCACACATACACATACACAC...,6.113682,True,6.231159,7.103733,7.147866,5.079498,6.113682,MITF
36152,kD_Zif268_F3_TCG,TATATATCGTGGGCGTATATATCCGCGCACACATACACATACACAC...,5.624018,True,5.885331,6.804060,6.659675,4.588360,5.624018,MITF
36153,kD_Zif268_F3_TCG_O2,CGGATATATACGCCCACGATATATACGCACACATACACATACACAC...,6.034285,True,6.230684,7.103322,7.068680,4.999889,6.034285,MITF
36154,kD_Zif268_consensus,TATATAGCGTGGGCGTATATATCCGCGCACACATACACATACACAC...,5.549076,True,5.716789,6.658011,6.584995,4.513157,5.549076,MITF


### Supplementary Table 4A - Nondamageable

Filter the dataframe for CREB1, EGR1, and MITF for sequences that do not have pyrimidine dinucleotides. Output as a csv file. 

In [5]:
training_df = data_df[~data_df["Has_PyDi"]].reset_index(drop=True)
training_df = training_df[["Protein", "Sequence", "Signal_Non_UV", "Signal_UV", "Signal_UV_PreScale", "Prediction_Upper", "Prediction_Lower", "Predicted"]]
training_df.to_csv(f"{OUTPUT_TABLE}/Table_S4A_Nondamageable_Sequences.csv", index=False)
# Display the result
training_df

Unnamed: 0,Protein,Sequence,Signal_Non_UV,Signal_UV,Signal_UV_PreScale,Prediction_Upper,Prediction_Lower,Predicted
0,CREB1,ACGTATGCACATACACGCGTATGTACGCACACATACACATACACAC...,8.565602,8.465462,9.112838,8.979166,8.152039,8.565602
1,CREB1,ACGTATGCACGTACACGCGTATGTACGCACACATACACATACACAC...,8.786686,8.867689,9.372204,9.200070,8.373302,8.786686
2,CREB1,ACGTATGCACGCGCACGCGTATGTACGCACACATACACATACACAC...,8.477724,8.372278,9.052750,8.891402,8.064047,8.477724
3,CREB1,ACGTATGCACGTGCACGCGTATGTACGCACACATACACATACACAC...,9.290260,9.078969,9.508443,9.703805,8.876715,9.290260
4,CREB1,ACGTATGCACGCATACGCGTATGTACGCACACATACACATACACAC...,8.948651,8.843025,9.356300,9.362000,8.535302,8.948651
...,...,...,...,...,...,...,...,...
1336,MITF,GTATGTACGCACGTGTATATATACGCGCACACATACACATACACAC...,9.118280,9.310621,9.772211,10.152285,8.084275,9.118280
1337,MITF,GTGTATATATACGTGCGTACATACGCGCACACATACACATACACAC...,7.416980,7.066097,7.827241,8.449211,6.384749,7.416980
1338,MITF,GTATGTATATATACGCGTACATACGCGCACACATACACATACACAC...,6.988874,6.667496,7.481837,8.021430,5.956318,6.988874
1339,MITF,GTATGTGTATATATACGTACATACGCGCACACATACACATACACAC...,6.654153,6.427802,7.274133,7.687178,5.621127,6.654153


### Supplementary Materials - Scaling UV values in UV vs non-UV comparisons

Generates a report to use for the supplementary materials section, "Scaling UV values in UV vs non-UV comparisons".

#### (1) Functions and Classes

In [6]:

Regression_Report = namedtuple("Regression_Report", ["Condition_X",
                                                     "Condition_Y",
                                                     "Prescale_R2",
                                                     "Prescale_Slope",
                                                     "Prescale_Slope_CI",
                                                     "Prescale_Intercept",
                                                     "Prescale_Intercept_CI",
                                                     "Postscale_R2",
                                                     "Postscale_Slope",
                                                     "Postscale_Slope_CI",
                                                     "Postscale_Intercept",
                                                     "Postscale_Intercept_CI"])


def generate_regression_report(prescale_regression,
                               postscale_regression,
                               len_values,
                               column_x,
                               column_y,
                               ci_percent=0.95):
    tinv = lambda p, df: abs(stats.t.ppf(p/2, df))
    ts = tinv(1 - ci_percent, len_values-2)
    return  Regression_Report(Condition_X=column_x,
                             Condition_Y=column_y,
                             Prescale_R2=prescale_regression.rvalue ** 2,
                             Prescale_Slope=prescale_regression.slope,
                             Prescale_Slope_CI=prescale_regression.stderr * ts,
                             Prescale_Intercept=prescale_regression.intercept,
                             Prescale_Intercept_CI=prescale_regression.intercept_stderr * ts,
                             Postscale_R2=postscale_regression.rvalue ** 2,
                             Postscale_Slope=postscale_regression.slope,
                             Postscale_Slope_CI=postscale_regression.stderr * ts,
                             Postscale_Intercept=postscale_regression.intercept,
                             Postscale_Intercept_CI=postscale_regression.intercept_stderr * ts)


#### (2) Analysis

In [7]:
# Generate regression statistics for scaling
reports = []
for protein in ("CREB1", "EGR1", "MITF"):
    protein_training_df = training_df[training_df["Protein"] == protein].reset_index(drop=True)
    # Prescale
    prescale_regression = stats.linregress(protein_training_df["Signal_Non_UV"],
                                           protein_training_df["Signal_UV_PreScale"])
    postscale_regression = stats.linregress(protein_training_df["Signal_Non_UV"],
                                            protein_training_df["Signal_UV"])
    reg_report = generate_regression_report(prescale_regression,
                                            postscale_regression,
                                            len(protein_training_df),
                                            f"Non_UV_{protein}",
                                            f"UV_{protein}")
    reports.append(reg_report)
report_df = pd.DataFrame(reports)
report_df.to_csv(f"{OUTPUT_TEXT}/Scaling_Report.csv")

### Prepare data for use as a dictionary


#### (1) Read data as a dictionary

In [8]:
# Create dictionary of OLS comparisons
data_dict = {}
for row in meta_data.itertuples():
    # Read data, scale UV on Non_UV values with no pydi
    df = uac.scale_uv_on_non_uv(f"{DATA_FOLDER}/{row.Non_UV_File}",
                                f"{DATA_FOLDER}/{row.UV_File}")
    # Run OLS analysis
    df = uac.ols(df,
                 column_x="Signal_Non_UV",
                 column_y="Signal_UV",
                 alpha=0.01,
                 column_damage="Has_PyDi")
    # Add to dictionary
    data_dict[row.Protein] = df
# Show example
print("Example dataframe in data_dict for:", meta_data["Protein"][0])
data_dict[meta_data["Protein"][0]]

Example dataframe in data_dict for: CREB1


Unnamed: 0,Name,Sequence,Signal_Non_UV,Has_PyDi,Signal_UV,Signal_UV_PreScale,Prediction_Upper,Prediction_Lower,Predicted
0,ClstD_CREB1_H0_TD25_S0,TTGGAAACCTTTGGAGGGAATTTCCCGCACACATACACATACACAC...,6.864848,True,8.414366,9.079890,7.284846,6.444850,6.864848
1,ClstD_CREB1_H0_TD25_S0_O2,GGAAATTCCCTCCAAAGGTTTCCAACGCACACATACACATACACAC...,7.175490,True,8.861874,9.368455,7.593654,6.757326,7.175490
2,ClstD_CREB1_H0_TD25_S1,TGGAAACCTTTGGAGGGAATTTCCGCGCACACATACACATACACAC...,6.980076,True,8.715496,9.274066,7.399360,6.560792,6.980076
3,ClstD_CREB1_H0_TD25_S1_O2,CGGAAATTCCCTCCAAAGGTTTCCACGCACACATACACATACACAC...,7.193686,True,8.897863,9.391661,7.611751,6.775620,7.193686
4,ClstD_CREB1_H0_TD25_S2,TGGAAACCTTTGGAGGGAATTTCCCCGCACACATACACATACACAC...,6.709304,True,8.450171,9.102978,7.130329,6.288280,6.709304
...,...,...,...,...,...,...,...,...,...
12049,kD_Zif268_F3_GTG_O2,CGGATATATACGCCCACACTATATACGCACACATACACATACACAC...,7.954899,True,8.616049,9.209940,8.369750,7.540048,7.954899
12050,kD_Zif268_F3_TCG,TATATATCGTGGGCGTATATATCCGCGCACACATACACATACACAC...,7.694165,True,8.556751,9.171703,8.109917,7.278413,7.694165
12051,kD_Zif268_F3_TCG_O2,CGGATATATACGCCCACGATATATACGCACACATACACATACACAC...,7.962416,True,8.696832,9.262031,8.377244,7.547588,7.962416
12052,kD_Zif268_consensus,TATATAGCGTGGGCGTATATATCCGCGCACACATACACATACACAC...,7.495264,True,8.253660,8.976262,7.911844,7.078684,7.495264


### Supplementary Table 4B - MITF Flanks

1. Query for the MITF flank variations sequence set
2. Seperate probes into groups
3. Organize dataframe into 3 groups:
    - TCACGTGA context: mitf_ta
    - CCACGTGG context: mitf_cg
    - Sequences used to train the OLS model: mitf_nd
4. Concatinate the groups into a single dataframe to uses as the source data for creating figures 3a-c.

**Input:** Dataframe from data_dict["MITF"]

**Output:** A Dataframe with the following additional columns:

1. Flank: The flanking context of the E-box. The first 2 letters correspond with the left flank and the last 2 letters correspond with the right flank. Ex/ ATAA -> **AT**CACGTG**AA**.
2. Group: The group of flanking contexts the row belongs to.

#### (1) Subset the data for MITF and count the number of sequences for OLS training

In [9]:
# Number of probes used to train the OLS model
mitf = data_dict["MITF"]
count_no_pydi = len(mitf) - sum(mitf["Has_PyDi"])
print("Sequence count for OLS model:", f"{count_no_pydi} of {len(mitf)}")

Sequence count for OLS model: 447 of 12053


#### (2) Organize data

In [10]:
# Query mitf flank probes
mitf_flank = mitf[mitf["Name"].str.startswith("Flank_MITF")]
mitf_flank = mitf_flank.reset_index(drop=True)
mitf_flank["Flank"] = mitf_flank["Name"].apply(lambda x: x.split('_')[-1])
# Seperate into flanking contexts for the analysis
mitf_ta = mitf_flank[mitf_flank["Flank"].str.contains(
    r".TA.")].reset_index(drop=True)
mitf_cg = mitf_flank[mitf_flank["Flank"].str.contains(
    r".CG.")].reset_index(drop=True)
mitf_nd = mitf[~mitf["Has_PyDi"]].reset_index(drop=True)
# Group labels
mitf_ta["Group"] = "TA"
mitf_cg["Group"] = "CG"
mitf_nd["Flank"] = "NN"
mitf_nd["Group"] = "ND"
# Concatinate dfs
plot_df = pd.concat([mitf_ta, mitf_cg, mitf_nd]).reset_index(drop=True)
plot_df = plot_df[plot_df["Name"].apply(lambda x: "MITF" in x)].reset_index(drop=True)
plot_df

Unnamed: 0,Name,Sequence,Signal_Non_UV,Has_PyDi,Signal_UV,Signal_UV_PreScale,Prediction_Upper,Prediction_Lower,Predicted,Flank,Group
0,Flank_MITF_CACGTG_ATAA,GTATGTAATCACGTGAATACATACGCGCACACATACACATACACAC...,11.090340,True,8.548932,9.112176,12.132489,10.048190,11.090340,ATAA,TA
1,Flank_MITF_CACGTG_ATAC,GTATGTAATCACGTGACTACATACGCGCACACATACACATACACAC...,11.090340,True,10.287640,10.618836,12.132489,10.048190,11.090340,ATAC,TA
2,Flank_MITF_CACGTG_ATAG,GTATGTAATCACGTGAGTACATACGCGCACACATACACATACACAC...,11.090340,True,8.781920,9.314070,12.132489,10.048190,11.090340,ATAG,TA
3,Flank_MITF_CACGTG_ATAT,GTATGTAATCACGTGATTACATACGCGCACACATACACATACACAC...,11.090340,True,9.626711,10.046115,12.132489,10.048190,11.090340,ATAT,TA
4,Flank_MITF_CACGTG_CTAA,GTATGTACTCACGTGAATACATACGCGCACACATACACATACACAC...,10.530775,True,6.853485,7.643004,11.569956,9.491595,10.530775,CTAA,TA
...,...,...,...,...,...,...,...,...,...,...,...
143,Pos_MITF_GCACGTGC_P16,GTATGTACATGCACGCGCACGTGCACGCACACATACACATACACAC...,9.039849,False,9.492806,9.930081,10.073664,8.006033,9.039849,NN,ND
144,Pos_MITF_GCACGTGC_P2,GTGCACGTGCGCACGCATACATACACGCACACATACACATACACAC...,10.921920,False,10.795631,11.059031,11.963121,9.880718,10.921920,NN,ND
145,Pos_MITF_GCACGTGC_P4,GTATGCACGTGCACGCATACATACACGCACACATACACATACACAC...,10.831796,False,10.831762,11.090340,11.872510,9.791082,10.831796,NN,ND
146,Pos_MITF_GCACGTGC_P6,GTATGTGCACGTGCGCATACATACACGCACACATACACATACACAC...,10.685801,False,10.606644,10.895266,11.725753,9.645850,10.685801,NN,ND


#### (3) Output table

In [11]:
table_s = plot_df.copy()
table_s = table_s[["Group", "Flank", "Sequence", "Signal_Non_UV", "Signal_UV", "Prediction_Upper", "Prediction_Lower", "Predicted"]]
table_s = table_s.reset_index(drop=True)
table_s["Sequence"] = table_s["Sequence"].apply(lambda x: x[:25])
table_s = table_s[table_s["Group"] != "NN"].reset_index(drop=True)
table_s.to_csv(f"{OUTPUT_TABLE}/Table_S4B_MITF_Flanks.csv")

### Figure 3A - MITF Flanks Scatterplot

Figure 3a is a scatterplot of the 3 groups from the code cell above with different colors for each group and a dashed line for the OLS prediction interval.

The bokeh plot is drawn by first creating a bokeh figure object with the figure() function. Then for each group from the previous scatterplot being drawn:

1. A CDSView is created for the group. This can be used as a filter parameter (view) for what data to drawn in a method of the plot.
2. A .circle method for the bokeh plot draws circles in the scatterplot for the group with the specified view and color parameters.

Then dashed lines for the prediction interval are drawn with the .line method. Tick marks are defined and labels, grids, minor ticks, and a toolbar that are made by default are turned off. **Input:** Dataframe from data_dict["MITF"]

**Input:** Dataframe (plot_df) with the following columns:

1. Signal_Non_UV
2. Signal_UV
3. Flank
4. Group
5. Prediction_Upper
6. Prediction_Lower

**Output:** A scatterplot showing groups for TA, CG, and those used to train the OLS model. 




In [12]:
# Plot 
line_df = plot_df.sort_values(by="Signal_Non_UV").reset_index(drop=True)
# Define X and Y columns in the dataframe to plot
column_x = "Signal_Non_UV"
column_y = "Signal_UV"
# Create bokeh figure object
plot = figure(width=800,
              height=800,
              x_range=F3A_PLOT_RANGE,
              y_range=F3A_PLOT_RANGE)
# Define bokeh ColumnDataSource
source = ColumnDataSource(plot_df)
# For each group, create a bokeh view and draw circles for a scatterplot
for group, color in zip(("TA", "CG", "ND"),
                        (F3A_TA_COLOR, F3A_CG_COLOR, F3A_ND_COLOR)):
    view = CDSView(source=source,
                    filters=[GroupFilter(column_name='Group', group=group)])
    plot.circle(column_x,
                column_y,
             source=source,
             view=view,
             size=F3A_CIRCLE_SIZE,
             color=color)
# Draw 2 line glyphs for the prediction intervals
for i in ("Prediction_Upper", "Prediction_Lower"):
    plot.line(line_df[column_x],
           line_df[i],
           line_dash = "dashed",
           color = "black",
           line_width = 3)
# Define tickers
plot.xaxis.ticker = F3A_TICKER
plot.yaxis.ticker = F3A_TICKER
# Remove labels, grids, minor ticks, and toolbars
plot.xaxis.major_label_text_font_size = '0pt'
plot.yaxis.major_label_text_font_size = '0pt'
plot.xgrid.grid_line_color = None
plot.ygrid.grid_line_color = None
plot.xaxis.minor_tick_line_color = None
plot.yaxis.minor_tick_line_color = None
plot.toolbar_location = None
# Update backend to svg
plot.output_backend='svg'
export_svg(plot,
           filename=f"{OUTPUT_MAIN}/Panel_A_Flank_Context_Scatterplot.svg")

['../Figure_3/Panel_A_Flank_Context_Scatterplot.svg']

### Figure 3B-C - MITF flank variations

Figure 3b-c are barplots for the non-log fold change of the TA context (b) and CG context (c). The plots are generated in the following manner:

1. A label column is created which takes the flank column and puts a core_sequence (CACGTG) between the first and second pair of letters. 
2. Columns representing the data in non-log space are generated.
3. A Fold_Change column is created which is the Non_UV signal divided by the UV signal in non-ln space.
4. A bokeh figure 

**Input:** Dataframes for the TA and CG flanking contexts

**Output:** Horizontal barplots representing fold change decreases in non-ln space. 

#### (1) Functions

In [13]:
def fold_change_barplot(df, core_sequence, bar_color):
    """Return a barplot for fold-changes."""
    # Add label for y axis
    df["label"] = df["Flank"].apply(lambda x:\
        x[:2] + core_sequence + x[2:])   
    # Transform data into non-ln space
    df["Nonln_Signal_Non_UV"] = df["Signal_Non_UV"].apply(lambda x: np.exp(x))
    df["Nonln_Signal_UV"] = df["Signal_UV"].apply(lambda x: np.exp(x))
    df["Nonln_Predicted"] = df["Predicted"].apply(lambda x: np.exp(x))
    df["Fold_Change"] = df["Nonln_Signal_Non_UV"] / df["Nonln_Signal_UV"]
    df = df.sort_values(by="Fold_Change").reset_index(drop=True)
    # Plot dataframe df
    source=ColumnDataSource(df)
    # Make a figure object with a y range of categories
    p = figure(y_range=df["label"],
               x_range=(0,50),
               height=800,
               width=800,
               toolbar_location=None,
               tools="")
    # Add horizontal bars
    glyph = HBar(y="label",
                 right="Fold_Change",
                 left=0,
                 height=0.8,
                 fill_color=bar_color)
    p.add_glyph(source, glyph)
    # Set x axis start to 0
    p.x_range.start = 0
    # Remove gridlines
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    return p

#### (2) Analysis

In [14]:
for df, color, flank, panel in ((mitf_ta, "#1f78b4", "TA", "Panel_B"),
                                (mitf_cg, "#33a02c", "CG", "Panel_C")):
    plot = fold_change_barplot(df, "CACGTG", color)
    plot.output_backend='svg'
    export_svg(plot, filename=f"{OUTPUT_MAIN}/{panel}_{flank}_Barplot.svg")

### Figure 3D-E; Figure S4; Supplementary Table 4C-E - Process Data

For dinucleotide variation data in a sliding window, the names are reformatted to accomodate easier parsing and a column indicating significance is given. Information on if a value is within a prediction interval is also required for coloring plots or producing tables. The following code block creates a dictionary of dataframes with the key as the transcription factor and value as a dataframe. Each dataframe has an additional column, 'Is_Significant', indicating if it outside the prediction interval. 

**Input:** 
1. Meta data from FORMATTED_VAR_NAMES
2. Dictionary of dataframes with protein names as the key, data_dict. (The data_dict variable from **Process files - CREB1, EGR1, MITF**).

**Output:** A dictionary, variation_dict, which contains the same keys as data_dict, but with formatted sequences labels and a column indicating if the value is outside the prediction interval.

#### (1) Functions

In [15]:
def add_significance_column(dataframe: pd.DataFrame) -> pd.DataFrame:
    """Add a column to a dataframe indicating significance based on OLS.
    
    Given a dataframe with the following columns:
    1. Signal_UV
    2. Prediction_Upper
    3. Prediction_Lower
    
    Create a column, Is_Significant, which contains True if a value in
    Signal_UV is outside the prediction interval or False if it is within the
    prediction interval.
    """
    bool_list = []
    for row in dataframe.itertuples():
        if ((row.Signal_UV > row.Prediction_Upper) or
            (row.Signal_UV < row.Prediction_Lower)):
            bool_list.append(True)
        else:
            bool_list.append(False)
    dataframe["Is_Significant"] = bool_list
    return dataframe



#### (2) Analysis

In [16]:
# Read formatted names
formatted_names = pd.read_csv(FORMATTED_VAR_NAMES)
variation_dict = {} # Dictionary of variation dataframes
for i in data_dict:
    # Add the formatted name information to the dataframe
    variant_df = pd.merge(data_dict[i],
                          formatted_names,
                          on=["Name", "Sequence"])
    # Add a column indicating significance
    variant_df = add_significance_column(variant_df)
    # Add the dataframe to a dictionary
    variation_dict[i] = variant_df

### Supplementary Table 4C-E

For each protein, organize the data and output to csv. 

In [17]:
formatted_s_tables = {"EGR1":[], "CREB1":[], "MITF":[]}
for protein, var_id, label in table_s_vars_meta:
    df = variation_dict[protein]
    df = df[df["Variant_ID"] == var_id].reset_index(drop=True)
    df["Label"] = label
    df = df[["Label", "Position","Variant", "Sequence", "Signal_Non_UV", "Signal_UV", "Signal_UV_PreScale", "Prediction_Upper", "Prediction_Lower", "Predicted"]]
    formatted_s_tables[protein].append(df)
for i in formatted_s_tables:
    result = pd.concat(formatted_s_tables[i]).reset_index(drop=True)
    result = result.sort_values(by=["Label", "Position", "Variant"]).reset_index(drop=True)
    result.to_csv(f"{OUTPUT_TABLE}/Table_S4_{i}_Variations.csv", index=False)
    

### Figure 3D - MITF Variations Scatterplot

Scatterplot of the e-box variation sequences for MITF. 

**Input:** MITF binding information as stored in variation_dict["MITF"]. 

**Output:** Scatterplot comparing MITF binding in non-UV and UV conditions relative to the prediction interval. 


#### (1) Query for the dataset and output the number of sequences outside the prediction interval for use in text. 

In [18]:
# Query for e-box sequence set
mitf_variants = variation_dict["MITF"]
mitf_ebox = mitf_variants[mitf_variants["Variant_ID"] == 9]
mitf_ebox = mitf_ebox.reset_index(drop=True)
# Information, total count of significant probes
significant_sequence_count = sum(mitf_ebox["Is_Significant"])
total_sequence_count = len(mitf_ebox)
print(f"{significant_sequence_count} of {total_sequence_count} sequences are outside the prediction interval for figure 3D.")

7 of 172 sequences are outside the prediction interval for figure 3D.


#### (2) Draw the plot

In [19]:
# Plot
p = figure(width=800,
           height=800,
           x_range=F3C_PLOT_RANGE,
           y_range=F3C_PLOT_RANGE)
source = ColumnDataSource(mitf_ebox)
# Boolean filters
significant_filter = list(source.data["Is_Significant"])
non_significant_filter = list(
    map(lambda x: not x, source.data["Is_Significant"]))
# Draw scatterplot
for bool_filter, color in ((significant_filter, FC3_COLOR_PALETTE[0]),
                           (non_significant_filter, FC3_COLOR_PALETTE[1])):
    view = CDSView(source=source,
                   filters=[BooleanFilter(bool_filter)])
    p.circle("Signal_Non_UV",
             "Signal_UV",
             source=source,
             view=view,
             size=F3C_CIRCLE_SIZE,
             color=color)
# Draw 2 line glyphs for the prediction intervals
line_df = mitf_ebox.sort_values(by="Signal_Non_UV").reset_index(drop=True)
for i in ("Prediction_Upper", "Prediction_Lower"):
    p.line(line_df[column_x],
           line_df[i],
           line_dash="dashed",
           color="black",
           line_width=3)
# Set tick marks
p.xaxis.ticker = F3C_TICKER
p.yaxis.ticker = F3C_TICKER
# Remove labels, grids, minor ticks, and the toolbar
p.xaxis.major_label_text_font_size = '0pt'
p.yaxis.major_label_text_font_size = '0pt'
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.xaxis.minor_tick_line_color = None
p.yaxis.minor_tick_line_color = None
p.toolbar_location = None
p.output_backend = 'svg'
export_svg(p, filename=f"{OUTPUT_MAIN}/Panel_D_Variation_Scatterplot.svg")

['../Figure_3/Panel_D_Variation_Scatterplot.svg']

### Figure 3E - MITF Variation Plot

Custom plots that show the variants relative to the "wildtype" sequence. If a variant is outside of the prediction interval, it is drawn as a box with the letters of the variant inside. If it is within the prediction interval, it is drawn as a dot instead.

**Input:** MITF binding information as stored in variation_dict["MITF"]. 

**Output:** Variation positional plots of non-UV and UV conditions in the same scale saved as svg files. 

#### (1) Functions

In [20]:
def plot_df_from_var_df(tf, var_id, yticker, output):
    t = variation_dict[tf].copy()
    t = t[t["Variant_ID"] == var_id]
    # Remove WT 
    wt = t[t["Position"] < 0]
    wt_len = len(wt)
    if wt_len != 1:
        raise ValueError("WT Search found {wt_len} entries.")
    wt_nuv = float(wt.Signal_Non_UV)
    wt_uv = float(wt.Signal_UV)
    var = t[t["Position"] > 0].reset_index(drop=True)
    # Calculate relative plot positions 
    min_pos = min(var["Position"])
    nuv_plot_positions = []
    for row in var.itertuples():
        # If single mutant
        if len(row.Variant) == 1:
            pos = row.Position - min_pos + 1
        # If double mutant
        elif len(row.Variant) == 2:
            pos = row.Position - min_pos + 1.5
        else:
            raise ValueError("Variant length not 1 or 2.")
        nuv_plot_positions.append(pos)
    var["Plot_Positions_NUV"] = nuv_plot_positions
    max_pos = max(nuv_plot_positions)
    offset = max_pos + 2
    var["Plot_Positions_UV"] = var["Plot_Positions_NUV"] + offset - 1
    #########
    colors = []
    for row in var.itertuples():
        if row.Signal_UV > row.Prediction_Upper:
            colors.append('#8db0fe')
        elif row.Signal_UV < row.Prediction_Lower:
            colors.append('#f4987a')
        else:
            colors.append("grey")
    var["Color"] = colors
    # Convert to a long format df
    nuv_data = var[["Signal_Non_UV", "Is_Significant", "Variant", "Plot_Positions_NUV", "Color"]]
    nuv_data = nuv_data.rename(columns={"Signal_Non_UV":"Signal",
                                        "Plot_Positions_NUV":"Plot_Position"})
    nuv_data["Group"] = "NUV"
    uv_data = var[["Signal_UV", "Is_Significant", "Variant", "Plot_Positions_UV", "Color"]]
    uv_data = uv_data.rename(columns={"Signal_UV":"Signal", "Plot_Positions_UV":"Plot_Position"})
    uv_data["Group"] = "UV"
    plot_df = pd.concat([nuv_data, uv_data]).reset_index(drop=True)
    max_plot_pos = max(plot_df["Plot_Position"])
    circle_size=5
    plot_y_offset = (max(plot_df["Signal"]) - min(plot_df["Signal"])) * 0.1
    y_range = (min(plot_df["Signal"]) - plot_y_offset, max(plot_df["Signal"]) + plot_y_offset)
    x_range = (0, max_plot_pos + 1)
    source = ColumnDataSource(plot_df)
    p = figure(height=500, width=1000, y_range=y_range, x_range=x_range)
    view_sig = CDSView(source=source,
                       filters=[BooleanFilter(source.data["Is_Significant"])])
    view_nsig = CDSView(source=source,
                        filters=[BooleanFilter(~source.data["Is_Significant"])])
    # Plot non-significant variants as dots
    p.circle("Plot_Position",
             "Signal",
             source=source,
             view=view_nsig,
             color="grey",
             size=circle_size)
    # Plot boxes and letters in overlapping fashion ()
    for row in plot_df.itertuples():
        if row.Is_Significant:
            p.square(row.Plot_Position,
                     row.Signal,
                     size=30,
                     color=row.Color,
                     line_color='black')
            if len(row.Variant) == 1:
                xoff = -5
            else:
                xoff = -10
                if "C" in row.Variant or "G" in row.Variant:
                    xoff += -2
            # For text to be layered it needs to be a Text glyph, which requires a custom source to do inv
            textattr = {"Plot_Position": [row.Plot_Position], "Signal": [row.Signal], "Variant":[row.Variant]}
            textsource = ColumnDataSource(data=textattr)
            glyph = Text(x="Plot_Position",
                         y="Signal",
                         text="Variant",
                         x_offset = xoff,
                         y_offset = 5,
                         text_color="black")
            p.add_glyph(textsource, glyph)
    # Plot WT lines
    p.line(x=(0, offset - 1), y = (wt_nuv, wt_nuv), line_width=3, color="black")
    p.line(x=(offset - 1, max_plot_pos + 1), y = (wt_uv, wt_uv), line_width=3, color="black")
    # Plot dividing line
    plot_sep = Span(location=offset - 1,
                   dimension='height',
                   line_color='black',
                   line_width=2)
    p.add_layout(plot_sep)
    # Tick marks
    p.xaxis.major_tick_line_color = None
    p.xaxis.minor_tick_line_color = None
    p.xaxis.major_label_text_font_size = '0pt'
    p.yaxis.major_label_text_font_size = '0pt'
    gridTicks = list(range(math.floor(min(plot_df["Plot_Position"])), math.ceil(max(plot_df["Plot_Position"]))))
    gridTicks = list(map(lambda x: x + 0.5, gridTicks))
    start, end = gridTicks[0] - 1, gridTicks[-1] + 1
    p.xgrid[0].ticker = [start] + gridTicks + [end]
    p.yaxis.ticker = yticker 
    p.background_fill_color = None
    p.border_fill_color = None
    #show(p)
    p.output_backend = "svg"
    export_svg(p, filename=output)



#### (2) Analysis

In [21]:
plot_df_from_var_df("MITF", 9, (np.log(500), np.log(5000), np.log(50000)), f"{OUTPUT_MAIN}/Figure_3e.svg")

### Supplementary Figure 4


#### (1) Read and display meta data

In [22]:
# Read meta data
fig_s5_meta = pd.read_csv(META_DATA_FIG_S5)
fig_s5_meta

Unnamed: 0,Protein,Variant_ID,Tick_Mark_A,Tick_Mark_B,Tick_Mark_C
0,CREB1,0,2500,10000,40000
1,CREB1,1,2500,10000,40000
2,EGR1,3,2500,10000,40000
3,EGR1,6,2500,10000,40000
4,MITF,8,500,5000,30000
5,MITF,10,500,5000,30000


#### (2) Analysis

In [23]:
for row in fig_s5_meta.itertuples():
    # Read and convert tick marks to ln space
    tick_marks = [row.Tick_Mark_A, row.Tick_Mark_B, row.Tick_Mark_C]
    tick_marks_ln = list(map(np.log, tick_marks))
    # Generate positional plots
    plot_df_from_var_df(row.Protein,
                        row.Variant_ID,
                        tick_marks_ln,
                        output = f"{OUTPUT_SUPP}/{row.Protein}_{row.Variant_ID}.svg")