### Supplementary Figure 1

This notebook contains the code used to count the number of pyrimidine dinucleotides in the UV-Bind universal sequence design and compare it to the original protein binding microarray design. 

### File Input and Output

**Input:**

| Input File | Associated Figure | Description |
| --- | --- | --- |
| Universal_UV_Bind_Meta_Data_All_9mer.csv | Supplementary Figure 1 | Library design for universal UV-Bind. |
| 8x15k_v2_sequences.txt | Supplementary Figure 1 | Library design for universal PBMs |

**Output:**

Design_Comparison_Pyrimidine_Dinucleotide_Counts.csv - Counts for pyrimidine dinucleotides.

### 3rd Party Packages

1. Numpy - Array usage
2. Pandas - Dataframe usage

### UV Bind Analysis Core Imports

- uac.count_overlapping_kmers: Counts overlapping k-mers from a given set
- uac.PYDI: Tuple of pyrimidine dinucleotides in both orientations

** Additional details can be found in the uvbind_analysis_core.py script.

### Abbreviations:

- df: DataFrame
- pydi: Pyrimidine dinucleotide
- seq: Sequence
- seqs: Sequences
- upbm: Universal protein binding microarray

#### Imports and Global Variables

In [1]:
from __future__ import annotations
import os

import pandas as pd
import numpy as np

import uvbind_analysis_core as uac

UV_DESIGN_FILE = "../../Design/Universal_UV_Bind/Universal_UV_Bind_Meta_Data_All_9mer.csv"
UPBM_DESIGN_FILE = "../../Data/External_Data/8x15k_v2_sequences.txt"
OUTPUT_FOLDER = "../Figure_S1"
OUTPUT_FILE = f"{OUTPUT_FOLDER}/Design_Comparison_Pyrimidine_Dinucleotide_Counts.csv"

In [2]:
# Create output folder if not already present
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

### Comparing Pyrimidine Dinucleotide Counts Among Designs

The median pyrimidine dinucleotide count for the 8x15k uPBM design and the universal UV-Bind sequences are calculated. Both designs give all possible 7-mers, with the 8x15k uPBM design being introduced with the Seed-and-Wobble method. 

In [3]:
def median_pydi_count(sequence_list: list[str]) -> float:
    """Return median count of pyrimidine dinucleotides in sequence iterable."""
    def pydi_count(sequence):
        """Return the count of pyrimidine dinucleotides in a single sequence."""
        return uac.count_overlapping_kmers(sequence, set(uac.PYDI), 2)
    return np.median(list(map(pydi_count, sequence_list)))

# Ensure output directory exists
os.makedirs(OUTPUT_FOLDER, exist_ok=True)
# Read uvbind design file
uvbind_universal = pd.read_csv(UV_DESIGN_FILE)
# Read original PBM design file
upbm_design = pd.read_csv(UPBM_DESIGN_FILE, sep='\t')
upbm_design = upbm_design.dropna(subset=["Name"])
upbm_design = upbm_design[upbm_design["Name"].str.startswith("9mer")]
ubpm_design = upbm_design.reset_index(drop=True)
upbm_design["Universal_Seq"] = upbm_design["Sequence"].apply(lambda x: x[1:36])
# Count uvbind values, save to output file
file_object = open(OUTPUT_FILE, 'w')
for seqs, label in ((uvbind_universal["Substring"], "UV_Bind_Universal"),
                    (uvbind_universal["Sequence"], "UV_Bind_Full"),
                    (upbm_design["Universal_Seq"], "UPBM_Universal"),
                    (upbm_design["Sequence"], "UPBM_Full")):
    count = median_pydi_count(seqs)
    file_object.write(f"{label},{count}\n")
file_object.close()