# CS525 FINAL PROJECT
### GROUP MEMBERS: Ted Monyak, Jack Forman




In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sys
import os
import json

simplecnnDir = os.path.join(os.getcwd(), 'Model_SimpleCNN_Best')
deepcnnDir = os.path.join(os.getcwd(), 'Model_DeepCNN_Best')
transformerDir = os.path.join(os.getcwd(), 'Model_DNATransformer_Best')

dirs = [simplecnnDir, deepcnnDir, transformerDir]

# Introduction

### Dataset
##### Training, Validation, Testing Dataset

##### Biological Dataset

To examine the biological relevance and generalizability of our models, we curated an additional dataset from [Ensembl Plants](https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-61/fasta/arabidopsis_thaliana/pep/), consisting of approximately 27,000 protein-coding genes from the *Arabidopsis thaliana* genome. The aim of this dataset is to investigate the chromatin accessibility of these genes across various plant tissues. This is biologically significant, as different tissues express distinct sets of proteins, and gene accessibility plays a key regulatory role in this expression. 

We expect the results to reveal two main patterns:  
1. A core set of genes with consistent accessibility across all tissues  
2. A set of genes exhibiting tissue-specific accessibility profiles



### Model Architectures

Several neural network architectures were tested to evaluate their effectiveness in predicting chromatin accessibility. In the simplest case—the **Simple CNN**—a single convolutional layer was followed by a fully connected layer. This minimal architecture offers advantages in terms of interpretability and computational efficiency. However, its simplicity may limit its ability to capture higher-order patterns in the sequence data that could be important for accurately modeling chromatin accessibility.

A more complex architecture—the **Deep CNN**—consisted of three convolutional layers followed by three fully connected layers. This model has significantly greater capacity to learn complex features and hierarchical patterns within the sequence data, while still retaining a relatively straightforward structure. Its depth allows it to capture more nuanced relationships that may be missed by simpler model.

The final architecture explored was a simple transformer-based model, referred to as the **DNA Transformer**. It consisted of three convolutional layers followed by a transformer block, which was then connected to a single linear output layer. This architecture is designed to capture both local and global patterns in the sequence data. The convolutional layers help extract localized features, while the transformer enables the model to learn long-range dependencies and higher-order structures that may be critical for understanding chromatin accessibility.

All of the source code for these models can be seen in SRC/Models



# Methods



# Results

### Training



In [None]:
fig, ax = plt.subplots(nrow=1, ncol=3, figsize=(10, 6))

for i, dir in enumerate(dirs):
    with open(os.path.join(dir, 'losses.json'), 'r') as f:
        losses = json.load(f)
        train_loss = losses['train_loss']
        val_loss = losses['validation_loss']
        epochs = len(train_loss)
        ax[i].plot(range(1, epochs + 1), train_loss, label='Train Loss', color='blue')
        ax[i].plot(range(1, epochs + 1), val_loss, label='Validation Loss', color='orange')
        ax[i].set_title(os.path.basename(dir))
        ax[i].set_xlabel('Epochs')
        ax[i].set_ylabel('Loss')
        ax[i].legend()
plt.show()


        

### Simple CNN
<style>
img {
    display: block;
    margin-left: 0;
    margin-right: auto;
    width: 30%;
    height: 30%;
}
</style>


The accuracy graphs provide insight into how well each model learns and generalizes to the data. Accuracy evaluated on the training set shows that the models are learning meaningful patterns without simply memorizing the data, as evidenced by their predictions not perfectly following the *y = x* line which would indicate overfitting and perfect recall of the training data. This diagonal line represents ideal predictions with perfect knowledge of the input-output relationship.

In addition, the Pearson correlation coefficient is used to assess how well the predicted values correlate with the experimental (ground truth) data. A value of 1 indicates perfect linear correlation.

Among the models, the **Simple CNN** performs worse on the training data compared to more complex architectures. However, it shows strong generalization to both the validation and testing datasets, suggesting that its simplicity helps prevent overfitting and allows it to capture robust, generalizable features.

![image](Model_SimpleCNN_Best/TrainingAccuracy.png)

![image](Model_SimpleCNN_Best/ValAccuracy.png)
 
![image](Model_SimpleCNN_Best/TestingAccuracy.png)

### Deeper CCN
<style>
img {
    display: block;
    margin-left: 0;
    margin-right: auto;
    width: 30%;
    height: 30%;
}
</style>



![image](Model_DeepCNN_Best/TrainingAccuracy.png)
![image](Model_DeepCNN_Best/ValAccuracy.png)
![image](Model_DeepCNN_Best/TestingAccuracy.png)


### DNA Transformer
<style>
img {
    display: block;
    margin-left: 0;
    margin-right: auto;
    width: 30%;
    height: 30%;
}
</style>
Discussion about training accuracy ![image](Model_DNATranformer_Best/TrainingAccuracy.png) 
Discussion about validation accuracy ![image](Model_DNATranformer_Best/ValAccuracy.png) 
Discussion about testing accuracy ![image](Model_DNATranformer_Best/TestingAccuracy.png)


### Filters

### Gene differnece in Tissues

<style>
img {
    display: block;
    margin-left: 0;
    margin-right: auto;
    width: 80%;
    height: 30%;
}
</style>



![image](AccessibilityHeatmap.png) 

# Discussion