## Understanding Promoter Sequences

### What is a Promoter?

Promoters are regions of DNA located **upstream** (5' direction) of a gene and are crucial for the **initiation of transcription**. They contain sequences that enable transcription factors and RNA polymerase to bind, starting the process of gene expression.

### Key Features of Promoter Sequences

1. **Core Promoter**:
   - Close to the Transcription Start Site (TSS).
   - Includes sequences like the **TATA box** (TATAAA) that allow RNA polymerase to initiate transcription.

2. **Proximal Promoter**:
   - Located upstream of the core promoter.
   - Contains **regulatory elements** such as the **CAAT box** and **GC box**, which influence transcription activity.

3. **Distal Promoter**:
   - Can be located far upstream, and may contain **enhancer sequences** that regulate gene expression through DNA looping interactions.

### Key Motifs in Promoter Sequences

1. **TATA Box**:
   - A conserved sequence that is essential for RNA polymerase binding.
   - Sequence: `TATAAA`.

2. **CpG Islands**:
   - Regions rich in **C** and **G** nucleotides.
   - Often **unmethylated** in actively transcribed genes, playing a role in gene regulation.

3. **CAAT Box**:
   - Regulatory element involved in transcription factor binding.
   - Sequence: `GGCCAATCT`.

4. **GC Box**:
   - Rich in **G** and **C** nucleotides and involved in transcription regulation.
   - Sequence: `GGGCGG`.

### Transcription Start Site (TSS)

- The **TSS** is where RNA polymerase begins transcription.
- Promoter regions are typically defined as sequences located **upstream** (5') of the TSS, often around **1-2 kb** in length.

### Function of Promoter Sequences

1. **Gene Expression Regulation**:
   - Promoters determine when and where a gene is expressed by interacting with transcription factors and RNA polymerase.
   
2. **Tissue-Specific Expression**:
   - Different promoters can be activated in specific tissues or conditions, ensuring appropriate gene expression patterns.
   
3. **External Stimuli Response**:
   - Promoters can contain elements that allow genes to respond to changes in the environment or specific signaling molecules.

### Promoter Sequence Length

- Promoters typically range from **1-2 kb** upstream of the TSS, though long-range regulatory elements may extend further upstream or downstream.
- The core regulatory elements are often found within the first **500-1000 bases** upstream of the TSS.

### Promoter Elements in Eukaryotic vs. Prokaryotic Systems

- **Eukaryotic Promoters**: More complex, containing various regulatory elements such as enhancers and silencers.
- **Prokaryotic Promoters**: Simpler, typically consisting of **-35** and **-10** regions recognized by bacterial RNA polymerase.

### Importance of Understanding Promoter Sequences

- **Gene Regulation**: Promoters play a pivotal role in regulating gene expression across various biological contexts.
- **Gene Expression Analysis**: Promoter analysis is crucial for studying how genes are regulated and how they respond to different signals.
- **Biotechnological Applications**: Understanding promoters is essential for gene therapy and genetic engineering applications.

### References

- Fickett, J. W. (1982). *Recognition of protein coding regions in DNA sequences.* Nucleic Acids Research, 10(17), 5303-5318.
- Hatzis, P., & Talianidis, I. (2002). *Dynamics of enhancer-promoter communication during differentiation-induced gene activation.* Molecular Cell, 10(6), 1467-1477.
- [Promoter Dynamics in Gene Regulation and Expression](https://biologyinsights.com/promoter-dynamics-in-gene-regulation-and-expression/)
(https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/General_Biology_%28Boundless%29/16%3A_Gene_Expression/16.06%3A_Eukaryotic_Gene_Regulation_-_The_Promoter_and_the_Transcription_Machinery)
- [Promoter (genetics) - Wikipedia](https://en.wikipedia.org/wiki/Promoter_%28genetics%29)

In [None]:
import os
import pandas as pd
from Bio import SeqIO

# Constants for file paths
FASTA_FILE = "Homo_sapiens.GRCh38.dna.primary_assembly.fa"  # Path to the genome FASTA file
GTF_FILE = "Homo_sapiens.GRCh38.110.gtf"  # Path to the GTF file

# Load genome assembly from FASTA file
def load_genome(fasta_path):
    print("Loading genome from FASTA file...")
    genome = {rec.id: str(rec.seq) for rec in SeqIO.parse(fasta_path, "fasta")}
    return genome

# Load and process the GTF file (annotations)
def load_gtf(gtf_path):
    print("Loading and processing GTF file...")
    columns = ['seqname', 'source', 'feature', 'start', 'end', 
               'score', 'strand', 'frame', 'attribute']
    
    gtf_df = pd.read_csv(gtf_path, sep='\t', comment='#', header=None, names=columns, low_memory=False)
    gtf_df[['start', 'end']] = gtf_df[['start', 'end']].astype(int)
    
    # Filter out the rows that are not genes (only keep 'gene' entries)
    gtf_df = gtf_df[gtf_df['feature'] == 'gene']
    return gtf_df

# Function to explore genome data and GTF annotation
def explore_data(genome, gtf_df):
    # 1. Number of chromosomes in the genome
    chromosomes = list(genome.keys())
    
    # Split chromosomes into main chromosomes and additional chromosomes (unplaced scaffolds, alternate loci)
    main_chromosomes = [chrom for chrom in chromosomes if chrom in [str(i) for i in range(1, 23)] + ['X', 'Y']]
    additional_chromosomes = [chrom for chrom in chromosomes if chrom not in main_chromosomes]

    num_main_chromosomes = len(main_chromosomes)
    num_additional_chromosomes = len(additional_chromosomes)
    
    # Total number of chromosomes
    total_chromosomes = num_main_chromosomes + num_additional_chromosomes
    
    # 2. Print chromosome types and counts
    print(f"\nTotal number of chromosomes in the genome: {total_chromosomes}")
    print(f"Main chromosomes (1-22, X, Y): {num_main_chromosomes} chromosomes")
    print(f"Additional chromosomes (unplaced scaffolds, alternate loci): {num_additional_chromosomes} chromosomes")
    
    print(f"\nChromosomes present in the genome (first 5 of each type):")
    print(f"Main chromosomes (first 5): {main_chromosomes[:5]}")
    print(f"Additional chromosomes (first 5): {additional_chromosomes[:5]}")
    
    print("\nTotal number of chromosomes printed: 194 (includes both main and additional chromosomes).")
    print("Additional chromosomes include unplaced scaffolds and alternate loci that are present in the FASTA file but not part of the main 24 human chromosomes.")

    # 3. Total number of genes
    total_genes = gtf_df['attribute'].nunique()  # Count unique gene IDs
    print(f"\nTotal number of genes in the GTF file: {total_genes}")
    
    # Gene Type Distribution
    print("\nGene Types Distribution:")
    gene_types = gtf_df['attribute'].str.extract(r'gene_biotype "([^"]+)"')[0].value_counts()
    print(gene_types)
    
    # 4. Check for missing or invalid values in the GTF dataframe
    missing_values = gtf_df.isnull().sum()
    print("\nMissing values in the GTF file:")
    print(missing_values)
    
    # 5. Check for duplicated gene entries in the GTF
    duplicated_genes = gtf_df[gtf_df.duplicated(subset=['attribute'])]
    if not duplicated_genes.empty:
        print("\nDuplicated genes found in the GTF file:")
        print(duplicated_genes[['attribute', 'seqname', 'start', 'end']])
    else:
        print("\nNo duplicated genes found in the GTF file.")
    
    # 6. Chromosome lengths in the genome
    chromosome_lengths = {chrom: len(seq) for chrom, seq in genome.items()}
    print("\nChromosome lengths (top 5 chromosomes):")
    for chrom in list(chromosome_lengths.keys())[:5]:
        print(f"{chrom}: {chromosome_lengths[chrom]} bases")
    
    # 7. Validate that all chromosomes in the GTF are present in the genome
    gtf_chromosomes = gtf_df['seqname'].unique()
    missing_chromosomes = [chrom for chrom in gtf_chromosomes if chrom not in genome]
    
    if missing_chromosomes:
        print("\nMissing chromosomes in the genome FASTA file (present in GTF):")
        print(missing_chromosomes)
    else:
        print("\nAll chromosomes in the GTF are present in the genome FASTA file.")
    
    # 8. Check the range of gene positions (start and end)
    print("\nChecking gene start and end positions (sample):")
    print(gtf_df[['seqname', 'start', 'end']].head())
    
    # 9. Check for gene annotations with invalid or overlapping positions
    invalid_positions = gtf_df[gtf_df['start'] >= gtf_df['end']]
    if not invalid_positions.empty:
        print("\nInvalid gene positions (start >= end):")
        print(invalid_positions)
    else:
        print("\nNo invalid gene positions (start < end).")
    
    # 10. Gene Biotype Distribution
    print("\nGene Biotypes Distribution:")
    gene_biotypes = gtf_df['attribute'].str.extract(r'gene_biotype "([^"]+)"')[0].value_counts()
    print(gene_biotypes)
    
    # 11. Number of genes on main human chromosomes (1-22, X, Y)
    relevant_chromosomes = [str(i) for i in range(1, 23)] + ['X', 'Y']
    genes_on_main_chromosomes = gtf_df[gtf_df['seqname'].isin(relevant_chromosomes)]
    num_genes_on_main_chromosomes = genes_on_main_chromosomes['attribute'].nunique()
    print(f"\nNumber of genes on the 24 human chromosomes (1-22, X, Y): {num_genes_on_main_chromosomes}")

# Main workflow

# Load the genome (FASTA) and GTF annotation data
genome = load_genome(FASTA_FILE)
gtf_df = load_gtf(GTF_FILE)

# Explore the data and check for missing or incorrect information
explore_data(genome, gtf_df)


# Promoter Identification Model Training: Data Selection and Justification

## Overview of the Dataset

The dataset used for this analysis includes gene annotations from the **GTF file** and genomic sequences from the **FASTA file**. Here is a summary of the key details that we will use for **promoter identification**:

### 1. **Chromosomes and Gene Distribution**
- **Total number of chromosomes in the genome**: 194
- **Main chromosomes (1-22, X, Y)**: 24 chromosomes (representing the primary human genome)
- **Additional chromosomes (unplaced scaffolds, alternate loci)**: 170 chromosomes
- **Total number of genes in the GTF file**: 62,754
- **Number of genes on the main chromosomes (1-22, X, Y)**: 62,663

### 2. **Gene Types Distribution**
The dataset includes genes categorized by their biological function or type. Below is the distribution of gene types:

- **Protein-coding genes**: 20,070
- **Non-coding RNA genes** (lncRNA, miRNA, etc.): 37,684
- **Pseudogenes**: 13,221
- **Other gene types** (such as rRNA, snRNA, etc.)

---

## Considerations for Promoter Identification Model Training

When training a **promoter identification model**, selecting the most relevant genes and chromosomes is essential to ensure the model is focused and biologically meaningful. Below, we discuss the reasons for **focusing only on protein-coding genes** and **main human chromosomes** for this study.

### 1. **Gene Type Selection: Focusing Only on Protein-Coding Genes**

#### **Why We Focus Only on Protein-Coding Genes**
- **Protein-coding genes** are essential for most biological processes, and their **promoter regions** are the most well-characterized and widely studied. These genes have a well-defined **transcription start site (TSS)**, making them ideal candidates for **promoter identification**.
  - **Total protein-coding genes**: 20,070

- **Justification for Exclusion of Non-Coding Genes**:
  - Non-coding genes, such as **lncRNA** and **miRNA**, are involved in **regulatory** functions, but their **promoter regions** are less well-defined and may vary considerably across different tissues and conditions. 
  - The primary goal of this study is to build a model for **identifying promoter regions**, and non-coding genes may complicate the model training due to their variability and lesser focus on the canonical promoter structure.
  - As a result, **non-coding genes** (such as **lncRNA**, **miRNA**, **snRNA**, etc.) are **excluded** from this study.

- **Justification for Exclusion of Pseudogenes**:
  - **Pseudogenes** are non-functional gene copies that are generally not transcribed or translated. Since they do not play an active role in gene expression, their **promoter regions** are not relevant for this study.
  - **Pseudogenes** are **excluded** from the dataset, as they do not contribute useful information for promoter identification.

---

### 2. **Chromosome Type Selection: Limiting to Main Human Chromosomes (1-22, X, Y)**

#### **Why We Limit to Main Human Chromosomes**
- The **main chromosomes (1-22, X, Y)** contain the majority of the functional genes in the human genome and are the primary focus of **promoter studies**.
  - These chromosomes represent the canonical human genome, with well-established gene positions and transcriptional regulation mechanisms.
  - **Total genes on main chromosomes (1-22, X, Y)**: 62,663

#### **Justification for Exclusion of Additional Chromosomes**
- **Additional chromosomes** (unplaced scaffolds, alternate loci) do not represent well-characterized, functional regions of the human genome. These chromosomes often contain fragmented or incomplete gene sequences, making them less relevant for studying promoter regions.
- The **additional chromosomes** may not have clear gene start and end positions, and their promoters may be difficult to define. Including them could **complicate the model** and lead to inaccurate predictions.
- Therefore, **additional chromosomes** are **excluded** from this study.

---

## Recommended Approach for Promoter Identification Model Training

Given the above considerations, the recommended approach for **promoter identification model training** is:

1. **Primary Focus on Protein-Coding Genes**:
   - Train the model using only **protein-coding genes**, which are the most biologically relevant and well-studied genes for promoter identification.
   - **Total protein-coding genes**: 20,070

2. **Limiting to Main Human Chromosomes (1-22, X, Y)**:
   - Use genes located on the **main human chromosomes** (1-22, X, Y), which account for the majority of functional human genes.
   - **Total genes on main chromosomes**: 62,663

3. **Exclusion of Non-Coding Genes and Pseudogenes**:
   - Exclude **non-coding genes** (such as **lncRNAs**, **miRNAs**, etc.) and **pseudogenes** as their promoters are less relevant for this study and could introduce unnecessary complexity.

4. **Exclusion of Additional Chromosomes**:
   - Exclude genes located on **additional chromosomes** (unplaced scaffolds, alternate loci) as they are typically uncharacterized and non-functional.

---

## Conclusion

For this **promoter identification study**, we have focused on **protein-coding genes** located on the **main human chromosomes (1-22, X, Y)**. This dataset offers the most relevant and well-defined regions for identifying **promoter sequences**. By excluding **non-coding genes**, **pseudogenes**, and **additional chromosomes**, we ensure that the model is trained on a biologically relevant and manageable dataset, improving the **accuracy** and **effectiveness** of promoter prediction.



# Extracting Promoter Regions and Preparing Data for Preprocessing

In [1]:
import pandas as pd
from Bio import SeqIO
import re

# Constants for file paths
FASTA_FILE = "Homo_sapiens.GRCh38.dna.primary_assembly.fa"  # Path to the genome FASTA file
GTF_FILE = "Homo_sapiens.GRCh38.110.gtf"  # Path to the GTF file

# Load genome assembly from FASTA file
def load_genome(fasta_path):
    print("Loading genome from FASTA file...")
    genome = {rec.id: str(rec.seq) for rec in SeqIO.parse(fasta_path, "fasta")}
    return genome

# Load and process the GTF file (annotations)
def load_gtf(gtf_path):
    print("Loading and processing GTF file...")
    columns = ['seqname', 'source', 'feature', 'start', 'end', 
               'score', 'strand', 'frame', 'attribute']
    
    gtf_df = pd.read_csv(gtf_path, sep='\t', comment='#', header=None, names=columns, low_memory=False)
    gtf_df[['start', 'end']] = gtf_df[['start', 'end']].astype(int)
    
    # Filter for coding genes (transcripts, exons, and coding genes)
    coding_genes_df = gtf_df[gtf_df['feature'].isin(['exon', 'CDS'])]
    return coding_genes_df

# Filter out the genes on chromosomes of interest (1-22, X, Y)
def filter_relevant_chromosomes(gtf_df):
    relevant_chromosomes = [str(i) for i in range(1, 23)] + ['X', 'Y']
    filtered_df = gtf_df[gtf_df['seqname'].isin(relevant_chromosomes)]
    print(f"Filtered genes on chromosomes 1-22, X, Y: {filtered_df.shape[0]} entries.")
    return filtered_df

# Function to extract promoter sequence (2kb upstream of TSS)
def extract_promoter_sequence(fasta_dict, chromosome, gene_start, promoter_length=2000):
    """
    Extracts the promoter region from a chromosome sequence, 2kb upstream of TSS.
    """
    start_pos = max(gene_start - promoter_length, 0)
    end_pos = gene_start
    promoter_seq = fasta_dict[chromosome][start_pos:end_pos]
    return promoter_seq

# Calculate GC content of a sequence
def calculate_gc_content(promoter_seq):
    """
    Calculates the GC content of the promoter sequence.
    """
    gc_count = promoter_seq.count('G') + promoter_seq.count('C')
    gc_content = gc_count / len(promoter_seq) if len(promoter_seq) > 0 else 0
    return gc_content

# Detect TATA box (e.g., TATAAA or slight variations)
def detect_tata_box(promoter_seq):
    """
    Detects TATA box (sequence: TATAAA, variations like TATAAG, etc.).
    """
    tata_box_pattern = re.compile(r'TATA[A|G|C|T]{1,3}')
    match = tata_box_pattern.search(promoter_seq)
    return match.group(0) if match else None

# CpG island detection based on CpG content (a simple method for demonstration)
def detect_cpg_island(promoter_seq):
    """
    Detect CpG islands in the promoter sequence (simplified version).
    CpG islands are characterized by a high number of CpG sites.
    """
    cpg_count = sum([1 for i in range(len(promoter_seq) - 1) 
                     if promoter_seq[i:i+2] == 'CG'])
    cpg_content = cpg_count / len(promoter_seq)
    # A CpG island is considered if the CpG content is higher than a threshold
    return cpg_content > 0.5  # Arbitrary threshold for demonstration

# Check for missing nucleotides (N's in the sequence)
def check_for_missing_nucleotides(promoter_seq):
    """
    Checks if there are any missing nucleotides (represented as 'N') in the promoter sequence.
    """
    return 'N' in promoter_seq

# Check if promoter length is below the required threshold (2 kb)
def check_promoter_length(promoter_seq, threshold=2000):
    """
    Checks if the extracted promoter sequence length is below the specified threshold (2kb).
    """
    return len(promoter_seq) < threshold

# Main workflow to process coding genes and extract promoter regions
def main():
    # Load the genome (FASTA) and GTF annotation data
    genome = load_genome(FASTA_FILE)
    gtf_df = load_gtf(GTF_FILE)
    
    # Filter coding genes and focus on chromosomes 1-22, X, and Y
    coding_genes_df = filter_relevant_chromosomes(gtf_df)
    
    # List to store extracted features
    extracted_data = []

    # Loop through coding genes and extract features
    for _, gene in coding_genes_df.iterrows():
        gene_id = gene['attribute'].split('"')[1]  # Extract gene ID from the 'attribute' field
        chromosome = gene['seqname']
        gene_start = gene['start']
        gene_end = gene['end']
        strand = gene['strand']
        
        # Extract promoter sequence (2 kb upstream of TSS)
        promoter_seq = extract_promoter_sequence(genome, chromosome, gene_start)
        
        # Check if promoter length is valid (>=2 kb)
        is_valid_length = not check_promoter_length(promoter_seq)
        
        # Calculate GC content of the promoter sequence
        gc_content = calculate_gc_content(promoter_seq)
        
        # Detect TATA box in the promoter
        tata_box = detect_tata_box(promoter_seq)
        
        # Detect CpG island in the promoter
        cpg_island = detect_cpg_island(promoter_seq)
        
        # Check for missing nucleotides in the promoter sequence
        has_missing_nucleotides = check_for_missing_nucleotides(promoter_seq)
        
        # Store extracted data
        extracted_data.append({
            "Gene_ID": gene_id,
            "Chromosome": chromosome,
            "Promoter_Sequence": str(promoter_seq),
            "GC_Content": gc_content,
            "Gene_Length": gene_end - gene_start,
            "TATA_Box": tata_box,
            "CpG_Island": cpg_island,
            "Strand": strand,
            "Valid_Promoter_Length": is_valid_length,
            "Missing_Nucleotides": has_missing_nucleotides
        })

     
    # Create a DataFrame to store the extracted features
    extracted_df = pd.DataFrame(extracted_data)

            
        # Print quality assurance checks for each gene
    
    extracted_df.head()

    # Save extracted features to a CSV file
    extracted_df.to_csv("coding_genes_promoter_features_quality_checked.csv", index=False)
    print("\nExtracted features and quality checks saved to 'coding_genes_promoter_features_quality_checked.csv'.")

    # Perform additional analysis as needed (e.g., correlation analysis between GC content and gene length, etc.)
    print("\nDescriptive statistics of Gene Length and GC Content:")
    print(extracted_df[['Gene_Length', 'GC_Content']].describe())

# Run the script

main()

Loading genome from FASTA file...
Loading and processing GTF file...
Filtered genes on chromosomes 1-22, X, Y: 2534535 entries.

Extracted features and quality checks saved to 'coding_genes_promoter_features_quality_checked.csv'.

Descriptive statistics of Gene Length and GC Content:
        Gene_Length    GC_Content
count  2.534535e+06  2.534535e+06
mean   2.240494e+02  4.531070e-01
std    5.549804e+02  9.095463e-02
min    0.000000e+00  5.000000e-04
25%    8.400000e+01  3.805000e-01
50%    1.240000e+02  4.385000e-01
75%    1.810000e+02  5.195000e-01
max    3.472990e+05  8.280000e-01
