# 🧬 ORF Finder & Gene Prediction Tool

This project identifies **Open Reading Frames (ORFs)** across all six reading frames in a DNA sequence.

It reports:
- Frame (+1, +2, +3, -1, -2, -3)
- Start & stop positions
- ORF length

---

✅ Tools: BioPython, Python standard library


In [1]:
!pip install biopython

from Bio import SeqIO
from Bio.Seq import Seq
from google.colab import files

Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85


## 📂 Upload a DNA FASTA File


In [9]:
uploaded = files.upload()
fasta_file = list(uploaded.keys())[0]

record = SeqIO.read(fasta_file, "fasta")
sequence = record.seq
print(f"✅ Sequence ID: {record.id}")
print(f"🧬 Sequence Length: {len(sequence)}")

Saving sequence.fasta to sequence (1).fasta
✅ Sequence ID: Long_Test_DNA
🧬 Sequence Length: 156


## 🔬 Finding ORFs in 6 Reading Frames
We'll scan for start (ATG) and stop codons (TAA, TAG, TGA).


In [10]:
stop_codons = ["TAA", "TAG", "TGA"]
start_codon = "ATG"

def find_orfs(seq, strand, frame_offset):
    orfs = []
    in_orf = False
    orf_start = None
    seq_len = len(seq)
    i = frame_offset
    while i < seq_len - 2:
        codon = seq[i:i+3]
        if not in_orf and codon == start_codon:
            in_orf = True
            orf_start = i
        elif in_orf and codon in stop_codons:
            orfs.append({
                "Strand": strand,
                "Frame": frame_offset+1,
                "Start": orf_start,
                "Stop": i+3,
                "Length": (i+3) - orf_start
            })
            in_orf = False
            orf_start = None
        i += 3
    return orfs

all_orfs = []

# Forward strand
for frame in range(3):
    all_orfs.extend(find_orfs(sequence, '+', frame))

# Reverse complement
rev_seq = sequence.reverse_complement()
for frame in range(3):
    all_orfs.extend(find_orfs(rev_seq, '-', frame))

print(f"✅ Total ORFs found: {len(all_orfs)}")


✅ Total ORFs found: 9


## 📊 ORF Summary Table
Shows the strand, frame, start, stop, and length of each ORF.


In [11]:
import pandas as pd

df_orfs = pd.DataFrame(all_orfs)
df_orfs = df_orfs.sort_values(by="Length", ascending=False)
df_orfs.reset_index(drop=True, inplace=True)

# Show top 10 longest ORFs
df_orfs.head(10).style.bar(subset=["Length"], color='#5fba7d')

Unnamed: 0,Strand,Frame,Start,Stop,Length
0,+,1,0,66,66
1,+,1,99,147,48
2,+,2,40,88,48
3,+,2,88,118,30
4,+,3,110,122,12
5,+,3,17,26,9
6,+,1,147,156,9
7,+,3,26,35,9
8,+,1,66,72,6


## 🔍 Filter ORFs longer than 100 bp


In [12]:
long_orfs = df_orfs[df_orfs['Length'] > 100]
print(f"ORFs longer than 100bp: {len(long_orfs)}")
long_orfs.head(10)

ORFs longer than 100bp: 0


Unnamed: 0,Strand,Frame,Start,Stop,Length


## ✅ Conclusion

This tool:
- Identifies ORFs in all 6 reading frames.
- Reports start, stop, and length.
- Filters ORFs by length for potential gene regions.

