In [23]:
def load_fasta(filepath):
    """
    Load a FASTA file and return the DNA sequence as a single continuous string.
    Header lines starting with '>' are skipped.
    """
    sequence = []
    with open(filepath, "r") as f:
        for line in f:
            line = line.strip()
            if not line.startswith(">"):
                sequence.append(line)
    return "".join(sequence)

In [24]:
class BruteForce:

    """
    This is a simple brute-force substring search for DNA sequences.
    Checks every starting position in the text.
    Returns all match positions
    """
    def __init__(self, genome, pattern):
        self.genome = genome
        self.pattern = pattern # the codon
        
    def search(self):
        M = len(self.pattern)
        N = len(self.genome)
        matches = []

        #Check every starting position
        for i in range(N - M + 1):
            j = 0

            # Compare the characters one by one
            while j < M and self.genome[i + j] == self.pattern[j]:
                j += 1

            #If we matched the full pattern, record the index
            if j == M:
                matches.append(i)
        return matches
    #Returns true if pattern is appears at least 1 time
    def contains(self, pattern, text):
        return len(self.search(pattern, text)) > 0
           


In [25]:
### Sanity Check Using a simple example

# Example DNA sequence (10 chars)
genome = "ACGTACGTAA"   # N = 10

# Example pattern (3 chars)
pattern = "CGT"   # M = 3

# Create an instance of BruteForce
bf = BruteForce(genome, pattern)

# Run search
matches = bf.search()

print("Genome:  ", genome)
print("Pattern: ", pattern)
print("Matches: ", matches)

Genome:   ACGTACGTAA
Pattern:  CGT
Matches:  [1, 5]


# https://www.ncbi.nlm.nih.gov/nuccore/NG_000007.3?report=fasta

The genomic region we selected on the NCBI page from the RefSeqGene record NG_000007.3, corresponds to the human β-globin (HBB) gene cluster. This interval includes the entire HBB gene, all of its exons and introns, and the surrounding 5′ and 3′ flanking regulatory DNA. Most importantly, this region contains codon 6 of the HBB gene, where the classic sickle-cell mutation occurs (the DNA change GAG → GTG, causing Glu → Val in the protein). Because this FASTA segment captures the complete gene plus adjacent regulatory sequences, it provides everything we need for searching and detecting the sickle-cell mutation.

## This exact genomic region has been saved in our file `HBB_region.fasta` for use in our analysis pipeline.

In [None]:
if __name__ == "__main__":
    genome = load_fasta("HBB_region.fasta") # i am just reazling we may need to make cDNA sequence out of the `HBB_region.fasta` but let me push for now
    
    pattern = "GTG"   # sickle-cell mutation codon
  

    bf = BruteForce(genome, pattern)
    result = bf.search()

    print("Found matches at positions:", result)

Found matches at positions: []


# Time and Space Complexity

## Time Complexity: O(N * M)
The brute-force string search checks every position in the text to see whether the pattern matches, comparing characters one by one.
If the text length is N and the pattern length is M, the algorithm starts at the first character, compares M characters, then moves to the next position and repeats.
Thus, it performs this comparison (N − M + 1) times.

Because each attempt compares up to M characters and this can happen N times, the worst-case total work becomes O(N × M).

This method is simple and easy to implement, but since it always “starts over” after each mismatch, the number of comparisons can grow very large when the text or pattern is long.
For massive sequences such as DNA (millions of bases), brute-force matching becomes very slow.

## Space Complexity: O(1)
Brute-force matching does not allocate extra memory for storing text or pattern.
It simply uses two indices (i, j) to compare characters.

Since the memory usage does not grow with the input sizes (N, M) and stays constant,
the space complexity is O(1).