# Genome Analysis with `GenomeVisualizer`

This notebook demonstrates how to use the `GenomeVisualizer` toolbox to analyze genomic DNA sequences.  
We explore functions related to:
* basic k-mer analysis
* reverse complements
* replication origin detection

---

### Import

In [None]:
import GenomeVisualizer

### Load genome sequence from a file

In [None]:
genome = GenomeVisualizer.load_genome_from_txt("ecoli.txt")
print("Genome length:", len(genome))
print("First 100 bases:", genome[:100])

### Reverse complement - primer design use case

Understanding the reverse complement of a DNA sequence is essential in molecular biology, especially in primer design.  
When designing primers for PCR amplification or sequencing, the primer must bind to the complementary strand of the DNA in the correct orientation.

The reverse complement gives you exactly what the primer will hybridize to — it reverses the sequence and swaps each base for its Watson–Crick pair:

- A ↔ T  
- C ↔ G

This ensures that your primer will bind to the correct location on the target strand in the right direction.

In [None]:
sequence = "AGCTTAGGCTA"
rc = GenomeVisualizer.ReverseComplement(sequence)
print("Sequence:", sequence)
print("Reverse Complement (reverse primer):", rc)

### Basic k-mer analysis

In genomic analysis, a **k-mer** is a substring of length **k** extracted from a DNA sequence.  
Studying the frequency of k-mers in a genome is a foundational method to uncover biological signals such as:

- promoter regions,
- repetitive elements,
- binding sites,
- and horizontal gene transfer signals.

By counting the occurrences of each k-mer, we can identify which patterns are overrepresented — these are often biologically relevant.

#### 🔬 Most frequent 5-mers in the first 1000 bases:

In [None]:
subsequence = genome[:1000]
k = 5

freq_map = GenomeVisualizer.FrequencyMap(subsequence, k)
top_kmers = GenomeVisualizer.FrequentWords(subsequence, k)

sorted_freq_map = sorted(freq_map.items(), key=lambda item: item[1], reverse=True)

print(f"Top {k}-mers (most frequent):", top_kmers)
print(f"Top 15 most frequent {k}-mers:")
for kmer, count in sorted_freq_map[:15]:
    print(f"{kmer}: {count}")

### Visualize base composition using a symbol array

In genomics, analyzing the base composition along the genome helps reveal nucleotide-rich regions.  
For instance, **C-rich or G-rich domains** may signal structural features, regulatory zones, or genome organization patterns.

#### 🔬 Frequency of 'C' nucleotides in the E. coli genome:

In [None]:
symbol_array = GenomeVisualizer.FasterSymbolArray("AAAAGGGG", "A")
for i, val in list(symbol_array.items())[:10]:
    print(f"Position {i}: {val}")

In [None]:
array = {}
array[0] = GenomeVisualizer.PatternCount(genome[0:len(genome)//2], "C")
print(array[0])
print("C" in genome[0:100])
print(genome)