
# Amino Acid Frequency Analysis
**Goal:** Parse one or more protein FASTA files, compute amino acid frequencies, and visualize the distribution.

**What to do:**
1. Upload one or more FASTA files (or paste UniProt sequences) into this notebook.
2. Run the parsing cell to read sequences.
3. Compute counts and normalized frequencies.
4. Plot bar charts.
5. Write 3â€“5 sentences of insights in the *Discussion* section.

> Tip: Keep this notebook clean and readable. Use headings and short comments so reviewers see your process clearly.



## 1. Setup
Install and import libraries. If running in Google Colab, the Biopython install cell will run. If using local Jupyter, you may already have it.


In [None]:

# If you are in Colab, uncomment the next line to install Biopython
# !pip install biopython

import io
import textwrap
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

try:
    from Bio import SeqIO
    BIOPYTHON_AVAILABLE = True
except Exception:
    BIOPYTHON_AVAILABLE = False

print("Biopython available:", BIOPYTHON_AVAILABLE)



## 2. Input Sequences
Choose **one** of the two options below.

**Option A: Upload a FASTA file**  
- In Colab: use the upload widget or mount Drive, then set `fasta_path` below.
- Locally: place your `.fasta` file next to this notebook and set `fasta_path`.

**Option B: Paste FASTA text**  
- Paste one or more sequences in standard FASTA format in the cell provided.


In [None]:

# === Option A: set a FASTA path if you have a file ===
# Example: fasta_path = "example.fasta"
fasta_path = ""  # leave blank if you will paste sequences in Option B

# === Option B: paste FASTA content directly here ===
fasta_text = """>seq1
MTEITAAMVKELRESTGAGMMDCKNALSETQHEWAYK
>seq2
MADSEQVVVVVVVPGKVFAGIIGLALAGLLALVSS
"""



## 3. Parse Sequences
This cell will read sequences either from a FASTA file or from the pasted text block.


In [None]:

def read_sequences(fasta_path: str, fasta_text: str):
    records = []
    if fasta_path:
        if BIOPYTHON_AVAILABLE:
            records = list(SeqIO.parse(fasta_path, "fasta"))
        else:
            # Minimal fallback parser if Biopython is not available
            with open(fasta_path, "r") as f:
                content = f.read()
            records = []
            header = None
            seq_lines = []
            for line in content.splitlines():
                if line.startswith(">"):
                    if header is not None:
                        records.append((header, "".join(seq_lines).replace(" ", "").upper()))
                    header = line[1:].strip()
                    seq_lines = []
                else:
                    seq_lines.append(line.strip())
            if header is not None:
                records.append((header, "".join(seq_lines).replace(" ", "").upper()))
            # Convert to a simple record-like dict list
            records = [{"id": h, "seq": s} for h, s in records]
    else:
        # Parse from pasted text
        lines = fasta_text.strip().splitlines()
        header = None
        seq_lines = []
        for line in lines:
            if line.startswith(">"):
                if header is not None:
                    records.append({"id": header, "seq": "".join(seq_lines).replace(" ", "").upper()})
                header = line[1:].strip()
                seq_lines = []
            else:
                seq_lines.append(line.strip())
        if header is not None:
            records.append({"id": header, "seq": "".join(seq_lines).replace(" ", "").upper()})
    return records

records = read_sequences(fasta_path, fasta_text)

print(f"Parsed {len(records)} sequences.")
for r in records[:5]:
    rid = getattr(r, "id", r.get("id"))
    seq = str(getattr(r, "seq", r.get("seq")))
    print("-", rid, len(seq), "aa")



## 4. Compute Amino Acid Frequencies
This makes a table with counts and relative frequencies for the 20 standard amino acids.


In [None]:

AA20 = list("ACDEFGHIKLMNPQRSTVWY")

def sequences_to_dataframe(records):
    def get_seq(r):
        # Works for both Biopython SeqRecord and dict fallback
        try:
            return str(r.seq)
        except Exception:
            return str(r.get("seq"))
    table = []
    for r in records:
        rid = getattr(r, "id", r.get("id"))
        seq = get_seq(r)
        counts = Counter([aa for aa in seq if aa in AA20])
        total = sum(counts.values()) or 1
        row = {"sequence_id": rid, "length": total}
        for aa in AA20:
            row[f"{aa}_count"] = counts.get(aa, 0)
            row[f"{aa}_freq"] = counts.get(aa, 0) / total
        table.append(row)
    return pd.DataFrame(table)

df = sequences_to_dataframe(records)
df.head()



## 5. Visualize Aggregate Frequencies
This aggregates across all sequences and plots counts and frequencies.


In [None]:

# Aggregate across sequences
count_cols = [f"{aa}_count" for aa in AA20]
freq_cols = [f"{aa}_freq" for aa in AA20]

agg_counts = df[count_cols].sum(axis=0)
agg_freqs = (df[freq_cols].mean(axis=0))  # average frequency per sequence

# Prepare tidy frames for plotting
counts_df = pd.DataFrame({
    "AminoAcid": [c.split("_")[0] for c in count_cols],
    "Count": agg_counts.values
}).sort_values("AminoAcid")

freqs_df = pd.DataFrame({
    "AminoAcid": [c.split("_")[0] for c in freq_cols],
    "Frequency": agg_freqs.values
}).sort_values("AminoAcid")

display(counts_df.head(), freqs_df.head())

# Plot counts
plt.figure(figsize=(10, 4))
plt.bar(counts_df["AminoAcid"], counts_df["Count"])
plt.title("Amino Acid Counts (Aggregate)")
plt.xlabel("Amino Acid")
plt.ylabel("Count")
plt.show()

# Plot frequencies
plt.figure(figsize=(10, 4))
plt.bar(freqs_df["AminoAcid"], freqs_df["Frequency"])
plt.title("Amino Acid Frequencies (Average per sequence)")
plt.xlabel("Amino Acid")
plt.ylabel("Frequency")
plt.show()



## 6. Discussion
Write 3 to 5 sentences summarizing what you found. For example:
- Which amino acids were most common in your sequences?
- Did the profile match expectations for your protein family?
- What could explain any unusual enrichments or depletions?
- What would you analyze next if you had more time?



## 7. Next Steps (optional)
- Compare two protein families side by side.
- Add a GC content or codon usage analysis if you switch to nucleotide sequences.
- Try a small classification: enzyme vs non enzyme using sequence features.
