# One-Hot Encoding in Bioinformatics & Machine Learning

One-hot encoding converts categorical data into binary vectors. Each category becomes a column with a `1` indicating presence and `0` indicating absence. This notebook walks through several practical examples relevant to bioinformatics and machine learning.

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

---
## Example 1 — DNA Sequence Encoding

DNA has four nucleotides: **A, C, G, T**. Deep-learning models for genomics (e.g., DeepBind, Basset) expect each position in a sequence to be represented as a 4-element binary vector:

| Nucleotide | Encoding |
|:---:|:---:|
| A | [1, 0, 0, 0] |
| C | [0, 1, 0, 0] |
| G | [0, 0, 1, 0] |
| T | [0, 0, 0, 1] |

A sequence of length *L* becomes a matrix of shape **(L, 4)**.

In [None]:
def one_hot_dna(sequence: str) -> np.ndarray:
    """One-hot encode a DNA sequence into an (L, 4) array."""
    mapping = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
    seq_upper = sequence.upper()
    ohe = np.zeros((len(seq_upper), 4), dtype=np.float32)
    for i, base in enumerate(seq_upper):
        if base in mapping:
            ohe[i, mapping[base]] = 1.0
    return ohe

seq = "ACGTTAGC"
encoded = one_hot_dna(seq)

print(f"Sequence: {seq}")
print(f"Shape:    {encoded.shape}  (positions x nucleotides)\n")
print(pd.DataFrame(encoded, columns=['A', 'C', 'G', 'T'],
                   index=list(seq)))

---
## Example 2 — Protein / Amino-Acid Encoding

Proteins are built from 20 standard amino acids. One-hot encoding produces a **(L, 20)** matrix — a common input to protein structure or function predictors.

In [None]:
AMINO_ACIDS = list("ACDEFGHIKLMNPQRSTVWY")  # 20 standard amino acids
AA_INDEX = {aa: i for i, aa in enumerate(AMINO_ACIDS)}

def one_hot_protein(sequence: str) -> np.ndarray:
    """One-hot encode a protein sequence into an (L, 20) array."""
    ohe = np.zeros((len(sequence), 20), dtype=np.float32)
    for i, aa in enumerate(sequence.upper()):
        if aa in AA_INDEX:
            ohe[i, AA_INDEX[aa]] = 1.0
    return ohe

peptide = "MKTLLILAG"
encoded_aa = one_hot_protein(peptide)

print(f"Peptide: {peptide}")
print(f"Shape:   {encoded_aa.shape}  (positions x amino acids)\n")
print(pd.DataFrame(encoded_aa, columns=AMINO_ACIDS, index=list(peptide)))

---
## Example 3 — Encoding Categorical Clinical / Sample Data with scikit-learn

In clinical genomics and ML pipelines, sample metadata often includes categorical variables (e.g., tissue type, disease status). Scikit-learn's `OneHotEncoder` handles this cleanly.

In [None]:
# Simulated clinical metadata
samples = pd.DataFrame({
    'tissue':  ['liver', 'brain', 'liver', 'kidney', 'brain', 'kidney'],
    'disease': ['healthy', 'tumor', 'tumor', 'healthy', 'healthy', 'tumor'],
    'gene_expression': [5.2, 8.1, 7.3, 4.9, 6.0, 9.5]
})
print("Raw data:")
print(samples, "\n")

encoder = OneHotEncoder(sparse_output=False)
cat_cols = ['tissue', 'disease']
ohe_array = encoder.fit_transform(samples[cat_cols])
ohe_df = pd.DataFrame(ohe_array, columns=encoder.get_feature_names_out(cat_cols))

result = pd.concat([ohe_df, samples[['gene_expression']].reset_index(drop=True)], axis=1)
print("One-hot encoded:")
print(result)

---
## Example 4 — Encoding Codons

A codon is a triplet of nucleotides that maps to an amino acid. There are **64** possible codons. One-hot encoding codons is useful for codon-usage bias studies and translation-rate prediction models.

In [None]:
from itertools import product

# Build the 64-codon vocabulary
CODONS = [''.join(c) for c in product('ACGT', repeat=3)]
CODON_INDEX = {c: i for i, c in enumerate(CODONS)}

def one_hot_codons(cds: str) -> np.ndarray:
    """One-hot encode a coding sequence by codon (length must be divisible by 3)."""
    codons = [cds[i:i+3] for i in range(0, len(cds), 3)]
    ohe = np.zeros((len(codons), 64), dtype=np.float32)
    for i, codon in enumerate(codons):
        if codon in CODON_INDEX:
            ohe[i, CODON_INDEX[codon]] = 1.0
    return ohe, codons

cds = "ATGGCTAACTGA"  # Met-Ala-Asn-Stop
ohe_codons, codon_list = one_hot_codons(cds)

print(f"CDS:    {cds}")
print(f"Codons: {codon_list}")
print(f"Shape:  {ohe_codons.shape}  (codons x 64 possible codons)\n")

# Show only the non-zero columns for readability
df = pd.DataFrame(ohe_codons, columns=CODONS, index=codon_list)
non_zero_cols = df.columns[df.any()]
print(df[non_zero_cols])

---
## Example 5 — Using pandas `get_dummies` for SNP Genotype Encoding

Single-nucleotide polymorphisms (SNPs) are commonly represented as categorical genotypes (e.g., AA, AG, GG). `pd.get_dummies` provides a quick one-hot encoding for this kind of data.

In [None]:
snp_data = pd.DataFrame({
    'sample': ['S1', 'S2', 'S3', 'S4', 'S5'],
    'rs1234': ['AA', 'AG', 'GG', 'AG', 'AA'],
    'rs5678': ['CC', 'CT', 'TT', 'CC', 'CT'],
})
print("Raw genotype data:")
print(snp_data, "\n")

encoded_snps = pd.get_dummies(snp_data, columns=['rs1234', 'rs5678'])
print("One-hot encoded genotypes:")
print(encoded_snps)

---
## Key Takeaways

| Context | Categories | Resulting Shape per Sample |
|---|---|---|
| DNA nucleotides | A, C, G, T | (L, 4) |
| Amino acids | 20 standard | (L, 20) |
| Codons | 64 triplets | (N_codons, 64) |
| Clinical metadata | varies | one column per category level |
| SNP genotypes | e.g., AA/AG/GG | one column per genotype per locus |

**Why one-hot?**
- Avoids imposing a false ordinal relationship between categories (e.g., A < C < G < T has no biological meaning).
- Required by most neural networks and many classical ML algorithms.
- Simple, lossless, and invertible.

**Watch out for:**
- High dimensionality when the number of categories is large (e.g., k-mers for large k).
- Sparse matrices — use `scipy.sparse` or `sparse_output=True` in scikit-learn to save memory.

In [None]:
## End of Notebook ##