# Variant Annotation & ClinVar Classification Notes

**Author:** Aina Rif’ah
**Week 2 – Bioinformatics Skills**
**Topic:** Understanding Variant Annotation and ClinVar Labeling

---

## 1. Allele Frequency

**Definition**
An *allele frequency* is the proportion of all alleles in a population represented by a particular allele.
For autosomal loci, each person carries two alleles—one maternal, one paternal—so the total number of alleles equals twice the number of individuals.

**Example Calculation**

[
\text{Allele frequency of A} = \frac{2(#AA) + (#AB) + (#AC)}{2n}
]

where *n* is the number of individuals sampled.

* Homozygotes contribute two copies of the same allele.
* Heterozygotes contribute one copy of each.
* The sum of all allele frequencies at a locus equals 1.0.

> *Reference:* Stephenson F.H. (2016). *Forensics and Paternity.* Elsevier eBooks: 439–463.

**Relation to Genotype Frequency**
Genotype frequencies always total 100%.
If Aa = 50% of the population, AA and aa would each ≈ 25%.

> *Reference:* Butler J.M. (2014). *STR Population Data Analysis.* Elsevier eBooks: 239–279.

---

## 2. How ClinVar Determines Clinical Significance

ClinVar **does not assign** pathogenicity; it **aggregates** submissions from laboratories, researchers, and expert panels.
Submitters usually follow the **ACMG/AMP guidelines**, which classify variants into five tiers based on multiple lines of evidence.

| Tier                             | Meaning                                                    | Typical Certainty / Actionability          |
| -------------------------------- | ---------------------------------------------------------- | ------------------------------------------ |
| **Pathogenic**                   | Strong evidence that the variant causes disease.           | High certainty; supports diagnosis/action. |
| **Likely Pathogenic**            | Very likely disease-causing (> 90%).                       | Often treated as pathogenic.               |
| **Uncertain Significance (VUS)** | Evidence insufficient or conflicting.                      | Low certainty; not clinically actionable.  |
| **Likely Benign**                | Very likely harmless (> 90%).                              | Unlikely to cause disease.                 |
| **Benign**                       | Strong evidence of non-pathogenicity /common polymorphism. | High certainty; not disease-causing.       |

---

## 3. Understanding ClinVar Review Status (“Gold Stars”)

The **Review Status** indicates the **quality and consensus** of submitted interpretations and appears as ⭐ stars in ClinVar.

| ⭐ Count                                                        | Review Status                                           | Meaning |
| -------------------------------------------------------------- | ------------------------------------------------------- | ------- |
| **4 ⭐ – Practice Guideline**                                   | Classification from an official practice guideline.     |         |
| **3 ⭐ – Reviewed by Expert Panel**                             | Verified by expert panel review.                        |         |
| **2 ⭐ – Criteria Provided, Multiple Submitters, No Conflicts** | Independent submitters agree; criteria documented.      |         |
| **1 ⭐ – Criteria Provided (Single or Conflicting)**            | Either single submitter or conflicting interpretations. |         |
| **0 ⭐ – No Criteria / No Classification**                      | Submitted without assertion criteria or not classified. |         |

**Interpretation Tips**

* Stars reflect **confidence**, not pathogenicity itself.
* A Pathogenic variant with 4 ⭐ is more trustworthy than one with 1 ⭐.
* Always record the Review Status when creating a labeled dataset.

---

## 4. Using These Fields in Variant Annotation and Labeling

When building a machine-learning dataset:

1. Use `ClinicalSignificance` to create binary labels (1 = pathogenic / likely pathogenic; 0 = benign / likely benign).
2. Exclude records with uncertain or conflicting interpretations.
3. Optionally filter or weight examples by Review Status (star rating).
4. Keep allele frequency (from gnomAD or dbSNP) as a key feature — common alleles are usually benign.

---

## 5. Summary

* **Allele frequency** quantifies how common a variant is in a population.
* **ClinVar** aggregates variant interpretations; its *Clinical Significance* and *Review Status* fields are crucial for labeling datasets.
* **Higher Review Status** = higher confidence in pathogenicity assessment.
* Understanding these concepts is essential before automating variant labeling and model training.

---

*(Compiled as part of Week 2 internship documentation.)*

In [None]:
import sys, platform
import pandas as pd
from Bio import Seq

print("Python:", sys.version.splitlines()[0])
print("Platform:", platform.platform())
print("pandas:", pd.__version__)