# CHAPTER 1 Literature Review 
## 1.1	GENETIC VARIANTS

Genetic variants arise from changes in the DNA sequence. A gene is a segment of DNA that holds the instructions for making a protein. This process begins with transcription, where the DNA sequence is copied into a messenger RNA (mRNA) molecule. The mRNA then travels to the ribosome for translation, where its sequence is read in three-base-pair units called codons. Each codon specifies a particular amino acid, the building block of a protein. A genetic variant occurs when the DNA sequence is altered, which in turn changes the mRNA. Depending on where and how the variant occurs, it may or may not change the codon, or it may result in a different amino acid being incorporated into the protein. Most of the changes in our DNA that create genetic variants are benign, meaning they are harmless differences that do not cause disease. They may alter the DNA sequence, but do not change the function of the resulting protein or gene regulation. These are what cause normal differences between people, like eye colour or blood type. However, when a variant detrimentally alters a gene's function, preventing the resulting protein from working correctly, it is classified as pathogenic and is expected to cause or significantly contribute to a genetic disorder. There are also risk factors where some variants, especially common ones, don't directly cause a disease but can increase a person's risk of developing certain complex conditions like heart disease or diabetes.

**1.1.1	SNVs**
A single-nucleotide variant (SNV) constitutes the most frequent class of genetic variation. Each SNVs reflects a difference in a single nucleotide (or letter). For a given SNV, the DNA letter at that genomic position might be a C in one person but a T in another person. 

**1.1.1.1	Synonymous** 
A synonymous variant or often called a silent mutation, is a substitution of one nucleotide that does not result in a change to the encoded amino acid. This occurs due to the degeneracy of the genetic code, where most amino acids are specified by more than one three-base codon (e.g., GGA and GGG both code for the amino acid Glycine). While traditionally considered phenotypically neutral, a synonymous variant can still have biological consequences. For example, it can alter the transfer RNA (tRNA) availability for translation, affecting the rate of protein synthesis (translational efficiency), change mRNA secondary structure, impacting its stability, splicing, or translation initiation and also disrupt Exonic Splicing Enhancers (ESEs), leading to aberrant splicing and potential loss of functional protein. 

**1.1.1.2	Missense** 
A missense variant, known as a nonsynonymous substitution, is a substitution that changes a codon to one that specifies a different amino acid. The impact of a missense variant on protein function is highly variable and depends on the physicochemical difference between the original and substituted amino acid, and its location within the protein structure: 
•	Conservative Missense: The substituted amino acid has similar chemical properties to the original (e.g., Valine and Leucine, both are hydrophobic). These often have minimal impact.
•	Non-conservative Missense: The substituted amino acid has significantly different properties (e.g., charged hydrophobic, as seen in Sickle Cell Disease, where Valine replaces Glutamic Acid). These frequently disrupt the protein's folding, stability, or interaction sites, often leading to a loss of function or gain of a toxic function. 

**1.1.1.3	Nonsense** 
A nonsense variant (also called a stop-gain mutation) is a substitution that changes an amino acid-coding codon (a "sense" codon) into one of the three stop codons. This leads to the premature termination of translation, resulting in a truncated and typically non-functional polypeptide. The short protein product is often unstable and rapidly degraded, or, if stable, it lacks critical C-terminal functional domains, rendering it inactive. Nonsense variants are generally considered highly deleterious and are a common cause of severe genetic disorders, such as a subset of Cystic Fibrosis cases caused by a premature stop codon in the CFTR gene. 

**1.1.2 INDEL** 
An INDEL is a type of genetic variant where a segment of DNA is either inserted or deleted from a sequence. The term is a blend of "insertion" and "deletion" and is used when the exact cause (whether a gain or a loss of nucleotides) cannot be determined, or when referring to both event types collectively. The effect of an INDEL that occurs within the coding region of a gene is primarily determined by whether the number of nucleotides inserted or deleted is a multiple of three, leading to two main consequences. 

**1.1.2.1 Frameshift** 
A frameshift variant (or out-of-frame INDEL) results when the number of inserted or deleted nucleotides is not divisible by three (e.g., one, two, four bases, etc.). This error dramatically alters the way the ribosome reads the mRNA strand. Because the reading frame is shifted, every codon downstream of the mutation is misread. This typically leads to a completely new and often nonsensical sequence of amino acids until a premature stop codon is encountered shortly after the mutation site. The resulting polypeptide is usually severely truncated (shortened) and non-functional. Furthermore, the altered mRNA is often destroyed by a cellular quality control mechanism called Nonsense-Mediated mRNA Decay (NMD), making frameshift variants a highly severe and common cause of genetic disorders. 

**1.1.2.2 In–frame shift** 
An in-frame variant occurs when the number of nucleotides inserted or deleted is a multiple of three (e.g., a deletion of three bases or an insertion of six bases). Because the entire codon (three bases) is added or removed, the original reading frame is preserved or maintained. The protein resulting from this type of INDEL will be different from the normal protein by having one or more amino acids either added or missing, but the amino acid sequence downstream of the change remains correct. The effect on the protein's function is variable: a small, in-frame deletion in a non-critical loop might have little effect, whereas the deletion of a large functional domain (such as in Cystic Fibrosis, which deletes a single critical amino acid) can result in a completely non-functional protein. 

**1.1.3 Structural variants** 
These large-scale genomic differences are called structural variants and involve at least 50 nucleotides and as many as thousands of nucleotides that have been inserted, deleted, inverted or moved from one part of the genome to another. Tandem repeats that contain more than 50 nucleotides are considered structural variants; in fact, such large tandem repeats account for nearly half of the structural variants present in human genomes. When a structural variant reflects differences in the total number of nucleotides involved, it is called a copy-number variant (CNV). Note that CNVs are distinguished from other structural variants, such as inversions and translocations, because the latter types often do not involve a difference in the total number of nucleotides. 


## 1.2 VARIANTS DATABASES 
**1.2.1 ClinVar** 
ClinVar is a freely accessible, public archive maintained by the National Institutes of Health (NIH) that focuses on the clinical significance of human genetic variations. Its primary role is to aggregate and reconcile reports submitted by clinical testing laboratories, research groups, and expert panels regarding the relationship between a genetic variant and an observed disease or condition. For a single genetic variant, ClinVar can contain multiple submissions from different laboratories, which may agree (consensus) or disagree (conflicting interpretations) on the clinical impact. The database transparently reports all submitted interpretations, which are typically classified using a five-tier system (e.g., Pathogenic, Likely Pathogenic, Uncertain Significance, Likely Benign, Benign) recommended by organisations such as the American College of Medical Genetics and Genomics (ACMG). This centralised resource is crucial for standardising clinical variant interpretation and improving diagnostic consistency across the healthcare community. 

**1.2.2 dbSNP** 
The Database of Short Genetic Variations (dbSNP) is a public archive developed and hosted by the National Center for Biotechnology Information (NCBI). While its name stands for Single Nucleotide Polymorphism, the database has expanded to catalogue a wide range of short genetic variations, including Single Nucleotide Variants (SNVs), small insertions and deletions (indels), and microsatellites. dbSNP serves as a central identifier for these variations, assigning a unique Reference SNP ID number to each distinct variant that is submitted. It is a comprehensive repository of where and what type of variation exists in the genome, collecting data from various sources and research projects. Crucially, dbSNP also provides information on population frequency, molecular consequence, and genomic mapping, making it a foundational resource for linking genomic location to the existence of a genetic change, regardless of its clinical relevance. 

**1.3.3 gnomAD** 
The Genome Aggregation Database (gnomAD) is a large, publicly available resource that aggregates and harmonises exome and whole-genome sequencing data from tens of thousands of individuals from various large-scale sequencing projects worldwide. The core purpose of gnomAD is to provide highly accurate allele frequency data—the proportion of a given variant in different human populations. This information is indispensable for variant interpretation, particularly in rare disease diagnostics. The general rule is that a genetic variant observed at a significant frequency in the healthy human population (as documented in gnomAD) is unlikely to be the cause of a severe rare disease. Therefore, gnomAD data allows clinicians and researchers to quickly filter out millions of common and benign variants, significantly streamlining the process of identifying truly rare, potentially pathogenic mutations. It is considered the most comprehensive and widely used population frequency dataset in clinical and research genetics. 


In [None]:
import sys, platform
import pandas as pd
from Bio import Seq

print("Python:", sys.version.splitlines()[0])
print("Platform:", platform.platform())
print("pandas:", pd.__version__)
