<h1 style="background-color:#DC143C; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Genetic Terminology</h1>

Citation: Elston RC, Satagopan JM, Sun S. Genetic terminology. Methods Mol Biol. 2012;850:1-9. doi:10.1007/978-1-61779-555-8_1

"Common terms used in genetics with multiple meanings are explained and the terminology used in subsequent chapters is defined. Statistical Human Genetics has existed as a discipline for over a century, and during that time the meanings of many of the terms used have evolved, largely driven by molecular discoveries, to the point that molecular and statistical geneticists often have difficulty understanding each other. It is therefore imperative, now that so much of molecular genetics is becoming an in silico and statistical science, that the authors have well-defined, common terminology."

**<span style="color:#DC143C;">Gene Concept</span>**

"The concept of a gene (the word itself was introduced by Bateson) is due to Mendel, who used the German word “Factor”. Mendel used the word in the same way that we might call “hot” and “cold” factors, not in the way that we call “temperature” a factor. In other words, his Factor was the level of what statisticians now call a factor. In the original terminology, still used by some population geneticists, genes occur in pairs on homologous chromosomes. In this terminology the four blood groups A, B, O and AB (defined in terms of agglutination reactions) are determined by three (allelic) genes: A, B and O."

"Nowadays molecular geneticists do not call these three factors genes, but rather “alleles”, defined as “alternative forms” of a gene that can occur at the same locus, or place, in the genome. Whereas Drosophila geneticists used to talk of two loci for a gene, and human geneticists used to talk of two genes at a locus, modern geneticists talk of “two alleles of a gene” or “two alleles at a locus”; this last, which is nowadays so common, is the terminology that will thus be used in this book. It then follows (rather awkwardly) that two alleles at the same locus are allelic to each other, whereas two alleles that are at different loci are non-allelic to each other. A gene is commonly defined as a DNA sequence that has a function, meaning a class of similar DNA sequences all involved in the same particular molecular function, such as the formation of the ABO red cell antigens. (Note the common illogical use of the phrase “cloning genes” by molecular geneticists when, by their own terminology, “cloning alleles” is meant). Some restrict the word gene to protein-coding genes, but there are many more sequences of DNA that have function by virtue of being transcribed to RNA without ever being translated to DNA and protein-coding, so this restricted definition of a gene would appear to be unwarranted.”.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4450815/

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
nRowsRead = 1000 # specify 'None' if want to read whole file
df = pd.read_csv('../input/cusersmarildownloadsgenescsv/genes.csv', delimiter=';', encoding = "ISO-8859-2", nrows = nRowsRead)
df.dataframeName = 'genes.csv'
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')
df.head()

# **<span style="color:#DC143C;">Gene, allele, locus, site</span>**

"A locus is the location on the genome of a gene, such as the “ABO gene”. By any definition a gene must involve more than one nucleotide base pair. Single nucleotide polymorphisms (SNPs) thus do not occur at loci, but rather in and around loci, and in this book we shall not write SNP markers as being “at” loci. Because of the confusion that occurs when SNPs are described as occurring at loci, some use the term “gene-locus”, but we shall always use the term locus for the location of a functional gene. We shall, however, allow SNP markers to have alleles and use the original term for their locations: “sites” within loci or, more generally, sites within the region of a locus or anywhere in the genome."

"If in the population only one allele occurs at a site or locus, we shall say that it is monomorphic, or monoallelic, in that population. If two alleles occur, as is common for SNPs, we shall use the original term diallelic which, apart from having precedence, is etymologically sounder than the now commonly used term biallelic. If many alleles occur, we shall describe the polymorphism as polyallelic or multiallelic (the former term is arguably more logical, the latter more common). When there are just two alleles at a locus, the one with the smaller population frequency is called the minor allele. In genetics, the term allele" “frequency”--which is strictly speaking a count--is used to mean relative frequency, i.e. the proportion of all such alleles at that locus among the members of a population; thus the term minor allele frequency is often used for diallelic markers."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4450815/

In [None]:
df.isnull().sum()

**<span style="color:#DC143C;">Polymorphism, Mutation</span>**

"The A, B, O and AB blood types comprise a polymorphism, in the sense that they are alternative phenotypes that commonly occur in the population. A polymorphic locus was originally defined operationally as a polymorphism-determining locus at which the least common allele occurs with a “frequency” of at least 1%; but a more appropriate definition would be a locus at which the most common allele occurs with a “frequency” of at most 99%. Different alleles arise at a locus as a result of mutation, or sudden change in the genetic material."

"Mutation is a relatively rare event, caused for example by an error in replication. Thus all alleles are by origin mutant alleles, and a genetic polymorphism was conceived of as a locus at which the frequency of the least common allele has a frequency too large to be maintained in the population solely by recurrent mutation."

"Many authors now use the term mutation for any rare allele, and the term polymorphism for any common allele."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4450815/

# **<span style="color:#DC143C;">Amyotrophic Lateral Sclerosis Genes</span>**

In [None]:
df["Gene"].value_counts()

# **<span style="color:#DC143C;">Allelic association, linkage disequilibrium, gametic phase disequilibrium</span>**

"If the alleles at one locus are not distributed in the population independently of the alleles at another locus, the two loci exhibit allelic association. If this association is a result of a mixture of subpopulations (such as ethnicities or religious groups) within each of which there is random mating, the association is often denoted as “spurious”. In such a case there is true association, but the cause is not of primary genetic interest. If the association is not due to this kind of population structure, it is either due to linkage disequilibrium (LD) or gametic phase disequilibrium (GPD); in the former case the loci are linked, i.e. they co-segregate in families, in the latter case they are not linked, i.e. they segregate independently in families. Owing to an unintended original definition, loci that are not linked have often been mistakenly described as being in LD."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4450815/

In [None]:
df["Associated_ND"].value_counts()

In [None]:
df["Associated_ND"].value_counts().plot.bar(color=['blue', 'red','lime','purple'], title='ALS Associated ND Genes');

"ALSgeneScanner is a pipeline designed for the analysis of NGS data of ALS patients. It perfoms alignment, variant calling, structural variant callin and repeat expansion calling as well as variant annotation using Annovar. It restricts the analysis to a subset of genes (~150) which have been shown to be associated with ALS. A complete list of the included genes is available as a Google Spreadsheet. It also prioritize variants according to the scientific evidence of the gene association and the effect prediction of the variant. At present this can only be done using the reference genome hg19."

"ALSgeneScanner is based on the DNAscan analysis framework. For a detailed description of all ALSgeneScanner components please read the ALSgeneScanner preprint and the DNAscan preprint."

Citation: Alfredo Iacoangeli et al. ALSgeneScanner: a pipeline for the analysis and interpretation of DNA NGS data of ALS patients. bioRxiv, 2018

https://github.com/KHP-Informatics/ALSgeneScanner

https://www.tandfonline.com/doi/full/10.1080/21678421.2018.1562553

In [None]:
df["Phenotype_influence"].value_counts()

In [None]:
df["Phenotype_influence"].value_counts().plot.bar(color=['blue', 'red','lime','purple', 'Chartreuse', 'Coral', 'DarkOrchid', 'Fuchsia'], title='ALS Associated ND Genes');

**<span style="color:#DC143C;">Longer Survival Genes</span>**

In [None]:
longsurv = df[(df['Phenotype_influence']=='Longer survival')].reset_index(drop=True)
longsurv.head()

**<span style="color:#DC143C;">Primarily bulbar onset Genes</span>**

In [None]:
primbulb = df[(df['Phenotype_influence']=='Primarily bulbar onset')].reset_index(drop=True)
primbulb.head()

**<span style="color:#DC143C;">Shorter Survival Genes</span>**

In [None]:
shorsurv = df[(df['Phenotype_influence']=='Shorter survival')].reset_index(drop=True)
shorsurv.head()

**<span style="color:#DC143C;">Limb-Onset Genes</span>**

In [None]:
limbons = df[(df['Phenotype_influence']=='Limb-onset')].reset_index(drop=True)
limbons.head()

**<span style="color:#DC143C;">Late Age of Onset Genes</span>**

In [None]:
latagons = df[(df['Phenotype_influence']=='Late age of onset')].reset_index(drop=True)
latagons.head()

**<span style="color:#DC143C;">Early Age of Onset and Shorter Survival Genes</span>**

In [None]:
earlshorsurv = df[(df['Phenotype_influence']=='Early age of onset and shorter survival')].reset_index(drop=True)
earlshorsurv.head()

**<span style="color:#DC143C;">Limb-Onset, Early Age of Onset and Shorter Survival Genes</span>**

In [None]:
limbearlshor = df[(df['Phenotype_influence']=='Limb-onset, early age of onset and shorter survival')].reset_index(drop=True)
limbearlshor.head()

In [None]:
df["Key_reference"].value_counts()

**<span style="color:#DC143C;">Acknowledgments</span>**

"ALSgeneScanner: a pipeline for the analysis and interpretation of DNA sequencing data of ALS patients is an EU Joint Programme - Neurodegenerative Disease Research (JPND) project. The project is supported through the following funding organisations under the aegis of JPND - www.jpnd.eu (United Kingdom, Medical Research Council (MR/L501529/1; MR/R024804/1) and Economic and Social Research Council (ES/L008238/1)) and through the Motor Neurone Disease Association. That study represents independent research part funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London and by awards establishing the Farr Institute of Health Informatics Research at UCLPartners, from the Medical Research Council, Arthritis Research UK, British Heart Foundation, Cancer Research UK, Chief Scientist Office, Economic and Social Research Council, Engineering and Physical Sciences Research Council, National Institute for Health Research, National Institute for Social Care and Health Research, and Wellcome Trust (grant MR/K006584/1). The work leading up to this publication was funded by the European Community’s Horizon 2020 Programme (H2020-PHC-2014-two-stage; grant agreement number 633413). Sequence data used in this research were in part obtained from the UK National DNA Bank for MND Research, funded by the MND Association and the Wellcome Trust. The authors would like to thank people with MND and their families for their participation in this project."

https://www.tandfonline.com/doi/full/10.1080/21678421.2018.1562553