In [1]:
#imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score
import sklearn.metrics as metrics

## The Human Gene module of SFARI Gene serves as a comprehensive, up-to-date reference for all known human genes associated with autism spectrum disorders (ASD).

The Human Gene module is the central component of SFARI Gene. The other modules of SFARI Gene (Animal Model, Gene Scoring, and Copy Number Variant (CNV)) are seamlessly integrated into the Human Gene module via a single search engine. The first Human Gene module was released in January of 2007, and in 2009 it was featured in the journal Nucleic Acids Research. There are currently ~800 genes in the Human Gene module that have potential links to ASD. These genes fall into four genetic categories: Rare, Syndromic, Association and Functional.

## Genetic Categories

1. Rare

Rare single gene variants, disruptions/mutations, and submicroscopic deletions/duplications directly linked to ASD

2. Syndromic

Genes implicated in syndromes in which a significant subpopulation develops symptoms of autism (examples: Angelman Syndrome, Fragile X Syndrome)

3. Association

Small risk-conferring common polymorphisms identified from genetic association studies in idiopathic ASD

4. Functional

Functional candidates not yet genetically linked with ASD

In [2]:
#upload data
genes = pd.read_csv('./sfari_genes.csv')

## The current version of the Copy Number Variant (CNV) module of SFARI Gene focuses on curation of recurrent CNVs associated with ASD.

CNVs are segments of DNA, typically greater than 1,000 basepairs in length, that vary in number from person to person. These submicroscopic deletions and duplications are increasingly thought to be involved in the pathogenesis of a wide range of human diseases, including neuropsychiatric disorders such as ASD.

In [3]:
#upload data
gene_cnvs = pd.read_csv('./sfari_gene_cnvs.csv')

## Our gene scoring system takes into account all available evidence supporting a gene's relevance to ASD risk and places each gene into a category reflecting the overall strength of that evidence.

SFARI Gene is a comprehensive database that includes any gene associated with autism risk, regardless of the nature of the evidence supporting its link to ASD. Given this approach and the potentially large number of false-positives it invites, we recognize the importance of establishing a ranking system that gives users an estimate of the strength of the evidence in favor of each gene. In collaboration with our curators at MindSpec and a team of expert autism geneticists, we've established a set of criteria that allows us to rank genes into one of four categories, enabling users to easily identify the genes whose association with autism risk is most likely to hold up over time.

In [4]:
#upload data
gene_scores = pd.read_csv('./sfari_gene_scoring.csv')

## The Mouse Models module of SFARI Gene examines data from mouse models used in laboratory research to elucidate the underlying causes of ASD.

Through rigorous curation of primary scientific research, the Mouse Models module provides integrated coverage of the latest discoveries at the molecular, cellular, and behavioral levels in ASD.

The data gleaned from genetic and induced mouse models can be used to better our understanding of the basic mechanisms of autism and help improve treatments for ASD. We envision that our work will also provide a platform for data mining, bioinformatics, and computational strategies that can be used to develop robust predictive models of these disorders.

## The Mouse Models module of SFARI Gene utilizes a unique classification system in order to more effectively navigate the intricacies of the mouse models.

For the genetic models, each reported model is assigned a name that consists of the gene name, chronologically ordered model number, the model construct (allele type, such as Knock Out, Knock In, etc.), and the genotype (Homozygous, Heterozygous, or Hemizygous). We also classify publications that report the first model of a gene as ``primary,`` with every subsequent publication recorded as ``additional.``

Nomenclature Sample:

Genetic Models:
Gene Name_ Model#_Construct_Genotype Example: NLGN3_ 1_ KO_ HM

Induced Models:
Inducer Abbreviation _Model#_ Model Type _ Model Subtype Example: VPA_ 1_IN_ C

In [5]:
#upload data
mouse_genes = pd.read_csv('./sfari_mouse_genes.csv')

In [6]:
#upload data
mouse_gene_cnvs = pd.read_csv('./sfari_mouse_gene_cnvs.csv')

In [7]:
#upload data
mouse_inbred = pd.read_csv('./sfari_mouse_inbred.csv')

In [8]:
#upload data
mouse_induced = pd.read_csv('./sfari_mouse_induced.csv')

In [9]:
#upload data
mouse_rescue = pd.read_csv('./sfari_mouse_rescue.csv')