# Structural Analysis Script
Created by Samuel Horovatin February 2022.

This script is meant to take as input correctly formatted hapmap data (see http://augustogarcia.me/statgen-esalq/Hapmap-and-VCF-formats-and-its-integration-with-onemap/) and generate some useful analysis.

Two forms of anaylsis are produced currently: trait based analysis and line based analysis. The analysis content was inspired by the paper found here: https://humgenomics.biomedcentral.com/articles/10.1186/s40246-018-0156-4

**Trait Based Analysis**
* SNP call rate: The rate of missing SNP data per trait across all lines. 0.95 is considered the threshold for removal. In the SNP data I am using, the last 16 lines are devoid of SNP data, thus an extra SNP call rate is calculated removing them. 
* Expected Hardy-Weinberg equilibrium (HWE): The expected HWE for trait across all lines.
* Actual Hardy-Weinberg equilibrium (HWE): The actual HWE for trait across all lines.

In [None]:
import pandas as pd
import numpy as np

# Name/location of correctly formatted hapmap. Utilizes output of hapmap_gen.ipynb
HAPMAP = "./hapmaps/wheat_hapmap_gen_8222SNP.txt"

# Column headers used within the fromated hapmap
HAPMAP_HEADERS = ['rs#','alleles','chrom','pos','strand','assembly#','center', 'protLSID', 'assayLSID', 'panelLSID', 'QCcode']

# Name/location of tab deliminated Trait Analysis output file
TRAIT_OUTPUT = "./8222SNP_Trait_Analysis.txt"

# Column headers for interesting population statistics in output file
TRAIT_HEADERS = ['rs#', 'snpCallRate', 'snpCallRateTrim-16', 'HWE_Expected', 'HWE_Actual', 'ViolatesHWE', 'MAF']

In [None]:
# Load in the relevant data
hap_df = pd.read_csv(HAPMAP, sep='\t')