# Introduction to Statistical Genetics
## 02-404/02-704, Fall 2024
## Code and Data File Documentation

For all homework assignments for this course, you are able to use the provided file `CB_02704.py` which contains several pre-defined functions for your use. A common dataset containing genetic data is also provided that should be used.

At the beginning of `CB_02704.py`, set the path variable to the directory in which you store your data, for example:

```python
path = "Users/jonathanzhu/Documents/statgen/data
```

You can then import this file using `import CB_02704` or `from CB_02704 import *`, depending on your preferences.

In [1]:
#we opt for the latter
from CB_02704 import *



### About Data Files
The provided dataset includes files of three different types. These will end with `.snp`, `.ind`, or `.geno`, and the preceding name will correspond to the population from which the data are derived. The list of populations is as follows:

* ASW: African-Americans in the USA
* CEU: northern Europeans in the USA
* CHB: Chinese in China
* CHD: Chinese in the USA
* GIH: Gujarati-Americans in the USA
* HapMap3: the 1,260 samples from 11 populations collected in the HapMap3 Project.
* JPT: Japanese in Japan
* LWK: Luhya in Kenya
* MKK: Maasai in Kenya
* MXL: Mexican-Americans in the USA
* TSI: Tuscan in Italy
* YRI: Yoruba in Nigeria

Each of these populations has three files. We go over the types, and the functions that work with them, here:

#### .snp Files
All files ending with `.snp` specify a specific SNP, a chromosome number, a column of zeroes, and a physical position on the chromosome, followed by the two polymorphisms.

The main function to read these files is `read_snp_pop(pop)`. This function reads the .snp file into a pandas dataframe, as in the below usage.

In [3]:
chb_snp = read_snp_pop("CHB")
chb_snp.head()

Unnamed: 0,chromosome,morgans,position,ref,alt
rs3131972,1,0.0,742584,G,A
rs3131969,1,0.0,744045,G,A
rs3131967,1,0.0,744197,C,T
rs1048488,1,0.0,750775,T,C
rs12562034,1,0.0,758311,G,A


The function creates a dataframe indexed by SNP, with features corresponding to the chromosome of the SNP, the position in morgans, the physical position of the SNP, the ancestral SNP, and the derived SNP.

This function relies on some other helper functions; see the end of this file for documentation on those.

#### .ind Files
All files ending with `.ind` specify the number of individuals in the population as the number of rows in the file, and some identifying characteristics about them (unique ID number, sex, and population).

This file is read using `read_ind_pop(pop)`, which specifies a population and reads in the corresponding file as a pandas dataframe, as in the below example.

In [5]:
chb_pop = read_ind_pop("CHB")
chb_pop.head()

Unnamed: 0,sex,pop
NA18597,F,CHB
NA18615,F,CHB
NA18557,M,CHB
NA18628,F,CHB
NA18745,M,CHB


This function relies on some other helper functions; see the end of this file for documentation on those.

#### .geno Files
All files ending with `.geno` specify as many rows as there are in the matching `.snp` file. These files contain rows of 0, 1, or 2; the number of characters in the row matches the number of individuals in the matching `.ind` file. Essentially, each SNP forms a row, and each individual forms a column. The 0, 1, or 2 specifies the number of the SNP that are present in that individual’s chromosome.

Two functions are used to read in these files, both of which create a numpy matrix. The first will read the entire file, in the form of `read_geno_pop(pop)`, as in the below example. NOTE: Due to the large size of these files, this may take a while!

In [7]:
chb_geno = read_geno_pop("CHB")
chb_geno[5]

masked_array(data=[2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1,
                   2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1,
                   2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2,
                   2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1,
                   2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False

This produces both a data matrix and a masked matrix in one object. The masked matrix corresponds to missing values indicated with a `True` in the mask or a 9 in the data matrix.

However, due to the large size of these files, there is another function that reads in these datapoints that correspond only to a specific indicated chromosome (integer), in the form of `read_geno_pop_chr(pop, chr)`, as in the following example:

In [8]:
chb_geno_chr1 = read_geno_pop_chr("CHB", 1)
chb_geno_chr1[5]

masked_array(data=[2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1,
                   2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1,
                   2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2,
                   2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1,
                   2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False

This function relies on some other helper functions; see the end of this file for documentation on those.

### Other Helpful Functions

The variable `SNPs` is a predefined variable found in the accompanied .py file that simply stores the output of `read_snp_pop('HapMap3')`. This contains information about all the SNPs that were recorded in the HapMap3 project.

Related to this is the function `get_chr_range(chr)`. This function takes in a chromosome number and returns the start and stop positions for the SNPs on that chromosome, found specifically in the `SNPs` variable.

### Further Documentation

* `pname(name)`: Takes in a file name and adds the prespecified path to the beginning; returns this string.
* `popen(name)`: Opens the file specified in the `name` argument.
* `read_snp(file)`: Performs the actual reading of the .snp file into a pandas dataframe, taking in a filename string and returning the dataframe. Due to the need to use multiple population data files in these assignments, we recommend using `read_snp_pop(pop)` instead.
* `read_ind(file)`: Performs the actual reading of the .ind file into a pandas dataframe, taking in a filename string and returning the dataframe. Due to the need to use multiple population data files in these assignments, we recommend using `read_ind_pop(pop)` instead.
* `read_geno(file)`: Performs the actual reading of the .geno file into a numpy matrix, taking in a filename string and returning the matrix. Due to the need to use multiple population data files in these assignments, we recommend using `read_geno_pop(pop)` or `read_geno_pop_chr(pop, chr)` instead.