# Flow Control Exercise: Convert genotypes

If most of this material is brand new to you, your goal is to complete Part 1. If you are more experienced, please move on to Part 2 and Part 3. And don't forget to talk to your neighbor!

### Motivation:

A biologist is interested in the genetic basis of height. She measures the heights of many subjects and sends off their DNA samples to a core for genotyping arrays. These arrays determine the DNA bases at the variable sites of the genome (known as single nucleotide polymorphisms, or SNPs). Since humans are diploid, i.e. have two of each chromosome, each data point will be two DNA bases corresponding to the two chromosomes in each individual. At each SNP, there will be only three possible genotypes, e.g. AA, AG, GG for an A/G SNP. In order to test the correlation between a SNP genotype and height, she wants to perform a regression with an additive genetic model. However, she cannot do this with the data in the current form. She needs to convert the genotypes, e.g. AA, AG, and GG, to the numbers 0, 1, and 2, respectively (in the example the number corresponds the number of G bases the person has at that SNP). Since she has too much data to do this manually, e.g. in Excel, she comes to you for ideas of how to efficiently transform the data.

## Part 1:

Create a new list which has the converted genotype for each subject ('AA' -> 0, 'AG' -> 1, 'GG' -> 2).

In [4]:
genos = ['AA', 'GG', 'AG', 'AG', 'GG']
genos_new = []
# Use your knowledge of if/else statements and loop structures below.
for i in genos:
    genos_new.append(i.count('G'))
genos_new

[0, 2, 1, 1, 2]

Check your work:

## Part 2:

Sometimes there are errors and the genotype cannot be determined. Adapt your code from above to deal with this problem (in this example missing data is assigned NA for "Not Available").

In [5]:
genos_w_missing = ['AA', 'NA', 'GG', 'AG', 'AG', 'GG', 'NA']
genos_w_missing_new = []
# The missing data should not be converted to a number, but remain 'NA' in the new list
for i in genos_w_missing:
    #print(i)
    if i=='NA':
        genos_w_missing_new.append(i)
    else:
        genos_w_missing_new.append(i.count('G'))
genos_w_missing_new

[0, 'NA', 2, 1, 1, 2, 'NA']

## Part 3:

The file genos.txt has a column of genotypes. Read in the data and convert the genotypes as above. Hint: You'll need to use the built-in string method strip to remove the new-line characters (See the example of reading in a file above. We will cover string methods in the next section).

In [6]:
# Store the genotypes from genos.txt in this list
genos_from_file = []

f=open('genos.txt')
file=f.readlines()

for i in file:
    #print(i)
    i=i.strip()
    if i=='NA':
        genos_w_missing_new.append(i)
    else:
        genos_w_missing_new.append(i.count('G'))
genos_from_file

[0,
 'NA',
 2,
 1,
 1,
 2,
 'NA',
 2,
 2,
 1,
 1,
 0,
 0,
 2,
 2,
 2,
 0,
 'NA',
 1,
 0,
 0,
 2,
 0,
 1,
 2,
 1,
 0,
 1,
 2,
 2,
 1,
 1,
 0,
 0,
 1,
 0,
 2,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 'NA',
 0,
 1,
 2,
 0,
 2,
 0,
 2,
 1,
 2,
 2,
 0,
 2,
 2,
 2,
 2,
 1,
 0,
 2,
 2,
 0,
 2,
 0,
 1,
 1,
 2,
 1,
 1,
 0,
 0,
 2,
 2,
 1,
 0,
 2,
 0,
 2,
 2,
 0,
 1,
 0,
 2,
 'NA',
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 2,
 2,
 1,
 2,
 0,
 1,
 1,
 2,
 1,
 2,
 0,
 1,
 2,
 2,
 1,
 0,
 0,
 1,
 2,
 1,
 1,
 0,
 2,
 0,
 1,
 2,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 2,
 1,
 2,
 2,
 2,
 0,
 0,
 0,
 0,
 1,
 1,
 2,
 2,
 2,
 2,
 0,
 1,
 1,
 0,
 1,
 2,
 1,
 0,
 0,
 1,
 0,
 2,
 0,
 1,
 'NA',
 1,
 1,
 0,
 2,
 0,
 2,
 2,
 1,
 2,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 2,
 2,
 2,
 0,
 0,
 0,
 2,
 2,
 1,
 1,
 2,
 1,
 0,
 2,
 0,
 1,
 0,
 2,
 1,
 0,
 0,
 0,
 2,
 'NA',
 0,
 0,
 0,
 0,
 2,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 2,
 2,
 1,
 0,
 2,
 0,
 0,
 2,
 0,
 2,
 2,
 1,
 1,
 1,
 1,
 2,
 2,
 2,
 2,
 2,
 1,
 2,

Check your work:

In [10]:
genos_from_file[7:15] == [2, 2, 1, 1, 0, 0, 2, 2, 2, 0, 'NA', 1, 0, 0, 2]

False