In [2]:
# Blue Eye Gene Variant Lab
# Zeke Van Dehy
# 10/22/2020

# Blue Eye #

A longer description of what to do in this lab can be found here:

https://genomeintelligence.org/?p=775 (explanation of the science)
https://genomeintelligence.org/?p=792 

The assignment this year is a little different than last time we did this lab, so go by what's in this notebook and not what's on the webpage. If you need a challenge, you can add in the addtional SNPs that are not already in blueeyepanel.txt and come up with a way to report each of them.

You have a package of files that contains one file that is descriptive. blueeyepanel.txt DEFINES the blue eye genetic pattern. The other three files are individual data. One person with the blue eye pattern, one without, and one that is a partial match.

A line in the blueeyepanel.txt file looks like this:

```AA, rs7495174, BEH1```

A line in the 23 and me file looks like this:

```rs4778138	15	28335820        GG```

What do we want from each of these files? That's the hard part.

Your goal is to write a program that reads in the defining blue eye pattern and then uses the information to *test* the other three files to see if they match the pattern.

The thing that connects the blueeyepanel.txt pattern file and the 23andme user files is the SNP IDENTIFIER. That is the number that starts with "rs".  If there is a rs001 in blueeyepanel.txt, and an rs001 in a 23andme file, you can be sure that they are talking about the same well-characterized SNP site in the human genome. The rs number is a UNIQUE IDENTIFIER for that human SNP.

We're going to talk about how you should proceed in class. But I've included some ideas below. THERE IS MORE THAN ONE WAY TO SOLVE THIS. My way is not the only way and maybe not even the best.

# First, get the SNPs that are assayed in the 23andme file #

Not all of the SNPs in the blue eye panel are assayed by 23 and me, because some of them occur in haplotype blocks where measuring one of them 100% predicts the values of the others. So you only need to measure the one.

I could make a list of everything that is assayed in a 23andme file by going through the lines one by one, checking what kind of line they are. If the line contains a # at the front it's a comment, and I don't want to process it. If the line contains an MT, it is part of the mitochondrial genome and I don't need to worry about it since none of the blue eye SNPs are on the mitochondrial genome.

How would we extract the assayed SNP IDs from one of the 23andme files?

In [3]:
def readFile(filename):
    """"returns [(id, chromosome, position, genotype)] from 23andme filename"""
    with open(filename) as fo:
        lines = []
        for line in fo.readlines():
            if line[0] == "#" or "MT" in line:
                continue
            else:
                lines.append(line)
        ids = [line.split()[0] for line in lines]
        chromosomes = [line.split()[1] for line in lines]
        positions = [line.split()[2] for line in lines]
        genotypes = [line.split()[3] for line in lines]
        return list(zip(ids, chromosomes, positions, genotypes))

In [4]:
partblue = readFile("partblue-23andMe.txt")[:5]
for i in partblue:
    print(i)

('rs4778138', '15', '28335820', 'GG')
('rs4778241', '15', '28338713', 'AA')
('rs7495174', '15', '28344238', 'GG')
('rs1129038', '15', '28356859', 'TT')
('rs12593929', '15', '28359258', 'AA')


# Then, make a set of parallel lists that define the blue eye pattern #

There's more than one way to set up a data structure that will work to have your blue eye pattern defined. However, because the lists we are making are of the same length and there is a one to one relationship between the unique identifier (rs#) and the genotype information at that site, parallel lists accessed with a common index will get the job done.

You could also use a tuple of three values that explicitly ties the three values together, then iterate over the tuple, but you'd have to break out individual values inside it to make conditionals based on them.

How would we extract the values from the blueeye file and put them in parallel lists?

In [11]:
def getBlueEyePattern():
    blueEyePattern = []
    with open("blueeyepanel.txt") as fo:
            lines = []
            for line in fo.readlines():
                line = line.strip()
                lines.append(line)
            genotypes = [line.split(", ")[0] for line in lines]
            ids = [line.split(", ")[1] for line in lines]
            haplotypes = [line.split(",")[2] for line in lines]
            return [i for i in list(zip(ids, genotypes, haplotypes))]
blueEyePattern = getBlueEyePattern()
for i in blueEyePattern:
    print(i)

('rs4778241', 'CC', ' h-1')
('rs1129038', 'AA', ' h-1')
('rs12593929', 'AA', ' h-1')
('rs12913832', 'GG', ' h-1')
('rs7183877', 'CC', ' h-1')
('rs3935591', 'CC', ' h-1')
('rs7170852', 'AA', ' h-1')
('rs2238289', 'TT', ' h-1')
('rs3940272', 'CC', ' h-1')
('rs8028689', 'TT', ' h-1')
('rs2240203', 'TT', ' h-1')
('rs916977', 'CC', ' h-1')
('rs11631797', 'GG', ' h-1')
('rs4778138', 'AA', ' BEH1')
('rs4778241', 'CC', ' BEH1')
('rs7495174', 'AA', ' BEH1')
('rs1129038', 'TT', ' BEH2')
('rs12913832', 'GG', ' BEH2')
('rs916977', 'CC', ' BEH3')
('rs1667394', 'TT', ' BEH3')


# Then, check the individual's SNP file to see if the genotypes match #

The rs# identifier connects the individual 23 and me file to the blue eye pattern file.

The triggering event to write something out in a report to the user is, we find a SNP ID in the 23andme file that is also in the list of rs# in our blue eye pattern. That event is going to trigger three things:

- we use a .index command to find the position of the rs# in the blueeye lists
- we get the values from that index position across all three lists
- we check the genotype value in the defined pattern against the individual's genotype

We might want to combine information from the two files somehow at this point to help with the final step.

In [12]:
def compareBlueEye(individual):
    """" given the ids and genotype from an individual, 
    return a data structure that compares the ids to the blue eye pattern"""
    
    #if individual's snp is in the pattern, check the genotype
    ret = []
    for ind_line in individual:
        for blue_line in getBlueEyePattern():
            if ind_line[0] == blue_line[0]:
                ret.append("" + ind_line[0] + ",\t" + blue_line[2] + ",\t\t\t" + ind_line[3] + ",\t" + blue_line[1] + ",\t\t" + (str(ind_line[3] == blue_line[1])))
    return ret
partblue = readFile("partblue-23andMe.txt")
for i in compareBlueEye(partblue):
    print(i)

rs4778138,	 BEH1,			GG,	AA,		False
rs4778241,	 h-1,			AA,	CC,		False
rs4778241,	 BEH1,			AA,	CC,		False
rs7495174,	 BEH1,			GG,	AA,		False
rs1129038,	 h-1,			TT,	AA,		False
rs1129038,	 BEH2,			TT,	TT,		True
rs12593929,	 h-1,			AA,	AA,		True
rs12913832,	 h-1,			GG,	GG,		True
rs12913832,	 BEH2,			GG,	GG,		True
rs7183877,	 h-1,			CC,	CC,		True
rs3935591,	 h-1,			CC,	CC,		True
rs8028689,	 h-1,			TT,	TT,		True
rs2240203,	 h-1,			TT,	TT,		True
rs916977,	 h-1,			CC,	CC,		True
rs916977,	 BEH3,			CC,	CC,		True
rs1667394,	 BEH3,			TT,	TT,		True


# Finally, we make a report #

Have your program step through the blueeye lists and report:

rs#, pattern, reference GT, user GT, match?

In [13]:
individualFiles = ["partblue-23andMe.txt", "notblue-23andMe.txt", "blueeye-23andMe.txt"]
individuals = [readFile(file) for file in individualFiles]

for i, individual in enumerate(individuals):
    print(individualFiles[i])
    print("rs#,\t\tpattern,\treference GT,\tuser GT,\tmatch?")
    for line in compareBlueEye(individual):
        print(line)
    print()

partblue-23andMe.txt
rs#,		pattern,	reference GT,	user GT,	match?
rs4778138,	 BEH1,			GG,	AA,		False
rs4778241,	 h-1,			AA,	CC,		False
rs4778241,	 BEH1,			AA,	CC,		False
rs7495174,	 BEH1,			GG,	AA,		False
rs1129038,	 h-1,			TT,	AA,		False
rs1129038,	 BEH2,			TT,	TT,		True
rs12593929,	 h-1,			AA,	AA,		True
rs12913832,	 h-1,			GG,	GG,		True
rs12913832,	 BEH2,			GG,	GG,		True
rs7183877,	 h-1,			CC,	CC,		True
rs3935591,	 h-1,			CC,	CC,		True
rs8028689,	 h-1,			TT,	TT,		True
rs2240203,	 h-1,			TT,	TT,		True
rs916977,	 h-1,			CC,	CC,		True
rs916977,	 BEH3,			CC,	CC,		True
rs1667394,	 BEH3,			TT,	TT,		True

notblue-23andMe.txt
rs#,		pattern,	reference GT,	user GT,	match?
rs4778138,	 BEH1,			GG,	AA,		False
rs4778241,	 h-1,			AA,	CC,		False
rs4778241,	 BEH1,			AA,	CC,		False
rs7495174,	 BEH1,			GG,	AA,		False
rs1129038,	 h-1,			AA,	AA,		True
rs1129038,	 BEH2,			AA,	TT,		False
rs12593929,	 h-1,			GG,	AA,		False
rs12913832,	 h-1,			TT,	GG,		False
rs12913832,	 BEH2,			TT,	GG,		False
rs7183877,	 h-