# Identify procedural reasons for 'all-nothing-all' mutations
## April 4, 2017

Mutations that have 100% frequency in one generation, disappear completely (0%) in the next generation sampled, then reappear at 100% frequency. Potential explanations: 

1. Biological - Frequency-dependent selection? 
2. Procedural - Undersampled populations? Missing coverage during sequencing? Breseq frequency threshold for polymorphism mode (5-95%) not met?

Input: 
1. HTML file from gdtools COMPARE
2. GD files (annotated.gd) from breseq output.

Tasks:
1. Identify rows with 'all-nothing-all' (a-n-a) pattern in input HTML file, by mutation position.
2. For each a-n-a, identify mutation position and generation(s) when 0% frequency occurs.
3. For each 0% frequency occurrence in a-n-a, refer to relevant annotated.gd file to identify reason for frequency.
4. Classify procedural reasons, summary counts.

from bs4 import BeautifulSoup
ua3 = BeautifulSoup(open('/Users/ymseah/Documents/compare/UA3.html'), 'html.parser')

In [141]:
#Task 1-2

#1. Identify line-generation headers
line_generations = []
next_line_gen = ua3.find('th', string=re.compile('0'))
line_gen = next_line_gen.string
while line_gen != 'annotation':
    line_generations.append(line_gen)
    next_line_gen = next_line_gen.next_element.next_element.next_element
    line_gen = next_line_gen.string

In [142]:
print(line_generations)

['0', 'UA3-100', 'UA3-300', 'UA3-500', 'UA3-780', 'UA3-1000']


In [None]:
#2. Identify ref genome and polymorphism position
#3. Identify frequencies across generations
#4. Link 2. and 3.
table_rows_0 = ua3.find_all('tr', class_='alternate_table_row_0')
table_rows_1 = ua3.find_all('tr', class_='alternate_table_row_1')

#5. Identify a-n-a pattern.


In [121]:
'''
#alternative script - string sorting messes up original order
find_line_generations = set(ua3.find_all('th', string=re.compile('UA3')))
line_generations=['0']
for each in find_line_generations:
    line_generations.append(each.string)
line_generations.sort()
print(line_generations)
'''

['0', 'UA3-100', 'UA3-1000', 'UA3-300', 'UA3-500', 'UA3-780']


In [35]:
import numpy as np
import pandas as pd
import re
UA3_100 = pd.read_table('/Users/ymseah/Documents/sic_UA3-15_breseq/annotated.gd', comment='#', names=range(50), dtype=str)
UA3_100.insert(0, 'generation', 100)
UA3_300 = pd.read_table('/Users/ymseah/Documents/sic_UA3.45_breseq/annotated.gd', comment='#', names=range(50), dtype=str)
UA3_300.insert(0, 'generation', 300)
UA3_500 = pd.read_table('/Users/ymseah/Documents/sic_UA3-76_breseq/annotated.gd', comment='#', names=range(50), dtype=str)
UA3_500.insert(0, 'generation', 500)
UA3_780 = pd.read_table('/Users/ymseah/Documents/sic_UA3.118_breseq/annotated.gd', comment='#', names=range(50), dtype=str)
UA3_780.insert(0, 'generation', 780)
UA3_1000 = pd.read_table('/Users/ymseah/Documents/sic_UA3_S2_L001_breseq/output/evidence/annotated.gd', comment='#', names=range(50), dtype=str)
UA3_1000.insert(0, 'generation', 1000)

UA3_df  = pd.concat([UA3_100, UA3_300, UA3_500, UA3_780, UA3_1000], ignore_index=True)
UA3_df.insert(0, 'line', 'UA3')
UA3_df.insert(2, 'frequency', 0.0)
UA3_df.insert(3, 'gene_product', '')
UA3_df.insert(4, 'gene_position', '')
UA3_df.insert(5, 'reject', '')

In [36]:
for row in UA3_df.itertuples():
    #check each column
    col_index = 6
    while col_index < 50:
        #1. polymorphism frequencies
        if re.match('frequency=', str(UA3_df.loc[row[0], col_index])):
            UA3_df.loc[row[0], 'frequency'] = re.sub('frequency=', '', str(UA3_df.loc[row[0], col_index]))
        #2. gene products
        elif re.match('gene_product=', str(UA3_df.loc[row[0], col_index])):
            UA3_df.loc[row[0], 'gene_product'] = re.sub('gene_product=', '', str(UA3_df.loc[row[0], col_index]))
        #3. polymorphism rejection reasons
        elif re.match('reject=', str(UA3_df.loc[row[0], col_index])):
            UA3_df.loc[row[0], 'reject'] = re.sub('reject=', '', str(UA3_df.loc[row[0], col_index]))
        #4. gene annotations
        elif re.match('gene_position=', str(UA3_df.loc[row[0], col_index])):
            UA3_df.loc[row[0], 'gene_position'] = re.sub('gene_position=', '', str(UA3_df.loc[row[0], col_index]))
        col_index += 1
    #set frequencies type to float
    if re.match('1|2|3|4|5|6|7|8|9', str(UA3_df.loc[row[0], 'frequency'])):
        UA3_df.loc[row[0], 'frequency'] = float(UA3_df.loc[row[0], 'frequency'])
    else:
        UA3_df.loc[row[0], 'frequency'] = 0.0
    #set positions (col 4) type to int
    UA3_df.loc[row[0], 4] = int(UA3_df.loc[row[0], 4])
    #set reject col to 'NA' when no reject reason given.
    if (UA3_df.loc[row[0], 'reject'] == '') & (UA3_df.loc[row[0], 2] == '.'):
        UA3_df.loc[row[0], 'reject'] = 'NA'

In [50]:
UA3_df.rename(columns = {0: 'entry_type', 1: 'item_id', 2: 'evidence_ids', 3: 'ref_genome', 4:'position'}, inplace=True)
ua3df_sub = UA3_df[['line', 'generation', 'frequency', 'gene_product', 'gene_position', 'reject', 'entry_type', 'item_id', 'evidence_ids', 'ref_genome', 'position']].copy()
ua3df_sub.to_csv('/Users/ymseah/Documents/ua3.csv', index=False)

Unnamed: 0,line,generation,frequency,gene_product,gene_position,reject,entry_type,item_id,evidence_ids,ref_genome,position
0,UA3,100,1,hypothetical protein/hypothetical protein,intergenic (-125/+57),,INS,1,109,NC_002937,42867
1,UA3,100,1,oligopeptide/dipeptide ABC transporter peripla...,intergenic (-395/-38),,INS,2,110,NC_002937,211389
2,UA3,100,1,"ISDvu4, transposase, truncation/glyceraldehyde...",intergenic (-61/+148),,MOB,3,14941495,NC_002937,629936
3,UA3,100,1,methyl-accepting chemotaxis protein/precorrin-...,intergenic (+16/+194),,SNP,4,112,NC_002937,716874
4,UA3,100,1,histidinol dehydrogenase/hypothetical protein,intergenic (-96/+34),,DEL,5,113,NC_002937,882512
5,UA3,100,1,hypothetical protein/hypothetical protein,intergenic (+125/+43),,INS,6,116,NC_002937,1313341
6,UA3,100,1,geranylgeranyl diphosphate synthase,122,,SNP,7,117,NC_002937,1426830
7,UA3,100,1,hypothetical protein,316,,SNP,8,118,NC_002937,1773256
8,UA3,100,1,hypothetical protein,coding (227/486 nt),,DEL,9,119,NC_002937,1773345
9,UA3,100,1,lipoprotein,304,,SNP,10,120,NC_002937,1913197
