# Genomic Sequencing - Week 3

### How do we sequence antibiotics (Part I)?

In August 1928, Scottish microbiologist Alexander Flemming noticed one of his cultures of infection-causing *Staphylococcus* bacteria was contaminated with *Penicillium* fungus. The colony of *Staphylococcus* surrounding the fungus had been destroyed. The bacteria killing substance he named **penicillin**. Experimentally this would not be important for a few more decades because it was so difficult to isolate the **antibiotic** agent (compound that actually killed bacteria) from the fungus. It was not until 1942 when Russian biologists Gregory Gause and Maria Brazhnikova noticed that the *Bacillus brevis* bacterium killed the pathogenic bacterium *Staphylococcus aureus*. They were able to successfully isolate the antibiotic compound from *Bacillus brevis* and named it Gramicidin Soviet. American scientists eventually were able to find a moldy cantaloupe in Illinois that allowed them to produce 2 million doses of penicillin in time for the invasion of Normandy in 1944.

Now the difficult problem was to elucidate the chemical composition of the antibiotic compounds. Gause was unable to do this for Gramicidin Soviet, however English biochemist Richard Synge was able to demonstrate that they represent short amino-acid sequences (mini-proteins) called **peptides**. Since then the mass production of antibiotics created an evolutionary arms race between scientists/pharma companies and pathogenic bacteria. New antibiotic drugs are developed while bacteria acquire resistance against these drugs. Modern medicine has largely prevailed over the last six decades, however in the last decade we have witnessed an alarming rise in antibiotic resistant bacterial infections that cannot be cured by even the most powerful antibiotics. A particular mutated strain Gause studied in 1942, **Methicillin-resistant *Staphylococcus aureus* (MRSA)**, is now the leading cause of death from infections.

Modern research has attempted to accelerate development of new antibiotics. A difficult problem in antibiotic research is **sequencing** newly discovered antibiotics in order to determine the order of amino acids that make up the antibiotic peptide.

#### How peptides are encoded by the genome

Consider **Tyrocidine B1**, one of the many antibiotics produced by *Bacillus brevis*. The 10 amino acid long sequence is detailed below. We want to figure out how *Bacillus brevis* made this antibiotic.
<img src="img/Capture.PNG">

The **Central Dogma of Molecular Biology** states DNA makes RNA makes protein.
1. A gene from a genome is first **transcribed** into a strand of RNA composed of ribonulceotides ((A)denine, (G)uanine, (C)ytosine, and (U)racil)
2. The RNA is then **translated** into an amino acid sequence of a protein.
<img src="img/central_dogma.jpg">

From a computational perspective, DNA to RNA transcription is pretty straightforward as it is a DNA string with all T replaced by U. Translation occurs by partitioning the RNA into non-overlapping 3-mers called **codons**. Each codon corresponds to one of 20 amino acids according to the **genetic code**.
<img src="img/genetic_code.png" width="400">

In [1]:
# Code reused from previous weeks
def revComp(pattern):
    codon = {'A': 'T',
             'T': 'A',
             'G': 'C',
             'C': 'G'}
    return ''.join(codon[x] for x in pattern[::-1])  

In [2]:
def dnaToRna(dna):
    '''
    Transcibe DNA string to RNA string
    
    Args:
        dna (string): DNA string to be transcribed
    
    Returns:
        string: Transcibed RNA string
    '''
    return ''.join('U' if x =='T' else x for x in dna)

def proteinTrans(rna, stop=False):
    '''
    Protein Translation
    Translate an RNA string into an amino acid string
    
    Args:
        rna (string): RNA string to be translated.
        
    Returns:
        string: Translated amino acid string
    '''
    with open("Data/rnaToAmino.txt") as inFile:
        data = inFile.readlines()
        data = [x.strip() for x in data]
    transMap = {}
    for codons in data:
        tCodon, amino = codons.split(" ")
        if amino == 'Stop':
            transMap[tCodon] = '*'
        else:
            transMap[tCodon] = amino
    prot = ''
    for i in range(0, len(rna), 3):
        codon = rna[i:i+3]
        if stop and transMap[codon] == '*':
            break
        elif len(codon) == 3:
            prot += transMap[codon]
    return prot

In [3]:
# test/run proteinTrans function

# dnaToRna("ATGGCCATGGCCCCCAGAACTGAGATCAATAGTACCCGTATTAACGGGTGA")
proteinTrans("AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA", stop=True)

# with open("Data/dataset_96_4.txt") as inFile:
#     data = inFile.readlines()
# rna = data[0].strip()
# proteinTrans(rna)

'MAMAPRTEINSTRING'

Looking at the genetic code there can be thousands of combinations of RNA strings that could ultimately code the same amino acid. For the Tyrocidine B1 amino acid peptide and a DNA string of length 30 there are 6,144 options. So looking at the *Bacillus brevis* genome how would we find where the sequence is that codes the antibiotic peptide? There are 3 ways to divide a DNA string into codons for translation, one starting at each of the first three starting positions of the string. These different ways of dividing the DNA string are called **reading frames**. Since DNA is double stranded a genome has 6 reading frames (3 on each strand).
<img src="img/reading_frames.png" width="400">

In [4]:
def peptideEncode(dna, peptide):
    '''
    Find substrings of a genome encoding a given amino acid sequence.
    Note this is crude as it does not take into account reading frames or stop codons
    
    Args:
        dna (string): DNA string to search through.
        peptide (string): Target amino acid sequence.
        
    Returns:
        list[string]: Collection of all DNA substrings that encode the target amino acid sequence.
    '''
    res = []
    for i in range(len(dna) - (3 * len(peptide)) + 1):
        frame = dna[i:i+(3*len(peptide))]
        if proteinTrans(dnaToRna(frame)) == peptide or proteinTrans(dnaToRna(revComp(frame))) == peptide:
            res.append(frame)
    return res

In [5]:
# test/run peptideEncode function

peptideEncode("ATGGCCATGGCCCCCAGAACTGAGATCAATAGTACCCGTATTAACGGGTGA", "MA")

# with open("Data/dataset_96_7.txt") as inFile:
#     data = inFile.readlines()

# dna = data[0].strip()
# pep = data[1].strip()
# print(*peptideEncode(dna, pep), sep="\n")

['ATGGCC', 'GGCCAT', 'ATGGCC']

Using this function, solve the Peptide Encoding Problem for *Bacillus brevis* and Tyrocidine B1 (Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr)[VKLFPWFNQY]. How many starting positions in Bacillus brevis encode this peptide?

In [None]:
%%time
with open("Data/Bacillus_brevis.txt") as inFile:
    data = inFile.readlines()

dna = ''.join([x.strip() for x in data])
print(len(peptideEncode(dna, "VKLFPWFNQY")))

The peptide corresponding to Tyrocidine is not found within any potential 30-mer in the genome. This is because unbeknownst to Gause or Synge tyrocidines and gramicidins are actually cyclic peptides, so if the length is 10 there are 10 different representations.
<img src="img/tyrocidine_linear_representations.png" width="550">

This does not make much sense. All peptides produced must be encoded by the genome. To test this in 1963 Edward Tatum inhibited the **ribosome**, the molecular machine responsible for protein translation, within *Bacillus brevis*. All proteins stopped being produced, except for tyrocidines and gramicidins. So there must be some non-ribosomal mode of production. In 1969, Fritz Lipmann demonstrated that tryocidines and gramicidins are **non-ribosomal peptides (NRPs)**, synthesized not by the ribosome but by **NRP synthetase**. This enzyme is able to generate antibiotic peptides without any RNA or genetic code, but it assembles them one amino acid at a time. NRPs are not limited to antibiotics, many of them represent anti-tumor agents and immunosuppressors.

So back to the original problem sequencing these peptide antibiotics is problematic because we cannot infer them from the genome and they are cyclic. We will describe the standard process for sequencing linear peptides which is **not** applicable to NRP analysis (because they are cyclic).

The primary equipment used is a **mass spectrometer**, which shatters molecules into pieces then weighs the fragments. The mass spec measures mass in **daltons (Da)**, where 1 Da is approximately equal to the mass of a single nuclear particle (proton or neutron). We will approximate the mass of a molecule by simply adding the number of protons and neutrons found in the molecules atoms giving us the molecule's **integer mass**.
* **Note** 1 Da is not exactly equal to the mass of a proton/neutron, and we may need to account for different naturally occurring isotopes. Amino acids typically have non-integer masses (G = 57.02). However for simplicity we will consider the integer mass only.
<img src="img/integer_mass_table.png">

Within the mass spec, one molecule of Tyrocidine B1 (VKLFPWFNQY) could break into many different fragments
* Example: LFP & WFNQYVK or PWFN & QYVKLF

The collection of all the fragment masses generated by the mass spec is called an **experimental spectrum**.

In [1]:
def calcMass(peptide, total=True):
    '''
    Calculate Mass of a peptide
    
    Args:
        peptide (string): Peptide string to calculate mass.
        total (bool, optional): Defualts to True. When true returns total mass
            When false returns list of constituent amino acid masses.
            
    Returns:
        int or list(int): Total mass of peptide or collection of masses of each amino acid in peptide.
    '''
    intMass = {}
    with open("Data/integer_mass_table.txt") as inFile:
        for key, mass in [x.strip().split(" ") for x in inFile.readlines()]:
            intMass[key] = int(mass)
    if total:
        return sum([intMass[x] for x in peptide])
    else:
        return [intMass[x] for x in peptide]

def linearSpectrum(peptide):
    '''
    Generate the theoretical mass spectrum of a given peptide, assuming it is linear
    
    Args:
        peptide (string): Peptide string to calculate the theoretical mass spectrum of.
        
    Returns:
        list[int]: Sorted theoretical mass spectrum of a peptide.
    '''
    intMass = {}
    with open("Data/integer_mass_table.txt") as inFile:
        for key, mass in [x.strip().split(" ") for x in inFile.readlines()]:
            intMass[key] = int(mass)
            
    prefMass = [0]
    for i in range(len(peptide)):
        prefMass.append(prefMass[i] + intMass[peptide[i]])
    linSpec = [0]
    for i in range(len(prefMass)):
        for j in range(i + 1, len(prefMass)):
            linSpec.append(prefMass[j] - prefMass[i])
    return sorted(linSpec)

In [2]:
# test/run linearSpectrum function

linearSpectrum("NQEL")

# with open("Data/dataset_4912_2.txt") as inFile:
#     data = inFile.readlines()
# print(*linearSpectrum(data[0].strip()), sep= " ")

[0, 113, 114, 128, 129, 242, 242, 257, 370, 371, 484]

In [3]:
def cyclicSpectrum(peptide, prnt=False, ret=True):
    '''
    Generate Theoretical Mass Spectrum for a given cyclical peptide
    
    Args:
        peptide (string): Peptide string to calculate the theoretical mass spectrum of.
        prnt (bool, optional): Defualts to False. If True prints space-separated, sorted mass spectrum.
        ret (bool, optional): Defualts to True. If True returns sorted list of mass spectrum.
        
    Returns:
        list[int], optional: Sorted list of theoretical mass spectrum of the given peptide.
    '''
    intMass = {}
    with open("Data/integer_mass_table.txt") as inFile:
        for key, mass in [x.strip().split(" ") for x in inFile.readlines()]:
            intMass[key] = int(mass)
            
    prefMass = [0]
    for i in range(len(peptide)):
        prefMass.append(prefMass[i] + intMass[peptide[i]])
    pepMass = prefMass[len(peptide)]
    cycSpec = [0]
    for i in range(len(prefMass)):
        for j in range(i+1, len(prefMass)):
            cycSpec.append(prefMass[j] - prefMass[i])
            if i > 0 and j < len(peptide):
                cycSpec.append(pepMass - cycSpec[-1])
    cycSpec = sorted(cycSpec)
    if prnt:
        print(*cycSpec, sep=" ")
    if ret:
        return cycSpec

In [4]:
# test/run cyclicSpectrum function

cyclicSpectrum("LEQN", prnt=True, ret=False)

# with open("Data/dataset_98_4.txt") as inFile:
#     data = inFile.readlines()
# cyclicSpectrum(data[0].strip(),prnt=True, ret=False)

0 113 114 128 129 227 242 242 257 355 356 370 371 484


In [10]:
aaMass = [57, 71, 87, 97, 99, 101, 103, 113, 114, 115, 128, 129, 131, 137, 147, 156, 163, 186]
masses = {}
def countPepFromMass(m):
    '''
    Given a peptide mass calculate how many possible peptides could have that mass
    
    Args:
        m (int): Integer mass of a peptide
        
    Returns:
        int: Number of possible linear peptides that have mass equal to m 
    '''
    if m < 0:
        return 0
    elif m == 0:
        return 1
    else:
        count = 0
        for aa in aaMass:
            t = m - aa
            if t in masses:
                count += masses[t]
            else:
                temp = countPepFromMass(t)
                masses[t] = temp
                count += temp
    return count

In [11]:
# test/run countPepFromMass function

# print(countPepFromMass(1024))

with open("Data/dataset_99_2.txt") as inFile:
    n = int(inFile.readlines()[0].strip())

print(f"Number of peptides with mass= {n}: {countPepFromMass(n)}")

Number of peptides with mass= 1306: 33829429664367


In [1]:
def specComp(pep, spectrum):
    '''
    Determines if a given peptide has a linear spectrum consistant with a given spectrum.
    Peptide is consistent with sepectrum if every theoretical mass from the linear spectrum of the peptide
    is contained within spectrum.
    
    Args:
        pep (string): Peptide to generate linear spectrum and compare.
        spectrum (list[int]): Spectrum of interest.
        
    Returns:
        bool: True if the peptide sepctrum is consistent with the spectrum of interest, false otherwise.
    '''
    linSpec = linearSpectrum(pep)
    test = spectrum[:]
    for mass in linSpec[:]:
        if mass in test:
            linSpec.remove(mass)
            test.remove(mass)
    return len(linSpec) == 0

def cycloPepSeq(spectrum, prnt=False, ret=True):
    '''
    Cyclopeptide Sequencing
    Find all peptide combinations that result in a theoretical cyclical spectrum that is the same as the given spectrum.
    
    Args:
        spectrum (list[int]): Spectrum to find matching peptides for.
        prnt (bool, optional): Defualts to False. When true print formatted list of all peptide mass combinations
        ret (bool, optional): Defualts to True. When true returns list of all possible peptides with matching spectrum.
        
    Returns:
        list[string]: Collection of all possible peptides that have a theoretical cyclical spectrum
            equal to the given spectrum.
     '''
    candPep = {""}
    final = []
    
    aaToMass = {}
    MassToAA = {}
    with open("Data/integer_mass_table.txt") as inFile:
        for key, mass in [x.strip().split(" ") for x in inFile.readlines()]:
            aaToMass[key] = int(mass)
            MassToAA[int(mass)] = key
    aa = []
    for mass in spectrum:
        if mass <= 186:
            if mass in MassToAA:
                aa.append(MassToAA[mass])
        else:
            break
            
    while candPep:
        temp = set()
        for cand in candPep:
            for a in aa:
                temp.add(cand + a)
        candPep = temp
        temp = set()
        for pep in candPep:
            if calcMass(pep) in spectrum:
                if cyclicSpectrum(pep) == spectrum and pep not in final:
                    final.append(pep)
                else:
                    temp.add(pep)
            elif calcMass(pep) <= max(spectrum) and specComp(pep, spectrum):
                temp.add(pep)
        candPep = temp
    if prnt:
        st = ''
        for pep in final:
            for i, mass in enumerate(calcMass(pep, total=False)):
                if i != 0:
                    st += '-'
                st += str(mass)
            st += ' '
        print(st)
    if ret:
        return final

In [6]:
%%time
# test/run cycloPepSeq function

# spec = [0, 113, 128, 186, 241, 299, 314, 427]
# cycloPepSeq(spec, prnt=True, ret=False)

with open("Data/dataset_100_6.txt") as inFile:
    data = inFile.readlines()

data = data[0].strip().split(' ')
data = [int(x) for x in data]
cycloPepSeq(data, prnt=True, ret=False)

128-147-137-99-101-115-129-186 101-115-129-186-128-147-137-99 115-101-99-137-147-128-186-129 147-128-186-129-115-101-99-137 129-115-101-99-137-147-128-186 137-147-128-186-129-115-101-99 129-186-128-147-137-99-101-115 101-99-137-147-128-186-129-115 186-129-115-101-99-137-147-128 137-99-101-115-129-186-128-147 99-101-115-129-186-128-147-137 99-137-147-128-186-129-115-101 128-186-129-115-101-99-137-147 186-128-147-137-99-101-115-129 115-129-186-128-147-137-99-101 147-137-99-101-115-129-186-128 
Wall time: 4.27 s


Now that we can generate possible peptides based on a given spectrum can we predict Tyrocidine B1 (VKLFPWFNQY) from its spectrum. **YES**

In [10]:
with open("Data/Tyrocidine_B1_theoretical_spectrum.txt") as inFile:
    data = inFile.readlines()
    data = [int(x) for x in data[0].split(' ')]
print(cycloPepSeq(data))

['QNFWPFLQVY', 'PFLQVYQNFW', 'FWPFLQVYQN', 'LQVYQNFWPF', 'WPFLQVYQNF', 'NFWPFLQVYQ', 'QLFPWFNQYV', 'VYQNFWPFLQ', 'FPWFNQYVQL', 'QVYQNFWPFL', 'FNQYVQLFPW', 'YVQLFPWFNQ', 'VQLFPWFNQY', 'QYVQLFPWFN', 'NQYVQLFPWF', 'FLQVYQNFWP', 'YQNFWPFLQV', 'PWFNQYVQLF', 'LFPWFNQYVQ', 'WFNQYVQLFP']


###### Code below is to answer quiz questions

In [29]:
# 2
seq = ['CCACGUACUGAAAUUAAC',
       'CCAAGAACAGAUAUCAAU',
       'CCAAGUACAGAGAUUAAC',
       'CCGAGGACCGAAAUCAAC']

for s in seq:
    if proteinTrans(s) == 'PRTEIN':
        print(s)

# 3
with open("Data/rnaToAmino.txt") as inFile:
    data = inFile.readlines()
data = [x.strip() for x in data]
aaDict = {}
for x in data:
    codon, aa = x.split(' ')
    if aa not in aaDict:
        aaDict[aa] = 1
    else:
        aaDict[aa] += 1
res = 1
for aa in 'CYCLIC':
    res *= aaDict[aa]
print(f"\nCYCLIC combos: {res}\n")

# 4
print(calcMass('W'))
print()

# 5
peps = ['MLAT',
        'ALTM',
        'TAIM',
        'MTAI',
        'TMLA',
        'MAIT']

for pep in peps:
    if cyclicSpectrum(pep) == [0, 71, 101, 113, 131, 184, 202, 214, 232, 285, 303, 315, 345, 416]:
        print(pep)
        
# 6
print("")   
peps = ['QCV',
        'AVQ',
        'TVQ',
        'TCE',
        'ETC',
        'TCQ']

for pep in peps:
    if specComp(pep, [0, 71, 99, 101, 103, 128, 129, 199, 200, 204, 227, 230, 231, 298, 303, 328, 330, 332, 333]):
        print(pep)

CCACGUACUGAAAUUAAC
CCGAGGACCGAAAUCAAC

CYCLIC combos: 288

186

ALTM
MAIT

TVQ
ETC
TCQ
