## Protein Structure Preparation

#### `Author: Simon Hackl`
#### `Project: The OMPeome of Treponema pallidum`
#### `Contact: simon.hackl@uni-tuebingen.de`
#### `Date: 15.02.2022`

This _Python_ Notebook guides through and documents the steps of protein structure preparation.

### 1. Collect Protein Structures
For this project a set of $36$ OMP coding genes of interest were considered. For $24$ of them a structure was already resolved by Hawley _et al._ (https://doi.org/10.1128/JB.00082-21). The respective structures were downloaded from the publication and stored in the `./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/` directory.

For the remaining $12$ OMPs no structure was known and thus the following procedure was applied to predict respective structures.

### 2. Extraction and Translation of Gene Sequences

Starting point for the protein structure prediction was the extraction of their respective genes nucleotide sequences. The chosen reference genome sequence and annotation were downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NC_021490.2 and stored at `./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ReferenceGenome/NC_021490.2.fasta` and `./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ReferenceGenome/NC_021490.2.gff3`. A list with the names of the OMP coding genes of interest with no resolved structure was created at `./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/TPOMPNoStructureGeneNames.txt`.

The respective locus and strand orientation of the genes in the chosen reference was detected by matching a TPXXXX gene name with the old_locus_tag attributes value (format: TPANIC\_XXXX) from the reference genome annotation.

Next, the genes nucleotide sequences were extracted from the reference genome acoording to this information and translated into their repsective amino-acid sequences using `Expasy Translate` tool (https://web.expasy.org/translate/). The obtained raw sequences were stored at `./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/TPOMPNoStructureSequences.fasta`. The sequences were truncated to their maximal length open reading frame (indicated by `Expasy Translate`), i.e. if the sequence did not start with an M residue, all residues before the first M were truncated. Affected sequences were appended the suffix \_Morf to their header. The resulting sequences were stored at `./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/TPOMPNoStructureSequences_MaxORF.fasta`

In [None]:
import os
from subprocess import Popen, PIPE

# Parse information from the genome annotation and store it in a dictionary accessible by the old_locus_tag attribute.
annotatedGenes = { }
with open( "./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ReferenceGenome/NC_021490.2.gff3", "r" ) as annotatedGenesFile :
    line = annotatedGenesFile.readline( )
    while line :
        if line.startswith( "##" ) : # Skip comment lines
            line = annotatedGenesFile.readline( )
            continue
        else :
            splitLine = line.split( "\t" )
            if splitLine[ 0 ] == "NC_021490.2" and splitLine[ 1 ] == "RefSeq" and splitLine[ 2 ] == "gene" :
                attributes = splitLine[ 8 ].strip( ).split( ";" )
                for attribute in attributes :
                    key, value = attribute.split( "=" )
                    if key == "old_locus_tag" :
                        annotatedGenes[ value ] = {
                            "locusStart": int( splitLine[ 3 ] ),
                            "locusEnd": int( splitLine[ 4 ] ),
                            "strandOrientation": splitLine[ 6 ]
                        }
        line = annotatedGenesFile.readline( )
        
# Parse the reference genome sequence.
referenceGenome = ""
with open( "./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ReferenceGenome/NC_021490.2.fasta", "r" ) as referenceGenomeFile :
    line = referenceGenomeFile.readline( ) # Skip fasta header
    line = referenceGenomeFile.readline( )
    while line :
        referenceGenome += line.strip( )
        line = referenceGenomeFile.readline( )
        
# For each of the OMP coding genes of interest the respective gene sequence is extracted and translated by accessing
# the Expasy Translate tool.
with open( "./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/TPOMPNoStructureGeneNames.txt", "r" ) as OMPGenesFile :
    translatedGenes = { }
    line = OMPGenesFile.readline( )
    while line :
        geneName = line.strip( )
        print( "Processing Gene: " + geneName )
        geneOldLocusTag = "TPANIC_" + geneName[ 2: ]
        geneStart = annotatedGenes[ geneOldLocusTag ][ "locusStart" ]
        geneEnd = annotatedGenes[ geneOldLocusTag ][ "locusEnd" ]
        geneOrientation = annotatedGenes[ geneOldLocusTag ][ "strandOrientation" ]
        geneSequence = referenceGenome[ geneStart - 1 : geneEnd ]
        print( "- %Ns: " + str( geneSequence.count( 'N' ) / len( geneSequence ) ) )
        print( "- Orientation: " + geneOrientation )
        process = Popen(
            [
                'wsl', 
                'curl',
                '-s',
                '-d',
                "'dna_sequence='" + geneSequence + "'&output_format=fasta'",
                'https://web.expasy.org/cgi-bin/translate/dna2aa.cgi'
                #'>',
                #'./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/OMPSequences/' + geneName + '.fasta'
            ],
            stdout=PIPE, stderr=PIPE
        )
        stdout, stderr = process.communicate( )
        response = stdout.decode( encoding = 'UTF-8' ).split( "\n" )
        translatedSequence = ""
        entryCounter = 0
        for entry in response :
            if entry.startswith( ">" ) :
                entryCounter += 1
                continue
            if geneOrientation == "-" and entryCounter == 1 :
                translatedSequence += entry
            elif geneOrientation == "-" and entryCounter > 1 :
                break
            elif geneOrientation == "+" and entryCounter == 4 :
                translatedSequence += entry
            elif geneOrientation == "+" and entryCounter > 4 :
                break
        print( "- Translated Sequence: \n" + translatedSequence.strip( "-" ) )
        translatedGenes[ geneName ] = translatedSequence.strip( "-" )
        line = OMPGenesFile.readline( )
        print( )
with open( "./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/TPOMPNoStructureSequences.fasta", "w+" ) as OMPSequencesFile :
    for key, value in translatedGenes.items( ) :
        OMPSequencesFile.write( ">" + key + "\n" )
        sequenceChunks = [ value[ i : i + 80 ] for i in range( 0, len( value ), 80 ) ]
        for sequenceChunk in sequenceChunks :
            OMPSequencesFile.write( sequenceChunk + "\n" )

### 3. Truncation Signal Peptides

In the next step possible signal peptides were remove from the translated genes sequences in advance before structure prediction. To do so, the file `./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/TPOMPNoStructureSequences_MaxORF.fasta` was used as input for `SignalP 5.0` (https://services.healthtech.dtu.dk/service.php?SignalP-5.0) with the following parameters:

    - Output format: Long output
    - Organism group: Gram-negative
    
The results of the `SignalP 5.0` signal peptide prediction were stored at  `./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/SignalP_Results.zip` and the amino-acid sequences with the respective predicted signal peptides being removed (this process was conducted manually, the suffix \_trc was added to the respective fasta headers if any predicted signal peptide was removed) were stored at  `./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/TPOMPNoStructureSequences_MaxORF_SignalP.fasta`.

### 4. Structure Prediction Using Robetta/RoseTTAFold 

The protein sequences were finally used as input sequences for structure prediction using `RoseTTAFold` via the `Robetta` web server (https://robetta.bakerlab.org/). The predicted structures were stored at `./R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/`. The file names will include
    
    - .morf If some residues were truncated from the start of the sequence in order to obtain the max. open reading frame.
    - .trc If some signal peptide sequence was truncated.
    - .pred If they were predicted herein.