**mzTbs**

_General_

mzTab is meant to be a light-weight, tab-delimited file format for proteomics data. The target audience for this format are primarily researchers outside of proteomics. It should be easy to parse and only contain the minimal information required to evaluate the results of a proteomics experiment. The aim of the format is to present the results of a proteomics experiment in a computationally accessible overview. The aim is not to provide the detailed evidence for these results, or allow recreating the process which led to the results. Both of these functions are established through links to more detailed representations in other formats, in particular mzIdentML and mzQuantML [Ref1](https://code.google.com/archive/p/mztab/). Besides, mzTab can be used alone or with those other formats [Ref2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4189001/)

**Warning**:Although mzTab can be used to report a detailed view on data, it explicitly does not aim to capture the whole complexity and evidence trail of a proteomics study. Even the most complex mzTab files still include simplifications/assumptions of the experimental results. This, for instance, is the case in identification (e.g. protein inference/grouping is only supported to a limited extent) and quantification results (e.g. the coordinates for isotope patterns in quantified two-dimensional “features” cannot be fully reported). This missing information can be reported using the existing PSI standard formats mzIdentML and mzQuantML.  

_File content_

Section:
- MTD: metadata - was deliberately kept flexible, and the majority of fields are optional. Therefore, it is possible to report different levels of experimental annotation depending on the interest of the producer of the files, ranging from basic annotations to the complete
- PRH: protein hearder
- PRT: protein identifications 
- PEH: peptide header
- PEP: peptide identifications
- PSH: peptide-spectrum  hearder
- PSM: peptide-spectrum match - indicates whether the peptides were unambiguously assigned to a given protein. 
- SMH: small molecules hearder  - is used to report aggregated quantification data based on several PSMs.
- SML: small molecules identifications
- COM: comments 


<!-- ![Fig1](/home/tiago/documents/lncRNA/Study/Fig1_mztabe_content.jpg) -->


In [1]:
from pyteomics import mztab

### 1. Coffie

In [2]:
coffie = mztab.MzTab("/home/tiago/documents/lncRNA/notebooks/studies/2experimentos.pride.mztab")

In [3]:
coffie.keys()

odict_keys(['PRT', 'PEP', 'PSM', 'SML'])

- PRT: protein identifications 
- PEP: peptide identifications
- PSM: peptide-spectrum match - indicates whether the peptides were unambiguously assigned to a given protein. 
- SML: small molecules identifications

In [4]:
coffie.metadata["mzTab-type"]

'Identification'

In [5]:
coffie["PRT"]


Unnamed: 0_level_0,accession,description,taxid,species,database,database_version,search_engine,best_search_engine_score[1],search_engine_score[1]_ms_run[1],search_engine_score[1]_ms_run[2],...,num_peptides_unique_ms_run[15],num_peptides_unique_ms_run[16],num_peptides_unique_ms_run[17],num_peptides_unique_ms_run[18],num_peptides_unique_ms_run[19],num_peptides_unique_ms_run[20],num_peptides_unique_ms_run[21],ambiguity_members,modifications,protein_coverage
accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0A068U1Z5_COFCA,A0A068U1Z5_COFCA,Eukaryotic translation initiation factor 3 sub...,,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,,
A0A068TTF0_COFCA,A0A068TTF0_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,464-UNIMOD:4,
A0A068U9U8_COFCA,A0A068U9U8_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,,
A0A068URX2_COFCA,A0A068URX2_COFCA,ATP-dependent Clp protease proteolytic subunit...,,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,,
A0A068TYH5_COFCA,A0A068TYH5_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
A0A068UX04_COFCA,A0A068UX04_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,,
A0A068ULJ5_COFCA,A0A068ULJ5_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,0.997500,...,,,,,,,,,,
A0A068UYE2_COFCA,A0A068UYE2_COFCA,Lipoxygenase OS=Coffea canephora GN=GSCOC_T000...,,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,0.999993,...,,,,,,,,,585-UNIMOD:4,
A0A068VF06_COFCA,A0A068VF06_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,0.923192,...,,,,,,,,A0A068V8A1_COFCA,,


In [6]:
with open("desc.txt","w+") as description: 
    for d in coffie["PRT"].description.values:
        description.write(f'{d}\n')

In [7]:
coffie["PSM"]

Unnamed: 0_level_0,sequence,PSM_ID,accession,unique,database,database_version,search_engine,search_engine_score[1],search_engine_score[2],search_engine_score[3],...,exp_mass_to_charge,calc_mass_to_charge,spectra_ref,pre,post,start,end,opt_global_mzidentml_original_ID,opt_global_cv_MS:1002217_decoy_peptide,opt_global_cv_PRIDE:0000091_rank
PSM_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,VFGPHQWEILR,1,A0A068U1Z5_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",20.95,25.0,0.996025,...,461.25412,461.250733,ms_run[3]:index=262,,,364,374,Spec_5760_VFGPHQWEILR,0,1
2,YLEDKTSVPYEPVYSDEQAR,2,A0A068TTF0_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",35.16,25.0,0.999713,...,797.04852,797.044699,ms_run[8]:index=1954,,,312,331,Spec_57572_YLEDKTSVPYEPVYSDEQAR,0,1
3,YLEDKTSVPYEPVYSDEQAR,3,A0A068TTF0_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",35.26,25.0,0.999544,...,797.04919,797.044699,ms_run[20]:index=1775,,,312,331,Spec_25966_YLEDKTSVPYEPVYSDEQAR,0,1
4,YLEDKTSVPYEPVYSDEQAR,4,A0A068TTF0_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",19.21,25.0,0.999363,...,797.05011,797.044699,ms_run[17]:index=1996,,,312,331,Spec_35544_YLEDKTSVPYEPVYSDEQAR,0,1
5,YLEDKTSVPYEPVYSDEQAR,5,A0A068TTF0_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",23.32,25.0,0.999418,...,797.04919,797.044699,ms_run[15]:index=1870,,,312,331,Spec_44982_YLEDKTSVPYEPVYSDEQAR,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16925,LFILDYHDMLLPFIEGMNSLPGR,16925,A0A068UYE2_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",35.74,25.0,0.999683,...,897.79639,897.794099,ms_run[2]:index=654,,,504,526,Spec_67915_LFILDYHDMLLPFIEGMNSLPGR,0,1
16926,IVNKWNTALIGLMTYFR,16926,A0A068VF06_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",44.30,25.0,0.999500,...,680.71399,680.708199,ms_run[2]:index=308,,,1298,1314,Spec_67515_IVNKWNTALIGLMTYFR,0,1
16927,IVNKWNTALIGLMTYFR,16927,A0A068VF06_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",23.94,25.0,0.997619,...,680.71197,680.708199,ms_run[4]:index=226,,,1298,1314,Spec_64756_IVNKWNTALIGLMTYFR,0,1
16926,IVNKWNTALIGLMTYFR,16926,A0A068V8A1_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",44.30,25.0,0.999500,...,680.71399,680.708199,ms_run[2]:index=308,,,1299,1315,Spec_67515_IVNKWNTALIGLMTYFR,0,1


How to access uniprot information using mztab info?

In [8]:
import urllib.parse
import urllib.request
import uniprot
import requests as r
from Bio import SeqIO
from io import StringIO

def get_sequences(accession:list,file_name:str):
    # convert pride accession number to swissprot accession number
    databases_url = 'https://www.uniprot.org/uploadlists/'
    params = {
    'from': 'ACC+ID', #pride accession
    'to': 'REFSEQ_NT_ID', #swissprot accession
    'format': 'list',
    'query': f'{" ".join(accession)}'
    }
    pride_accession = urllib.parse.urlencode(params)
    pride_accession = pride_accession.encode('utf-8')
    req = urllib.request.Request(databases_url, pride_accession)
    with urllib.request.urlopen(req) as accession_file:
        response = accession_file.read()
    swissprot_accession = list(response.decode('utf-8').split("\n"))
    
    # Get protein sequences from swissprot
    swissprot_url = "http://www.uniprot.org/uniprot/"
    with open(f"{file_name}.fasta","w+") as fasta:
        for IDs in swissprot_accession:
            joint_url = swissprot_url + IDs + ".fasta"
            swissprot_response = r.post(joint_url)
            raw_data = "".join(swissprot_response.text)
            Seq = StringIO(raw_data)
            for seq_info in SeqIO.parse(Seq,'fasta'):
                if "Uncharacterized protein" in seq_info.description:
                    fasta.write(f">{seq_info.description}\n")
                    fasta.write(f"{seq_info.seq}\n")
    return fasta
        

In [9]:
coffie_accession = coffie["PRT"].accession.values
#get_sequences(accession = coffie_accession,file_name = "teste")

In [None]:
    for accessions in coffie_accession[:10]:
        try:
            databases_url = 'https://www.uniprot.org/uploadlists/'
            params = {
            'from': 'ACC+ID', #pride accession
            'to': 'EMBL', #swissprot accession
            'format': 'list',
            'query': f'{accessions}'
            }
            pride_accession = urllib.parse.urlencode(params)
            pride_accession = pride_accession.encode('utf-8')
            req = urllib.request.Request(databases_url, pride_accession)
            with urllib.request.urlopen(req) as accession_file:
                response = accession_file.read()
            swissprot_accession = response.decode('utf-8').split("\n")
            print(accessions,swissprot_accession[0])
            url = f"https://www.ebi.ac.uk/ena/browser/api/fasta/{swissprot_accession[0]}?lineLimit=1000"
            sequence_request = requests.get(url).text.split("\n")
            print(sequence_request)

        except:
            pass

In [83]:
def download_sequences(mztab_prt,out_file):
    with open(f"{out_file}.fasta","w+") as sequence_fasta_file:
        for data in mztab_prt["PRT"].itertuples():
            uniprot_url = 'https://www.uniprot.org/uploadlists/'
            params = {
            'from': 'ACC+ID', #pride accession
            'to': 'EMBL', #swissprot accession
            'format': 'list',
            'query': f'{data.accession}'
            }
            pride_accession = urllib.parse.urlencode(params)
            pride_accession = pride_accession.encode('utf-8')
            pride_request = urllib.request.Request(databases_url, pride_accession)
            with urllib.request.urlopen(pride_request) as request_data:
                request_response = request_data.read()
            ENA_accession = request_response.decode("utf-8").split('\n')
            ENA_ulr = url = f"https://www.ebi.ac.uk/ena/browser/api/fasta/{ENA_accession[0]}?lineLimit=1000"
            ENA_seq_request = requests.get(url).text.split("\n")
            fasta_id = data.description
            fasta_seq = "".join(ENA_seq_request[1:])
            sequence_fasta_file.write(f">{fasta_id}\n")
            sequence_fasta_file.write(f"{fasta_seq}\n")



In [84]:
download_sequences(coffie,"cafe.test")

KeyboardInterrupt: 

BLAST

- 1.  qseqid: query or source (e.g., gene) sequence id

- 2.  sseqid: subject  or target (e.g., reference genome) sequence id

- 3.  pident: percentage of identical matches

- 4.  length: alignment length (sequence overlap)

- 5.  mismatch :number of mismatches

- 6.  gapopen: number of gap openings

- 7.  qstart: start of alignment in query

- 8.  qend: end of alignment in query

- 9.  sstart: start of alignment in subject

- 10. send: end of alignment in subject

- 11. evalue: expect value

- 12. bitscore: bit score

- 13. qseq: Aligned part of query sequence

- 14. sseq:      Aligned part of subject sequence