## Resultados NCBI + Uniprot

Depois de termos tanto os resultados do NCBI para cada uma das proteínas, bem como os da Uniprot, resolvemos compilar os resultados obtidos em ambas as bases de dados numa única tabela e comparar os mesmos. Os resultados da Uniprot permitiram complementar a informação obtida no NCBI, uma vez que, para algumas proteínas, dados como a função das mesmas estão especificadas com um maior nível de detalhe. Além disso, detetamos diferenças em 22 dos resultados entre as bases de dados, os quais estão sublinhados a azul. Optamos por considerar os resultados da Uniprot mais corretos, uma vez que as entradas se encontram revistas nesses casos.


Com os resultados do NCBI e da Uniprot, mais concretamente das funções e das classes dos processos biológicos, construimos um gráfico de cores que agrupa os genes/proteínas consoante a classe biológica funcional a que pertence. Algumas proteínas têm associadas mais do que uma função, contudo, optamos por colorir o retângulo que a representa com a função que consideramos mais importante.


![colour_chart](img/colour_chart.png)
![legenda](img/legenda.png)

__Figura 1 -__ Classificação funcional das genes/proteínas e respetiva legenda.

In [15]:
import os, sys, inspect, math
import pandas as pd
from IPython.core.display import display, HTML

def import_modules():
    """
    Importar os módulos que desenvolvemos neste trabalho.
    """
    current_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
    parent_dir = os.path.dirname(current_dir)
    sys.path.insert(0, parent_dir)

def itemize(l):
    """
    Criar uma lista HTML.
    """
    if isinstance(l, float) and math.isnan(l):
        return ""
    if isinstance(l, dict):
        return itemize_dict(l)

    html = "<ul>"
    for i in l:
        html += "<li>"
        if isinstance(i, dict):
            html += itemize_dict(i)
        else:
            html += i
        html +="</li>"
    html += "</ul>"
    return html

def itemize_dict(d):
    """
    Criar uma lista HTML dado um dicionário.
    """
    html = "<ul>"
    for k in d:
        html += "<li><strong>" + k + ":</strong> " + str(d[k]) + "</li>"  
    html += "</ul>"
    return html

def pretty_print(v):
    """
    Remove valores NaN e None.
    """
    if isinstance(v, float) and math.isnan(v):
        return ""
    if v is None:
        return ""
    return v

def background_it(v):
    """
    Torna o background azul.
    """
    return "<div style=\"background-color:powderblue;\">" + str(v) + "</div>"

def shorten_it(v):
    """
    Retorna uma string mais pequena + "..."
    """
    return v[:10] + "..."

def main():
    import_modules()
    import util.rw as rw
    
    # mostra todas as linhas
    pd.options.display.max_rows = 250
    
    # não truncar informação
    pd.set_option('display.max_colwidth', -1)

    cdd = rw.read_json("files/domains.json")
    ncbi = rw.read_json("files/.ncbi_uniprot.json")
    diff = rw.read_json("files/.ncbi_uniprot_diff.json")

    columns_to_itemize = [
        "location",
        "accessions",
        "cofactors",
        "comment_functions",
        "molecular_functions",
        "biological_processes",
        "domains",
        "locations",
        "pdbs",
        "modified_residues"
    ]
        
    columns_to_pp = [
        "gene",
        "EC_number",
        "uniprot_id",
        "protein_id",
        "organism",
        "length",
        "mass",
        "translation"
    ]

    columns_to_show = [
        "short_name",
        "product",
        "gene",
        "EC_number",
        "accessions",
        "status",
        "type",
        "uniprot_id",
        "protein_id",
        "organism",
        "location",
        "length",
        "mass",
        "comment_functions",
        "molecular_functions",
        "biological_processes",
        "domains",
        "locations",
        "cofactors",
        "pdbs",
        "modified_residues",
        "translation"
    ]
    
    df = pd.DataFrame(ncbi).transpose()
    
    # adicionar os dominios
    l = []
    for tag in sorted(ncbi.keys()):
        if tag in cdd:
            domains = cdd[tag]["domains"]
        else:
            domains = []
        l.append(domains)

    df["domains"] = l
        
    for p in columns_to_itemize:
        df[p] = df[p].apply(itemize)
        
    for p in columns_to_pp:
        df[p] = df[p].apply(pretty_print)
        
    # mostrar parte da translação
    df["translation"] = df["translation"].apply(shorten_it)
    
    # marcar a informação que difere nos sites NCBI e UniProt
    for tag in diff:
        for p in diff[tag]:
            df[p][tag] = background_it(df[p][tag])

    display(HTML(df[columns_to_show].to_html(escape=False)))

    
main()

Unnamed: 0,short_name,product,gene,EC_number,accessions,status,type,uniprot_id,protein_id,organism,location,length,mass,comment_functions,molecular_functions,biological_processes,domains,locations,cofactors,pdbs,modified_residues,translation
lpg0232,Q5ZYX9_LEGPH,"Transcriptional regulator np20, Fur family",np20,,Q5ZYX9,unreviewed,mRNA,Q5ZYX9,YP_094286.1,Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513),end: 270569strand: 1start: 270036,177.0,20453.0,,"DNA bindingtranscription factor activity, sequence-specific DNA binding",,accession: COG0735name: Furdesc: Fe2+ or Zn2+ uptake regulation protein [Inorganic ion transport and metabolism],,,,,MIGCCLIIFP...
lpg0233,Q5ZYX8_LEGPH,Benzoylformate decarboxylase,mdlC,4.1.1.7,Q5ZYX8,unreviewed,mRNA,Q5ZYX8,YP_094287.1,Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513),end: 272278strand: -1start: 270686,530.0,58497.0,,benzoylformate decarboxylase activitymagnesium ion bindingthiamine pyrophosphate binding,,"accession: cd02002name: TPP_BFDCdesc: Thiamine pyrophosphate (TPP) family, BFDC subfamily, TPP-binding module accession: pfam02776name: TPP_enzyme_Ndesc: Thiamine pyrophosphate enzyme, N-terminal TPP binding domain accession: pfam00205name: TPP_enzyme_Mdesc: Thiamine pyrophosphate enzyme, central domain accession: COG0028name: IlvBdesc: Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme",,,,,MKKTGSDVLK...
lpg0234,Q5ZYX7_LEGPH,SidE,sidE,,Q5ZYX7,unreviewed,mRNA,Q5ZYX7,YP_094288.1,Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513),end: 277121strand: -1start: 272577,1514.0,171651.0,,,,accession: pfam12252name: SidEdesc: Dot/Icm substrate protein This family of proteins is found in bacteria,,,,,MLIFKSQILI...
lpg0235,Q5ZYX6_LEGPH,Uncharacterized protein,,,Q5ZYX6,unreviewed,mRNA,Q5ZYX6,YP_094289.1,Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513),end: 277987strand: -1start: 277484,167.0,19095.0,,carbon-sulfur lyase activity,metabolic process,accession: cl01553name: GFA super familydesc: Glutathione-dependent formaldehyde-activating enzyme,,,,,MKKAFRIMAT...
lpg0236,Q5ZYX5_LEGPH,Uncharacterized protein,,,Q5ZYX5,unreviewed,mRNA,Q5ZYX5,YP_094290.1,Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513),end: 280039strand: -1start: 278060,659.0,77097.0,,,,,,,,,MRYTNIELLK...
lpg0237,Q5ZYX4_LEGPH,Lipolytic enzyme,mhpC,,Q5ZYX4,unreviewed,mRNA,Q5ZYX4,YP_094291.1,Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513),end: 281114strand: 1start: 280320,264.0,29403.0,,,,accession: pfam12695name: Abhydrolase_5desc: Alpha/beta hydrolase family accession: cl21494name: Abhydrolase super familydesc: alpha/beta hydrolases,,,,,MATLKINGVD...
lpg0238,Q5ZYX3_LEGPH,Glycine betaine aldehyde dehydrogenase,gbsA,1.2.1.8,Q5ZYX3,unreviewed,mRNA,Q5ZYX3,YP_094292.1,Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513),end: 282597strand: 1start: 281131,488.0,52945.0,,betaine-aldehyde dehydrogenase activity,,accession: cd07119name: ALDH_BADH-GbsAdesc: Bacillus subtilis NAD+-dependent betaine aldehyde dehydrogenase-like,,,,,MEIYKMYIDG...
lpg0239,Q5ZYX2_LEGPH,4-aminobutyrate aminotransferase,gabT,2.6.1.19,Q5ZYX2,unreviewed,mRNA,Q5ZYX2,YP_094293.1,Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513),end: 283924strand: 1start: 282572,450.0,49049.0,,4-aminobutyrate transaminase activitypyridoxal phosphate binding,gamma-aminobutyric acid metabolic process,accession: cl18945name: AAT_I super familydesc: Aspartate aminotransferase (AAT) superfamily (fold type I) of pyridoxal phosphate (PLP)-dependent enzymesaccession: COG0160name: GabTdesc: 4-aminobutyrate aminotransferase or related aminotransferase,,,,,MKHQLVGTKL...
lpg0240,Q5ZYX1_LEGPH,DNA repair protein,recN,,Q5ZYX1,unreviewed,mRNA,Q5ZYX1,YP_094294.1,Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513),end: 284787strand: 1start: 284008,259.0,29552.0,,,,,,,,,MNDIMWYQNI...
lpg0241,GLSA_LEGPH,Glutaminase,,3.5.1.2,Q5ZYX0,reviewed,mRNA,Q5ZYX0,YP_094295.1,Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513),end: 285979strand: 1start: 285047,310.0,33970.0,,glutaminase activity,glutamine metabolic process,accession: TIGR03814name: Gln_asedesc: glutaminase A,,,,,MSSKLLTIQL...


[Índice](index.html) | [Anterior](cdd_results.html) | [Seguinte](alignments.html)