# General Thoughts on PPI prediction and Machine Learning
## Definition of an interaction
PHI-base defines an interaction as: an observable function of one gene, on one host and on one tissue type [Urban, 2022](https://academic.oup.com/nar/article/50/D1/D837/6426057).
## Mimicry of host molecules by pathogens
### Date: 27-04-2022
I have been thinking along the lines of this for a while now. If I am a pathogen, the molecules that I develop to infect the host should look closely like the normal proteins that usually use the receptors I am targeting. The article [Urban, 2022](https://academic.oup.com/nar/article/50/D1/D837/6426057) referenced an article where they discuss pathogen mimicry.

If this is the case, then a machine learning model that is trained to predict host-host interactions could be fine-tuned with pathogenic data to predict host-pathogen interactions.

### Method
I would think, you would train a model using a Transformer then reload the parameters and fine-tune this model using host-pathogen data. Perhaps with the better models I have now I should give this a go.

* Another way to look at this is to fine functionally homologous proteins in pathogens and hosts. Could we classify similarity and family?

* Could we create a dataset of pathogen target proteins, and their endogenous interactors? Then fine-tune this model with pathogenic proteins????

## Loading Large Language Models from HuggingFace
### Date: 27-04-2022
Since the ProtTrans models on HuggingFace use the same vocabulary as the Lanchintin models, I should see if I can load the pre-trained BERT models and perform the training on the host-pathogen datasets I have.

Lanchintin has trained his own LMs but has not made them available.

### Some questions
A big question when training the LMs is, did Lanchintin use the CNN attached to BERT when training the LMs. Did he first train the LMs then used the CNN + BERT to fine-tune the models. If it is the second case then I can use the ProTrans models directly from HuggingFace.

## Ensemble Bacteria databases
### Date 27-04-2022
PHI-base provides annotation of pathogen-host interactions in the Ensemble Bacteria databases.

I need to see if I can download more PPIs particularly dealing with interactions between humans and bacteria to further extend the database I already have.

## PHI-BASE
### Date 27-04-2022
PHI-BASE is a multi-host pathogen interaction database, that has been curated. It contains plant, human and fungi interactions between pathogens. The interactions leads to a specific phenotypic effect.

In [1]:
import pandas as pd
phibase = pd.read_csv('../data/phi_base/phi-base_current.csv', sep=',', encoding='ISO-8859-1')
phibase.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Record ID,PHI_MolConn_ID,Protein ID source,Protein ID,Gene ID source,Gene ID,AA sequence,NT sequence,Sequence Strain,Gene,...,Ref. detail,Author email,Comments,Author reference,Year,Curation details,File name,Batch no.,Curation date,Curator organization
0,Record 1,PHI:3,Uniprot,P26215,EMBL,AAA79885,,,SB111,PGN1,...,,,Expression during all infection stages. pathog...,Scott-Craig et al.,1990.0,,,PHI-base Vers.3.3,38476,Rres
1,Record 2,PHI:7,Uniprot,P22287,EMBL,CAA42824,,,race 5,AVR9,...,,,,Van Kan et al.,1991.0,,,PHI-base Vers.3.3,38476,Rres
2,Record 3,PHI:12,Uniprot,Q01886,EMBL,AAA33023,,,SB111,HTS1,...,,,pathogen formerly called Cochliobolus carbonum...,Panaccione et al.,1992.0,,,PHI-base Vers.3.3,38476,Rres
3,Record 4,PHI:14,Uniprot,P0C017,EMBL,AAB09711,,,clinical isolate,ADE2,...,,,,Perfect et al.,1993.0,,,PHI-base Vers.3.3,38476,Rres
4,Record 5,PHI:14,Uniprot,P0C017,EMBL,AAB09711,,,clinical isolate,ADE2,...,,,,Perfect et al.,1993.0,,,PHI-base Vers.3.3,38476,Rres


In [2]:
phibase

Unnamed: 0,Record ID,PHI_MolConn_ID,Protein ID source,Protein ID,Gene ID source,Gene ID,AA sequence,NT sequence,Sequence Strain,Gene,...,Ref. detail,Author email,Comments,Author reference,Year,Curation details,File name,Batch no.,Curation date,Curator organization
0,Record 1,PHI:3,Uniprot,P26215,EMBL,AAA79885,,,SB111,PGN1,...,,,Expression during all infection stages. pathog...,Scott-Craig et al.,1990.0,,,PHI-base Vers.3.3,38476,Rres
1,Record 2,PHI:7,Uniprot,P22287,EMBL,CAA42824,,,race 5,AVR9,...,,,,Van Kan et al.,1991.0,,,PHI-base Vers.3.3,38476,Rres
2,Record 3,PHI:12,Uniprot,Q01886,EMBL,AAA33023,,,SB111,HTS1,...,,,pathogen formerly called Cochliobolus carbonum...,Panaccione et al.,1992.0,,,PHI-base Vers.3.3,38476,Rres
3,Record 4,PHI:14,Uniprot,P0C017,EMBL,AAB09711,,,clinical isolate,ADE2,...,,,,Perfect et al.,1993.0,,,PHI-base Vers.3.3,38476,Rres
4,Record 5,PHI:14,Uniprot,P0C017,EMBL,AAB09711,,,clinical isolate,ADE2,...,,,,Perfect et al.,1993.0,,,PHI-base Vers.3.3,38476,Rres
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18187,Record 18188,PHI:7054,Uniprot,P13470,Genbank,AE014940_2,,,UA159,gtfC,...,Infect Immun. 2016 Oct 17;84(11):3206-3219,rmgraner@fop.unicamp.br,UAcov showed increased exvivo survival in huma...,L A Alves 2016,2016.0,6 genes,Alves-2016-CovR regulates Streptococcus mutans,BATCH_067B,May-17,MC
18188,Record 18189,PHI:7055,Uniprot,P49331,Genbank,AE014932_2,,,UA159,gtfD,...,Infect Immun. 2016 Oct 17;84(11):3206-3219,rmgraner@fop.unicamp.br,UAcov showed increased exvivo survival in huma...,L A Alves 2016,2016.0,6 genes,Alves-2016-CovR regulates Streptococcus mutans,BATCH_067B,May-17,MC
18189,Record 18190,PHI:8088,Uniprot,Q8DUS1,Genbank,AE014924_6,,,UA159,rgpf (SMU.830),...,J Bacteriol. 2017 Nov 14;199(24). pii: e00497-17,Robert_Quivey@urmc.rochester.edu,The loss of rgpF caused a perturbation of memb...,C J Kovacs 2017,2017.0,1 gene,Kovacs-2017-RgpF is required for maintenance o,BATCH_082,May-18,MC
18190,Record 18191,PHI:9976,Uniprot,no data found,Genbank,AFM51859,,,,Mbov_0503,...,Microorganisms. 2020 Jan 23;8(2),aizhen@mail.hzau.edu.cn,Although Mbov_0503 disruption was only associa...,X Zhu 2020,2020.0,1 gene,Zhu-2020-Mbov_0503 Encodes a Novel Cytoadhesin,BATCH_099B,43891,MC


## Host species in Phibase

In [3]:
set(phibase['Host species'].values)

{'Acanthamoeba castellanii (no common name found)',
 'Acanthamoeba polyphaga (no common name found)',
 'Acipenser sinensis (related: chinese sturgeon)',
 'Actinidia chinensis (no common name found)',
 'Acyrthosiphon pisum (related: pea aphid)',
 'Adalia bipunctata (related: two-spotted ladybird beetle)',
 'Aedes aegypti (related: yellow fever mosquito)',
 'Aeschynomene virginica (no common name found)',
 'Agaricus bisporus (related: cultivated mushroom)',
 'Ageratina adenophora (no common name found)',
 'Allium cepa (related: onion)',
 'Anas platyrhynchos (related: mallard)',
 'Anguilla anguilla (related: european eel)',
 'Anopheles stephensi (related: asian malaria mosquito)',
 'Anthurium andraeanum (related: oilcloth flower)',
 'Aphididae (related: aphids)',
 'Apis mellifera (related: honey bee)',
 'Apium graveolens (no common name found)',
 'Arabidopsis thaliana (related: thale cress)',
 'Arabidopsis thaliana (related: thale cress) (non-host bioassay: nicotiana benthamiana)',
 'Arab

## Filter host species for human entries

In [4]:
human_phibase = phibase[phibase['Host species'] == "Homo sapiens (related: human)"]

In [5]:
human_phibase

Unnamed: 0,Record ID,PHI_MolConn_ID,Protein ID source,Protein ID,Gene ID source,Gene ID,AA sequence,NT sequence,Sequence Strain,Gene,...,Ref. detail,Author email,Comments,Author reference,Year,Curation details,File name,Batch no.,Curation date,Curator organization
193,Record 194,PHI:198,Uniprot,P46590,EMBL,AAC41649,,,B792,ALS1,...,,,Buccal reconstituted human epithelium,,,,,PHI-base Vers.3.3,38476,Rres
194,Record 195,PHI:198,Uniprot,P46590,EMBL,AAC41649,,,B792,ALS1,...,,,Human umbilical vein endothelial cells,,,,,PHI-base Vers.3.3,38476,Rres
258,Record 259,PHI:252,Uniprot,Q5A7S7,EMBL,EAK98810,,,SC5314,FKH2,...,,,Forkhead Transcription Factor. Tissue Culture,Benson et al.,2002.0,,,PHI-base Vers.3.3,38476,Rres
483,Record 484,PHI:453,Uniprot,Q5ACU3,EMBL,EAL00423,,,SC5314,PMT5,...,,,Engineered human oral mucosa,Rouabhia et al.,2005.0,,,PHI-base Vers.3.3,39206,Rres
500,Record 501,PHI:468,Uniprot,Q6VBJ0,EMBL,AAQ82691,,,,EPA1,...,,,Reduced adhesion to human epithelial cells,Cormack et al,1999.0,,,PHI-base Vers.3.3,39206,Rres
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18184,Record 18185,PHI:7051,Uniprot,Q8DTF1,Genbank,AE014973_2,,,UA159,gbpC,...,Infect Immun. 2016 Oct 17;84(11):3206-3219,rmgraner@fop.unicamp.br,UAcov showed increased exvivo survival in huma...,L A Alves 2016,2016.0,6 genes,Alves-2016-CovR regulates Streptococcus mutans,BATCH_067B,May-17,MC
18185,Record 18186,PHI:7052,Uniprot,Q8DTB7,Genbank,AE014976_8,,,UA159,epsC,...,Infect Immun. 2016 Oct 17;84(11):3206-3219,rmgraner@fop.unicamp.br,UAcov showed increased exvivo survival in huma...,L A Alves 2016,2016.0,6 genes,Alves-2016-CovR regulates Streptococcus mutans,BATCH_067B,May-17,MC
18186,Record 18187,PHI:7053,Uniprot,P08987,Genbank,AE014940_1,,,UA159,gtfB,...,Infect Immun. 2016 Oct 17;84(11):3206-3219,rmgraner@fop.unicamp.br,UAcov showed increased exvivo survival in huma...,L A Alves 2016,2016.0,6 genes,Alves-2016-CovR regulates Streptococcus mutans,BATCH_067B,May-17,MC
18187,Record 18188,PHI:7054,Uniprot,P13470,Genbank,AE014940_2,,,UA159,gtfC,...,Infect Immun. 2016 Oct 17;84(11):3206-3219,rmgraner@fop.unicamp.br,UAcov showed increased exvivo survival in huma...,L A Alves 2016,2016.0,6 genes,Alves-2016-CovR regulates Streptococcus mutans,BATCH_067B,May-17,MC


In [6]:
set(human_phibase['Host species'].values)

{'Homo sapiens (related: human)'}

In [7]:
print(f'Number of entries for humans in phibase is {len(human_phibase)}.')


Number of entries for humans in phibase is 1135.


In [8]:
print('The pathogens in PhiBASE for humans')
set(human_phibase['Pathogen species'].values)

The pathogens in PhiBASE for humans


{'Acinetobacter baumannii',
 'Actinobacillus pleuropneumoniae',
 'Aspergillus fumigatus',
 'Bacillus cereus',
 'Bordetella bronchiseptica',
 'Bordetella pertussis',
 'Borreliella burgdorferi',
 'Brucella abortus',
 'Brucella melitensis',
 'Burkholderia cenocepacia',
 'Burkholderia contaminans',
 'Burkholderia pseudomallei',
 'Campylobacter jejuni',
 'Candida albicans',
 'Candida glabrata',
 'Candida parapsilosis',
 'Chlamydia pneumoniae',
 'Chlamydia trachomatis',
 'Citrobacter rodentium',
 'Clostridioides difficile',
 'Corynebacterium diphtheriae',
 'Corynebacterium ulcerans',
 'Coxiella burnetii',
 'Cronobacter turicensis',
 'Cronobacter universalis',
 'Cryptococcus neoformans',
 'Edwardsiella tarda',
 'Enterococcus faecalis',
 'Escherichia coli',
 'Exophiala dermatitidis',
 'Francisella tularensis',
 'Haemophilus ducreyi',
 'Haemophilus influenzae',
 'Helicobacter pylori',
 'Kingella kingae',
 'Klebsiella pneumoniae',
 'Lactococcus lactis',
 'Legionella pneumophila',
 'Leishmania me

In [17]:
human_phibase.drop(['Gene ID', 'Gene ID source', 'NT sequence', 'Chr location', 'Gene/Protein modification', 'Modified gene/protein Id'], axis=1)

Unnamed: 0,Record ID,PHI_MolConn_ID,Protein ID source,Protein ID,AA sequence,Sequence Strain,Gene,Interacting partner(s),Interacting partner(s) Id,Multiple mutation,...,Ref. detail,Author email,Comments,Author reference,Year,Curation details,File name,Batch no.,Curation date,Curator organization
193,Record 194,PHI:198,Uniprot,P46590,,B792,ALS1,,,,...,,,Buccal reconstituted human epithelium,,,,,PHI-base Vers.3.3,38476,Rres
194,Record 195,PHI:198,Uniprot,P46590,,B792,ALS1,,,,...,,,Human umbilical vein endothelial cells,,,,,PHI-base Vers.3.3,38476,Rres
258,Record 259,PHI:252,Uniprot,Q5A7S7,,SC5314,FKH2,,,,...,,,Forkhead Transcription Factor. Tissue Culture,Benson et al.,2002.0,,,PHI-base Vers.3.3,38476,Rres
483,Record 484,PHI:453,Uniprot,Q5ACU3,,SC5314,PMT5,,,,...,,,Engineered human oral mucosa,Rouabhia et al.,2005.0,,,PHI-base Vers.3.3,39206,Rres
500,Record 501,PHI:468,Uniprot,Q6VBJ0,,,EPA1,,,,...,,,Reduced adhesion to human epithelial cells,Cormack et al,1999.0,,,PHI-base Vers.3.3,39206,Rres
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18184,Record 18185,PHI:7051,Uniprot,Q8DTF1,,UA159,gbpC,,,,...,Infect Immun. 2016 Oct 17;84(11):3206-3219,rmgraner@fop.unicamp.br,UAcov showed increased exvivo survival in huma...,L A Alves 2016,2016.0,6 genes,Alves-2016-CovR regulates Streptococcus mutans,BATCH_067B,May-17,MC
18185,Record 18186,PHI:7052,Uniprot,Q8DTB7,,UA159,epsC,,,,...,Infect Immun. 2016 Oct 17;84(11):3206-3219,rmgraner@fop.unicamp.br,UAcov showed increased exvivo survival in huma...,L A Alves 2016,2016.0,6 genes,Alves-2016-CovR regulates Streptococcus mutans,BATCH_067B,May-17,MC
18186,Record 18187,PHI:7053,Uniprot,P08987,,UA159,gtfB,,,PHI:7054; PHI:7055,...,Infect Immun. 2016 Oct 17;84(11):3206-3219,rmgraner@fop.unicamp.br,UAcov showed increased exvivo survival in huma...,L A Alves 2016,2016.0,6 genes,Alves-2016-CovR regulates Streptococcus mutans,BATCH_067B,May-17,MC
18187,Record 18188,PHI:7054,Uniprot,P13470,,UA159,gtfC,,,PHI:7053; PHI:7055,...,Infect Immun. 2016 Oct 17;84(11):3206-3219,rmgraner@fop.unicamp.br,UAcov showed increased exvivo survival in huma...,L A Alves 2016,2016.0,6 genes,Alves-2016-CovR regulates Streptococcus mutans,BATCH_067B,May-17,MC


In [10]:
import panel as pn
pn.extension('tabulator')

In [13]:
human_phibase_widget = pn.widgets.DataFrame(human_phibase, autosize_mode='fit_columns', name='Human Phibase Interactions')
human_phibase_widget