<h1>Analysis of epitope database</h1>

<h2> Number of unique molecule names </h2>

In [4]:
import DatabaseConnection as db
import pandas as pd

In [17]:
sql = """SELECT Mol_name, count(Mol_name) as total
FROM epitope_analyzer.PDB_Mol_Info
GROUP BY Mol_name
ORDER BY total desc
LIMIT 100;"""
data, titles = db.getDataFromAQuerry_descrip(sql)
p_data = pd.DataFrame( data, columns=titles)
p_data

Unnamed: 0,Mol_name,total
0,BETA-2-MICROGLOBULIN,1951
1,SPIKE GLYCOPROTEIN,1014
2,CELLULAR TUMOR ANTIGEN P53,408
3,"HLA CLASS I HISTOCOMPATIBILITY ANTIGEN, A-2 AL...",402
4,HEAT SHOCK PROTEIN HSP 90-ALPHA,332
...,...,...
95,RHO-ASSOCIATED PROTEIN KINASE 1,69
96,GALACTOSE-BINDING LECTIN,67
97,CARBONIC ANHYDRASE 9,66
98,SNACLEC RHODOCETIN SUBUNIT DELTA,66


In [18]:
import plotly.io as pio
import plotly.express as px
import plotly.offline as py

fig = px.bar(p_data, x="Mol_name", y = "total")
fig.show()


<p> Looks like the molecules are biased on a few molecule names like: <ul>
<li> BETA-2-MICROGLOBULIN </li>
<li> SPIKE GLYCOPROTEIN </li>
<li> CELLULAR TUMOR ANTIGEN P53 </li>
</ul>
Are the molecules distribuited over several pdb structures or are consentrated in a few pdb oligomeric complexes?
</p>

In [21]:
sql = """SELECT Pdb, count(Pdb) as Mol_x_pdb,
max(Organism_TaxID) as TaxID,
max(Organism_Scientific) as Scientific_Name,
max(Organism_common) as Common_Name,
min(PDB_Mol_Id) as Mol_Id
FROM epitope_analyzer.PDB_Mol_Info
WHERE Mol_Name = 'BETA-2-MICROGLOBULIN'
GROUP BY Pdb
ORDER BY Mol_x_Pdb DESC;"""
data, titles = db.getDataFromAQuerry_descrip(sql)
p_data = pd.DataFrame( data, columns=titles)
p_data

Unnamed: 0,Pdb,Mol_x_pdb,TaxID,Scientific_Name,Common_Name,Mol_Id
0,6NCA,20,9606,HOMO SAPIENS,HUMAN,3
1,4L3C,14,9606,HOMO SAPIENS,HUMAN,2
2,4L29,14,9606,HOMO SAPIENS,HUMAN,2
3,3CIQ,12,9606,HOMO SAPIENS,HUMAN,1
4,6GK3,8,9606,HOMO SAPIENS,HUMAN,1
...,...,...,...,...,...,...
1239,7C9V,1,9606,HOMO SAPIENS,HUMAN,6
1240,6LA6,1,9606,HOMO SAPIENS,HUMAN,6
1241,6ILM,1,9606,HOMO SAPIENS,HUMAN,6
1242,6LA7,1,9606,HOMO SAPIENS,HUMAN,6


<p> Beta 2 microglobulin is present in 1244 different pdb files </p>

In [22]:
sql = """SELECT Pdb, count(Pdb) as Mol_x_pdb,
max(Organism_TaxID) as TaxID,
max(Organism_Scientific) as Scientific_Name,
max(Organism_common) as Common_Name,
min(PDB_Mol_Id) as Mol_Id
FROM epitope_analyzer.PDB_Mol_Info
WHERE Mol_Name = 'SPIKE GLYCOPROTEIN'
GROUP BY Pdb
ORDER BY Mol_x_Pdb DESC;"""
data, titles = db.getDataFromAQuerry_descrip(sql)
p_data = pd.DataFrame( data, columns=titles)
p_data

Unnamed: 0,Pdb,Mol_x_pdb,TaxID,Scientific_Name,Common_Name,Mol_Id
0,7TF2,6,2697049,SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS 2;,,1
1,5W9I,6,1335626,MIDDLE EAST RESPIRATORY SYNDROME-RELATED CORON...,,1
2,7TF4,6,2697049,SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS 2;,,1
3,7TF1,6,2697049,SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS 2;,,1
4,7TF5,6,2697049,SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS 2;,,1
...,...,...,...,...,...,...
379,7JN5,1,694009,HUMAN SARS CORONAVIRUS,SARS-COV,3
380,7K8M,1,2697049,SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS 2;,2019-NCOV,3
381,7JV2,1,2697049,SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS 2;,2019-NCOV,3
382,7JVA,1,2697049,SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS 2;,2019-NCOV,3


<p> SPIKE GLYCOPROTEIN is present in 384 different pdb files </p>

In [23]:
sql = """SELECT Pdb, count(Pdb) as Mol_x_pdb,
max(Organism_TaxID) as TaxID,
max(Organism_Scientific) as Scientific_Name,
max(Organism_common) as Common_Name,
min(PDB_Mol_Id) as Mol_Id
FROM epitope_analyzer.PDB_Mol_Info
WHERE Mol_Name = 'CELLULAR TUMOR ANTIGEN P53'
GROUP BY Pdb
ORDER BY Mol_x_Pdb DESC;"""
data, titles = db.getDataFromAQuerry_descrip(sql)
p_data = pd.DataFrame( data, columns=titles)
p_data

Unnamed: 0,Pdb,Mol_x_pdb,TaxID,Scientific_Name,Common_Name,Mol_Id
0,4D1M,12,7955,DANIO RERIO,ZEBRAFISH,1
1,2H1L,12,9606,HOMO SAPIENS,HUMAN,2
2,7BWN,8,9606,HOMO SAPIENS,HUMAN,2
3,4CZ7,6,7955,DANIO RERIO,ZEBRAFISH,1
4,4D1L,6,7955,DANIO RERIO,ZEBRAFISH,1
...,...,...,...,...,...,...
178,6S9Q,1,9606,HOMO SAPIENS,HUMAN,2
179,6RX2,1,9606,HOMO SAPIENS,HUMAN,2
180,6RKK,1,9606,HOMO SAPIENS,HUMAN,2
181,6RWI,1,9606,HOMO SAPIENS,HUMAN,2


<p> CELLULAR TUMOR ANTIGEN P53 is present in 183 different pdb files </p>
<p> After the analysis of the 3 main molecules, we observe that the molecules are distributed among several pdb files.
<br> We have some exeptions (6NCA, 7TF2, 4D1M) but not enough to changes the distribution of the molecules.</p>

<h2> Molecule organism distribution </h2>

In [27]:
sql = """SELECT count(Organism_TaxID) as total,
Organism_TaxID as TaxID,
max(Organism_Scientific) as Scientific_Name,
max(Organism_common),
max(Mol_Name) as Molecule,
min(PDB_Mol_Id) as Mol_Id
FROM epitope_analyzer.PDB_Mol_Info
GROUP BY TaxID
ORDER BY total DESC
LIMIT 150;"""
data, titles = db.getDataFromAQuerry_descrip(sql)
p_data = pd.DataFrame( data, columns=titles)
p_data

Unnamed: 0,total,TaxID,Scientific_Name,max(Organism_common),Molecule,Mol_Id
0,32396,9606,HUMAN,"MOUSE,HUMAN",[PYRUVATE DEHYDROGENASE [LIPOAMIDE]] KINASE IS...,1
1,8408,10090,"MUS MUSCULUS, HOMO SAPIENS","MOUSE, MOUSE",ZV-67 ANTIBODY FAB LIGHT CHAIN,1
2,1995,9913,BOS TAURUS,COW,"VITAMIN D-DEPENDENT CALCIUM-BINDING PROTEIN, I...",1
3,1771,,TRYPANOSOMA CRUZI,ZEBRAFISH,ZAP-70 PEPTIDE,1
4,1656,559292,SACCHAROMYCES CEREVISIAE S288C,YEAST,UBIQUITIN-LIKE PROTEIN SMT3,1
...,...,...,...,...,...,...
145,24,274,THERMUS THERMOPHILUS,,TT0826,1
146,24,6100,AEQUOREA VICTORIA,JELLYFISH,GREEN FLUORESCIENT PROTEIN UV,1
147,24,10559,BOVINE PAPILLOMAVIRUS TYPE 1,,REPLICATION PROTEIN E1,1
148,24,11786,MURINE LEUKEMIA VIRUS,,GAG PROTEIN,1


In [25]:
import plotly.io as pio
import plotly.express as px
import plotly.offline as py

fig = px.bar(p_data, x="Scientific_Name", y = "total")
fig.show()

<h3> Search by taxonomy id </h3>
<p> Return the Molecules that belong to that Taxonomy ID </p>

In [31]:
sql = """SELECT Mol_Name, count(Mol_Name) as total
FROM epitope_analyzer.PDB_Mol_Info
WHERE Organism_TaxID like '%9913%'
GROUP BY Mol_Name
ORDER BY total DESC;"""
data, titles = db.getDataFromAQuerry_descrip(sql)
p_data = pd.DataFrame( data, columns=titles)
p_data

Unnamed: 0,Mol_Name,total
0,CYTOCHROME C OXIDASE SUBUNIT 1,94
1,CYTOCHROME C OXIDASE SUBUNIT 2,94
2,CYTOCHROME C OXIDASE SUBUNIT 3,94
3,CYTOCHROME C OXIDASE SUBUNIT 6B1,78
4,CYTOCHROME C OXIDASE SUBUNIT 6C,76
...,...,...
198,F145 VH,1
199,B77 VH,1
200,RHODOPSIN,1
201,F145 VL,1
