# Exploratory data analysis of T-cell epitopes from IEDB

The dataset was downloaded from the IEDB [resource](http://www.iedb.org/database_export_v3.php).

Current dataset has a two level column index which helps to query it.
Below you can see selector for a final dataset. It includes:
1. Object Type of Epitope can only be a Linear peptide
2. Organism Name of Epitope is not Homo sapiens
3. Host Name can be: Homo sapiens, Homo sapiens Black or Homo sapiens Caucasian
4. 1st in vivo Process Type is Occurrence of infectious disease

In [5]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [6]:
df = pd.read_csv('data/tcell_full_v3.tar.gz', compression = "gzip", header=[0, 1])

In [8]:
df.shape

(365868, 143)

In [9]:
df.columns.get_level_values(0).unique()

Index(['./._tcell_full_v3.csv', 'Reference', 'Epitope', 'Related Object',
       'Host', '1st in vivo Process', '2nd in vivo Process',
       'In Vitro Process', 'Adoptive Transfer', 'Immunization Comments',
       'Assay', 'Effector Cells', 'TCR', 'Antigen Presenting Cells', 'MHC',
       'Assay Antigen', 'Assay Comments'],
      dtype='object')

In [11]:
ndf = df[(df.Epitope['Object Type'] == 'Linear peptide') & 
         (df.Epitope['Organism Name'] != 'Homo sapiens') & 
         (df.Host.Name.isin(['Homo sapiens', 'Homo sapiens Black', 'Homo sapiens Caucasian'])) & 
         (df['1st in vivo Process']['Process Type'] == 'Occurrence of infectious disease')]
ndf.shape

(57836, 143)

In [14]:
ndf.Assay["Qualitative Measure"].value_counts()

Negative                 40680
Positive                 16410
Positive-Low               543
Positive-High              167
Positive-Intermediate       36
Name: Qualitative Measure, dtype: int64

In [17]:
ndf[ndf.Assay["Qualitative Measure"].str.startswith("Positive")].Epitope.Description.nunique()

7835

In [18]:
ndf[ndf.Assay["Qualitative Measure"].str.startswith("Negative")].Epitope.Description.nunique()

32878

In [21]:
ndf[ndf.Assay["Qualitative Measure"].str.startswith("Negative")].Epitope["Organism Name"].value_counts()[:10]

Mycobacterium tuberculosis                               21160
Hepatitis C virus subtype 1a                              4694
Human alphaherpesvirus 2                                  2306
Hepatitis C virus subtype 1b                              1829
Brucella melitensis                                        970
Hepatitis E virus                                          901
Hepacivirus C                                              646
Severe acute respiratory syndrome-related coronavirus      552
Japanese encephalitis virus                                444
Hepatitis B virus                                          425
Name: Organism Name, dtype: int64

In [43]:
# ndf[(ndf.Assay["Qualitative Measure"].str.startswith("Negative")) &
#     (ndf.MHC.Class == "I")].Epitope["Organism Name"].nunique()

In [42]:
# ndf[(ndf.Assay["Qualitative Measure"].str.startswith("Positive")) &
#     (ndf.MHC.Class == "I")].Epitope["Organism Name"].nunique()

In [41]:
# ndf[ndf.Assay["Qualitative Measure"].str.startswith("Positive")].Epitope["Organism Name"].value_counts()[:10]

In [40]:
# ndf[(ndf.Epitope["Organism Name"] == "Mycobacterium tuberculosis") &
#     (ndf.Assay["Qualitative Measure"] == "Negative")]["MHC"]["Class"].value_counts()

In [39]:
# ndf[(ndf.Epitope["Organism Name"] == "Mycobacterium tuberculosis") &
#     (ndf.Assay["Qualitative Measure"] == "Negative")].Assay[]