In [1]:
import pertpy as pt
import scanpy as sc
import pandas as pd

Global seed set to 0


This notebook is mainly used to discore the CellLineMetaData.lookup function in pertpy. In general, the lookup object with following functions:
- Summarize the metadata in the database, e.g. the number of cell lines, the number of genes/proteins measured
- Give an overview of the possible reference_id (cell line identifiers in the metadata)
- Given a list of unique query_id (cell line identifiers in the adata.obs) for the cell, return the number of matched identifiers in the metadata

In [2]:
# here we use two adata as example
adata_dialogue = pt.dt.dialogue_example() 
adata_dialogue

AnnData object with n_obs × n_vars = 5374 × 6329
    obs: 'nCount_RNA', 'nFeature_RNA', 'cellQ', 'gender', 'location', 'clinical.status', 'cell.subtypes', 'pathology', 'origin', 'subset'
    var: 'name'

In [3]:
adata_dialogue.obs['cell_line_name'] = 'MCF7'

In [4]:
adata_mc = pt.dt.mcfarland_2020()
adata_mc

AnnData object with n_obs × n_vars = 182875 × 32738
    obs: 'DepMap_ID', 'cancer', 'cell_det_rate', 'cell_line', 'cell_quality', 'channel', 'disease', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', 'doublet_z_margin', 'hash_assignment', 'hash_tag', 'num_SNPs', 'organism', 'percent.mito', 'perturbation', 'perturbation_type', 'sex', 'singlet_ID', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'time', 'tissue_type', 'tot_reads', 'nperts', 'ngenes', 'ncounts', 'percent_mito', 'percent_ribo', 'chembl-ID'
    var: 'ensembl_id', 'ncounts', 'ncells'

In [5]:
# create the metadata object
pt_metadata = pt.tl.CellLineMetaData()

In [6]:
# In order to annotate cell line metadata, 
# we can create a LookUp object specific for CellLineMetaData by calling lookup()
cl_lookup = pt_metadata.lookup()

In [9]:
# As default, the function is named after the type of meta data to fetch. It has following features:
## summarizes the available meta data for the specified source
## gives an overview of the available reference id which we can use in the meta data
## prints the default parameter of the function
cl_lookup.cell_lines()

To summarize: in the DepMap cell line annotation you can find: 
1840 cell lines
29 meta data, including 
- DepMap_ID
- cell_line_name
- stripped_cell_line_name
- CCLE_Name
- alias
- COSMICID
- sex
- source
- RRID
- WTSI_Master_Cell_ID
- sample_collection_site
- primary_or_metastasis
- primary_disease
- Subtype
- age
- Sanger_Model_ID
- depmap_public_comments
- lineage
- lineage_subtype
- lineage_sub_subtype
- lineage_molecular_subtype
- default_growth_pattern
- model_manipulation
- model_manipulation_details
- patient_id
- parent_depmap_id
- Cellosaurus_NCIt_disease
- Cellosaurus_NCIt_id
- Cellosaurus_issues
Overview of possible cell line reference identifiers: 
    DepMap_ID cell_line_name stripped_cell_line_name                                    CCLE_Name
0  ACH-000016         SLR 21                   SLR21                                 SLR21_KIDNEY
1  ACH-000032     MHH-CALL-3                MHHCALL3  MHHCALL3_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
2  ACH-000033      NCI-H1819       

In [10]:
cl_lookup.cell_lines(cell_line_source = "Cancerrxgene")
# for Cancerrxgene, we use stripped_cell_line_name as default reference identifier

To summarize: in the cell line annotation from the project Genomics of Drug Sensitivity in Cancer you can find: 
978 cell lines
9 meta data, including 
- TCGA Classfication
- Model ID
- Tissue
- COSMIC ID
- Tissue sub-type
- stripped_cell_line_name
- cell_line_name
- GDSC1
- GDSC2
Overview of possible cell line reference identifiers: 
  cell_line_name stripped_cell_line_name   Model ID  COSMIC ID
0        SNU-283                  SNU283  SIDM00215    1659929
1     MHH-CALL-4                MHHCALL4  SIDM00376     908133
2           SC-1                     SC1  SIDM00400    1331030
3          NCCIT                   NCCIT  SIDM00655     908441
4       BONNA-12                 BONNA12  SIDM00984     906695
Default is to annotate cell line meta data from DepMap database. 
However, to annotate cell line meta data from the project Genomics of Drug Sensitivity in Cancer, 
Default parameters are:
- query_id: stripped_cell_line_name
- reference_id: stripped_cell_line_name
- cell_line_informat

In [11]:
# you can also give a list of unique query ids to test how many cell lines are matched in the metad data
# Sometimes the result is different depending on the identifier and the source of cell line annotation you choose
cl_lookup.cell_lines(query_id_list=adata_dialogue.obs['cell_line_name'].unique(), reference_id="cell_line_name") # Default reference_id is DepMap_ID

To summarize: in the DepMap cell line annotation you can find: 
1840 cell lines
29 meta data, including 
- DepMap_ID
- cell_line_name
- stripped_cell_line_name
- CCLE_Name
- alias
- COSMICID
- sex
- source
- RRID
- WTSI_Master_Cell_ID
- sample_collection_site
- primary_or_metastasis
- primary_disease
- Subtype
- age
- Sanger_Model_ID
- depmap_public_comments
- lineage
- lineage_subtype
- lineage_sub_subtype
- lineage_molecular_subtype
- default_growth_pattern
- model_manipulation
- model_manipulation_details
- patient_id
- parent_depmap_id
- Cellosaurus_NCIt_disease
- Cellosaurus_NCIt_id
- Cellosaurus_issues
Overview of possible cell line reference identifiers: 
    DepMap_ID cell_line_name stripped_cell_line_name                                    CCLE_Name
0  ACH-000016         SLR 21                   SLR21                                 SLR21_KIDNEY
1  ACH-000032     MHH-CALL-3                MHHCALL3  MHHCALL3_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
2  ACH-000033      NCI-H1819       

In [12]:
cl_lookup.cell_lines(query_id_list=adata_mc.obs['DepMap_ID'].unique())

To summarize: in the DepMap cell line annotation you can find: 
1840 cell lines
29 meta data, including 
- DepMap_ID
- cell_line_name
- stripped_cell_line_name
- CCLE_Name
- alias
- COSMICID
- sex
- source
- RRID
- WTSI_Master_Cell_ID
- sample_collection_site
- primary_or_metastasis
- primary_disease
- Subtype
- age
- Sanger_Model_ID
- depmap_public_comments
- lineage
- lineage_subtype
- lineage_sub_subtype
- lineage_molecular_subtype
- default_growth_pattern
- model_manipulation
- model_manipulation_details
- patient_id
- parent_depmap_id
- Cellosaurus_NCIt_disease
- Cellosaurus_NCIt_id
- Cellosaurus_issues
Overview of possible cell line reference identifiers: 
    DepMap_ID cell_line_name stripped_cell_line_name                                    CCLE_Name
0  ACH-000016         SLR 21                   SLR21                                 SLR21_KIDNEY
1  ACH-000032     MHH-CALL-3                MHHCALL3  MHHCALL3_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
2  ACH-000033      NCI-H1819       

In [13]:
cl_lookup.cell_lines(query_id_list=adata_mc.obs['DepMap_ID'].unique(), cell_line_source="Cancerrxgene")
# DepMap_ID is not available for Cancerrxgene, stripped_cell_line_name is used as default

To summarize: in the cell line annotation from the project Genomics of Drug Sensitivity in Cancer you can find: 
978 cell lines
9 meta data, including 
- TCGA Classfication
- Model ID
- Tissue
- COSMIC ID
- Tissue sub-type
- stripped_cell_line_name
- cell_line_name
- GDSC1
- GDSC2
Overview of possible cell line reference identifiers: 
  cell_line_name stripped_cell_line_name   Model ID  COSMIC ID
0        SNU-283                  SNU283  SIDM00215    1659929
1     MHH-CALL-4                MHHCALL4  SIDM00376     908133
2           SC-1                     SC1  SIDM00400    1331030
3          NCCIT                   NCCIT  SIDM00655     908441
4       BONNA-12                 BONNA12  SIDM00984     906695
Default is to annotate cell line meta data from DepMap database. 
However, to annotate cell line meta data from the project Genomics of Drug Sensitivity in Cancer, 
Default parameters are:
- query_id: stripped_cell_line_name
- reference_id: stripped_cell_line_name
- cell_line_informat

In [14]:
# We can do the same for the bulk RNA expression annotation
cl_lookup.bulk_rna_expression()

To summarize: in the RNA-Seq Data for broad cell line only, you can find: 
925 cell lines
37263 genes
5 meta data, including 
- model_id
- model_name
- gene_id
- read_count
- fpkm
Overview of possible cell line reference identifiers: 
    model_id model_name
0  SIDM00499      22RV1
1  SIDM00499      22RV1
2  SIDM00499      22RV1
3  SIDM00499      22RV1
4  SIDM00499      22RV1
Default parameters to annotate bulk RNA expression: 
- query_id: cell_line_name
- reference_id: model_name
- cell_line_source: broad
- bulk_rna_information: read_count


In [15]:
cl_lookup.bulk_rna_expression(cell_line_source="sanger")

To summarize: in the RNA-Seq Data for Sanger cell line only, you can find: 
442 cell lines
37263 genes
5 meta data, including 
- model_id
- model_name
- gene_id
- read_count
- fpkm
Overview of possible cell line reference identifiers: 
    model_id model_name
0  SIDM01240      451Lu
1  SIDM01240      451Lu
2  SIDM01240      451Lu
3  SIDM01240      451Lu
4  SIDM01240      451Lu
Default parameters to annotate bulk RNA expression: 
- query_id: cell_line_name
- reference_id: model_name
- cell_line_source: broad
- bulk_rna_information: read_count


In [16]:
cl_lookup.bulk_rna_expression(cell_line_source = "broad",
                                       query_id_list = adata_dialogue.obs['cell_line_name'].unique())

To summarize: in the RNA-Seq Data for broad cell line only, you can find: 
925 cell lines
37263 genes
5 meta data, including 
- model_id
- model_name
- gene_id
- read_count
- fpkm
Overview of possible cell line reference identifiers: 
    model_id model_name
0  SIDM00499      22RV1
1  SIDM00499      22RV1
2  SIDM00499      22RV1
3  SIDM00499      22RV1
4  SIDM00499      22RV1
Default parameters to annotate bulk RNA expression: 
- query_id: cell_line_name
- reference_id: model_name
- cell_line_source: broad
- bulk_rna_information: read_count
0 cell lines are not found in the metadata.
1 cell lines are found! 


In [17]:
cl_lookup.bulk_rna_expression(cell_line_source = "sanger",
                                       query_id_list = adata_dialogue.obs['cell_line_name'].unique())

To summarize: in the RNA-Seq Data for Sanger cell line only, you can find: 
442 cell lines
37263 genes
5 meta data, including 
- model_id
- model_name
- gene_id
- read_count
- fpkm
Overview of possible cell line reference identifiers: 
    model_id model_name
0  SIDM01240      451Lu
1  SIDM01240      451Lu
2  SIDM01240      451Lu
3  SIDM01240      451Lu
4  SIDM01240      451Lu
Default parameters to annotate bulk RNA expression: 
- query_id: cell_line_name
- reference_id: model_name
- cell_line_source: broad
- bulk_rna_information: read_count
1 cell lines are not found in the metadata.
0 cell lines are found! 


In [18]:
# For another dataset
# we can not find matched cell lines in the meta data
# What we can do is to annotate cell line metadata though annotate_cell_line to see whether we can get more possible cell line identifiers
# In this notebook we only focus on lookup function, so we dont call annotate_cell_line
cl_lookup.bulk_rna_expression(cell_line_source = "broad", reference_id = "model_id",
                                       query_id_list = adata_mc.obs['DepMap_ID'].unique())

To summarize: in the RNA-Seq Data for broad cell line only, you can find: 
925 cell lines
37263 genes
5 meta data, including 
- model_id
- model_name
- gene_id
- read_count
- fpkm
Overview of possible cell line reference identifiers: 
    model_id model_name
0  SIDM00499      22RV1
1  SIDM00499      22RV1
2  SIDM00499      22RV1
3  SIDM00499      22RV1
4  SIDM00499      22RV1
Default parameters to annotate bulk RNA expression: 
- query_id: cell_line_name
- reference_id: model_name
- cell_line_source: broad
- bulk_rna_information: read_count
209 cell lines are not found in the metadata.
0 cell lines are found! 


In [19]:
cl_lookup.protein_expression()

To summarize: in the proteomics data you can find: 
948 cell lines
8457 proteins
6 meta data, including 
- uniprot_id
- model_id
- model_name
- protein_intensity
- zscore
- symbol
Overview of possible cell line reference identifiers: 
    model_id model_name
0  SIDM00483    SK-GT-4
1  SIDM00689        JM1
2  SIDM01259      GR-ST
3  SIDM00846       HeLa
4  SIDM00958     CML-T1
Default parameters to annotate protein expression: 
- query_id: cell_line_name
- reference_id: model_name
- bulk_rna_information: read_count
- protein_information: protein_intensity
- protein_id: uniprot_id


In [20]:
cl_lookup.protein_expression(query_id_list = adata_dialogue.obs['cell_line_name'].unique())

To summarize: in the proteomics data you can find: 
948 cell lines
8457 proteins
6 meta data, including 
- uniprot_id
- model_id
- model_name
- protein_intensity
- zscore
- symbol
Overview of possible cell line reference identifiers: 
    model_id model_name
0  SIDM00483    SK-GT-4
1  SIDM00689        JM1
2  SIDM01259      GR-ST
3  SIDM00846       HeLa
4  SIDM00958     CML-T1
Default parameters to annotate protein expression: 
- query_id: cell_line_name
- reference_id: model_name
- bulk_rna_information: read_count
- protein_information: protein_intensity
- protein_id: uniprot_id
0 cell lines are not found in the metadata.
1 cell lines are found! 


In [21]:
cl_lookup.protein_expression(query_id_list = adata_mc.obs['DepMap_ID'].unique(),
                                    reference_id = "model_id")

To summarize: in the proteomics data you can find: 
948 cell lines
8457 proteins
6 meta data, including 
- uniprot_id
- model_id
- model_name
- protein_intensity
- zscore
- symbol
Overview of possible cell line reference identifiers: 
    model_id model_name
0  SIDM00483    SK-GT-4
1  SIDM00689        JM1
2  SIDM01259      GR-ST
3  SIDM00846       HeLa
4  SIDM00958     CML-T1
Default parameters to annotate protein expression: 
- query_id: cell_line_name
- reference_id: model_name
- bulk_rna_information: read_count
- protein_information: protein_intensity
- protein_id: uniprot_id
209 cell lines are not found in the metadata.
0 cell lines are found! 


In [22]:
cl_lookup.ccle_expression()

To summarize: in the CCLE expression data you can find: 
1406 cell lines
19221 genes
Only DepMap_ID is allowed to use as `reference_id`


In [23]:
cl_lookup.ccle_expression(query_id_list = adata_dialogue.obs['cell_line_name'].unique())

To summarize: in the CCLE expression data you can find: 
1406 cell lines
19221 genes
Only DepMap_ID is allowed to use as `reference_id`
1 cell lines are not found in the metadata.
0 cell lines are found! 


In [24]:
cl_lookup.ccle_expression(query_id_list = adata_mc.obs['DepMap_ID'].unique())

To summarize: in the CCLE expression data you can find: 
1406 cell lines
19221 genes
Only DepMap_ID is allowed to use as `reference_id`
1 cell lines are not found in the metadata.
208 cell lines are found! 


In [25]:
cl_lookup.driver_genes()

To summarize: in the DepMap_Sanger driver gene annotation for intogen genes, you can find: 
3333 driver genes
18 meta data including: 
- symbol
- transcript
- cohort
- cancer_type
- methods
- mutations
- samples
- %_samples_cohort
- qvalue_combination
- role
- cgc_gene
- cgc_cancer_gene
- domain
- 2d_clusters
- 3d_clusters
- excess_mis
- excess_non
- excess_spl


In [26]:
pt_metadata.driver_gene_intogen.columns

Index(['symbol', 'transcript', 'cohort', 'cancer_type', 'methods', 'mutations',
       'samples', '%_samples_cohort', 'qvalue_combination', 'role', 'cgc_gene',
       'cgc_cancer_gene', 'domain', '2d_clusters', '3d_clusters', 'excess_mis',
       'excess_non', 'excess_spl'],
      dtype='object')

In [27]:
pt_metadata.driver_gene_cosmic.columns

Index(['Gene Symbol', 'Name', 'Entrez GeneId', 'Genome Location', 'Tier',
       'Hallmark', 'Chr Band', 'Somatic', 'Germline', 'Tumour Types(Somatic)',
       'Tumour Types(Germline)', 'Cancer Syndrome', 'Tissue Type',
       'Molecular Genetics', 'Role in Cancer', 'Mutation Types',
       'Translocation Partner', 'Other Germline Mut', 'Other Syndrome',
       'Synonyms'],
      dtype='object')