## All Cancers Integration Notebook 01: Preprocessing 

This is the Jupyter Notebook to preprocess AnnData objects from all paediatric cancer datasets. As a result, a concatenated AnnData object is outputted for scVI integration.

| Dataset GEO Term     | Sample Number (Patient Number) |
|----------------------|--------------------------------|
| GSE132509            | 11 (11)                        |
| GSE236351            | 7 (7)                          |
| GSE148218            | 8 (6)                          |
| GSE154109*           | 15 (15)                        |
| GSE235923            | 31 (20)                        |
| GSE235063            | 75 (28)                        |
| GSE227122            | 16 (11)                        |
| GSE102130*           | 10 (6)                         |
| GSE119926*           | 25 (25)                        |
| GSE155446/GSE156053  | 30 (28)                        |
| GSE141460            | 28 (21)                        |
| GSE189939            | 4 (4)                          |
| GSE125969/GSE126025  | 26 (26)                        |
| GSE231860/GSE231859  | 19 (19)                        |
| GSE221776            | 39 (39)                        |
| GSE198896*           | 14 (12)                        |
| GSE162454*           | 3 (3)                          |
| GSE152048*           | 6 (6)                          |
| GSE243347            | 27 (11)                        |
| GSE195709            | 4 (4)                          |
| GSE174376            | 18 (16)                        |
| EGAD00001008345*     | 28 (28)                        |
| GSE137804            | 22 (22)                        |
| GSE192906            | 10 (10)                        |
| GSE140819            | 13 (7)                         |
| GSE216176            | 17 (16)                        |
| GSE147766            | 19 (17)                        |
| PRJNA737188          | 2 (2)                          |
| GSE249995            | 4 (4)                          |
| GSE168434            | 10 (7)                         |
| GSE223373            | 3 (1)                          |

In total, there are 29/31 usuable datasets.

In [16]:
import anndata
import scanpy as sc
import pandas as pd
from rich import print
from scipy.sparse import issparse

import warnings
warnings.filterwarnings("ignore")

In [17]:
# Load all AnnData objects into a list

from pathlib import Path

cancer_types = ['01_leukemia', '02_brain_tumor', '03_bone_cancer', '04_rhabdomyosarcoma', '05_neuroblastoma', '06_retinoblastoma', '07_kidney_cancer']
data_directory = Path('/scratch/user/s4543064/Xiaohan_Summer_Research/data')

adatas = {}

for cancer_type in cancer_types:
    print(cancer_type)
    
    cancer_directory = data_directory / cancer_type

    for dataset in cancer_directory.iterdir():
        print(dataset.stem, len(list(dataset.glob('*'))))

        for file in dataset.iterdir():
            if "_uni.h5ad" in file.name:
                sample_name = dataset.stem + '_' + file.stem
                adata = sc.read_h5ad(file)
                adata.var_names_make_unique()

                # Check if the index of obs has duplicates
                if adata.obs.index.duplicated().any():
                    print(file.stem, 'originally has duplicated index')
                adata.obs.index = adata.obs.index + '_' + adata.obs['sample_barcode'].astype(str)
                if adata.obs.index.duplicated().any():
                    print(file.stem, 'still has duplicated index')

                # Check if the count matrix has NaN values
                count_df = anndata.AnnData.to_df(adata)
                if count_df.isnull().any().any():
                    print(file.stem, 'has NaN values in the count matrix')
                    adata.X.fillna(0, inplace=True)

                # Check if the count matrix is sparse matrix
                if not issparse(adata.X):
                    print(sample_name, 'has a wrong format for .X')

                adatas[sample_name] = adata

print(len(adatas))

In [18]:
# Double check we have 528 unique sample barcodes (what about 491????)
sample_count = 0
for sample in adatas:
    adata = adatas[sample]
    sample_count += len(adata.obs['sample_barcode'].unique())
sample_count

528

In [22]:
# Find out common genes among all AnnData objects
samples = set(adatas['GSE235923_GSM7512002_Sample4D_uni'].obs['sample_barcode'].unique().tolist())
for sample in adatas:
    sample_list = adatas[sample].obs['sample_barcode'].unique().tolist()
    for id in sample_list:
        if id in samples:
            print(sample, id)
    samples.update(adatas[sample].obs['sample_barcode'].unique().tolist())
    # print(len(common_genes))
print(len(samples))

For the dataset GSE221776, the provided data is separated to include CD4 and CD8 information. Since they share the same donors, the number of sample barcodes are counted twice when summing the number of sample barcodes of all loaded AnnData objects.

Thus, in total we have 491 sample.

In [4]:
adatas.keys()

dict_keys(['GSE235923_GSM7512002_Sample4D_uni', 'GSE235923_GSM7512005_Sample5R_uni', 'GSE235923_GSM7512020_Sample16E_uni', 'GSE235923_GSM7512000_Sample3D_uni', 'GSE235923_GSM7512018_Sample15E_uni', 'GSE235923_GSM7512021_Sample17D_uni', 'GSE235923_GSM7512025_Sample19D_uni', 'GSE235923_GSM7512007_Sample6E_uni', 'GSE235923_GSM7512026_Sample19E_uni', 'GSE235923_GSM7512003_Sample5D_uni', 'GSE235923_GSM7512017_Sample14E_uni', 'GSE235923_GSM7512006_Sample6D_uni', 'GSE235923_GSM7511999_Sample2D_uni', 'GSE235923_GSM7512027_Sample20D_uni', 'GSE235923_GSM7512011_Sample9D_uni', 'GSE235923_GSM7512009_Sample7D_uni', 'GSE235923_GSM7511998_Sample1D_uni', 'GSE235923_GSM7512004_Sample5E_uni', 'GSE235923_GSM7512012_Sample10D_uni', 'GSE235923_GSM7512019_Sample16D_uni', 'GSE235923_GSM7512016_Sample14D_uni', 'GSE235923_GSM7512023_Sample18Dx_uni', 'GSE235923_GSM7512028_Sample20E_uni', 'GSE235923_GSM7512013_Sample11D_uni', 'GSE235923_GSM7512024_Sample18E_uni', 'GSE235923_GSM7512010_Sample8D_uni', 'GSE235923_G

In [5]:
# Find one AnnData has a complete .var dataframe
adatas['GSE235923_GSM7512002_Sample4D_uni']

AnnData object with n_obs × n_vars = 4268 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'uni_barcode', 'disease_progression', 'sample_barcode'
    var: 'gene_ids', 'feature_types'

In [6]:
adatas['GSE235923_GSM7512002_Sample4D_uni'].var

Unnamed: 0,gene_ids,feature_types
MIR1302-2HG,ENSG00000243485,Gene Expression
FAM138A,ENSG00000237613,Gene Expression
OR4F5,ENSG00000186092,Gene Expression
AL627309.1,ENSG00000238009,Gene Expression
AL627309.3,ENSG00000239945,Gene Expression
...,...,...
AC233755.2,ENSG00000277856,Gene Expression
AC233755.1,ENSG00000275063,Gene Expression
AC240274.1,ENSG00000271254,Gene Expression
AC213203.1,ENSG00000277475,Gene Expression


In [7]:
# Find out common genes among all AnnData objects
common_genes = set(adatas['GSE235923_GSM7512002_Sample4D_uni'].var_names)
for sample in adatas:
    common_genes.intersection_update(adatas[sample].var_names)
    # print(len(common_genes))
print(len(common_genes))

We have 7,662 common genes for all cancer datasets.

In [8]:
# Check the total number of cells at the beginning
total_cells = 0

# Iterate through each AnnData object in the list
for sample in adatas:
    # Add the number of cells in the current AnnData object to the total
    total_cells += adatas[sample].shape[0]  # 'shape[0]' gives the number of cells

# Print the total number of cells
print("Total number of cells:", total_cells)

At the beginning, we have a total of 1,944,497 cells.

### <span style="color:yellow">**Preprocessing:**</span> normalization & log transformation

Use the preprocessing package from dandelion to filter out cell and gene outliers

Modified parameters:

`max_genes: int = 6000`

`mito_cutoff: Optional[int] = None`

In [9]:
from sklearn.mixture import GaussianMixture

def recipe_scanpy_qc(
    adata: anndata,
    mito_startswith: str = "MT-",
    max_genes: int = 6000,
    min_genes: int = 200,
    mito_cutoff: int = None,
):
    
    _adata = adata.copy()
    # run basic scanpy pipeline
    sc.pp.filter_cells(_adata, min_genes=0)
    _adata.var["mt"] = _adata.var_names.str.startswith(mito_startswith)
    sc.pp.calculate_qc_metrics(
        _adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
    )
    if mito_cutoff is None:
        # use a model-based method to determine the cut off
        # for mitochondrial content
        gmm = GaussianMixture(
            n_components=2, max_iter=1000, covariance_type="full", random_state=0
        )
        X = _adata.obs[["pct_counts_mt", "n_genes_by_counts"]]
        try:
            _adata.obs["gmm_pct_count_clusters"] = gmm.fit(X).predict(X)
            # use a simple metric to workout which cluster
            # is the one that contains lower mito content?
            A1 = (
                _adata[_adata.obs["gmm_pct_count_clusters"] == 0]
                .obs["pct_counts_mt"]
                .mean()
            )
            B1 = (
                _adata[_adata.obs["gmm_pct_count_clusters"] == 1]
                .obs["pct_counts_mt"]
                .mean()
            )
            A2 = (
                _adata[_adata.obs["gmm_pct_count_clusters"] == 0]
                .obs["n_genes_by_counts"]
                .mean()
            )
            B2 = (
                _adata[_adata.obs["gmm_pct_count_clusters"] == 1]
                .obs["n_genes_by_counts"]
                .mean()
            )
            if (A1 > B1) and (A2 < B2):
                keepdict = {0: False, 1: True}
            else:
                keepdict = {1: False, 0: True}
            _adata.obs["gmm_pct_count_clusters_keep"] = [
                keepdict[x] for x in _adata.obs["gmm_pct_count_clusters"]
            ]
        
            _adata.obs["filter_rna"] = (
                (
                    pd.Series(
                        [
                            ((n < min_genes) or (n > max_genes))
                            for n in _adata.obs["n_genes_by_counts"]
                        ],
                        index=_adata.obs.index,
                    )
                )
                | ~(_adata.obs.gmm_pct_count_clusters_keep)
            )
        except Exception as e:
            print(adata, 'has less than 2 cells passing the filtering') 
        
    bool_dict = {True: "True", False: "False"}
    _adata.obs["filter_rna"] = [bool_dict[x] for x in _adata.obs["filter_rna"]]

    # removing columns that probably don't need anymore
    _adata.obs = _adata.obs.drop(
        ["gmm_pct_count_clusters"],
        axis=1,
    )
    adata.obs = _adata.obs.copy()

In [10]:
adatas_filtered = {}

for sample in adatas:
    # print(sample)
    adata = adatas[sample]

    # Do QC and filtering
    recipe_scanpy_qc(adata)
    try:
        adata = adata[adata.obs['filter_rna'] == 'False']
    except Exception as e:
        print(sample, "may not have the 'filter_rna' column")

    # We need at least 1 cell passing the filtering for the downstream processing
    if adata.shape[0] > 0:
        # Subset for common genes
        adata = adata[:, list(common_genes)]

        # Store the raw counts
        adata.layers['counts'] = adata.X.copy()

        # Do normalization
        sc.pp.normalize_total(adata, target_sum=1e4)

        # Do the log transformation
        sc.pp.log1p(adata)

        # Freeze the state in `.raw`
        adata.raw = adata  
        
        adatas_filtered[sample] = adata

In [11]:
len(adatas_filtered)

369

We have 3 samples that have no cells pass the filtering.

In [12]:
adatas_filtered.keys()

dict_keys(['GSE235923_GSM7512002_Sample4D_uni', 'GSE235923_GSM7512005_Sample5R_uni', 'GSE235923_GSM7512020_Sample16E_uni', 'GSE235923_GSM7512000_Sample3D_uni', 'GSE235923_GSM7512018_Sample15E_uni', 'GSE235923_GSM7512021_Sample17D_uni', 'GSE235923_GSM7512025_Sample19D_uni', 'GSE235923_GSM7512007_Sample6E_uni', 'GSE235923_GSM7512026_Sample19E_uni', 'GSE235923_GSM7512003_Sample5D_uni', 'GSE235923_GSM7512017_Sample14E_uni', 'GSE235923_GSM7512006_Sample6D_uni', 'GSE235923_GSM7511999_Sample2D_uni', 'GSE235923_GSM7512027_Sample20D_uni', 'GSE235923_GSM7512011_Sample9D_uni', 'GSE235923_GSM7512009_Sample7D_uni', 'GSE235923_GSM7511998_Sample1D_uni', 'GSE235923_GSM7512004_Sample5E_uni', 'GSE235923_GSM7512012_Sample10D_uni', 'GSE235923_GSM7512019_Sample16D_uni', 'GSE235923_GSM7512016_Sample14D_uni', 'GSE235923_GSM7512023_Sample18Dx_uni', 'GSE235923_GSM7512028_Sample20E_uni', 'GSE235923_GSM7512013_Sample11D_uni', 'GSE235923_GSM7512024_Sample18E_uni', 'GSE235923_GSM7512010_Sample8D_uni', 'GSE235923_G

In [13]:
print(adatas_filtered['GSE235923_GSM7512002_Sample4D_uni'].X.expm1().sum(axis = 1))
# print(adatas_filtered['GSE235063_GSM7494302_AML22_REM_processed_uni'].X.expm1().sum(axis = 1))

In [14]:
# Check the total number of cells after filtering
total_cells = 0

# Iterate through each AnnData object in the list
for sample in adatas_filtered:
    # Add the number of cells in the current AnnData object to the total
    total_cells += adatas_filtered[sample].shape[0]  # 'shape[0]' gives the number of cells

# Print the total number of cells
print("Total number of cells:", total_cells)

We have 1,300,958 cells after filtering.

In [15]:
# Concatenate all AnnData objects
adatas_filtered_all = anndata.concat(adatas_filtered, join='outer')

# Provide the gene_id info in .var
columns_toadd = adatas_filtered['GSE235923_GSM7512002_Sample4D_uni'].var.columns.tolist()
adatas_filtered_all.var[columns_toadd] = adatas_filtered['GSE235923_GSM7512002_Sample4D_uni'].var[columns_toadd]

adatas_filtered_all

AnnData object with n_obs × n_vars = 1300958 × 7662
    obs: 'cancer_type', 'dataset', 'tissue', 'uni_barcode', 'disease_progression', 'sample_barcode', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'gmm_pct_count_clusters_keep', 'filter_rna', 'cell_type_from_paper', 'malignant_from_paper', 'age_months', 'age', 'sex', 'recurrent', 'cancer_subtype', 'metastatic'
    var: 'gene_ids', 'feature_types'
    layers: 'counts'

The shape of `adatas_filtered_all` is `n_obs × n_vars = 1300958 × 7662`

So we have 1,300,958 cells and 7,662 genes for leukemia.

In [16]:
adatas_filtered_all.obs

Unnamed: 0,cancer_type,dataset,tissue,uni_barcode,disease_progression,sample_barcode,n_genes,n_genes_by_counts,total_counts,total_counts_mt,...,gmm_pct_count_clusters_keep,filter_rna,cell_type_from_paper,malignant_from_paper,age_months,age,sex,recurrent,cancer_subtype,metastatic
GSE235923_AAACCTGAGCTAGTGG-1_GSM7512002_Sample4D,acute_myeloid_leukemia,GSE235923,bone_marrow,GSE235923_AAACCTGAGCTAGTGG-1,diagnosis,GSM7512002_Sample4D,2376,2376,11016.0,943.0,...,True,False,,,,,,,,
GSE235923_AAACCTGAGGATGCGT-1_GSM7512002_Sample4D,acute_myeloid_leukemia,GSE235923,bone_marrow,GSE235923_AAACCTGAGGATGCGT-1,diagnosis,GSM7512002_Sample4D,3565,3565,21188.0,1127.0,...,True,False,,,,,,,,
GSE235923_AAACCTGAGGGTGTTG-1_GSM7512002_Sample4D,acute_myeloid_leukemia,GSE235923,bone_marrow,GSE235923_AAACCTGAGGGTGTTG-1,diagnosis,GSM7512002_Sample4D,2886,2886,11618.0,894.0,...,True,False,,,,,,,,
GSE235923_AAACCTGAGTTCGCGC-1_GSM7512002_Sample4D,acute_myeloid_leukemia,GSE235923,bone_marrow,GSE235923_AAACCTGAGTTCGCGC-1,diagnosis,GSM7512002_Sample4D,2986,2986,13508.0,661.0,...,True,False,,,,,,,,
GSE235923_AAACCTGCAATGAATG-1_GSM7512002_Sample4D,acute_myeloid_leukemia,GSE235923,bone_marrow,GSE235923_AAACCTGCAATGAATG-1,diagnosis,GSM7512002_Sample4D,2187,2187,9031.0,414.0,...,True,False,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSE223373_TTCACGCAGCGAGTAAACGTATCA_GSM6946667_WT-RBG-N_matrix,Nephroblastoma,GSE223373,Kidney,,,GSM6946667_WT-RBG-N_matrix,1976,1976,8004.0,3758.0,...,True,False,,,,,,,,
GSE223373_TTCACGCAGTCGTAGAAAGACGGA_GSM6946667_WT-RBG-N_matrix,Nephroblastoma,GSE223373,Kidney,,,GSM6946667_WT-RBG-N_matrix,2120,2120,16639.0,12168.0,...,True,False,,,,,,,,
GSE223373_TTCACGCAGTGTTCTAGTACGCAA_GSM6946667_WT-RBG-N_matrix,Nephroblastoma,GSE223373,Kidney,,,GSM6946667_WT-RBG-N_matrix,2009,2009,6494.0,2877.0,...,True,False,,,,,,,,
GSE223373_TTCACGCATAGGATGATCTTCACA_GSM6946667_WT-RBG-N_matrix,Nephroblastoma,GSE223373,Kidney,,,GSM6946667_WT-RBG-N_matrix,3274,3274,9095.0,433.0,...,True,False,,,,,,,,


In [17]:
len(adatas_filtered_all.obs['sample_barcode'].unique())

488

We have 491 - 488 = 3 samples not pass the filtering.

In [18]:
adatas_filtered_all.var

Unnamed: 0,gene_ids,feature_types
LGALS9,ENSG00000168961,Gene Expression
SSBP2,ENSG00000145687,Gene Expression
PPP3CC,ENSG00000120910,Gene Expression
AUP1,ENSG00000115307,Gene Expression
TOM1,ENSG00000100284,Gene Expression
...,...,...
MKKS,ENSG00000125863,Gene Expression
C1orf52,ENSG00000162642,Gene Expression
MED13,ENSG00000108510,Gene Expression
ABCA2,ENSG00000107331,Gene Expression


In [20]:
adatas_filtered_all.obs['age'] = adatas_filtered_all.obs['age'].astype(str)

In [21]:
# Save the concatenated adatas_filtered_all object
adatas_filtered_all.write_h5ad('/scratch/user/s4543064/Xiaohan_Summer_Research/write/08_all_cancer/all_cancer_meta_anndata.h5ad', compression='gzip')

In [3]:
# Load the concatenated adatas_filtered_all object
adatas_filtered_all = sc.read_h5ad('/scratch/user/s4543064/Xiaohan_Summer_Research/write/08_all_cancer/all_cancer_meta_anndata.h5ad')
adatas_filtered_all

AnnData object with n_obs × n_vars = 1300958 × 7662
    obs: 'cancer_type', 'dataset', 'tissue', 'uni_barcode', 'disease_progression', 'sample_barcode', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'gmm_pct_count_clusters_keep', 'filter_rna', 'cell_type_from_paper', 'malignant_from_paper', 'age_months', 'age', 'sex', 'recurrent', 'cancer_subtype', 'metastatic'
    var: 'gene_ids', 'feature_types'
    layers: 'counts'

In [4]:
adatas_filtered_all.X.expm1().sum(axis = 1)

matrix([[ 9999.99934143],
        [10000.00051373],
        [ 9999.99982159],
        ...,
        [10000.00001831],
        [ 9999.99974831],
        [10000.00053825]])

In [5]:
adatas_filtered_all.var

Unnamed: 0,gene_ids,feature_types
LGALS9,ENSG00000168961,Gene Expression
SSBP2,ENSG00000145687,Gene Expression
PPP3CC,ENSG00000120910,Gene Expression
AUP1,ENSG00000115307,Gene Expression
TOM1,ENSG00000100284,Gene Expression
...,...,...
MKKS,ENSG00000125863,Gene Expression
C1orf52,ENSG00000162642,Gene Expression
MED13,ENSG00000108510,Gene Expression
ABCA2,ENSG00000107331,Gene Expression


In [6]:
adatas_filtered_all.obs

Unnamed: 0,cancer_type,dataset,tissue,uni_barcode,disease_progression,sample_barcode,n_genes,n_genes_by_counts,total_counts,total_counts_mt,...,gmm_pct_count_clusters_keep,filter_rna,cell_type_from_paper,malignant_from_paper,age_months,age,sex,recurrent,cancer_subtype,metastatic
GSE235923_AAACCTGAGCTAGTGG-1_GSM7512002_Sample4D,acute_myeloid_leukemia,GSE235923,bone_marrow,GSE235923_AAACCTGAGCTAGTGG-1,diagnosis,GSM7512002_Sample4D,2376,2376,11016.0,943.0,...,True,False,,,,,,,,
GSE235923_AAACCTGAGGATGCGT-1_GSM7512002_Sample4D,acute_myeloid_leukemia,GSE235923,bone_marrow,GSE235923_AAACCTGAGGATGCGT-1,diagnosis,GSM7512002_Sample4D,3565,3565,21188.0,1127.0,...,True,False,,,,,,,,
GSE235923_AAACCTGAGGGTGTTG-1_GSM7512002_Sample4D,acute_myeloid_leukemia,GSE235923,bone_marrow,GSE235923_AAACCTGAGGGTGTTG-1,diagnosis,GSM7512002_Sample4D,2886,2886,11618.0,894.0,...,True,False,,,,,,,,
GSE235923_AAACCTGAGTTCGCGC-1_GSM7512002_Sample4D,acute_myeloid_leukemia,GSE235923,bone_marrow,GSE235923_AAACCTGAGTTCGCGC-1,diagnosis,GSM7512002_Sample4D,2986,2986,13508.0,661.0,...,True,False,,,,,,,,
GSE235923_AAACCTGCAATGAATG-1_GSM7512002_Sample4D,acute_myeloid_leukemia,GSE235923,bone_marrow,GSE235923_AAACCTGCAATGAATG-1,diagnosis,GSM7512002_Sample4D,2187,2187,9031.0,414.0,...,True,False,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSE223373_TTCACGCAGCGAGTAAACGTATCA_GSM6946667_WT-RBG-N_matrix,Nephroblastoma,GSE223373,Kidney,,,GSM6946667_WT-RBG-N_matrix,1976,1976,8004.0,3758.0,...,True,False,,,,,,,,
GSE223373_TTCACGCAGTCGTAGAAAGACGGA_GSM6946667_WT-RBG-N_matrix,Nephroblastoma,GSE223373,Kidney,,,GSM6946667_WT-RBG-N_matrix,2120,2120,16639.0,12168.0,...,True,False,,,,,,,,
GSE223373_TTCACGCAGTGTTCTAGTACGCAA_GSM6946667_WT-RBG-N_matrix,Nephroblastoma,GSE223373,Kidney,,,GSM6946667_WT-RBG-N_matrix,2009,2009,6494.0,2877.0,...,True,False,,,,,,,,
GSE223373_TTCACGCATAGGATGATCTTCACA_GSM6946667_WT-RBG-N_matrix,Nephroblastoma,GSE223373,Kidney,,,GSM6946667_WT-RBG-N_matrix,3274,3274,9095.0,433.0,...,True,False,,,,,,,,


In [7]:
import numpy as np

np.max(adatas_filtered_all.layers['counts'][:2000, ].toarray())

1312.0

In [8]:
# Select highly variable genes
sc.pp.highly_variable_genes(
    adatas_filtered_all,
    # n_top_genes=1200,
    # layer="counts",
    batch_key="sample_barcode",
    # flavor="seurat_v3",
    subset=True,
)

X.shape:  (1, 3660)


It seems that one sample only has 1 cell pass the filtering, so arbitrarily set its gene variance to 0.

In [9]:
adatas_filtered_all

AnnData object with n_obs × n_vars = 1300958 × 559
    obs: 'cancer_type', 'dataset', 'tissue', 'uni_barcode', 'disease_progression', 'sample_barcode', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'gmm_pct_count_clusters_keep', 'filter_rna', 'cell_type_from_paper', 'malignant_from_paper', 'age_months', 'age', 'sex', 'recurrent', 'cancer_subtype', 'metastatic'
    var: 'gene_ids', 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'hvg'
    layers: 'counts'

There are 559 highly variable genes for the kidney cancer datasets.

In [10]:
# Check the presence of TCR genes (we don' expect to see any)
tr_columns = [col for col in adatas_filtered_all.var.columns if col.startswith('TR')]
tr_columns

[]

In [11]:
adatas_filtered_all.var

Unnamed: 0,gene_ids,feature_types,highly_variable,means,dispersions,dispersions_norm,highly_variable_nbatches,highly_variable_intersection
RAMP1,ENSG00000132329,Gene Expression,True,0.395341,1.404469,0.826235,232,False
FECH,ENSG00000066926,Gene Expression,True,0.352477,1.648399,1.286079,203,False
TNFRSF12A,ENSG00000006327,Gene Expression,True,0.182145,1.344135,0.729749,233,False
ETS2,ENSG00000157557,Gene Expression,True,0.509476,1.472557,0.684707,255,False
HLA-B,ENSG00000234745,Gene Expression,True,2.638680,2.702799,1.673440,198,False
...,...,...,...,...,...,...,...,...
KCNQ1OT1,ENSG00000269821,Gene Expression,True,0.958201,1.925883,1.477188,331,False
GYPC,ENSG00000136732,Gene Expression,True,1.115444,2.065587,1.616435,303,False
LY96,ENSG00000154589,Gene Expression,True,0.266592,1.373795,0.930653,245,False
DCAF12,ENSG00000198876,Gene Expression,True,0.371687,1.612170,1.117161,176,False


In [12]:
# Save the hvg-subsetted adatas_filtered_all object
adatas_filtered_all.write_h5ad('/scratch/user/s4543064/Xiaohan_Summer_Research/write/08_all_cancer/all_cancer_meta_anndata_hvg.h5ad', compression='gzip')

In [13]:
meta_anndata_hvg = sc.read_h5ad('/scratch/user/s4543064/Xiaohan_Summer_Research/write/08_all_cancer/all_cancer_meta_anndata_hvg.h5ad')
meta_anndata_hvg

AnnData object with n_obs × n_vars = 1300958 × 559
    obs: 'cancer_type', 'dataset', 'tissue', 'uni_barcode', 'disease_progression', 'sample_barcode', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'gmm_pct_count_clusters_keep', 'filter_rna', 'cell_type_from_paper', 'malignant_from_paper', 'age_months', 'age', 'sex', 'recurrent', 'cancer_subtype', 'metastatic'
    var: 'gene_ids', 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'hvg'
    layers: 'counts'

In [14]:
meta_anndata_hvg.raw.to_adata()

AnnData object with n_obs × n_vars = 1300958 × 7662
    obs: 'cancer_type', 'dataset', 'tissue', 'uni_barcode', 'disease_progression', 'sample_barcode', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'gmm_pct_count_clusters_keep', 'filter_rna', 'cell_type_from_paper', 'malignant_from_paper', 'age_months', 'age', 'sex', 'recurrent', 'cancer_subtype', 'metastatic'
    uns: 'hvg'