# Extracting sample SL040 from the GBM test dataset
In this notebook, I demonstrate how I extracted the sample SL040 from the full test dataset. This specific sample was then used for further analysis, including examining prediction results and performing clustering analysis on this individual sample.

In [1]:
import scanpy as sc
import numpy as np
import pandas as pd
import anndata as ad
import re

#### Run this if you want all the cells in the sample:
**Note**: For consistency with the analyses presented in the report, I initially subset all cells labeled SL040, including healthy and tumor cells. However, this can also be done directly on the tumor cells.

In [2]:
test = sc.read_h5ad('/hpc/hers_basak/rnaseq_data/Silettilab/samples/GBM_202406.h5ad')

#### Run this if you only want tumor cells in the sample:

In [3]:
#test = sc.read_h5ad(
#    '/hpc/hers_basak/rnaseq_data/Silettilab/proj/GBM/output/data/GBM_Tumor_Dissociation.h5ad')

#### This code adds an additional column to the dataset to consolidate single tumors denoted by A, B, C, etc., into a single label. This is done to improve the clarity and visual presentation of the data.
Specifically, the function:

- **Removes the final letter if present, unless the suffix is '_ATAC' or '_nuclei'**.
-  **If the suffix is '_ATAC' or '_nuclei', and there is a single letter before the underscore, it removes that letter**.

In [4]:
def remove_trailing_letter(sample):
    sample = re.sub(r'^([A-Za-z0-9]+)([A-Za-z])(_ATAC)$', r'\1\3', sample)
    sample = re.sub(r'^([A-Za-z0-9]+)([A-Za-z])(_nuclei)$', r'\1\3', sample)
    sample = re.sub(r'^([A-Za-z0-9]+)([A-Za-z])$', r'\1', sample)
    return sample
test.obs['stackedSamples'] = test.obs['Sample'].apply(remove_trailing_letter)

#### I subsetted and saved the samples I'm interested in. It's straightforward to save all of them using a simple loop.

In [5]:
SL040 = test[test.obs['stackedSamples'] == 'SL040']
KS414 = test[test.obs['stackedSamples'] == 'KS414']
SL040.write_h5ad('/hpc/hers_basak/rnaseq_data/Silettilab/samples/final_useful_datasets/SL040_refined.h5ad')
KS414.write_h5ad('/hpc/hers_basak/rnaseq_data/Silettilab/samples/final_useful_datasets/KS414_refined.h5ad')

  df[key] = c
  df[key] = c


#### Check if they were correctly saved

In [6]:
sc.read_h5ad('/hpc/hers_basak/rnaseq_data/Silettilab/samples/final_useful_datasets/SL040_refined.h5ad')

AnnData object with n_obs × n_vars = 135482 × 59480
    obs: 'Age', 'AneuploidMeta', 'Aneuploid_Manual', 'Aneuploid_expression', 'CellCycleFraction', 'Chemistry', 'Clones', 'ClonesAbberations', 'Clusters', 'ClustersProbability', 'ClustersSecondary', 'ClustersSecondaryProbability', 'Dissociation', 'Donor', 'DoubletFlag', 'DoubletScore', 'GBmapPredicted', 'ManualAnnotation', 'ManualAnnotationSample', 'ManualAnnotationSample_new', 'MitoFraction', 'NGenes', 'PrevClusters', 'Sample', 'SampleID', 'Sex', 'Tissue', 'TopLevelCluster', 'TopLevelCluster2', 'TotalUMIs', 'UnsplicedFraction', 'ValidCells', 'stackedSamples'
    var: 'Chromosome', 'End', 'Gene', 'GeneNonzeros', 'GeneTotalUMIs', 'SelectedFeatures', 'Start', 'StdevExpression', 'ValidGenes'
    obsm: 'Factors'

In [8]:
sc.read_h5ad('/hpc/hers_basak/rnaseq_data/Silettilab/samples/final_useful_datasets/KS414_refined.h5ad')

AnnData object with n_obs × n_vars = 6131 × 59480
    obs: 'Age', 'AneuploidMeta', 'Aneuploid_Manual', 'Aneuploid_expression', 'CellCycleFraction', 'Chemistry', 'Clones', 'ClonesAbberations', 'Clusters', 'ClustersProbability', 'ClustersSecondary', 'ClustersSecondaryProbability', 'Dissociation', 'Donor', 'DoubletFlag', 'DoubletScore', 'GBmapPredicted', 'ManualAnnotation', 'ManualAnnotationSample', 'ManualAnnotationSample_new', 'MitoFraction', 'NGenes', 'PrevClusters', 'Sample', 'SampleID', 'Sex', 'Tissue', 'TopLevelCluster', 'TopLevelCluster2', 'TotalUMIs', 'UnsplicedFraction', 'ValidCells', 'stackedSamples'
    var: 'Chromosome', 'End', 'Gene', 'GeneNonzeros', 'GeneTotalUMIs', 'SelectedFeatures', 'Start', 'StdevExpression', 'ValidGenes'
    obsm: 'Factors'