# Create Dataset Subset

This notebook creates a smaller subset of the merged Gambardella-Kinker dataset for testing purposes. The subset will maintain the same structure as the original dataset but with fewer cells and genes.

## 1. Import Required Libraries and Load Data
Import the necessary libraries and load the original dataset.

In [1]:
import scanpy as sc
import numpy as np

# Read the original dataset
print("Loading original dataset...")
adata = sc.read_h5ad('../output/merged_gambardella_kinker_common_genes.h5ad')
print(f"Original dataset shape: {adata.shape} (cells × genes)")

Loading original dataset...
Original dataset shape: (64943, 18235) (cells × genes)


## 2. Create Sample Subset

Create a smaller subset by:
1. Randomly selecting a portion of cells
2. Keeping all the genes
3. Maintaining the same structure and metadata

In [2]:
# Set random seed for reproducibility
np.random.seed(42)

# Define the fraction of cells to keep (e.g., 10%)
fraction = 0.005

# Calculate the number of cells to keep
n_cells = int(adata.n_obs * fraction)

# Randomly select cell indices
cell_indices = np.random.choice(adata.n_obs, size=n_cells, replace=False)

# Create the subset
adata_subset = adata[cell_indices]

print(f"Subset dataset shape: {adata_subset.shape} (cells × genes)")

# Check the distribution of TP53 status in the subset
print("\nTP53 status distribution in the subset:")
print(adata_subset.obs['TP53_status'].value_counts())

Subset dataset shape: (324, 18235) (cells × genes)

TP53 status distribution in the subset:
TP53_status
Missense_Mutation    216
Nonsense_Mutation     30
Frame_Shift_Del       26
Frame_Shift_Ins       23
Splice_Site           10
In_Frame_Del           7
Intron                 7
Silent                 4
In_Frame_Ins           1
Name: count, dtype: int64


## 3. Save Subset Data

Save the subset to a new h5ad file.

In [3]:
# Save the subset to a new file
output_file = '../output/local_testing_merged_gambardella_kinker_common_genes.h5ad'
adata_subset.write(output_file)
print(f"\nSubset saved to: {output_file}")

# Verify the file size difference
import os

original_size = os.path.getsize('../output/merged_gambardella_kinker_common_genes.h5ad') / (1024 * 1024)  # MB
subset_size = os.path.getsize(output_file) / (1024 * 1024)  # MB

print(f"\nFile sizes:")
print(f"Original: {original_size:.2f} MB")
print(f"Subset: {subset_size:.2f} MB")
print(f"Size reduction: {((original_size - subset_size) / original_size * 100):.1f}%")


Subset saved to: ../output/local_testing_merged_gambardella_kinker_common_genes.h5ad

File sizes:
Original: 2370.21 MB
Subset: 12.48 MB
Size reduction: 99.5%


In [4]:
# Inspect AnnData objects: obs columns, var, and X header

def inspect_anndata(adata_obj, name="AnnData"):
    print(f"--- {name} ---")
    print(f"Shape: {adata_obj.shape} (cells × genes)\n")
    print("obs columns:")
    print(adata_obj.obs.columns.tolist())
    print("\nvar columns:")
    print(adata_obj.var.columns.tolist())
    print("\nX (first 5 rows, first 5 columns):")
    print(adata_obj.X[:5, :5])
    print("\n")

inspect_anndata(adata_subset, name="adata_subset")

--- adata_subset ---
Shape: (324, 18235) (cells × genes)

obs columns:
['Cell_line', 'TP53_status', 'dataset']

var columns:
[]

X (first 5 rows, first 5 columns):
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 7 stored elements and shape (5, 5)>
  Coords	Values
  (0, 1)	1.0
  (1, 1)	5.0
  (2, 1)	3.0
  (2, 3)	1.0
  (3, 3)	1.0
  (4, 1)	2.0
  (4, 3)	2.0


