# Inspecting H5AD File: sample_aorta_data_updated.h5ad

This notebook loads and inspects the contents of the H5AD file `data/cell_level/Aorta/sample_aorta_data_updated.h5ad`. It provides an overview of the data structure, including the main expression matrix, cell metadata, gene metadata, and other components stored in the AnnData object.

## 1. Import Required Libraries

We will use `scanpy` to read the H5AD file and `anndata` to work with the AnnData object. Additionally, `pandas` and `numpy` are used for data manipulation and inspection.

In [1]:
import scanpy as sc
import anndata
import pandas as pd
import numpy as np

# Set scanpy verbosity to minimal
sc.settings.verbosity = 1

## 2. Load the H5AD File

Load the H5AD file from the specified path and store it as an AnnData object.

In [2]:
# Path to the H5AD file
file_path = "/gpfs/flash/home/jcw/Project/scDiffusion/data/data/Human_PF_Lung/human_lung.h5ad"

# Load the H5AD file
adata = sc.read_h5ad(file_path)

# Display basic information about the AnnData object
print(adata)

AnnData object with n_obs × n_vars = 114396 × 27281
    obs: 'barcodes', 'celltype', 'n_genes', 'status'
    var: 'n_cells'


In [3]:
adata.obs.head()

Unnamed: 0,barcodes,celltype,n_genes,status
0,F01157_AAACCTGAGCATCATC,AT2,4259,Control
3,F01157_AAACCTGTCATGCTCC,AT2,3492,Control
7,F01157_AAACGGGGTTACGACT,AT2,3604,Control
8,F01157_AAACGGGGTTAGTGGG,AT2,3856,Control
10,F01157_AAACGGGTCCTTTCGG,Proliferating Epithelial Cells,4104,Control


In [4]:
adata.var.head()

Unnamed: 0,n_cells
RP11-34P13.3,8
RP11-34P13.7,89
RP11-34P13.8,3
RP11-34P13.14,7
FO538757.3,29


## 3. Inspect the Main Data Matrix (X)

The `X` attribute contains the main data matrix, typically the cell × gene expression matrix. Let's check its shape and type.

In [13]:
adata.X.min(), adata.X.max()

(0.0, 8318.198)

In [5]:
# Display the shape of the main data matrix
print(f"Main data matrix shape (cells × genes): {adata.X.shape}")

# Check the type of the matrix (e.g., sparse or dense)
print(f"Matrix type: {type(adata.X)}")

# Display a small sample of the matrix (first 5 rows and columns)
print("Sample of X (first 5 rows, 5 columns):")
print(adata.X[:5, :5].toarray() if hasattr(adata.X, 'toarray') else adata.X[:5, :5])

Main data matrix shape (cells × genes): (114396, 27281)
Matrix type: <class 'scipy.sparse._csr.csr_matrix'>
Sample of X (first 5 rows, 5 columns):
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]


## 4. Inspect Cell Metadata (obs)

The `obs` attribute contains metadata for each cell, such as cell type, batch, or cluster labels. Let's display its structure and a few rows.

In [6]:
# Display the columns in obs
print("Cell metadata columns:")
print(adata.obs.columns.tolist())

# Display the first few rows of obs
print("\nFirst 5 rows of cell metadata (obs):")
print(adata.obs.head())

Cell metadata columns:
['barcodes', 'celltype', 'n_genes', 'status']

First 5 rows of cell metadata (obs):
                   barcodes                        celltype  n_genes   status
0   F01157_AAACCTGAGCATCATC                             AT2     4259  Control
3   F01157_AAACCTGTCATGCTCC                             AT2     3492  Control
7   F01157_AAACGGGGTTACGACT                             AT2     3604  Control
8   F01157_AAACGGGGTTAGTGGG                             AT2     3856  Control
10  F01157_AAACGGGTCCTTTCGG  Proliferating Epithelial Cells     4104  Control


## 5. Inspect Gene Metadata (var)

The `var` attribute contains metadata for each gene, such as gene names or IDs. Let's check its structure.

In [None]:
# Display the columns in var
print("Gene metadata columns:")
print(adata.var.columns.tolist())

# Display the first few rows of var
print("\nFirst 5 rows of gene metadata (var):")
print(adata.var.head())

## 6. Inspect Additional Components

The AnnData object may contain additional components like `obsm` (multidimensional cell annotations), `varm` (multidimensional gene annotations), `layers` (additional matrices), `uns` (unstructured metadata), `obsp` (pairwise cell relationships), and `varp` (pairwise gene relationships). Let's check their contents.

In [None]:
# Inspect obsm (multidimensional cell annotations)
print("Multidimensional cell annotations (obsm):")
print(list(adata.obsm.keys()))
for key in adata.obsm.keys():
    print(f"  {key}: shape = {adata.obsm[key].shape}")

# Inspect varm (multidimensional gene annotations)
print("\nMultidimensional gene annotations (varm):")
print(list(adata.varm.keys()))
for key in adata.varm.keys():
    print(f"  {key}: shape = {adata.varm[key].shape}")

# Inspect layers (additional matrices)
print("\nAdditional data layers (layers):")
print(list(adata.layers.keys()))
for key in adata.layers.keys():
    print(f"  {key}: shape = {adata.layers[key].shape}")

# Inspect uns (unstructured metadata)
print("\nUnstructured metadata (uns):")
print(list(adata.uns.keys()))

# Inspect obsp (pairwise cell relationships)
print("\nPairwise cell relationships (obsp):")
print(list(adata.obsp.keys()))
for key in adata.obsp.keys():
    print(f"  {key}: shape = {adata.obsp[key].shape}")

# Inspect varp (pairwise gene relationships)
print("\nPairwise gene relationships (varp):")
print(list(adata.varp.keys()))
for key in adata.varp.keys():
    print(f"  {key}: shape = {adata.varp[key].shape}")

## 7. Basic Statistics

Calculate some basic statistics about the data, such as the total number of cells, genes, and non-zero entries in the expression matrix.

In [None]:
# Number of cells and genes
n_cells, n_genes = adata.X.shape
print(f"Number of cells: {n_cells}")
print(f"Number of genes: {n_genes}")

# Number of non-zero entries in the expression matrix
non_zero = adata.X.nnz if hasattr(adata.X, 'nnz') else np.count_nonzero(adata.X)
print(f"Number of non-zero entries: {non_zero}")

# Sparsity of the matrix
sparsity = 1 - (non_zero / (n_cells * n_genes))
print(f"Sparsity of the matrix: {sparsity:.2%}")

## 8. Notes

- If the file path is incorrect or the file is not accessible, you may encounter a `FileNotFoundError`. Ensure the file `data/cell_level/Aorta/sample_aorta_data_updated.h5ad` exists in the specified directory.
- The notebook assumes that the H5AD file contains standard single-cell RNA-seq data. If the file contains non-standard components, you may need to modify the inspection steps.
- To explore specific components further (e.g., visualize UMAP or specific metadata), additional code can be added.