# Tutorial 01: Getting Started with ScpTensor

This tutorial introduces the basics of ScpTensor - a Python library for single-cell proteomics (SCP) data analysis.

## Learning Objectives

By the end of this tutorial, you will:
- Understand the ScpTensor data structure (ScpContainer -> Assay -> ScpMatrix)
- Load and explore example datasets
- Inspect data and metadata
- Understand the mask code system for tracking missing values
- Save and load data

---

## 1. Installation and Setup

First, let's import ScpTensor and set up our environment.

In [None]:
import numpy as np
import polars as pl

# Import ScpTensor
import scptensor
from scptensor import (
    Assay,
    MaskCode,
    ScpContainer,
    ScpMatrix,
)
from scptensor.datasets import load_toy_example

# Check version
print(f"ScpTensor version: {scptensor.__version__}")
print("\nLibraries imported successfully!")

## 2. Understanding the ScpTensor Data Structure

ScpTensor uses a hierarchical data structure:

ScpContainer (top-level)
|-- obs: pl.DataFrame           # Sample metadata (n_samples x metadata)
|-- assays: Dict[str, Assay]    # Named assay registry
'-- history: List[ProvenanceLog]  # Operation audit trail

Assay (feature-space)
|-- var: pl.DataFrame            # Feature metadata (n_features x metadata)
'-- layers: Dict[str, ScpMatrix] # Named layer registry

ScpMatrix (physical storage)
|-- X: Union[np.ndarray, sp.spmatrix]  # Values
|-- M: Union[np.ndarray, sp.spmatrix, None]  # Mask codes
'-- metadata: MatrixMetadata     # Quality scores

Key Concepts:

- Container: Top-level object holding all your data
- Assay: Represents a feature space (e.g., proteins, peptides)
- Layer: Represents a transformed version of the data (e.g., raw, log, imputed)
- Mask (M): Tracks the provenance of each value (valid, missing, imputed, etc.)

## 3. Loading an Example Dataset

ScpTensor provides built-in example datasets for learning and testing.

In [None]:
# Load the toy example dataset
container = load_toy_example()

# Display basic information
print("Container loaded successfully!")
print(f"\n{container}")

Expected Output:

Container loaded successfully!
ScpContainer with 100 samples and 1 assay
  Assay: proteins (100 samples, 50 features)
  Layers: ['raw']

## 4. Exploring the Container

Let's explore the different components of the container.

### 4.1 Sample Metadata (obs)

The obs DataFrame contains metadata for each sample (cell).

In [None]:
# View sample metadata
print("Sample Metadata (obs):")
print("=" * 40)
print(container.obs)
print("\nShape:", container.obs.shape)
print("\nColumns:", container.obs.columns.tolist())

### 4.2 Accessing Assay and Layers

In [None]:
# Access the proteins assay
assay = container.assays["proteins"]
print(f"Assay name: {assay.name}")
print(f"Number of features: {assay.n_features}")
print(f"Available layers: {list(assay.layers.keys())}")

### 4.3 Feature Metadata (var)

In [None]:
# View feature metadata
print("Feature Metadata (var):")
print("=" * 40)
print(assay.var.head(10))

### 4.4 Data Matrix (X) and Mask (M)

In [None]:
# Access the raw data matrix
matrix = assay.layers["raw"]

print(f"Data matrix (X) shape: {matrix.X.shape}")
print(f"Mask matrix (M) shape: {matrix.M.shape}")
print(f"\nData type of X: {matrix.X.dtype}")
print(f"Data type of M: {matrix.M.dtype}")

### 4.5 Viewing the Data

In [None]:
# View a subset of the data (first 5 samples, first 5 features)
print("First 5 samples x 5 features of the data matrix:")
print(matrix.X[:5, :5])

print("\nCorresponding mask codes:")
print(matrix.M[:5, :5])

## 5. Understanding Mask Codes

The mask matrix M tracks the provenance of each value:

Code | Name        | Description
-----|-------------|-------------------------------------
0    | VALID       | Detected value
1    | MBR         | Missing Between Runs (random)
2    | LOD         | Limit of Detection (systematic)
3    | FILTERED    | Removed by QC
5    | IMPUTED     | Filled by imputation

In [None]:
# Count mask codes in the data
from scptensor import count_mask_codes

mask_counts = count_mask_codes(matrix.M)
print("Mask code distribution:")
print("=" * 30)
for code, name in MaskCode.names().items():
    count = mask_counts.get(code, 0)
    percentage = (count / matrix.M.size) * 100
    print(f"{code} ({name:12s}): {count:6d} ({percentage:5.2f}%)")

## 6. Container Properties and Methods

In [None]:
# Access container properties
print(f"Number of samples: {container.n_samples}")
print(f"Number of assays: {len(container.assays)}")
print(f"Assay names: {list(container.assays.keys())}")

# For the proteins assay
proteins_assay = container.assays["proteins"]
print(f"\nNumber of features in 'proteins': {proteins_assay.n_features}")
print(f"Number of layers in 'proteins': {len(proteins_assay.layers)}")
print(f"Layer names: {list(proteins_assay.layers.keys())}")

## 7. Computing Basic Statistics

In [None]:
# Get data matrix (only valid values)
X = matrix.X
M = matrix.M

# Create a masked array where missing values are NaN
X_masked = X.copy().astype(float)
X_masked[M != 0] = np.nan

# Compute statistics
print("Data Statistics (considering only valid values):")
print("=" * 50)
print(f"Mean intensity: {np.nanmean(X_masked):.4f}")
print(f"Median intensity: {np.nanmedian(X_masked):.4f}")
print(f"Std intensity: {np.nanstd(X_masked):.4f}")
print(f"Min intensity: {np.nanmin(X_masked):.4f}")
print(f"Max intensity: {np.nanmax(X_masked):.4f}")

## 8. Exploring Metadata

In [None]:
# Sample metadata by batch
print("Sample distribution by batch:")
batch_counts = container.obs.group_by("batch").count().sort("batch")
print(batch_counts)

print("\nSample distribution by cell type:")
celltype_counts = container.obs.group_by("cell_type").count().sort("cell_type")
print(celltype_counts)

## 9. Provenance Tracking

ScpTensor automatically tracks all operations performed on the data.

In [None]:
# View operation history
print("Operation History:")
print("=" * 40)
for i, log in enumerate(container.history):
    print(f"{i + 1}. {log.action}: {log.description}")
    print(f"   Parameters: {log.params}")
    print()

## 10. Saving and Loading Data

ScpTensor supports multiple file formats for saving and loading data.

In [None]:
import os

from scptensor import save_csv, save_npz

# Create output directory
output_dir = "tutorial_output"
os.makedirs(output_dir, exist_ok=True)

# Save to CSV format
csv_path = os.path.join(output_dir, "toy_example.csv")
save_csv(container, csv_path)
print(f"Data saved to: {csv_path}")

# Save to NPZ format (compressed, faster)
npz_path = os.path.join(output_dir, "toy_example.npz")
save_npz(container, npz_path)
print(f"Data saved to: {npz_path}")

## 11. Creating Your Own Container

You can also create a container from your own data.

In [None]:
# Example: Create a simple container from scratch
n_samples = 20
n_features = 10

# Generate some random data
np.random.seed(123)
X = np.random.lognormal(mean=2, sigma=0.5, size=(n_samples, n_features))

# Create a simple mask (all valid)
M = np.zeros((n_samples, n_features), dtype=np.int8)

# Create metadata
obs = pl.DataFrame(
    {
        "sample_id": [f"Sample_{i}" for i in range(n_samples)],
        "condition": ["Control" if i % 2 == 0 else "Treatment" for i in range(n_samples)],
    }
)

var = pl.DataFrame(
    {
        "protein_id": [f"Protein_{i}" for i in range(n_features)],
        "gene_name": [f"Gene_{i}" for i in range(n_features)],
    }
)

# Create the container
custom_container = ScpContainer(
    obs=obs.with_columns(pl.Series("_index", obs["sample_id"])),
    sample_id_col="sample_id",
)

# Create assay and add to container
custom_assay = Assay(var=var, feature_id_col="protein_id")
custom_assay.add_layer("raw", ScpMatrix(X=X, M=M))
custom_container.add_assay("proteins", custom_assay)

print(f"Custom container created: {custom_container}")
print(
    f"Shape: {custom_container.n_samples} samples x {custom_container.assays['proteins'].n_features} features"
)

## Summary

In this tutorial, you learned:

1. Data Structure: The ScpContainer -> Assay -> ScpMatrix hierarchy
2. Loading Data: Using built-in datasets (load_toy_example())
3. Inspecting Data: Accessing obs, var, X, and M
4. Mask Codes: Understanding the provenance tracking system
5. Saving/Loading: Using save_csv() and save_npz()
6. Creating Containers: Building containers from scratch

Next Steps:
- Tutorial 02: Quality Control and Normalization
- Tutorial 03: Imputation and Batch Correction  
- Tutorial 04: Clustering and Visualization

Additional Resources:
- API Reference: docs/design/API_REFERENCE.md
- GitHub Repository: https://github.com/your-org/scptensor