# Install required packages

In [11]:
!pip install \
  cytotrace2-py \
  scanpy \
  anndata \
  pandas \
  scipy \
  scikit-learn \
  torch \
  matplotlib \
  seaborn

Collecting numpy==1.23.5
  Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting pandas==1.5.3
  Using cached pandas-1.5.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting matplotlib==3.7.1
  Using cached matplotlib-3.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting seaborn==0.12.2
  Using cached seaborn-0.12.2-py3-none-any.whl.metadata (5.4 kB)
Collecting scanpy==1.9.3
  Using cached scanpy-1.9.3-py3-none-any.whl.metadata (6.1 kB)
Collecting anndata==0.9.1
  Using cached anndata-0.9.1-py3-none-any.whl.metadata (4.6 kB)
Collecting scikit-learn==1.2.2
  Downloading scikit_learn-1.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting scipy==1.10.1
  Downloading scipy-1.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5

# import packages and suppress irrelevant warnings

In [2]:
import warnings
import logging
import matplotlib

# suppress the Helvetica font
warnings.filterwarnings("ignore", message="findfont: Font family")
# Suppress font-related messages
logging.getLogger('matplotlib.font_manager').setLevel(logging.ERROR)

# Suppress the deprecation warning from cytotrace2_py
warnings.filterwarnings(
    "ignore",
    message="pkg_resources is deprecated as an API.*",
    category=UserWarning,
    module="cytotrace2_py.common.gen_utils"
)

In [3]:
import gdown
import os
import requests
import pandas as pd
from cytotrace2_py.cytotrace2_py import cytotrace2
import matplotlib.pyplot as plt

# Download expression file and annotatin file

In [8]:
os.makedirs("data", exist_ok=True)

outputs = ["data/Pancreas_10x_downsampled_expression.txt", "data/Pancreas_10x_downsampled_annotation.txt"]
file_ids = ["11eI1gSBoBqn9ccvBbthZ2nPW3CENsKbT", "1UESeZJDl2qWYnSu0VQQA5igpEbtxZPgq"]
for i, f_id in enumerate(file_ids):
    url = f"https://drive.google.com/uc?id={f_id}"
    # Download only if the file doesn't already exist
    if not os.path.exists(outputs[i]):
        gdown.download(url, outputs[i], quiet=False)
    else:
        print("Dataset already downloaded.")

Downloading...
From (original): https://drive.google.com/uc?id=11eI1gSBoBqn9ccvBbthZ2nPW3CENsKbT
From (redirected): https://drive.google.com/uc?id=11eI1gSBoBqn9ccvBbthZ2nPW3CENsKbT&confirm=t&uuid=68755ad8-7735-4428-93df-890d346e4a1c
To: /content/data/Pancreas_10x_downsampled_expression.txt

  0%|          | 0.00/160M [00:00<?, ?B/s][A
  9%|▉         | 14.2M/160M [00:00<00:01, 141MB/s][A
 21%|██▏       | 34.1M/160M [00:00<00:00, 130MB/s][A
 30%|██▉       | 47.7M/160M [00:00<00:01, 75.8MB/s][A
 40%|███▉      | 63.4M/160M [00:00<00:01, 74.1MB/s][A
 51%|█████     | 81.8M/160M [00:00<00:00, 96.9MB/s][A
 59%|█████▊    | 93.8M/160M [00:01<00:00, 80.8MB/s][A
 69%|██████▉   | 111M/160M [00:01<00:00, 98.2MB/s] [A
 77%|███████▋  | 123M/160M [00:01<00:00, 90.2MB/s][A
 84%|████████▍ | 135M/160M [00:01<00:00, 97.6MB/s][A
 92%|█████████▏| 147M/160M [00:01<00:00, 73.1MB/s][A
100%|██████████| 160M/160M [00:01<00:00, 87.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1UESeZJDl2qWYn

# What is expression file?

In [9]:
expression = pd.read_csv("data/Pancreas_10x_downsampled_expression.txt", sep='\t')

### Columns = individual cells

* Each column label (e.g. `TCTAATAGGAGCGAG_1_2`) is a **`cellID`**
* These come from the sequencing process (droplet barcodes) — they **identify individual cells**, not **cell types**
* So `cellID ≠ cell type`
* The corresponding **cell type or "phenotype"** must come from a **separate annotation file**


### Rows = individual genes

* Each row is a gene symbol (e.g., `Xkr4`, `Rp1`, `Gm1992`)


### Values = expression levels

* The values are typically raw counts or normalized values (Counts Per Million)
  * e.g., how many times mRNA for `Xkr4` was detected in that cell
* A `0` means no expression for that gene in that cell
* A `10`, `500`, etc. means that gene was transcribed that many times (or normalized to that magnitude)


# What is the annotation file?

In [None]:
annotation = pd.read_csv("data/Pancreas_10x_downsampled_annotation.txt", sep='\t')

In [None]:
annotation.head()

The first column are indicies of **cellID** and the second column is **phenotype**

> In single-cell RNA-seq, you're measuring **gene expression in thousands (or millions) of individual cells**, so each cell **must have a unique identifier** — that's what the `cellID` is for.


### Why `cellID` is essential:

1. **Each row of expression data = one cell**

   * You're working with a matrix: **cells × genes**
   * So you need a unique `cellID` to know which expression profile belongs to which cell

2. **No pre-existing IDs in the body**

   * Your body doesn't label cells with barcodes
   * So the sequencing pipeline assigns artificial IDs during the experiment

3. **You need it to match annotations**

   * If you cluster cells (e.g., by expression), or label them by type (like "Epsilon cell"), those labels must be linked to **specific cells** using their `cellID`


### What it looks like in practice:

| `cellID`               | `gene1` | `gene2` | ... | `phenotype`             |
| ---------------------- | ------- | ------- | --- | ----------------------- |
| `GGTATTGAGTCGTACT_1_0` | 5       | 0       | ... | Epsilon cell            |
| `GTAACTGGTCACTGGC_1_3` | 0       | 8       | ... | Immature endocrine cell |

* `cellID` lets you join expression data and annotation
* Without `cellID`, you’d have no way to say **which cell is which**


In [None]:
# list all possbile cell type
annotation['phenotype'].unique()

## What are these "phenotype" e.g. Alpha, Beta, Epsilon, and Endocrine Precursor Cells?

These are all **cell types** found in the **pancreas**, specifically in the islets of Langerhans, a region involved in hormone production (like insulin and glucagon).

They’re biologically distinct cell types, each with a specific function, and each with a unique gene expression signature that allows us to identify them in single-cell RNA-seq data.

> A cell type is defined by the set of genes it expresses, especially marker genes that are unique or highly active in that type.

* These phenotype labels were assigned based on the clustering of gene expression profiles using marker genes.
  
> Clustering + marker gene expression $\rightarrow$ label = Alpha, Beta, etc.

* Each type expresses specific marker genes. For example:

  * Beta cells: INS (insulin)
  * Alpha cells: GCG (glucagon)
  * Epsilon cells: GHRL (ghrelin)
  * Precursor cells: lower or mixed expression of these markers

* Same cell type means similar overall gene expression profile. But there is still some natural variability:
    - Due to the cell cycle
    - Environmental cues
    - Technical noise in the experiment

### How was this likely done?

1. The original expression matrix was clustered (e.g., with Louvain or Leiden)
2. For each cluster, marker genes were checked:

   * If cells express `INS`, they’re Beta
   * If they express `GCG`, they’re Alpha
   * If they express none or some, they are likely not yet fully differentiated and are labeled as precursors or immature.
3. The label was saved as `phenotype`


# To find the cell type of a column:

In [None]:
# assuming you have expression (genes × cells) and annotation (cellID → phenotype)
# example: look up cell type for one column
cell_id = "CTCTAATAGGAGCGAG_1_2"
annotation.loc[cell_id]

# Run Cytotrace2

In [None]:
plt.rcParams['font.family'] = 'DejaVu Sans' # Use a generic font

In [None]:
from cytotrace2_py.cytotrace2_py import *

results =  cytotrace2("data/Pancreas_10x_downsampled_expression.txt",
                      annotation_path="data/Pancreas_10x_downsampled_annotation.txt",
                      species="mouse")