# Quick Start: scRNA-seq Data Processing with Scanpy



This notebook is designed for learning basic single-cell RNA-seq (scRNA-seq) data processing workflows using **Scanpy**.



You can run it locally (e.g. in Jupyter) or on **Google Colab**.



---



## Step 1: Open This Notebook in Google Colab (optional)



There is no single fixed link for all users, because it depends on where you store this notebook. Use one of the options below:



### Option A: Upload the notebook file

1. Download this notebook file (`1_Quick_Start_Single_Cell.ipynb`) to your computer.

2. Open Colab: [https://colab.research.google.com](https://colab.research.google.com)

3. In Colab, go to **File â†’ Upload notebook** and select the downloaded file.



### Option B: Open directly from GitHub

1. Make sure this notebook is in a public GitHub repository.

2. Open Colab: [https://colab.research.google.com](https://colab.research.google.com)

3. Click the **GitHub** tab.

4. Paste the GitHub URL of this notebook (from your browser) into the search box.

5. Click the notebook name to open it in Colab.


## Step 2: Install Required Packages



Before starting the analysis, install (or update) the main packages we will use:



- `scanpy` and `anndata` for single-cell analysis and data structures

- `umap-learn` for dimensionality reduction

- `leidenalg` and `python-igraph` for clustering

- `gseapy` for enrichment analysis

- `seaborn` and `matplotlib` (already in Colab) for plotting



**Instructions:**



1. If you are on **Colab**, run the next code cell once when you start the notebook.

2. If you are running **locally**, you can either:

   - Use the next cell (it will install into your current environment), or

   - Install these packages with `pip`/`conda` in your environment beforehand.


In [None]:
# Colab / Environment Setup

# Run this cell once at the start of your session.

# On Google Colab, many core packages are pre-installed, but we

# install or upgrade the key single-cell analysis packages here.



%%capture

!pip install -q \

  scanpy \

  anndata \

  umap-learn \

  leidenalg \

  python-igraph \

  gseapy \

  seaborn


In [None]:
import scanpy as sc; sc.set_figure_params(dpi=200)
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

In [None]:
adata_from_web = sc.read_h5ad('/data/sarkar_lab/Projects/teaching/IGP_S26_single_cell/data/VAL_and_DIS.h5ad')

In [None]:
adata = adata_from_web.raw.to_adata()

In [None]:
adata.raw = adata.copy()

In [None]:
adata.X.toarray()

In [None]:
# normalize, transform, and scale counts
sc.pp.normalize_total(adata)
adata.X = np.arcsinh(adata.X).copy()
sc.pp.scale(adata)

In [None]:
adata.X

In [None]:
adata.var_names

In [None]:
adata.var

In [None]:
## Step 3: Make `var_names` Interpretable



Right now, the features in `adata.var_names` may be internal IDs (for example, Ensembl IDs or generic feature IDs), which are not easy to interpret.



In many datasets, `adata.var` also contains a more human-readable column with gene symbols or feature names (for example, a column called `feature_name` or `gene_symbol`).



In the next cell, we:



1. Keep the current IDs in a separate column (`'original_ids'`).

2. Replace `adata.var_names` with a readable feature-name column from `adata.var` (if available).

3. Call `adata.var_names_make_unique()` to avoid duplicate names.


In [None]:
# Try to replace adata.var_names with a more readable feature-name column



# 1. Preserve the original IDs

adata.var['original_ids'] = adata.var_names



# 2. Choose a readable feature-name column if present

candidate_columns = ['feature_name', 'gene_symbol', 'gene_name']



for col in candidate_columns:

    if col in adata.var.columns:

        adata.var_names = adata.var[col].astype(str)

        break

else:

    # If none of the expected columns exist, keep var_names as-is

    print('No feature-name column (feature_name / gene_symbol / gene_name) found in adata.var.')



# 3. Ensure var_names are unique

adata.var_names_make_unique()



adata.var.head()  # Show the updated var table

In [None]:
adata.var['Mitochondrial'] = adata.var.index.str.startswith('mt-')
sc.pp.calculate_qc_metrics(adata,qc_vars=['Mitochondrial'],use_raw=True,inplace=True)

In [None]:
sc.pp.pca(adata,random_state=0)

In [None]:
neighborhood_k = np.sqrt(adata.n_obs).astype(int)  # We have found that scaling the K to equal the square root of the total number of neighbors to be effective
sc.pp.neighbors(adata,n_neighbors=neighborhood_k,use_rep='X_pca',random_state=0)  # Calculate this KNN based off of the PCA distances
sc.tl.leiden(adata,resolution=0.5,random_state=0)  # Here we use a resolution of 2, which should yield 30+ clusters. This step may take a while.

In [None]:
# UMAP visualization
sc.tl.umap(adata,random_state=0)
sc.pl.umap(adata,color=['leiden'],legend_loc='on data',title='Leiden Clusters')

In [None]:
# Differential gene expression testing using the .raw values
sc.tl.rank_genes_groups(adata,groupby='leiden',use_raw=True,n_genes=200,method='wilcoxon')

In [None]:
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5, use_raw=False)

In [None]:
sc.pl.rank_genes_groups_heatmap(adata, n_genes=5, show_gene_labels=True, use_raw=False)