# Using CellTypist for cell type classification
This notebook showcases the cell type classification for scRNA-seq query data by retrieving the most likely cell type labels from either the built-in CellTypist models or the user-trained custom models.

*This is my version modified using also the github information.*

Only the main steps and key parameters are introduced in this notebook. Refer to detailed [Usage](https://github.com/Teichlab/celltypist#usage) if you want to learn more.

## Install CellTypist

In [22]:
!pip install celltypist



In [2]:
import celltypist
from celltypist import models
import scanpy as sc
import pandas as pd

In [10]:
#from google.colab import drive
#drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
#import os
#import gzip

In [2]:
# Enabling `force_update = True` will overwrite existing (old) models.
models.download_models(force_update = True)

📜 Retrieving model list from server https://celltypist.cog.sanger.ac.uk/models/models.json
📚 Total models in list: 44
📂 Storing models in /home/seriph/.celltypist/data/models
💾 Downloading model [1/44]: Immune_All_Low.pkl
💾 Downloading model [2/44]: Immune_All_High.pkl
💾 Downloading model [3/44]: Adult_CynomolgusMacaque_Hippocampus.pkl
💾 Downloading model [4/44]: Adult_Human_PancreaticIslet.pkl
💾 Downloading model [5/44]: Adult_Human_Skin.pkl
💾 Downloading model [6/44]: Adult_Mouse_Gut.pkl
💾 Downloading model [7/44]: Adult_Mouse_OlfactoryBulb.pkl
💾 Downloading model [8/44]: Adult_Pig_Hippocampus.pkl
💾 Downloading model [9/44]: Adult_RhesusMacaque_Hippocampus.pkl
💾 Downloading model [10/44]: Autopsy_COVID19_Lung.pkl
💾 Downloading model [11/44]: COVID19_HumanChallenge_Blood.pkl
💾 Downloading model [12/44]: COVID19_Immune_Landscape.pkl
💾 Downloading model [13/44]: Cells_Fetal_Lung.pkl
💾 Downloading model [14/44]: Cells_Intestinal_Tract.pkl
💾 Downloading model [15/44]: Cells_Lung_Airway.pk

In [4]:
models.models_path

'/home/seriph/.celltypist/data/models'

In [13]:
models.models_description()

👉 Detailed model information can be found at `https://www.celltypist.org/models`


Unnamed: 0,model,description
0,Immune_All_Low.pkl,immune sub-populations combined from 20 tissue...
1,Immune_All_High.pkl,immune populations combined from 20 tissues of...
2,Adult_CynomolgusMacaque_Hippocampus.pkl,cell types from the hippocampus of adult cynom...
3,Adult_Human_PancreaticIslet.pkl,cell types from pancreatic islets of healthy a...
4,Adult_Human_Skin.pkl,cell types from human healthy adult skin
5,Adult_Mouse_Gut.pkl,cell types in the adult mouse gut combined fro...
6,Adult_Mouse_OlfactoryBulb.pkl,cell types from the olfactory bulb of adult mice
7,Adult_Pig_Hippocampus.pkl,cell types from the adult pig hippocampus
8,Adult_RhesusMacaque_Hippocampus.pkl,cell types from the hippocampus of adult rhesu...
9,Autopsy_COVID19_Lung.pkl,cell types from the lungs of 16 SARS-CoV-2 inf...


In [35]:
models.models_description().loc[40][1]

👉 Detailed model information can be found at `https://www.celltypist.org/models`


'cell types from the adult mouse isocortex (neocortex) and hippocampal formation'

## cl525 from Loom dataset E15.0

In [19]:
input = '../Data/MouseCortexFromLoom/NotCTSingleClusters/cl525/RawDataCL525.csv'

In [49]:
clusters = pd.read_csv("../Data/MouseCortexFromLoom/NotCTSingleClusters/cl525/CellClusters.csv",index_col=0)

In [50]:
clusters

Unnamed: 0,x
10X74_4_A_1:CATAAAACCTGAACx,2
10X73_3_A_1:CCATCGTGAGGAGCx,2
10X73_3_A_1:GATCCCTGTTGACGx,3
10X73_3_A_1:GCCGACGAAGTCACx,2
10X74_4_A_1:GCATTGGAACACACx,1
...,...
10X74_4_A_1:CTTGAACTACGTACx,3
10X74_4_A_1:CAGACAACAGGTCTx,2
10X74_4_A_1:CGGAATTGTGACACx,3
10X73_3_A_1:TAGCGATGGGAACGx,1


In [63]:
cluster_labels = clusters.squeeze()
# Modify the index to add an initial 'X'
cluster_labels.index = 'X' + cluster_labels.index.str.replace(':', '.')


In [64]:
cluster_labels

Cells
X10X74_4_A_1.CATAAAACCTGAACx    2
X10X73_3_A_1.CCATCGTGAGGAGCx    2
X10X73_3_A_1.GATCCCTGTTGACGx    3
X10X73_3_A_1.GCCGACGAAGTCACx    2
X10X74_4_A_1.GCATTGGAACACACx    1
                               ..
X10X74_4_A_1.CTTGAACTACGTACx    3
X10X74_4_A_1.CAGACAACAGGTCTx    2
X10X74_4_A_1.CGGAATTGTGACACx    3
X10X73_3_A_1.TAGCGATGGGAACGx    1
X10X74_4_A_1.GCTACGCTATAAGGx    3
Name: Cl, Length: 826, dtype: int64

In [55]:
# Load your input data
input_data = pd.read_csv(input, index_col=0).transpose()

In [57]:
input_data

Unnamed: 0,Lamc1,Lama1,Hs3st1,Fabp3,Nrg2,Kdelr3,Bend4,Gjb4,Mogs,Lamb1,...,Smyd3,Gm4285,Gm38250,Zmym3,Fam3a,Gpr155,Scg5,Vps37a,Pcf11,Gpatch1
X10X74_4_A_1.CATAAAACCTGAACx,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
X10X73_3_A_1.CCATCGTGAGGAGCx,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
X10X73_3_A_1.GATCCCTGTTGACGx,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,1,1,2,0,0
X10X73_3_A_1.GCCGACGAAGTCACx,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
X10X74_4_A_1.GCATTGGAACACACx,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,3,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X10X74_4_A_1.CTTGAACTACGTACx,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,5,0,0,0
X10X74_4_A_1.CAGACAACAGGTCTx,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,5,0,2,0
X10X74_4_A_1.CGGAATTGTGACACx,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,5,0,0,0
X10X73_3_A_1.TAGCGATGGGAACGx,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,3,0,0,0


In [65]:
input_data.index.equals(cluster_labels.index)

True

Get an overview of the models and what they represent.

In [74]:
predictions = celltypist.annotate(input, model = 'Mouse_Isocortex_Hippocampus.pkl',over_clustering=cluster_labels, transpose_input = True, majority_voting = True,mode = 'best match')
predictions.to_table(folder = '../Data/MouseCortexFromLoom/NotCTSingleClusters/cl525/', prefix = "Cl525_")

📁 Input file is '../Data/MouseCortexFromLoom/NotCTSingleClusters/cl525/RawDataCL525.csv'
⏳ Loading data
🔬 Input data has 826 cells and 13857 genes
🔗 Matching reference genes in the model
🧬 2790 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
🗳️ Majority voting the predictions
✅ Majority voting done!


In [77]:
predictions = celltypist.annotate(input, model = 'Mouse_Isocortex_Hippocampus.pkl',over_clustering=cluster_labels, transpose_input = True, majority_voting = True,mode = 'prob match', p_thres = 0.3)
predictions.to_table(folder = '../Data/MouseCortexFromLoom/NotCTSingleClusters/cl525/', prefix = "Cl525_")

📁 Input file is '../Data/MouseCortexFromLoom/NotCTSingleClusters/cl525/RawDataCL525.csv'
⏳ Loading data
🔬 Input data has 826 cells and 13857 genes
🔗 Matching reference genes in the model
🧬 2790 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
🗳️ Majority voting the predictions
✅ Majority voting done!


## CD14 cleaned 

In [1]:
input = '../Data/CD14Cleaned/CD14_Monocytes_cleaned.csv'

In [21]:
df = pd.read_csv(input,index_col=0)

In [29]:
df

Unnamed: 0,AAACATACCACTAG.1,AAACATACGTTCAG.1,AAACATTGACGGTT.1,AAACATTGCTTCGC.1,AAACATTGGGCAAG.1,AAACGGCTACGGAG.1,AAACGGCTAGTCAC.1,AAACGGCTCAGCTA.1,AAAGACGACCGTAA.1,AAAGACGACTCGCT.1,...,TTTCACGAGGCGAA.1,TTTCAGTGCCATAG.1,TTTCAGTGTCCTGC.1,TTTCAGTGTGTGCA.1,TTTCCAGATTGCGA.1,TTTCGAACACAGCT.1,TTTCGAACGCTAAC.1,TTTCGAACTCCTGC.1,TTTGACTGTGTAGC.1,TTTGCATGTCACCC.1
MIR1302-10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
FAM138A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
OR4F5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
RP11-34P13.7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
RP11-34P13.8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AC145205.1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BAGE5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
CU459201.1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AC002321.2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
cotan_clusters = pd.read_csv("../Data/CD14Cleaned/MergeCotanClusters.csv",index_col=0)

In [56]:
cotan_clusters

AAACATACCACTAG.1    1
AAACATACGTTCAG.1    1
AAACATTGACGGTT.1    1
AAACATTGCTTCGC.1    1
AAACATTGGGCAAG.1    2
                   ..
TTTCGAACACAGCT.1    1
TTTCGAACGCTAAC.1    1
TTTCGAACTCCTGC.1    1
TTTGACTGTGTAGC.1    1
TTTGCATGTCACCC.1    2
Name: x, Length: 2438, dtype: int64

In [54]:
cotan_clusters = cotan_clusters.squeeze()

In [55]:
# Modify the index to add an initial 'X'
cotan_clusters.index = cotan_clusters.index.str.replace('-', '.')

In [32]:
df = df.T
len(df.index)

2438

In [31]:
df

Unnamed: 0,AAACATACCACTAG.1,AAACATACGTTCAG.1,AAACATTGACGGTT.1,AAACATTGCTTCGC.1,AAACATTGGGCAAG.1,AAACGGCTACGGAG.1,AAACGGCTAGTCAC.1,AAACGGCTCAGCTA.1,AAAGACGACCGTAA.1,AAAGACGACTCGCT.1,...,TTTCACGAGGCGAA.1,TTTCAGTGCCATAG.1,TTTCAGTGTCCTGC.1,TTTCAGTGTGTGCA.1,TTTCCAGATTGCGA.1,TTTCGAACACAGCT.1,TTTCGAACGCTAAC.1,TTTCGAACTCCTGC.1,TTTGACTGTGTAGC.1,TTTGCATGTCACCC.1
MIR1302-10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
FAM138A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
OR4F5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
RP11-34P13.7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
RP11-34P13.8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AC145205.1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BAGE5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
CU459201.1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AC002321.2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [59]:
df.index.equals(cotan_clusters.index)

True

In [58]:
predictions = celltypist.annotate(input, model = 'Immune_All_Low.pkl',over_clustering=cotan_clusters, transpose_input = True, majority_voting = True,mode = 'best match')
predictions.to_table(folder = '../Data/CD14Cleaned/', prefix = "CD14_Immune_Cleaned_Low_COTAN_cluster_")

📁 Input file is '../Data/CD14Cleaned/CD14_Monocytes_cleaned.csv'
⏳ Loading data
🔬 Input data has 2438 cells and 32738 genes
🔗 Matching reference genes in the model
🧬 5278 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
🗳️ Majority voting the predictions
✅ Majority voting done!


## Mouse Brain Le Manno - Loom file E13.5

In [5]:
input = '../Data/MouseCortexFromLoom/e13.5_ForebrainDorsal_cleaned.csv'

In [6]:
predictions = celltypist.annotate(input, model = 'Developing_Mouse_Brain.pkl', transpose_input = True, majority_voting = True,mode = 'best match')
predictions.to_table(folder = '../Data/MouseCortexFromLoom/', prefix = "E135_Devel_Mouse_Brain_")

📁 Input file is '../Data/MouseCortexFromLoom/e13.5_ForebrainDorsal_cleaned.csv'
⏳ Loading data
🔬 Input data has 4981 cells and 14282 genes
🔗 Matching reference genes in the model
🧬 5981 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()
⛓️ Over-clustering input data with resolution set to 5
🗳️ Majority voting the predictions
✅ Majority voting done!


## Mouse Brain Le Manno - Loom file E15.0

In [6]:
input = '../Data/MouseCortexFromLoom/e15.0_ForebrainDorsal_cleaned.csv'

In [7]:
predictions = celltypist.annotate(input, model = 'Developing_Mouse_Brain.pkl', transpose_input = True, majority_voting = True,mode = 'best match')
predictions.to_table(folder = '../Data/MouseCortexFromLoom/', prefix = "E150_Devel_Mouse_Brain_")

📁 Input file is '../Data/MouseCortexFromLoom/e15.0_ForebrainDorsal_cleaned.csv'
⏳ Loading data
🔬 Input data has 8562 cells and 14120 genes
🔗 Matching reference genes in the model
🧬 5902 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()
⛓️ Over-clustering input data with resolution set to 10
🗳️ Majority voting the predictions
✅ Majority voting done!


## Mouse Brain Le Manno - Loom file E17.5

In [8]:
input = '../Data/MouseCortexFromLoom/e17.5_ForebrainDorsal_cleaned.csv'

In [9]:
predictions = celltypist.annotate(input, model = 'Developing_Mouse_Brain.pkl', transpose_input = True, majority_voting = True,mode = 'best match')
predictions.to_table(folder = '../Data/MouseCortexFromLoom/', prefix = "E175_Devel_Mouse_Brain_")

📁 Input file is '../Data/MouseCortexFromLoom/e17.5_ForebrainDorsal_cleaned.csv'
⏳ Loading data
🔬 Input data has 2467 cells and 14227 genes
🔗 Matching reference genes in the model
🧬 5949 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
⛓️ Over-clustering input data with resolution set to 5
🗳️ Majority voting the predictions
✅ Majority voting done!


## Cortical cells DGE E13.5 (mouse)

In [10]:
input = '../Data/Yuzwa_MouseCortex/CorticalCells_GSM2861511_E135_cleaned.csv'

In [11]:
predictions = celltypist.annotate(input, model = 'Developing_Mouse_Brain.pkl', transpose_input = True, majority_voting = True,mode = 'best match')
predictions.to_table(folder = '../Data/Yuzwa_MouseCortex/', prefix = "E13_5_Devel_Mouse_Brain_")

📁 Input file is '../Data/Yuzwa_MouseCortex/CorticalCells_GSM2861511_E135_cleaned.csv'
⏳ Loading data
🔬 Input data has 1112 cells and 17082 genes
🔗 Matching reference genes in the model
🧬 6136 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
⛓️ Over-clustering input data with resolution set to 5
🗳️ Majority voting the predictions
✅ Majority voting done!


## Cortical cells DGE E17.5 (mouse)

In [12]:
input = '../Data/Yuzwa_MouseCortex/CorticalCells_GSM2861514_E175_cleaned.csv'

In [13]:
predictions = celltypist.annotate(input, model = 'Developing_Mouse_Brain.pkl', transpose_input = True, majority_voting = True,mode = 'best match')
predictions.to_table(folder = '../Data/Yuzwa_MouseCortex/', prefix = "E17_5_Devel_Mouse_Brain_")

📁 Input file is '../Data/Yuzwa_MouseCortex/CorticalCells_GSM2861514_E175_cleaned.csv'
⏳ Loading data
🔬 Input data has 874 cells and 17085 genes
🔗 Matching reference genes in the model
🧬 6158 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
⛓️ Over-clustering input data with resolution set to 5
🗳️ Majority voting the predictions
✅ Majority voting done!


----------------------------------------------------------