# Tutorial

In this tutorial, we will take a PBMC dataset captured by CITE-Seq as an example to illustrate the functions of GFPA.

In [1]:
import scanpy as sc
import gfpa
import matplotlib.pyplot as plt
import numpy as np
from skbio.stats.composition import clr

In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Load a PBMC dataset.

In [3]:
pbmc = sc.read_h5ad("/home/hh/bigdata/hh/GFPA/test_pbmc.h5ad")

In [4]:
pbmc = gfpa.tl.separate_out_protein_expression(pbmc, protein_prefix="AB_", set_protein_obsm_name="protein_expression", layers="raw")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11864/11864 [00:00<00:00, 712921.71it/s]


CD4 cells were extracted from PBMC dataset. P. S. This step is optional. The gfpa can also analyze the entire PBMC dataset.

In [5]:
pbmc_cd4 = gfpa.tl.celldata(pbmc, obs="initial_clustering", sp_obs="CD4")

🍪 729 cells are extracted.


The gfpa needs to standardize the RNA data. Please do not input normalized RNA data to gfpa.

In [6]:
gfpa.pp.rna_normalization(pbmc_cd4, layers="raw")

🫘 11735 types of genes are normalized.


The gfpa needs to standardize the RNA data. Please do not input normalized protein data to gfpa. Note: You need to specify the location of the protein data.

In [7]:
gfpa.pp.protein_normalization(pbmc_cd4, protein_obsm="protein_expression")

🥚 129 types of proteins are normalized.


The gfpa needs to specify a *. gmt file. These *. gmt files will be converted into gene function data sets by gfpa. You can download these files at the following address: https://www.gsea-msigdb.org/gsea/msigdb/index.jsp

In [8]:
gfpa.tl.geneset(pbmc_cd4, gmt_filepath="h.all.v2022.1.Hs.symbols.gmt")

🍽 Reading functional gene collection.


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 60804.64it/s]

🥗 50 functional gene collections are extracted.





The time to calculate gfpa depends on the size of cell subtype data set and gene function data set

In [None]:
gfpa.tl.scores(pbmc_cd4)

🙂 GFPA scores are being calculated.


 52%|████████████████████████████████████████████████████████████████████████▎                                                                  | 26/50 [00:04<00:04,  5.92it/s]

The gfpa can be saved as a *. csv file.

In [None]:
gfpa.tl.to_csv(pbmc_cd4, filepath="pbmc_cd4_gfpa.csv")

You can use the following functions to quickly preview the gfpa score.

In [None]:
gfpa.tb.top(pbmc_cd4, n=10)

You can use the following functions to quickly view the correlation between gene sets and proteins.

In [None]:
geneset_name = "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"
protein_name = "AB_CD99"
gfpa.pl.scatter(pbmc_cd4, geneset_name, protein_name, scatter_kws={'color': '#71AD47'}, line_kws={'color': 'black'})

You can view the relationship between each gene and protein in the gene set. The higher the weight, the stronger the relationship between genes and proteins.

In [None]:
plt.figure(figsize = (18,6))
plt.rcParams.update({'font.size': 10})

geneset_name = "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"
protein_name = "AB_CD99"
gfpa.pl.weight(pbmc_cd4, geneset_name, protein_name, model="rf", color="#C55A12")

In [None]:
gfpa.tb.weight(pbmc_cd4, geneset_name, protein_name, model="rf")

In [None]:
pbmc_cd4.uns["gfpa"].to_csv("temp.csv")

In [None]:
pbmc_cd4.obsm