# What is about 


We apply scanpy pipeline as described in tutorial here: https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html
to Glioblastoma dataset. 

**Context** The dataset - gene expression data of brain cancer cells - Glioblastoma cells. 
"SCanpy" - "Single Cell Python" - Python package for such data analysis. It is Python verision of similar "Seurat" - R package. 



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [1]:
pip install scanpy

In [1]:
import scanpy as sc

In [1]:
path = '/kaggle/input/human-glioblastoma-dataset/' # matrix.mtx
adata = sc.read_10x_mtx(path)
print(type(adata))
df = adata.to_df()
if len(np.unique(df.columns)) != df.shape[1]:
    print('Non unique column (gene) names')
df

In [1]:
sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')

# Preprocessing

In [1]:
sum0 = np.sum(adata.X,axis = 0)
print ( (sum0 == 0).sum() , ' number of totally zero expressed genes ' )
print( np.round( (adata.X != 0).sum()/ (adata.X.shape[0]*adata.X.shape[1]) * 100 , 2) ,' % of non-zeros in data' ) 

In [1]:
# Show those genes that yield the highest fraction of counts in each single cells, across all cells.
sc.pl.highest_expr_genes(adata, n_top=20 )

In [1]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

Let us assemple some information about mitochondrial genes, which are important for quality control.

Citing from “Simple Single Cell” workflows (Lun, McCarthy & Marioni, 2017):

High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.
With pp.calculate_qc_metrics, we can compute many metrics very efficiently.

In [1]:
adata.var['mt'] = adata.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

In [1]:
adata.var['mt'].sum()

In [1]:
for t in df.columns:
    if t.upper().startswith('MT'):
        print(t)

A violin plot of some of the computed quality measures:

the number of genes expressed in the count matrix
the total counts per cell
the percentage of counts in mitochondrial genes

In [1]:
# sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'], jitter=0.4, multi_panel=True)

Total-count normalize (library-size correct) the data matrix X to 10,000 reads per cell, so that counts become comparable among cells.

In [1]:
sc.pp.normalize_total(adata, target_sum=1e4)

In [1]:
sc.pp.log1p(adata)

In [1]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)


In [1]:
sc.pl.highly_variable_genes(adata)


In [1]:
adata.raw = adata

In [1]:
adata = adata[:, adata.var.highly_variable]


In [1]:
# sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])

In [1]:
sc.pp.scale(adata, max_value=10)


In [1]:
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca(adata)#, color='CST3')


In [1]:
sc.pl.pca_variance_ratio(adata, log=True)


In [1]:
#adata.write(results_file)
adata

# Computing the neighborhood graph

Let us compute the neighborhood graph of cells using the PCA representation of the data matrix. You might simply use default values here. For the sake of reproducing Seurat’s results, let’s take the following values.

In [1]:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

# Embedding the neighborhood graph

We advertise embedding the graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preservers trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:

In [1]:
#tl.paga(adata)
#pl.paga(adata, plot=False)  # remove `plot=False` if you want to see the coarse-grained graph
#tl.umap(adata, init_pos='paga')

sc.tl.umap(adata)


In [1]:
sc.pl.umap(adata) # , color=[ 'CST3', 'NKG7', 'PPBP']  


# Clustering the neighborhood graph

As Seurat and many others, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.

In [1]:
!pip3 install leidenalg

In [1]:
sc.tl.leiden(adata)


In [1]:
sc.pl.umap(adata, color = ['leiden',])# , color=[ 'CST3', 'NKG7'])

# Finding marker genes

Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.



In [1]:
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

The result of a Wilcoxon rank-sum (Mann-Whitney-U) test is very similar. We recommend using the latter in publications, see e.g., Sonison & Robinson (2018). You might also consider much more powerful differential testing packages like MAST, limma, DESeq2 and, for python, the recent diffxpy.

In [1]:
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

In [1]:
sc.tl.rank_genes_groups(adata, 'leiden', method='logreg')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

In [1]:
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)


In [1]:
if 0:
    result = adata.uns['rank_genes_groups']
    groups = result['names'].dtype.names
    pd.DataFrame(
        {group + '_' + key[:1]: result[key][group]
        for group in groups for key in ['names', 'pvals']}).head(5)

In [1]:
sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)

In [1]:
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)


In [1]:
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
