# Gene Expression Analysis

## Differential Expression Analysis (DEA) using PyDESeq2

PyDESeq2 package is a python implementation of the [DESeq2 R package](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) for differential expression analysis (DEA) with bulk RNA-seq data. It aims to facilitate DEA experiments for python users.

As PyDESeq2 is a re-implementation of DESeq2 from scratch, you may experience some differences in terms of retrieved values or available features.

### Installation

In [None]:
%pip install pydeseq2

In [5]:
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

### Data loading with pandas

To perform differential expression analysis (DEA), PyDESeq2 requires two types of inputs:

* A count matrix of shape ‘number of samples’ x ‘number of genes’, containing [read counts](http://talipzengin.github.io/CENG4525/2023/count_table.csv) (non-negative integers),
* Metadata (or annotations, or “column” data) of shape ‘number of samples’ x ‘number of variables’, containing sample annotations that will be used to split the data in cohorts.

Both should be provided as pandas dataframes.

In [None]:
import pandas as pd
counts = pd.read_csv('count_table.csv')
counts

Set the DataFrame index (row labels) using 'Geneid' column. The index can replace the existing index or expand on it.

In [None]:
counts = counts.set_index('Geneid')
counts

### Data filtering

#### Low count filtering
We filter out genes that have less than 10 read counts in total. Note again that there are no such genes in this synthetic dataset.

In [None]:
counts = counts[counts.sum(axis = 1) >= 10]
counts

Note that the counts data is in a ‘number of genes’ x ‘number of samples’ format, whereas ‘number of samples’ x ‘number of genes’ is required. To fix this issue, we transpose the counts dataframe.

In [None]:
counts = counts.T
counts

In this example, the metadata data contains two columns, Sample and Condition, representing two types of bi-level annotations. Here, we will only use the condition factor.

In [18]:
metadata = pd.DataFrame(zip(counts.index, ['C','C','C','C', 'RS', 'RS', 'RS', 'RS']),
                        columns = ['Sample', 'Condition'])

In [None]:
metadata = metadata.set_index('Sample')
metadata

### Single factor analysis
As in the getting started example, we will use the condition column as our design factor.

#### Read counts modeling with the *DeseqDataSet* class
We start by creating a *DeseqDataSet* object from the count and metadata data that were just loaded.

In [22]:
dds = DeseqDataSet(
    counts=counts,
    metadata=metadata,
    design_factors="Condition",
    refit_cooks=True
)

#design_factors=["gender", "condition"] = ~ gender + condition

Once a *DeseqDataSet* was initialized, we may run the *deseq2()* method to fit dispersions and LFCs.

In [None]:
dds.deseq2()

The *DeseqDataSet* class extends the *AnnData* class.

In [None]:
print(dds)

### Statistical analysis with the DeseqStats class

Now that dispersions and LFCs were fitted, we may proceed with statistical tests to compute p-values and adjusted p-values for differential expresion. This is the role of the *DeseqStats* class.

In [27]:
stat_res = DeseqStats(dds, contrast = ('Condition','RS','C'))

*PyDESeq2* computes p-values using *Wald tests*. This can be done using the *summary()* method, which runs the whole statistical analysis, cooks filtering and multiple testing adjustement included.

In [None]:
stat_res.summary()

The results are then stored in the results_df attribute (stat_res.results_df).

In [None]:
res = stat_res.results_df
res

It is often more convenient to have the results as a CSV. Hence, we may export stat_res.results_df as CSV, using pandas.DataFrame.to_csv().

In [30]:
stat_res.results_df.to_csv("results.csv")

## Gene Annotation

We determined the differentially expressed genes however we have only Ensembl gene IDS which are mostly used for RNA transcripts. We need to get gene names which are more human readable and memoriable. For this purpose we can use several R or python packages to convert Ensembl Gene IDs to HGNC symbol (gene name) and/or Entrez gene IDs.

In [None]:
%pip install sanbomics

In [34]:
from sanbomics.tools import id_map
gene_mapper = id_map(species = 'human')

In [None]:
res['Symbol'] = res.index.map(gene_mapper.mapper)
res

**Lets filter significantly differentially expressed genes (adjusted p value < a significance threshold: padj < 0.05) which have more or less than 2 fold expression changing.**

In [None]:
sig_genes = res[(res.padj < 0.05) & (abs(res.log2FoldChange) > 1)]
sig_genes

## PCA (Principal Component Analysis)

Principal Component Analysis (PCA) stands out as a widely employed method for analyzing extensive datasets characterized by a multitude of dimensions or features per observation. Its primary objectives include enhancing data interpretability, retaining maximal information, and facilitating the visualization of complex, multidimensional data sets. PCA, in essence, serves as a statistical technique dedicated to dimensionality reduction in datasets. This reduction is achieved through a linear transformation of the data into a novel coordinate system, wherein a diminished number of dimensions effectively capture (most of) the data's variation compared to the original dataset. Commonly, researchers utilize the first two principal components to generate two-dimensional plots, aiding in the visual identification of closely related data clusters. Principal Component Analysis finds applications across various fields due to its versatility and effectiveness in handling high-dimensional datasets.

In [None]:
%pip install scanpy

In [40]:
import scanpy as scp

In [None]:
dds

Firstly, let's computes PCA coordinates, loadings and variance decomposition. The command below uses the implementation of *scikit-learn* ML python package.

In [45]:
scp.tl.pca(dds)

Let's draw our PCA plot:

In [None]:
scp.pl.pca(dds, color = 'Condition', size = 300)

### GSEA (Gene Set Enrichment Analysis)

In [None]:
%pip install gseapy

In [53]:
import gseapy as gp
from gseapy.plot import gseaplot

In [None]:
res

In [None]:
ranking = res[['Symbol', 'padj']].dropna().sort_values('padj', ascending = True)
ranking

In [57]:
ranking = ranking.drop_duplicates('Symbol')

In [None]:
ranking

In [59]:
manual_set = {'things':['STAU2', 'USP53', 'SERPINE1', 'TMEM178B', 'PSAP']}

In [60]:
#to look at available libraries
#gp.get_library_name()

In [None]:
pre_res = gp.prerank(rnk = ranking,
                     gene_sets = ['GO_Biological_Process_2021', manual_set],
                     seed = 6, permutation_num = 100)

In [None]:
out = []

for term in list(pre_res.results):
    out.append([term,
               pre_res.results[term]['fdr'],
               pre_res.results[term]['es'],
               pre_res.results[term]['nes']])

out_df = pd.DataFrame(out, columns = ['Term','fdr', 'es', 'nes']).sort_values('fdr').reset_index(drop = True)
out_df

In [None]:
out_df[0:11]

In [None]:
out_df.iloc[0].Term

### Heatmap

In [77]:
import numpy as np
import seaborn as sns

In [None]:
dds.layers['normed_counts']

In [79]:
dds.layers['log1p'] = np.log1p(dds.layers['normed_counts'])

In [None]:
dds.layers['log1p']

In [None]:
sig_genes

In [None]:
dds_sig_genes = dds[:, sig_genes.index]
dds_sig_genes

In [85]:
grapher = pd.DataFrame(dds_sig_genes.layers['log1p'].T,
                       index=dds_sig_genes.var_names, columns=dds_sig_genes.obs_names)

In [None]:
sns.clustermap(grapher, z_score=0, cmap = 'RdYlBu_r')

### Volcano Plot

In [89]:
from sanbomics.plots import volcano

In [None]:
res

In [None]:
volcano(res, symbol='Symbol')