### Gene Correlation Matrix: Analyzing Expression Data in AnnData Format

This Jupyter Notebook demonstrates how to download and analyze gene expression data in AnnData format based on a specified tissue or cell type. It filters out genes with zero expression values and computes a correlation matrix for a specified gene.

#### Steps:

1. **Open and Query Data:**
   - Access gene expression data using `cellxgene_census.open_soma()`.
   - Query data for 'Homo sapiens' and filter by cell type ('adipocyte') or tissue type ('Adipose').

2. **Filter Genes:**
   - Set gene names (`adata.var['feature_id']`).
   - Filter out genes with zero expression values.

3. **Convert to DataFrame:**
   - Convert the filtered expression data into a Pandas DataFrame.

4. **Filter Samples:**
   - Remove samples where the gene of interest has zero expression.
   - Print the percentage of samples with non-zero expression for the gene of interest.

5. **Calculate Correlations:**
   - Compute Pearson correlation coefficients for the gene of interest.
   - Return and print the top 500 most positively and negatively correlated genes.

#### Usage Instructions:

- Modify the `cell_type`, `tissue_type`, and `gene_of_interest` variables as needed.
- Use `get_coexpression_matrix(gene, tissue, cell_type, k=500)` to obtain top correlated genes.

#### Notes:

- Ensure access to the appropriate AnnData formatted dataset.
- Experiment with different parameters to explore gene expression correlations.


### Downloading Dependencies

To run this notebook, ensure you have the necessary libraries installed:

- `cellxgene_census` for accessing gene expression data.
- `pandas` for data manipulation and analysis.
- `numpy` for numerical operations.
- `scipy.stats` for statistical calculations, including Pearson correlation (`pearsonr`).


In [1]:
# %pip install cellxgene-census pandas numpy scipy
%pip install cellxgene-census

Collecting fsspec==2024.5.0.* (from s3fs>=2021.06.1->cellxgene-census)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Collecting pandas (from tiledbsoma~=1.9.1->cellxgene-census)
  Using cached pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Downloading fsspec-2024.5.0-py3-none-any.whl (316 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.1/316.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
Installing collected packages: fsspec, pandas
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.3.1
    Uninstalling fsspec-2024.3.1:
      Successfully uninstalled fsspec-2024.3.1
  Attempting uninstall: pandas
    Found existing installation: pandas 1.4.4
    Uninstalling pandas-1.4.4:
      Successfully uninstalled pandas-1.4.4
[31mERROR: pip's dependency resolver does not c

### Importing Required Libraries

This code cell imports necessary libraries for data analysis:

- `cellxgene_census`: Imports functionality for working with cellxgene data.
- `pandas` (`pd`): Imports the Pandas library for data manipulation and analysis.
- `numpy` (`np`): Imports NumPy for numerical computing operations.
- `pearsonr` from `scipy.stats`: Imports the `pearsonr` function specifically for computing Pearson correlation coefficients.


In [2]:
import cellxgene_census
import pandas as pd
import numpy as np
from scipy.stats import pearsonr

--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy, cupy-cuda11x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------



### Function to Get Co-Expression Matrix

The following code defines a function `get_coexpression_matrix` that performs the following steps:

1. **Open and Query Data:**
   - Uses `cellxgene_census.open_soma()` to access the gene expression data.
   - Queries data for the organism 'Homo sapiens' and filters by a specific cell type (`cell_type`) using `obs_value_filter=f"cell_type == '{cell_type}'"`.

2. **Ensure Correct Gene Names and Filter Genes:**
   - Checks and sets gene names (`adata.var['feature_id']`).
   - Filters out genes with zero expression values (`adata.X > 0`).

3. **Convert Data to DataFrame:**
   - Converts the filtered expression data into a Pandas DataFrame (`df_expression`).

4. **Filter Samples Based on Gene of Interest Expression:**
   - Filters out samples where the gene of interest has zero expression.
   - Calculates and prints the percentage of samples with non-zero expression for the gene of interest.

5. **Calculate Pearson Correlations:**
   - Computes Pearson correlation coefficients between the gene of interest and other genes in `df_expression`.
   - Sorts and returns the top 500 most positively and negatively correlated genes.

The function is then run with an example gene (`ENSG00000140718`) and the specified tissue and cell type (`Adipose` and `adipocyte`). The top 10 most positively and negatively correlated genes are printed.

#### Usage Instructions:

- Call `get_coexpression_matrix(gene, tissue, cell_type, k=500)` with your desired gene, tissue type, and cell type.
- The function will return two lists: the top 500 most positively and negatively correlated genes.
- Modify `k` to adjust the number of top correlations returned.

Example:
```python
top_positive, top_negative = get_coexpression_matrix('ENSG00000140718', 'Adipose', 'adipocyte', k=500)


In [48]:
def get_coexpression_matrix(gene, tissue, cell_type, k=500):
    with cellxgene_census.open_soma() as census:
        # Query the data for a specific organism and cell types
        adata = cellxgene_census.get_anndata(
            census=census,
            organism="Homo sapiens",
            obs_value_filter=f"cell_type == '{cell_type}'", # use obs_value_filter=f"tissue_general == '{tissue}'" if you want to filter with tissue type
            column_names={"obs": ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]},
        )

        # Ensure the gene names are set correctly
        if 'feature_id' in adata.var.columns:  # Adjust column name as needed
            adata.var_names = adata.var['feature_id']
        else:
            print("Gene names column 'feature_id' not found in var DataFrame")

        # Filter out genes with zero expression values
        gene_expression_sum = np.array((adata.X > 0).sum(axis=0)).flatten()
        adata_filtered = adata[:, gene_expression_sum > 0]
        genes = adata_filtered.var['feature_id']
        # Convert the filtered expression data to a DataFrame
        df_expression = pd.DataFrame(adata_filtered.X.toarray(), columns=genes)

        # Check if the gene of interest is in the dataset
        if gene in df_expression.columns:
            # Filter out samples where the gene of interest is not expressed (expression value = 0)
            non_zero_samples = df_expression[df_expression[gene] > 0]
            
            # Calculate percentage of samples with non-zero expression for the gene of interest
            total_samples = df_expression.shape[0]
            non_zero_sample_count = non_zero_samples.shape[0]
            non_zero_percentage = (non_zero_sample_count / total_samples) * 100

            print(f"Total samples: {total_samples}")
            print(f"Samples with non-zero expression for '{gene}': {non_zero_sample_count} ({non_zero_percentage:.2f}%)")

            # Calculate Pearson correlation coefficients and p-values
            correlations = {}
            for g in non_zero_samples.columns:
                if g != gene:
                    corr, p_value = pearsonr(non_zero_samples[gene], non_zero_samples[g])
                    if p_value < 0.05:
                        correlations[g] = corr
                        
            # Sort correlations
            sorted_correlations = sorted(correlations.items(), key=lambda x: x[1], reverse=True)
            top_positive = sorted_correlations[:k]
            top_negative = sorted_correlations[-k:]

            return top_positive, top_negative, genes
        else:
            print(f"Gene of interest '{gene}' not found in the dataset.")
            return [], []

### Selecting Tissue or Cell Type and Gene of Interest

The below code cell sets variables to specify the tissue or cell type and a specific gene for analysis:


In [4]:
gene_of_interest = 'ENSG00000140718'
tissue_type = 'Adipose'
cell_type = 'adipocyte'

### Running the Co-Expression Matrix Function

The following code calls the `get_coexpression_matrix` function to obtain the top 500 most positively and negatively correlated genes for a specified gene of interest, tissue type, and cell type.


In [49]:
top_positive, top_negative, all_genes = get_coexpression_matrix(gene_of_interest, tissue_type, cell_type, k=500)
len(all_genes)

The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.


Total samples: 81378
Samples with non-zero expression for 'ENSG00000140718': 53193 (65.37%)


  corr, p_value = pearsonr(non_zero_samples[gene], non_zero_samples[g])


36959

### Printing Top Correlated Genes

The following code prints the top 10 most positively correlated genes with the specified gene of interest. Each gene is listed along with its Pearson correlation coefficient.

In [39]:
print("Top 10 most positively correlated genes:")
for gene, corr in top_positive[:10]:
    print(f"{gene}: correlation = {corr:.3f}")


Top 10 most positively correlated genes:
ENSG00000181722: correlation = 0.700
ENSG00000230590: correlation = 0.682
ENSG00000116117: correlation = 0.675
ENSG00000090905: correlation = 0.675
ENSG00000131558: correlation = 0.667
ENSG00000144357: correlation = 0.664
ENSG00000184903: correlation = 0.663
ENSG00000152818: correlation = 0.658
ENSG00000144036: correlation = 0.656
ENSG00000164330: correlation = 0.654


The following code prints the top 10 most negatively correlated genes with the specified gene of interest. Each gene is listed along with its Pearson correlation coefficient.

In [40]:
print("\nTop 10 most negatively correlated genes:")
for gene, corr in top_negative[:10]:
    print(f"{gene}: correlation = {corr:.3f}")


Top 10 most negatively correlated genes:
ENSG00000206474: correlation = 0.009
ENSG00000255307: correlation = 0.009
ENSG00000136918: correlation = 0.009
ENSG00000173080: correlation = 0.009
ENSG00000287076: correlation = 0.009
ENSG00000233932: correlation = 0.009
ENSG00000206262: correlation = 0.009
ENSG00000272788: correlation = 0.009
ENSG00000180697: correlation = 0.009
ENSG00000231815: correlation = 0.009


In [43]:
import pickle
import os

ensembl_to_hgnc_map = pickle.load(open("./data/ensembl_to_hgnc.pkl", "rb"))

In [51]:
top_positive_hgnc = [(ensembl_to_hgnc_map.get(gene, gene), corr) for gene, corr in top_positive]
top_negative_hgnc = [(ensembl_to_hgnc_map.get(gene, gene), corr) for gene, corr in top_negative]
all_genes_hgnc = [ensembl_to_hgnc_map.get(gene, gene) for gene in all_genes]

In [50]:
all_genes_hgnc[:10]

['TSPAN6',
 'TNMD',
 'DPM1',
 'SCYL3',
 'FIRRM',
 'FGR',
 'CFH',
 'FUCA2',
 'GCLC',
 'NFYA']

In [52]:
import gseapy as gp 

library = "GO_Biological_Process_2023"
organism = "Human"

res = gp.enrichr(gene_list=[gene[0] for gene in top_positive_hgnc],
                                gene_sets=library,
                                background=all_genes_hgnc,
                                organism=organism,
                                outdir=None).results
res.drop("Gene_set", axis=1, inplace=True)
res.insert(1, "ID", res["Term"].apply(
    lambda x: x.split("(")[1].split(")")[0]))
res["Term"] = res["Term"].apply(lambda x: x.split("(")[0])
res = res[res["Adjusted P-value"] < 0.05]

In [54]:
# case insensitive search
res[res["Term"].str.contains("adipose", case=False)]

Unnamed: 0,Term,ID,P-value,Adjusted P-value,Old P-value,Old adjusted P-value,Odds Ratio,Combined Score,Genes
131,Positive Regulation Of Adipose Tissue Developm...,GO:1904179,8.3e-05,0.001539,0,0,55.012575,517.118355,NCOA1;NCOA2;PPARG
188,Regulation Of Adipose Tissue Development,GO:1904177,0.000375,0.004885,0,0,27.50327,216.995159,NCOA1;NCOA2;PPARG


In [33]:

# Negative correlation
res_neg = gp.enrichr(gene_list=[gene[0] for gene in top_negative_hgnc],
                                gene_sets=library,
                                background=all_genes_hgnc,
                                organism=organism,
                                outdir=None).results
# res_neg.drop("Gene_set", axis=1, inplace=True)
# res_neg.insert(1, "ID", res_neg["Term"].apply(
#     lambda x: x.split("(")[1].split(")")[0]))
# res_neg["Term"] = res_neg["Term"].apply(lambda x: x.split("(")[0])
# res_neg = res_neg[res_neg["Adjusted P-value"] < 0.05]

In [34]:
res_neg.head()

Unnamed: 0,Gene_set,Term,P-value,Adjusted P-value,Old P-value,Old adjusted P-value,Odds Ratio,Combined Score,Genes
0,GO_Biological_Process_2023,Cell Junction Disassembly (GO:0150146),2.4e-05,0.025517,0,0,110.031187,1169.910518,C1QB;DKK1;C1QC
1,GO_Biological_Process_2023,Skeletal Muscle Contraction (GO:0003009),0.000109,0.034657,0,0,19.593548,178.744303,TNNC1;MYH8;TCAP;TNNI3
2,GO_Biological_Process_2023,Striated Muscle Contraction (GO:0006941),0.00012,0.034657,0,0,8.670636,78.251506,SMPX;TNNC1;MYL2;MYH8;TCAP;TNNI3
3,GO_Biological_Process_2023,Synapse Pruning (GO:0098883),0.000131,0.034657,0,0,44.008853,393.4438,C1QB;DKK1;C1QC
4,GO_Biological_Process_2023,Inflammatory Response (GO:0006954),0.000354,0.074883,0,0,3.688551,29.311142,IL1A;CXCL6;VCAM1;HP;CCR5;CCL18;FCGR2B;FOLR2;AI...
