### Imports

In [34]:
import anndata as ad
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
from umap import UMAP
import numpy as np
import warnings
warnings.filterwarnings("ignore") # Some log2(0) in the DE analysis raises runtime errors, but we supress because it just means its not significant

# test 

### Reading in Data

In [35]:
%%time
#Change this to the appropriate data path
data_path = './KOLF_Chroma_mixscape_output_x_pert.h5ad'
adata = ad.read_h5ad(data_path) #Note that the .X is already set to the X_pert layer if you are following the single cell best practices guide

CPU times: user 1.65 s, sys: 2.64 s, total: 4.28 s
Wall time: 9.63 s


### Visualizing perturbation similarities through correlation analysis
1. The data provided to you has been filtered through the entire [single cell best practices](https://www.sc-best-practices.org/conditions/perturbation_modeling.html#analysing-single-pooled-crispr-screens) pipeline as described in the earlier talk (link).
2. Download the data (link) and to read the .h5ad into Python. Read up on the data structure of [AnnData objects](https://anndata.readthedocs.io/en/latest/) and explore the .obs/.var of the downloaded data. The .X is the X_pert layer computed by [pertpy](https://pertpy.readthedocs.io/en/latest/tutorials/notebooks/mixscape.html), and is filtered down to top 6000 highly variable genes.
3. Compute a mean gene expression vector for each perturbation.
4. Compute the pairwise Pearson correlation matrix between all perturbations.
5. Read on generally what [UMAP](https://umap-learn.readthedocs.io/en/latest/) does. Use the correlation matrix as a precomputed feature matrix as input to UMAP to get a 2-dimensional embedding of each perturbation. 6. Use the nearest n=3 neighbors with spread = 1.0 to compute the UMAP.
7. Use [networkx](https://networkx.org/documentation/stable/tutorial.html) to plot each perturbation as a node, using the UMAP embeddings as the X/Y position of each node. Draw edges between each node and the top 5 person correlates.
8. What perturbations cluster together in the network? Do these have known biological relationships? How does the clustering change as a function of changing the number of UMAP neighbors?

In [None]:
%%time
def create_perturbation_network(adata, target_genes_category='perturbation_target_genes'):
    # Filter cells with mixscape_class_global as 'KO' which correspond to perturbed cells
    # Note that "KO" actually represents a Knock-Down in this application as CRISPRi is used
    
    # Get the list of perturbations
    
    
    # Compute average gene expression vector for all cells with each perturbation


    # Create DataFrame for the expression data
    

    # Compute pairwise Pearson correlation matrix between all perturbation average gene expression vectors
    

    # Convert correlation to distance; want a small distance between high correlates
    

    # Compute the 2D UMAP embedding for each perturbation, using the correlation vector as the feature vector for each perturbation
    

    # Store 2D embeddings as positions in a dictionary for NetworkX Visualization
    

    # Initialize the networkx graph, where nodes are each perturbation


    # Add nodes and edges, positioning nodes based on the UMAP, and draw edges between each perturbation and the top 5 pearson correlates


    # Visualize the graph
    
    print("Done")

create_perturbation_network(adata)

### Computing Differentially Expressed Genes

1. For each perturbation, compute the number of differentialy expressed genes relative to NTCs. What is an appropriate DE test to use (e.g. parametric or non-parametric?) What are the FDR and Log Folders Change cutoffs chosen? Justify the selection
2. Investigate the DEGs for a perturbation of interest. Is there any interesting biological relationship between them?
3. Can you leverage parallel computing to significantly speed up the differential expression analysis?

In [None]:
%%time
def create_differential_gene_expression_plot(adata, pvalue_threshold=-np.log10(0.05)):

    # Isolate only cells identified as having a mixscape_class_global of "KO" or "NT" corresponding to perturbed or non targeting controls
    # Note that "KO" actually represents a Knock-Down in this application as CRISPRi is used
    adata_ko = adata[adata.obs['mixscape_class_global'].isin(['NT',"KO"])].copy()

    # Extract a list of each perturbed gene
    genes_of_interest = [] # Adjust this accordingly

    # Initialize datastructure to store the number of significiantly upregulated / downregulated genes
    

    for gene_of_interest in genes_of_interest: # TODO: Speed this up (e.g. parallel processing)
        # Isolate cells that are either NTCs or having a perturbation corresponding to the gene of interest
        

        # Step 2: Differential Expression Analysis using sc.tl.rank_genes_groups (what is an appropriate DE test to use? why?)
        

        # Step 3: Extract the results from the DE analysis
        

       # Step 4: Count the number of significantly upregulated and downregulated genes
        

    # Plot the Number of significantly upregulated and downregulated genes per perturbation
   
    print("Done")

create_differential_gene_expression_plot(adata)