# Transcription Factor Project
**Robin Anwyl, UCSD Subramaniam Lab, Winter Quarter 2025**

**Goal:** Analyze the hiPSC Perturb-seq dataset from the Mali lab (Nourreddine et al preprint) to investigate the effects of transcription factor knockouts (TF KOs)

**Guiding Questions:**
*  What TF KOs are present in the dataset? 
*  When performing clustering with the dataset - can we associate a unique set of TFs with a cluster? Is there any functional significance to these clusters?
*  What gene targets are associated with these TFs?
*  Which TFs are involved in co-regulation, and which genes are co-regulated? Which TFs are regulated by other TFs?
*  How do TFs influence chromatin remodeling? Which sets of writers and erasers are regulated by these TFs?

***

# Import statements

In [2]:
import mudata as md
import anndata as ad
import rapids_singlecell as rsc
import scanpy as sc
import pandas as pd
import numpy as np
# import requests

# Working with the original Perturb-Seq dataset

We will read in the Perturb-seq single-cell dataset `.h5mu` file as a `MuData` object, `mdata`.

`mdata` contains two `AnnData` objects: `rna` (cell-by-gene matrix) and `crispr` (cell-by-knockout matrix)
- `rna`: 5386783 cells x 38606 features  
    - Features = expression of each human gene measured in the dataset
    - `rna.obs` = cell barcodes, `rna.var` = features = genes
- `crispr`: 5386783 cells x 35989 features
    - Features = CRISPR sgRNAs (3 sgRNAs per gene target * 11739 gene targets, plus 478 non-targeting control sgRNAs)
    - `crispr.obs` = cell barcodes, `crispr.var` = features = knockouts

We will be working with `rna` for the majority of the analysis.

In [3]:
mdata = md.read_h5mu("/home/data/Mali_project/KOLF_Pan_Genome_Aggregate.h5mu")
rna = mdata["rna"]

  self._update_attr("var", axis=0, join_common=join_common)
  self._update_attr("obs", axis=1, join_common=join_common)


## QC and preprocessing 

CellRanger count and aggregation has already been performed on the data. The file `protospacer_calls_per_cell.csv` contains information on which CRISPR guide RNA (sgRNA) each cell, identified by its barcode, received. We will assign sgRNAs to the cell barcodes in `rna.obs` based on this information. Nourreddine et al filtered for cells that received a single sgRNA during QC, so we will only add this metadata to cells that received a single sgRNA.

In [None]:
protospacer_calls = pd.read_csv("/home/data/Mali_project/protospacer_calls_per_cell.csv", index_col=0)
single_guide_cells = protospacer_calls[protospacer_calls['num_features'] == 1] # barcodes in protospacer_calls with single sgRNA
barcode_to_sgRNA = single_guide_cells['feature_call'].to_dict()
rna.obs['sgRNA'] = rna.obs.index.map(lambda barcode: barcode_to_sgRNA.get(barcode, None))

Now we will filter for genes that are expressed in at least one cell.

In [None]:
expressed_genes = rna_single_guides.X.getnnz(axis=0) > 0  # Genes expressed in at least one cell

MemoryError: Unable to allocate 60.0 GiB for an array with shape (8051880520,) and data type int64

In [None]:
rna_single_guides = rna[rna.obs['sgRNA'].notna(), :]  # Cells with sgRNAs

## Determining which KOs are TF KOs

Extract the set of all gene knockouts from `crispr.var` and save as a .txt file.

In [5]:
# crispr = mdata["crispr"]
# crispr_sgrnas = crispr.var.index.tolist()
# crispr_genes = {crispr_sgrnas[i][:-2] for i in range(len(crispr_sgrnas))}
# crispr_genes = sorted(list(crispr_genes))
# print(crispr_genes[:10])
# with open("gene_knockouts.txt", "w") as f:
#     for gene in crispr_genes:
#         f.write(gene + "\n")

Read the KO .txt file in.

In [6]:
# with open("gene_knockouts.txt") as f:
#     gene_kos = f.read().split()
# print(gene_kos[:10])
# print(len(gene_kos))

### Comparing Perturb-seq KOs to human TFs from Lambert et al 2018 study

Lambert et al (2018) identified 1639 human transcription factors; this list of TFs is publicly available as a spreadsheet of each studied gene and whether or not it is a TF. 

Read in the spreadsheet and view the first few rows and columns:

In [7]:
# lambert_csv = pd.read_csv("Lambert_2018_TFs.csv")
# print(lambert_csv.iloc[:10, :4])

Filter TFs from full list

In [8]:
# lambert_tfs = lambert_csv[lambert_csv.iloc[:,3] == "Yes"].iloc[:,1].tolist()
# print(lambert_tfs[:10])
# print(len(lambert_tfs))

Find the intersection of the set of 11702 Perturb-seq KOs and the set of 1639 transcription factors

In [9]:
# tfs = set(gene_kos) & set(lambert_tfs)
# tfs = list(tfs)
# tfs.sort()
# print(tfs[:10])
# print(len(tfs))

Some TFs in the Lambert set are missing from the gene KO set. To verify that this is indeed the case, use the Ensembl ID of each missing TF to find alternate gene names (synonyms), and search the gene KO set for these synonyms.

In [10]:
# # Get missing TFs
# missing_tfs = set(lambert_tfs) - set(tfs)
# print(f"Lambert TFs not present in Perturb-seq KOs: {missing_tfs}")
# # Get Ensembl ID for each missing TF
# missing_tf_ensembl = lambert_csv[lambert_csv.iloc[:, 1].isin(missing_tfs)].iloc[:, 0].tolist()
# print(f"Ensembl IDs for Lambert TFs not in Perturb-seq KOs: {missing_tf_ensembl}")

# # Generate gene synonyms for each missing TF
# def get_ensembl_synonyms(ensembl_id):
#     # Use Ensembl REST API xrefs/id to search up gene synonyms for a given Ensembl ID
#     url = "https://rest.ensembl.org/xrefs/id/" + ensembl_id + "?content-type=application/json"
#     response = requests.get(url)
#     if response.status_code == 200:
#         data = response.json()
#         synonyms = list()
#         for xref in data:
#             if xref.get('synonyms'):
#                 synonyms.extend(xref['synonyms'])
#         return synonyms
#     else:
#         return f"Error fetching data for {ensembl_id}: {response.status_code}"
# synonym_list = []
# for id in missing_tf_ensembl:
#     alt_names = list(set(get_ensembl_synonyms(id)))
#     synonym_list.append(alt_names)
# print(f"Gene synonyms for Lambert TFs not in Perturb-seq KOs: {synonym_list}")

# # Search gene KO set for these synonyms
# alt_name_tfs = list()
# for gene_synonyms in synonym_list:
#     for synonym in gene_synonyms:
#         if synonym in gene_kos:
#             alt_name_tfs.append(synonym)
# print(f"TF gene synonyms present in Perturb-seq KOs: {alt_name_tfs}")

None of the synonyms were found in the Perturb-seq gene KO set; thus, the final set of TFs in common between the gene KO set and the Lambert TF set is 1610 TFs. Save the final set of TFs as a .txt file.

In [11]:
# with open("tfs_1610.txt", "w") as f:
#     for tf in tfs:
#         f.write(tf + "\n")