# Transcription Factor Extraction
**Author:** Robin Anwyl

**Objective:** Read in Perturb-Seq hiPSC CRISPR KO scRNA-seq dataset from Mali Lab and extract the transcription factors from the full list of gene knockouts
***

Read in the scRNA-seq dataset, which is a .h5mu file, and assign it to the variable `mdata`

In [None]:
import mudata
mdata = mudata.read_h5mu("/home/data/Mali_project/KOLF_Pan_Genome_Aggregate.h5mu")

Rich representation of `mdata`, which is a `MuData` object

In [None]:
with mudata.set_options(display_style = "html", display_html_expand = 0b000):
    display(mdata)

0,1,2,3
crispr,bool,numpy.ndarray,
rna,bool,numpy.ndarray,


Access the `rna` cell-by-gene matrix and `crispr` cell-by-knockout matrix, which are `AnnData` objects.
- `rna`: 5386783 cells x 38606 features  
    - Features = expression of each gene in the human genome
- `crispr`: 5386783 cells x 35989 features
    - Features = CRISPR sgRNAs (3 sgRNAs per gene target * 11739 gene targets, plus 478 non-targeting control sgRNAs)

In [None]:
rna_adata = mdata['rna']
crispr_adata = mdata['crispr']

`rna` and `crispr` both have `obs` and `var` attributes. 

- `obs` = cell barcodes associated with single cells; same for both `rna` and `crispr`  
- `rna.var` = human genes, `crispr.var` = gene knockouts

In [None]:
rna_adata.var

Unnamed: 0,gene_ids,feature_types
DDX11L2,ENSG00000290825,Gene Expression
MIR1302-2HG,ENSG00000243485,Gene Expression
FAM138A,ENSG00000237613,Gene Expression
ENSG00000290826,ENSG00000290826,Gene Expression
OR4F5,ENSG00000186092,Gene Expression
...,...,...
ENSG00000277836,ENSG00000277836,Gene Expression
ENSG00000278633,ENSG00000278633,Gene Expression
ENSG00000276017,ENSG00000276017,Gene Expression
ENSG00000278817,ENSG00000278817,Gene Expression


In [None]:
crispr_adata.var

Unnamed: 0,gene_ids,feature_types
A1BG_1,A1BG_1,CRISPR Guide Capture
A1BG_2,A1BG_2,CRISPR Guide Capture
A1BG_3,A1BG_3,CRISPR Guide Capture
A1CF_1,A1CF_1,CRISPR Guide Capture
A1CF_2,A1CF_2,CRISPR Guide Capture
...,...,...
ZZEF1_2,ZZEF1_2,CRISPR Guide Capture
ZZEF1_3,ZZEF1_3,CRISPR Guide Capture
ZZZ3_1,ZZZ3_1,CRISPR Guide Capture
ZZZ3_2,ZZZ3_2,CRISPR Guide Capture


Get the list of all gene KOs (which can then be compared to a database of human TFs and narrowed down to only the TF KOs)

In [22]:
crispr_sgrnas = crispr_adata.var.index.tolist()
crispr_genes = {crispr_sgrnas[i][:-2] for i in range(len(crispr_sgrnas))}
crispr_genes = sorted(list(crispr_genes))
print(crispr_genes[:10])
# with open("gene_knockouts.txt", "w") as f:
#     for gene in crispr_genes:
#         f.write(gene + "\n")

['A1BG', 'A1CF', 'A4GALT', 'AAAS', 'AACS', 'AADAT', 'AAGAB', 'AAMP', 'AAR2', 'AARD']
