# Transcription Factor Analysis - TF Project Part 2

**Author:** Robin Anwyl

**Objective:** 1610 transcription factors were identified from the set of Perturb-Seq gene knockouts. The next broad goal of the project is to build a transcription factor interaction network. The methodology for building the TF interaction network will depend on what information is specifically included in the Perturb-seq dataset. In this notebook, we will take a closer look at the dataset.
***

Import packages

In [None]:
import mudata
import pandas as pd
import re

Read in Perturb-seq dataset

In [None]:
mdata = mudata.read_h5mu("/home/data/Mali_project/KOLF_Pan_Genome_Aggregate.h5mu")
with mudata.set_options(display_style = "html", display_html_expand = 0b000):
    display(mdata)

  self._update_attr("var", axis=0, join_common=join_common)
  self._update_attr("obs", axis=1, join_common=join_common)


0,1,2,3
crispr,bool,numpy.ndarray,
rna,bool,numpy.ndarray,


Read in list of 1610 TFs in the Perturb-seq dataset

In [67]:
with open("tfs_1610.txt") as f:
    tfs = f.read().split()
print(tfs[:100])
print(len(tfs))

['ADNP', 'ADNP2', 'AEBP1', 'AEBP2', 'AHCTF1', 'AHDC1', 'AHR', 'AHRR', 'AIRE', 'AKAP8', 'AKAP8L', 'AKNA', 'ALX1', 'ALX3', 'ALX4', 'ANHX', 'ANKZF1', 'AR', 'ARGFX', 'ARHGAP35', 'ARID2', 'ARID3A', 'ARID3B', 'ARID3C', 'ARID5A', 'ARID5B', 'ARNT', 'ARNT2', 'ARX', 'ASCL1', 'ASCL2', 'ASCL3', 'ASCL4', 'ASCL5', 'ASH1L', 'ATF1', 'ATF2', 'ATF3', 'ATF4', 'ATF5', 'ATF6', 'ATF6B', 'ATF7', 'ATMIN', 'ATOH1', 'ATOH7', 'ATOH8', 'BACH1', 'BACH2', 'BARHL1', 'BARHL2', 'BARX1', 'BARX2', 'BATF', 'BATF2', 'BATF3', 'BAZ2A', 'BAZ2B', 'BBX', 'BCL11A', 'BCL11B', 'BCL6', 'BCL6B', 'BHLHA15', 'BHLHA9', 'BHLHE22', 'BHLHE23', 'BHLHE40', 'BHLHE41', 'BNC1', 'BNC2', 'BPTF', 'BRF2', 'BSX', 'CAMTA1', 'CAMTA2', 'CARF', 'CASZ1', 'CBX2', 'CC2D1A', 'CCDC17', 'CDC5L', 'CDX1', 'CDX2', 'CDX4', 'CEBPA', 'CEBPB', 'CEBPD', 'CEBPE', 'CEBPG', 'CEBPZ', 'CENPA', 'CENPB', 'CENPS', 'CENPT', 'CENPX', 'CGGBP1', 'CHAMP1', 'CHCHD3', 'CIC']
1610


For this project, we only want to consider single-TF knockout and non-targeting control (NTC) samples. The CRISPR perturbation metadata file "protospacer_calls_per_cell.csv" contains information on which CRISPR guide RNA(s) each cell received. First, we will read in and view the file.

In [None]:
protospacer_df = pd.read_csv("/home/data/Mali_project/protospacer_calls_per_cell.csv")
display(protospacer_df[108885:108890]) # Representative subset of file

Unnamed: 0,cell_barcode,num_features,feature_call,num_umis
108885,TGCGACGTCAACTCTT-24,1,ADNP2-1,20
108886,AACGTCACATAATCCG-25,2,ADNP2-1|MACROH2A2-1,11|11
108887,ACGCACGAGCCTATTG-25,2,ADNP2-1|Non-Targeting-498,21|22
108888,AGAGCCCCACGCTGCA-25,1,ADNP2-1,44
108889,AGGGTTTGTTACCCAA-25,2,ADNP2-1|ZNF33A-3,7|19


Next, we will use the sgRNA-per-cell information and the list of 1610 TFs to generate a subset of cell barcodes corresponding to single-TF KO and NTC cell samples only.

In [None]:
tfs = set(tfs)
# Filter function
def is_single_tf_or_ntc(feature_call):
    # Return True if feature_call is a single TF KO or single non-targeting gRNA
    if "|" not in feature_call:
        if re.match(r"Non-Targeting-\d+", feature_call) or feature_call[:-2] in tfs:
            return True
    # Return True if feature_call is a single TF KO and one or more non-targeting gRNAs
    guide_rnas = set(feature_call.split("|"))
    tf_kos, nt_rnas = set(), set()
    for g in guide_rnas:
        if re.match(r"Non-Targeting-\d+", g):
            nt_rnas.add(g)
        elif g[:-2] in tfs:
            tf_kos.add(g)
    if (guide_rnas - tf_kos - nt_rnas): # Return False if any gRNAs are not TFs or NTs
        return False
    if len(tf_kos) == 1:
        return True
    else:
        return False
# Test the filter function on a representative subset of file
test_df = protospacer_df[108885:110000]
filter_test = test_df[test_df["feature_call"].apply(is_single_tf_or_ntc)]
display(filter_test)

Unnamed: 0,cell_barcode,num_features,feature_call,num_umis
108885,TGCGACGTCAACTCTT-24,1,ADNP2-1,20
108887,ACGCACGAGCCTATTG-25,2,ADNP2-1|Non-Targeting-498,21|22
108888,AGAGCCCCACGCTGCA-25,1,ADNP2-1,44
108891,ATTACTCGTCCACGCA-25,1,ADNP2-1,9
108894,CACTGAATCAAAGAAC-25,1,ADNP2-1,24
...,...,...,...,...
109976,TGAGGGAAGCGCCTCA-28,1,ADNP2-3,40
109979,TTCTAGTGTCACTTCC-28,1,ADNP2-3,49
109989,GACTTCCAGCACTCTA-29,2,ADNP2-3|Non-Targeting-923,4|5
109992,GGATGTTTCTTACGTT-29,1,ADNP2-3,77


Save the list of cell barcodes in "filtered_barcodes.txt"

In [None]:
# Run filter function and save list of cell barcodes as .txt file
# filtered_df = protospacer_df[protospacer_df["feature_call"].apply(is_single_tf_or_ntc)]
# filtered_barcodes = filtered_df["cell_barcode"]
# with open("filtered_barcodes.txt", "w") as f:
#     for barcode in filtered_barcodes:
#         f.write(f"{barcode}\n")

Read in "filtered_barcodes.txt" to get the list of cell barcodes corresponding to single-TF KO or NTC samples.

In [85]:
with open("filtered_barcodes.txt") as f:
    barcodes = f.read().split()
print(barcodes[:10])
print(len(barcodes))


['CATTGTTCACAGTGAG-1', 'CCCTCTCAGGTTCATC-1', 'CCTTTGGGTAACCCTA-2', 'CGCCATTGTTCGTTCC-2', 'GATGCTACATCGCCTT-2', 'GGTGTTAAGGTCGTGA-3', 'TGAATCGAGTATCTGC-3', 'GAGGGTATCCGCTGTT-4', 'GCTCAAATCGAGCACC-7', 'GGGCGTTTCGATTTCT-7']
628136


Out of the 5,386,783 cells in the dataset, 628,136 correspond to single-TF knockouts or NTC samples.