## Get target genes for all siRNAs in rxrx1

The rxrx1 dataset from Recursion uses siRNAs as cell perturbations. In the metadata, ThermoFisher IDs of the respective siRNAs are given. However, the metadata does not contain information about which gene(s) each siRNA targets. For batch effect correction, this information could be useful because if the batches of two siRNAs cluster together, a possible explanation is that they target the same gene.<br>
This notebook demonstrates how to retrieve the information about which genes the siRNAs used in rxrx1 target from the ThermoFisher website (https://www.thermofisher.com/de/de/home.html).

In [1]:
import os
import urllib.request
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from bcc.utils import get_sirna_target_gene

 captum (see https://github.com/pytorch/captum).
INFO:lightning_fabric.utilities.seed:Global seed set to 0


In [2]:
if not os.path.exists("../data/rxrx1/metadata.csv"):
    raise FileNotFoundError("Please download the metadata from https://www.rxrx.ai/rxrx1 and put the file metadata.csv in the directory data/rxrx1")

In [3]:
meta = pd.read_csv("../data/rxrx1/metadata.csv")
meta = meta.set_index("site_id")

In [4]:
thermofisher_ids = np.unique(meta["sirna"].values)

In [5]:
target_gene = []

For each siRNA, the target gene(s) are retrieved via webscraping.

In [6]:
%%time

for sirna in thermofisher_ids:
    if sirna == "EMPTY":
        target_gene.append(pd.NA)
    else:
        target_gene.append(get_sirna_target_gene(sirna))


CPU times: user 1min, sys: 1.21 s, total: 1min 1s
Wall time: 9min 11s


In [7]:
df = pd.DataFrame({"sirna": thermofisher_ids, "target_gene": target_gene})
df

Unnamed: 0,sirna,target_gene
0,EMPTY,
1,n337250,PKHD1
2,s1174,JAG1
3,s12279,RPS6KA3
4,s134,UQCR11
...,...,...
1134,s706,PGRMC2
1135,s7128,IDS
1136,s766,HEMK1
1137,s8645,MECP2


In [8]:
df.to_csv("../data/rxrx1/sirna_to_targets.csv", index=False)

### Test which targets occur multiple times

After adding the information about the target genes to each siRNA, we can analyze how many siRNAs in the rxrx1 dataset target the same gene(s).

In [9]:
d = df[df.target_gene.notna()]
d[d["target_gene"].duplicated()]

Unnamed: 0,sirna,target_gene
50,s18583,IQCB1


In [10]:
d[d["target_gene"] == "IQCB1"]

Unnamed: 0,sirna,target_gene
49,s18582,IQCB1
50,s18583,IQCB1


In [11]:
pcs = meta[meta["well_type"]=="positive_control"]["sirna"].values
print("s18582" in pcs)
print("s18583" in pcs)

True
False


In [12]:
x = []
for s in [i.split("_") for i in d["target_gene"].values]:
    for item in s:
        x.append(item)
x = pd.DataFrame({"target_gene": x})
x[x["target_gene"].duplicated()]

Unnamed: 0,target_gene
58,IQCB1


So: only IQCB1 is targeted by more than one siRNA (namely s18582 and s18583).<br>
One of them is a positive control siRNA, the other not.<br>
None of the siRNAs targeting more than one gene target a gene that is also targeted by another siRNA used for the rxrx1 dataset.