### Script to analyze NIS-Seq data for Figure 1E

---

The goal of this script is to compare the percentage of dictionary-matching nuclei sequenced by *NIS-Seq* or [*in-situ sequencing*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6886477/). 


The input files for this script were generated using the web-based [*NIS-Seq image analysis suite*](https://jsb-lab.bio/opticalscreening/) and consist of the follwing files for each celltype:


1. **CellType_NuclearIntensities_normal.txt**
* *tile*: Tile number, according to order of files loaded; counting from 0
* *nucleus*: Number of the nucleus in that tile, as defined by the masks loaded. Cellpose masks start counting at 1.
* *intensity*: For each nucleus, the maximum intensity over all cycles and channels is indicated; intensity is calculated as the integrated pixel intensity across nuclear masks.
    
    
2. **CellType_NuclearIntensities_scrambled.txt**
* *tile*: Tile number, according to order of files loaded; counting from 0
* *nucleus*: Number of the nucleus in that tile, as defined by the masks loaded. Cellpose masks start counting at 1.
* *intensity*: For each nucleus, the maximum intensity over all cycles and channels is indicated; intensity is calculated as the integrated pixel intensity across nuclear masks.
    
    
3. **CellType_NuclearSequences_normal.txt**
* *tile*: Tile number, according to order of files loaded; counting from 0
* *nucleus*: Number of the nucleus in that tile, as defined by the masks loaded. Cellpose masks start counting at 1.
* *x*: Pixel-wise center position of each nuclear mask in the first cycle.
* *y*: Pixel-wise center position of each nuclear mask in the first cycle.
* *sequence*: Library-matched consensus sequence detected in a nucleus.
* *max_intensity*: For each nucleus, the maximum intensity over all cycles and channels is indicated; intensity is calculated as the integrated pixel intensity across nuclear 
        
    
4. **CellType_NuclearSequences_scrambled.txt**
* *tile*: Tile number, according to order of files loaded; counting from 0
* *nucleus*: Number of the nucleus in that tile, as defined by the masks loaded. Cellpose masks start counting at 1.
* *x*: Pixel-wise center position of each nuclear mask in the first cycle.
* *y*: Pixel-wise center position of each nuclear mask in the first cycle.
* *sequence*: Library-matched consensus sequence detected in a nucleus.
* *max_intensity*: For each nucleus, the maximum intensity over all cycles and channels is indicated; intensity is calculated as the integrated pixel intensity across nuclear 
    
---

In [1]:
import pandas as pd
import os
import numpy as np
from tqdm import tqdm

In [2]:
# Derive celltypes from file names

data_location = 'NIS-Seq_data/'
celltypes = set([x.split('_')[0] for x in os.listdir(data_location) if x[0] != '.'])

In [3]:
# Count total vs. sequenced nuclei per tile

tile_num = 196
tile_counts = {'normal': pd.DataFrame(), 'scrambled': pd.DataFrame()}

# Iterate through celltypes
for celltype in celltypes:
    
    # Load normal NIS-Seq results
    nucs = pd.read_csv(data_location + f'{celltype}_NuclearIntensities_normal.txt', sep = '\t')
    seqs = pd.read_csv(data_location + f'{celltype}_NuclearSequences_normal.txt', sep = '\t')
    
    # Load scrambled NIS-Seq results
    nucs_sc = pd.read_csv(data_location + f'{celltype}_NuclearIntensities_scrambled.txt', sep = '\t')
    seqs_sc = pd.read_csv(data_location + f'{celltype}_NuclearSequences_scrambled.txt', sep = '\t')
    
    # Count total vs. sequenced nuclei in each tile
    tile_counts['normal'] = pd.concat([tile_counts['normal'], pd.DataFrame([(celltype, (nucs.tile == tile).sum(), (seqs.tile == tile).sum()) for tile in range(0,tile_num)])])
    
    # Count total vs. sequenced nuclei in each tile
    tile_counts['scrambled'] = pd.concat([tile_counts['scrambled'], pd.DataFrame([(celltype, (nucs_sc.tile == tile).sum(), (seqs_sc.tile == tile).sum()) for tile in range(0,tile_num)])])

# Rename columns
tile_counts['normal'] = tile_counts['normal'].rename({0: 'celltype', 1: 'nucs', 2: 'seqs'}, axis=1)
tile_counts['scrambled'] = tile_counts['scrambled'].rename({0: 'celltype', 1: 'nucs', 2: 'seqs'}, axis=1)

In [4]:
# Check example

#tile_counts['normal'].query(f'celltype=="HeLa"').to_csv('Fig1E_HeLa_tiles.tsv', sep='\t')

In [5]:
# Split the imaged area (=tiles) into two halves

dic = {'normal': {}, 'scrambled': {}}

for scrambled in ['normal', 'scrambled']:
    for celltype in celltypes:
        
        df_ = tile_counts[scrambled].query(f'celltype=="{celltype}"')
        total_r1 = df_.iloc[:(tile_num//2)].nucs.sum()
        total_r2 = df_.iloc[(tile_num//2):].nucs.sum()
        sequenced_r1 = df_.iloc[:(tile_num//2)].seqs.sum()
        sequenced_r2 = df_.iloc[(tile_num//2):].seqs.sum()
        #print(scrambled,celltype,total_r1,sequenced_r1,total_r2,sequenced_r2)

        percent_mapped_r1 = sequenced_r1 / total_r1 * 100
        percent_mapped_r2 = sequenced_r2 / total_r2 * 100
        #print(scrambled,celltype,percent_mapped_r1,percent_mapped_r2)

        dic[scrambled][celltype] = (percent_mapped_r1, percent_mapped_r2)

In [6]:
# Reformat table and save

df = pd.DataFrame(dic['normal']).T.rename({0: 'normal', 1: 'normal'}, axis=1)
df['0'] = pd.DataFrame(dic['scrambled']).T[0]
df['1'] = pd.DataFrame(dic['scrambled']).T[1]
df = df.rename({'0': 'scrambled', '1': 'scrambled'}, axis=1)
df.to_csv('Fig1E_NIS-Seq.tsv', sep='\t')