# Non-Natural Aptamer Array (N2A2) Data Processing

Soh Lab, Stanford University

*Last updated August 2019*

## 1. Introduction

Data from the N2A2 can be distinguished as three primary components:
1. FASTQ (sequence, x_f, y_f, quality score) for each sequencing index
2. locs (x_l, y_l) for each cluster on each tile
3. cifs (intensities [integer]) for each cluster on each tile and cycle

The locs and cifs both correspond to the same clusters on each tile, so the primary goal is to associate the clusters to the appropriate fastq files via the shared x,y locations.

*Note: The fastq (x_f,y_f) are different from the locs (x_,y_l) by a rounding scheme, so they have to be converted and matched using the appropriate formula*

## 2. Processing Overview

The data is processed in this order:
1. Data is first separated into three folders (fastq, locs, cifs) in a primary run folder from the initial run folder (<run_path>)
2. FASTQ data (seq, x_f, y_f) is extracted from the zipped (.gz) fastq files and broken into tiles under a sub-directory (directory name: <fastq_name>_tile_data) as .csv files (seq,x,y)
3. Sequence-intensity data is generated for each fastq and exported as csv files in a new child directory (<run_path>/intensities) for each fastq and channel (A,T,C,G).
 * The names of the files are <fastq_name>_<processing_tag>_<channel_tag>.csv
 * Data is formatted to have seq,x,y,int_1,int_2,...,int_n for the n cycles
 * Filtering by sequence can be performed in this step to remove non-compliant sequences under that same sequencing index (if so, the processing tag will be 'filt')

Subsequent processing can be performed to remove faulty tiles or otherwise

## 3. Usage Instructions

Cells in the notebook should be run sequentially unless specified otherwise. Support functions are included as an 'n2a2_utils.py' file in the same directory, so please check the code resources or contact the author if you need the supporting functions file.

Make sure to have Python 3.x installed plus common libraries (numpy, matplotlib)

## 4. Google Colaboratory (Optional)

If processing data using Google Colaboratory, make a copy of this notebook and support file ('n2a2_utils.py') first and then mount your Google Drive (run the appropriate cells below)


## 5. Running the Notebook!

### Connect to Google Drive

In [1]:
# Mount Google Drive and access via your credentials
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Setup (run these first)

In [2]:
# Import the libraries to be used
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time
import os
import sys
import scipy.optimize

In [3]:
# Import the functions used to process
sys.path.insert(1, 'drive/Shared drives/imager_v1/04_processing_code/20181108_NNinsulinF')
from n2a2_utils import *

### Edit the run specific details

In [4]:
# Use the full path to the top directory containing the three subfolders (fastq,locs,cifs)
run_path='drive/Shared drives/imager_v1/03_imager_runs/20181108_NNinsulinF'

# Default starting cycle is 93
cycle_start=93

# Custom names for the cifs (usually descriptive of what each cycle contains)
# These names will also be used to name the cycles in the exported files
cycle_names=['FM',
             'ins_1_uM_ser_0',
             'ins_10_uM_ser_0',
             'ins_25_uM_ser_0',
             'FM',
             'ins_1_uM_ser_1',
             'ins_10_uM_ser_1',
             'ins_25_uM_ser_1',
             'FM']

# Define the fastq names up to the S# mark in a list
fastq_list=['FM_S1','insS1_S2','insR2_S3','tyroapt_S4','ksl2b_S5']

In [5]:
# Rename sequences
rename_cycle_directories(run_path,cycle_start,cycle_names)

Renamed to: drive/Shared drives/imager_v1/03_imager_runs/20181108_NNinsulinF/cifs/C93.1_FM
Renamed to: drive/Shared drives/imager_v1/03_imager_runs/20181108_NNinsulinF/cifs/C94.1_ins_1_uM_ser_0
Renamed to: drive/Shared drives/imager_v1/03_imager_runs/20181108_NNinsulinF/cifs/C95.1_ins_10_uM_ser_0
Renamed to: drive/Shared drives/imager_v1/03_imager_runs/20181108_NNinsulinF/cifs/C96.1_ins_25_uM_ser_0
Renamed to: drive/Shared drives/imager_v1/03_imager_runs/20181108_NNinsulinF/cifs/C97.1_FM
Renamed to: drive/Shared drives/imager_v1/03_imager_runs/20181108_NNinsulinF/cifs/C98.1_ins_1_uM_ser_1
Renamed to: drive/Shared drives/imager_v1/03_imager_runs/20181108_NNinsulinF/cifs/C99.1_ins_10_uM_ser_1
Renamed to: drive/Shared drives/imager_v1/03_imager_runs/20181108_NNinsulinF/cifs/C100.1_ins_25_uM_ser_1
Renamed to: drive/Shared drives/imager_v1/03_imager_runs/20181108_NNinsulinF/cifs/C101.1_FM


In [6]:
# Parse data
for fastq_name in fastq_list:
    fastq_separate_extract(run_path,fastq_name)

Working on FASTQ file: FM_S1_L001_R1_001
 - Current Tile: 2119
 - Completed after 6.73 seconds
Working on FASTQ file: insS1_S2_L001_R1_001
 - Current Tile: 2119
 - Completed after 0.58 seconds
Working on FASTQ file: insR2_S3_L001_R1_001
 - Current Tile: 2119
 - Completed after 61.23 seconds
Working on FASTQ file: tyroapt_S4_L001_R1_001
 - Current Tile: 2119
 - Completed after 3.16 seconds
Working on FASTQ file: ksl2b_S5_L001_R1_001
 - Current Tile: 2119
 - Completed after 5.48 seconds


### Filtering by sequence (optional)

To pre-filter the sequences for QC and reduce the final file sizes, use the appropriate notations as defined here:
* `regex_format` (list of options)
 * `'single'` : For a single sequence (i.e. for a control sequence or fiducial mark)
 * `'primers'` : For a variable region flanked by constant regions (i.e. FP, RP)
 * `'none'` : Skip filtering for this fastq
* `regex_seqs` (list of regex formats)
 * `'single'` : Use the sequence to be filtered (e.g. `'TCGATGCAGTACTGCGTAGCTA'`)
 * `'primers'` : `['<FP>','<RP>']` for the flanking constant regions (*Note: depending on the read length, parts of the sequence may be cutoff*)
 * `'none'` : `'none'`
* `seq_lengths` (list of lengths) : Use the tolerated lengths for the variable regions (the `'primers'` option)

In [7]:
# If filtering, make sure this is 'True', otherwise leave as false and ignore the rest
filter_sequences=True

# Edit the first contents as necessary
if filter_seqs:
    # Need a format for each fastq (see instructions above)
    # Note: These lists have to be same lengths as the number of fastq files
    regex_formats=['single','primers','primers','single','single']

    FM_seq='ACCGACGGAACGCCAAAGAAACGCAAGG'
    ksl2b_seq='AGCAGCACAGAGGTCAGATGCAATTGGGCCCGTCCGTATGGTGGGTCCTATGCGTGCTACCGTGAA'
    tyroapt_seq='TGGAGCTTGGATTGATGTGGTGTGTGAGTGCGGTGCCC'
    FP,RP='GCGCATACCAGCTTATTCAATT','GCCGAGATTGCACTTACTATCT'
    RP_short='ACTTACTATCT'

    regex_seqs=[FM_seq,[FP,RP_short],[FP,RP_short],tyroapt_seq,ksl2b_seq]
    # Example of sequence lengths for random region of 30 bases with tolerated two base difference
    rand_region_len=40
    seq_len_tol=2
    seq_lengths_rand=[rand_region_len-seq_len_tol,rand_region_len+seq_len_tol]

    # Definition of lengths
    seq_lengths=[[],seq_lengths_rand,seq_lengths_rand,[],[]]

    # Package into one variable
    regex_input=(regex_formats,regex_seqs,seq_lengths)
else:
    regex_input=None

### Connect and write out the sequence-intensity data!

In [8]:
tile_list=np.concatenate((np.arange(1101,1119+1),np.arange(2101,2119+1)))
cycle_nums=np.arange(cycle_start,cycle_start+len(cycle_names))
cycle_list=retrieve_cif_names(run_path,cycle_nums)

In [9]:
write_fastq_intensities(cycle_list,tile_list,run_path,fastq_list,regex_input,filter_output=filter_sequences)

Working on tile: 1101
 - locs imported
 - intensities imported
 - files written
 - time: 21.19284725189209
Working on tile: 1102
 - locs imported
 - intensities imported
 - files written
 - time: 24.630675315856934
Working on tile: 1103
 - locs imported
 - intensities imported
 - files written
 - time: 27.05755352973938
Working on tile: 1104
 - locs imported
 - intensities imported
 - files written
 - time: 24.26839852333069
Working on tile: 1105
 - locs imported
 - intensities imported
 - files written
 - time: 24.600151300430298
Working on tile: 1106
 - locs imported
 - intensities imported
 - files written
 - time: 22.982799768447876
Working on tile: 1107
 - locs imported
 - intensities imported
 - files written
 - time: 25.09532403945923
Working on tile: 1108
 - locs imported
 - intensities imported
 - files written
 - time: 24.85816717147827
Working on tile: 1109
 - locs imported
 - intensities imported
 - files written
 - time: 25.131545782089233
Working on tile: 1110
 - locs imp