# 2. Sequence data analysis
Sequence analysis uses the fastq.gz files that are created by the sequencer as a starting point. These are compressed [fastq](https://en.wikipedia.org/wiki/FASTQ_format) text files containing the metadata (name, position, etc.), sequence and quality of each sequenced cluster. 
All these sequences have to be aligned to the input or reference sequence(s) because there might be some insertions,
deletions or mismatches. If multiple samples are present, this step is also to determine which sequence matches which sample. The output of sequence alignment is a sam file, which is needed for the step where the single-molecule and sequencing data are linked.

For our dataset we obtained two fastq files.

In [1]:
from pathlib2 import Path
import papylio as pp
import matplotlib.pyplot as plt
import subprocess
import gzip

In [2]:
experiment_path = Path(r'C:\Users\user\Desktop\SPARXS example dataset')

In [3]:
sequencing_data_path = (experiment_path / 'Sequencing data').absolute()
sequencing_data_path

WindowsPath('C:/Users/user/Desktop/SPARXS example dataset/Sequencing data')

## Create reference fasta file

A reference fasta file needs to be created containing the reference sequences. These are the sequences to which the sequencing data will be aligned and should thus contain the (general) sequences that are present in the sequenced sample.
Below, the reference sequences are specified in a dictionary. The name of each reference sequence can be chosen by the user.

In [4]:
reference_sequences = {
    'CalSeq': 'CCAACAATGCCTAGCCGATCCGTAATGCCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAATGCCATCTCGTATGCCGTCTTCTGCTTG',
    'HJ_general': 'CCCACCGCTCNNCTCAACTGGGTTTTCCCAGTTGAGNNCTTGCTAGGGTTTTCCCTAGCAAGNNGCTGCTACGGTTTTCCGTAGCAGCNNGAGCGGTGGG'
}

In [5]:
def create_reference_fasta(folder_path, reference_sequences):
    with open(Path(folder_path) / 'Reference.fasta', 'wt') as fasta_file:
        for key, value in reference_sequences.items():
            fasta_file.write('>' + key + '\n' + value + '\n\n')

In [6]:
create_reference_fasta(sequencing_data_path, reference_sequences)

## Merge fastq files

When using or configuring index reads, the sequencer may produce multiple fastq files.
You can combine these into a single file.

In [7]:
def merge_gz_files(input_files, output_file):
    with open(output_file, 'wt') as file_out:
        for input_file in input_files:
            with gzip.open(input_file, 'rt') as file_in:
                for line in file_in:
                    file_out.write(line)

In [8]:
list(sequencing_data_path.glob('*_R1_001.fastq.gz'))

[WindowsPath('C:/Users/user/Desktop/SPARXS example dataset/Sequencing data/Main_S1_L001_R1_001.fastq.gz'),
 WindowsPath('C:/Users/user/Desktop/SPARXS example dataset/Sequencing data/Undetermined_S0_L001_R1_001.fastq.gz')]

Combining two specific files:

In [9]:
merge_gz_files(input_files = [sequencing_data_path / 'Main_S1_L001_R1_001.fastq.gz',
                              sequencing_data_path / 'Undetermined_S0_L001_R1_001.fastq.gz'],
               output_file = sequencing_data_path / 'Read1.fastq')

Or combining all fastq files in the folder:

In [10]:
# merge_gz_files(sequencing_data_path.glob('*_R1_001.fastq.gz'), sequencing_data_path / 'Read1.fastq')

## Run aligner
Sequence alignment was performed using bowtie 2.5.3.

To start using bowtie2, download and unzip the bowtie2 binaries from [Sourceforge](https://sourceforge.net/projects/bowtie-bio/files/bowtie2/), as explained in the [bowtie2 manual](https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml).

The steps below run several bowtie2 commands that can be run in the terminal or command prompt.

We found that on Windows, it was neccesary to replace line 118 of bowtie2-build file by
`if not (os.path.exists(build_bin_spec) | os.path.exists(build_bin_spec + '.exe')):`


In [11]:
# Specify the location of Bowtie2
bowtie2_path = r"C:\Users\user\Desktop\bowtie2-2.5.3-mingw-aarch64"

In [12]:
# Assuming the use of a conda environment named "papylio"
subprocess.run("conda activate papylio".split(' '), shell=True, capture_output=True) 

CompletedProcess(args=['conda', 'activate', 'papylio'], returncode=0, stdout=b'', stderr=b'')

The following line runs bowtie2-build. For more information about the settings, see the [bowtie2 manual](https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)

In [13]:
out = subprocess.run(
    [
        'python', bowtie2_path + r'\bowtie2-build', 
        "--quiet",
        "Reference.fasta", 
        "Reference",
    ], 
    cwd=str(sequencing_data_path.absolute()), 
    shell=True, 
    capture_output=True)

print(out.stderr.decode())
print(out.stdout.decode())


C:\Users\user\Desktop\bowtie2-2.5.3-mingw-aarch64\bowtie2-build-s



The following line runs the bowtie2 aligner. 

In [14]:
out = subprocess.run(
    [
        bowtie2_path + r'\bowtie2',
        '-x', 'Reference',
        '-U', 'Read1.fastq',
        '-S', 'Alignment.sam',
        '--local',
        '--np', '0',
        '--very-sensitive-local',
        '-L', '7',
        '--n-ceil', 'L,0,1',
        '--threads 10',
        '--norc',
    ],
    cwd=str(sequencing_data_path.absolute()), 
    shell=True, 
    capture_output=True)

print(out.stderr.decode())
print(out.stdout.decode())

1358397 reads; of these:
  1358397 (100.00%) were unpaired; of these:
    237629 (17.49%) aligned 0 times
    1118683 (82.35%) aligned exactly 1 time
    2085 (0.15%) aligned >1 times
82.51% overall alignment rate




The “local” setting allows soft clipping of the ends of the reads. The “very-sensitive-local” setting may be a good place to start. 
The “norc” setting will prevent alignment to the reverse complement of the reference. Depending on the length of the sequence region that is identical among similar samples, the seed length for searching will need to be adjusted using the “L” setting. If the reference contains “N”s then it is important to set “np” and “n-ceil” options. Currently out of all degenerate base codes only “N”s are supported by Bowtie 2. In addition, the “score-min” option may be used to change the threshold for including alignments. 

For more information about the settings, see the [bowtie2 manual](https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)

Bowtie2 produces a file with the [SAM file format](https://en.wikipedia.org/wiki/SAM_(file_format)), with the name Alignment.sam. This file contains all orginal sequences and information on how this sequence is best aligned in the form of a [CIGAR string](https://en.wikipedia.org/wiki/Sequence_alignment#Representations).

## Import sequencing data into experiment

Retreiving specific data from the text-based SAM file is relatively slow. Additionally the SAM file only states how to align the sequence but does not contain the actual aligned sequence. Therefore, we convert the SAM file to a NetCDF file for fast data retrieval and to enable inclusion of the actual aligned sequence.

In [15]:
exp = pp.Experiment(experiment_path)

Import files: 100%|██████████████████████████████████████████████████████████████| 4190/4190 [00:00<00:00, 5935.58it/s]



File(Single-molecule data - bead slide\Bead slide TIRF 561 001) used as mapping

Initialize experiment: 
C:\Users\user\Desktop\SPARXS example dataset


In [16]:
aligned_sam_filepath = sequencing_data_path.joinpath('Alignment.sam')
index1_fastq_filepath = None # If the index1 was sequenced as well, specify the path to the fastq file to import it into the sequencing data.
extract_sequence_subset = [10, 11, 36, 37, 62, 63, 88, 89] # Positions in the sequence to be separately extracted for easy lookup

exp.import_sequencing_data(aligned_sam_filepath, index1_file_path=index1_fastq_filepath, remove_duplicates=True,
                           add_aligned_sequence=True, extract_sequence_subset=extract_sequence_subset)

Determine number of primary alignments: 1358401it [00:04, 272598.57it/s]
Parse sam file: 100%|████████████████████████████████████████████████████████████████| 136/136 [01:35<00:00,  1.43it/s]


Arguments:
- `remove_duplicates`: Depending on the configuration for bowtie2 the SAM file can contain multiple alignments for a single sequence. These can cause problems later on in the analysis. If set to `True` (recommended), duplicate alignments will be removed and only the best alignment will be kept.
- `add_aligned_sequence`: If set to `True` (recommended), the aligned sequence will be added in the dataset.
- `extract_sequence_subset`: A subset of the aligned sequence can be added as a separate entry in the dataset, which is useful to easily distinguish the sequences. The variable can be set to the indexes in the sequence to be used.