<a href="https://colab.research.google.com/github/wjoonkim/tutorials/blob/main/behi5003_fall2025/aav_illumina_seq_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BEHI 5003 Fall 2025
## Tutorial: DNA sequencing data processing and analysis
### Author: Won Joon Kim
### Affiliation: Biomolecular Engineering Lab (BEL), CBE, HKUST

I acknowledge the use of generative AI (Gemini Agent in Colab) in writing the code below.

Only the raw input data and outline was adapted from
the Jupyter notebook authored by Mr. Mingyi Sun (BEL, CBE, HKUST)
during the Fall 2024 offering of BIEN 6930B (this course).

I sincerely thank the authors and maintainers of the packages used in this tutorial!

# 1. Setup the Colab environment for data processing and analysis

In [None]:
# install biopython for sequence analysis on python
try:
    import google.colab
    # Running on Google Colab, so install Biopython first
    !pip install biopython
except ImportError:
    pass

In [None]:
# Install conda for command-line tools
!pip install -q condacolab
import condacolab
condacolab.install()

In [None]:
import os
import sys

from urllib.request import urlretrieve

import Bio

print("Python version:", sys.version_info)
print("Biopython version:", Bio.__version__)

In [None]:
# Install the cutadapt package from bioconda
!conda install -c bioconda cutadapt -y -q

In [None]:
# read basic info about the cutadapt package
!cutadapt

# 2. Explore how the raw Illumina_demo_pre.fastq file looks like

Where is the raw data stored (i.e., the file path)?

In [None]:
# list everything in the current directory in a long, human-interpretable format
!ls -lSh

Lines of code that start with an exclamation mark (i.e., `!`) are `bash` commands in Google Colab. They send the command (those that come after `!`) to the "terminal" (where you can talk to your computer more directly), instead of running it on the Python interpreter.

This is exactly how a FASTQ file looks like:

In [None]:
# print out the first 8 lines of the Illumina_demo_pre.fastq file
!head -n 8 /content/Illumina_demo_pre.fastq

A FASTQ file is nothing but a text file containing the sequence of life (e.g., DNA and protein)!

All information for a single read from the sequencer is stored every 4 lines. The most important lines are:
- the second, which contains the DNA sequence (where you see `A`, `T`, `G`, `C`, and sometimes `N`), and
- the fourth, which contains the Phred quality scores for each base call (where you see `F`, `#`, `:`, `,`, etc.).

You may refer to:
- this [link](https://en.wikipedia.org/wiki/FASTQ_format) for more information on the FASTQ file format, and
- this [link](https://en.wikipedia.org/wiki/Phred_quality_score) for more information on Phred quality scores.

A word of caution! Hardcoding "absolute paths" (i.e., one that starts with `/`, e.g., `/content/Illumina_demo_pre.fastq`) is not a good habit.

Below shows a better way of calling your files via "relative paths" (i.e., does not start with `/`; but this requires you to be extra careful when organizing your files).

How many reads are there?

This is a useful `bash` command to use to quickly check how many lines of anything there are in a text(-based) file (like sequencing data)!

In [None]:
# count all the lines in Illumina_demo_pre.fastq that starts with "@"
!grep -c '^@' Illumina_demo_pre.fastq

Let's import it to Python to see the base call quality in numbers!

In [None]:
from Bio import SeqIO

fastq_file = 'Illumina_demo_pre.fastq'

try:
    # Open the fastq file and read the first record
    with open(fastq_file, "r") as handle:
        first_read = next(SeqIO.parse(handle, "fastq"))
        # Illumina_demo_pre_seqrecordlist = list(SeqIO.parse(handle, "fastq"))

    # Print the first read
    print(first_read.id)
    print(first_read.seq)
    print(first_read.letter_annotations["phred_quality"])

except FileNotFoundError:
    print(f"Error: The file {fastq_file} was not found.")
except StopIteration:
    print(f"Error: The file {fastq_file} is empty or does not contain valid fastq records.")
except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# line plot of phred qualities in the first read
# Assuming 'first_read' object exists from a previous cell
if 'first_read' in locals():
    phred_qualities = first_read.letter_annotations["phred_quality"]

    sns.lineplot(x=range(len(phred_qualities)), y=phred_qualities)
    plt.title("Phred Quality Scores Across Positions in the First Read")
    plt.xlabel("Position in Read")
    plt.ylabel("Phred Quality Score")
    plt.ylim(0, 42) # Phred scores typically range from 0 to 41
    plt.grid(True, alpha=0.5)
    plt.show()
else:
    print("Error: 'first_read' object not found. Please run the cell to import the first read first.")

Let's check how long the first raw read is!

In [None]:
from Bio.Seq import Seq
len(first_read.seq)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from Bio import SeqIO

fastq_file = 'Illumina_demo_pre.fastq'
sequence_lengths = []

try:
    # Read all sequence lengths from the fastq file
    with open(fastq_file, "r") as handle:
        for record in SeqIO.parse(handle, "fastq"):
            sequence_lengths.append(len(record.seq))

    # Plot a histogram of the sequence lengths
    sns.histplot(sequence_lengths, bins=50, kde=False, edgecolor=None)
    plt.title('Distribution of Sequence Lengths in Raw FASTQ File')
    plt.xlabel('Sequence Length (bp)')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.5)
    plt.show()

except FileNotFoundError:
    print(f"Error: The file {fastq_file} was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

**Question 1: What can you say about the library length?**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from Bio import SeqIO

all_quals = []

try:
    # Read all quality scores from the fastq file
    with open(fastq_file, "r") as handle:
        for record in SeqIO.parse(handle, "fastq"):
            all_quals.extend(record.letter_annotations['phred_quality'])

    # Plot a histogram of the quality scores
    sns.histplot(all_quals, bins=100, kde=False, edgecolor=None)
    plt.title("Distribution of Phred Quality Scores in Raw FASTQ File")
    plt.xlabel("Phred Quality Score")
    plt.ylabel("Frequency")
    plt.grid(True, alpha=0.5)
    plt.show()

except FileNotFoundError:
    print(f"Error: The file {fastq_file} was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

**Question 2: What can you say about the sequencing quality?**

$ \text{Phred quality score } Q = -10\log_{10}(p_{\text{error}}) $

# 3. Remove "adapter" sequences from the data
"adapter" in this case denotes constant sequences flanking the desired 7-mer insertion library

In [None]:
import os
# check the number of cores in the Colab runtime
num_cores = os.cpu_count()
print(f"Number of CPU cores available: {num_cores}")

In [None]:
# (DO NOT CHANGE) set up file paths
raw_input_file = "Illumina_demo_pre.fastq"
untrimmed_output_file = "Illumina_demo_pre_untrimmed.fastq" # discard
lt21_output_file = "Illumina_demo_pre_lt21.fastq" # discard
gt21_output_file = "Illumina_demo_pre_gt21.fastq" # discard
len21_output_file = "Illumina_demo_pre_len21.fastq" # keep

**Question 3: What are the adapter sequences and library length?**

Hint: You can find the necessary information to put in the prompt boxes on the right from today's slides!

In [None]:
#@markdown - define adapter sequences
fiveprime_adapter_seq = "" #@param {type:"string"}
threeprime_adapter_seq = "" #@param {type:"string"}

#@markdown - define length of library (in base pairs)
library_len = 0 #@param {type:"raw"}

In [None]:
import subprocess

# (DO NOT CHANGE) Define the cutadapt command and arguments
command = [
    "cutadapt",
    "--cores", f"{num_cores}", # Using all available cores
    "--adapter", f"{fiveprime_adapter_seq}...{threeprime_adapter_seq}",
    "--untrimmed-output", untrimmed_output_file,
    "--minimum-length", f"{library_len}",
    "--too-short-output", lt21_output_file,
    "--maximum-length", f"{library_len}",
    "--too-long-output", gt21_output_file,
    "--output", len21_output_file,
    raw_input_file,
]

# (DO NOT CHANGE) Execute the command
try:
    subprocess.run(command, check=True)
    print("Cutadapt command executed successfully.")
except subprocess.CalledProcessError as e:
    print(f"Error executing cutadapt command: {e}")
except FileNotFoundError:
    print("Error: cutadapt command not found. Make sure cutadapt is installed and in your PATH.")

# 4. Check the desired output file

**Question 4: What is the desired output FASTQ file, and how do the first 4 lines look?**

Hint: You can use a previous `bash` command (i.e., one that starts with `!`) on the newly created file!

In [None]:
# replace this comment (starting from "#") with your code here

**Question 5: How many reads are left after keeping only the desired lengths?**

In [None]:
# your code here

# 5. Convert into amino acid residues

In [None]:
from Bio import SeqIO
from Bio.Seq import Seq
import pandas as pd
import argparse
from tqdm import tqdm

def count_unique_dna_seq(fastq_file):
    """
    Reads DNA sequences from a fastq file, counts the occurrences of each unique sequence,
    and stores the results in a pandas DataFrame.

    Parameters:
    fastq_file (str): Path to the input fastq file.

    Returns:
    pd.DataFrame: DataFrame containing unique sequences and their counts.
    """
    sequence_counts = {}

    # Read sequences from the fastq file with tqdm progress bar
    for record in tqdm(SeqIO.parse(fastq_file, "fastq"), desc="Parsed: ", unit=" sequences", dynamic_ncols=True):
        sequence = str(record.seq)
        if sequence in sequence_counts:
            sequence_counts[sequence] += 1
        else:
            sequence_counts[sequence] = 1

    # Convert the dictionary to a pandas DataFrame
    df = pd.DataFrame(list(sequence_counts.items()), columns=['sequence', 'count'])
    df['length'] = df['sequence'].apply(len)
    df['amino_acid'] = df['sequence'].apply(lambda seq: str(Seq(seq).translate()))

    return df

def count_unique_aa_seq(df):
    """
    Counts the occurrences of each unique amino acid sequence in the DataFrame.

    Parameters:
    df (pd.DataFrame): DataFrame containing DNA sequences, their counts, lengths, and translations.

    Returns:
    pd.DataFrame: DataFrame containing unique amino acid sequences, their summed counts, and occurrences.
    """
    # Group by amino_acid and aggregate the sum of counts and the occurrence
    amino_acid_counts_df = df.groupby('amino_acid').agg(
        count=('count', 'sum'),
        occurrence=('amino_acid', 'count')
    ).reset_index()

    return amino_acid_counts_df

def filter_codons(df):
    """
    Filters out sequences with a '*' (stop codon) or 'X' (unconfident base call) in the amino_acid column.

    Parameters:
    df (pd.DataFrame): DataFrame containing unique amino acid sequences, their summed counts, and occurrences.

    Returns:
    tuple: A tuple containing two DataFrames:
        - Filtered DataFrame without sequences containing "*" or "X".
        - DataFrame containing sequences that were filtered out.
    """
    filtered_df = df[~df['amino_acid'].str.contains(r'\*|X')]
    filtered_out_df = df[df['amino_acid'].str.contains(r'\*|X')]

    return filtered_df, filtered_out_df

In [None]:
input_file = "Illumina_demo_pre_len21.fastq"
dna_df = count_unique_dna_seq(input_file)
dna_output_file = "Illumina_demo_pre_len21_dna.csv"
dna_df.to_csv(dna_output_file, index=False)
print(f"\n{len(dna_df)} DNA variants saved.")
display(dna_df)

In [None]:
amino_acid_df = count_unique_aa_seq(dna_df)
amino_acid_output_file = "Illumina_demo_pre_len21_aa.csv"
amino_acid_df.to_csv(amino_acid_output_file, index=False)
print(f"{len(amino_acid_df)} raw amino acid variants saved.")
display(amino_acid_df)

In [None]:
filtered_amino_acid_df, filtered_out_amino_acid_df = filter_codons(amino_acid_df)
filtered_amino_acid_output_file = "Illumina_demo_pre_len21_aa_filtered.csv"
filtered_amino_acid_df.to_csv(filtered_amino_acid_output_file, index=False)
print(f"{len(filtered_amino_acid_df)} filtered amino acid variants saved.")
display(filtered_amino_acid_df) # this is the desired output dataframe!

In [None]:
filtered_out_amino_acid_output_file = "Illumina_demo_pre_len21_aa_filtered_out.csv"
filtered_out_amino_acid_df.to_csv(filtered_out_amino_acid_output_file, index=False)
print(f"{len(filtered_out_amino_acid_df)} filtered out amino acid variants saved.")
display(filtered_out_amino_acid_df)

**Question 6: How many valid amino-acid level variants are there in this library?**

# 6. Visualizing the 7-mer insertion variants we have in our data

In [None]:
display(filtered_amino_acid_df)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Function to calculate the frequencies of each character at each position in the sequences
def calculate_frequencies(df):
    frequencies = {}
    for index, row in df.iterrows():
        value = row['amino_acid']
        count = row['count']
        for i, char in enumerate(value):
            freq_dict = frequencies.get(i + 1, {})
            freq_dict[char] = freq_dict.get(char, 0) + count
            frequencies[i + 1] = freq_dict
    return frequencies

In [None]:
# Calculate the frequencies of each character at each position
freq = calculate_frequencies(filtered_amino_acid_df)

# Convert the frequencies dictionary to a DataFrame
freq_df = pd.DataFrame(freq)

display(freq_df)

In [None]:
# Plot a heatmap of the frequencies
plt.figure(figsize=(3, 6))
sns.heatmap(freq_df,
            cmap='YlGnBu',
            xticklabels=range(1, freq_df.shape[1] + 1),
            linewidths=0.5,
            cbar_kws={'label': 'Frequency'}) # add "Frequency" as colormap title
plt.xlabel('Position in 7-mer')
plt.ylabel('Amino Acid Residue')
plt.show()

**Question 7: What can you say about this library's amino acid composition?**

# You made it to the end of the tutorial!
# Congratulations, you just analyzed your first AAV sequencing data!