Skip to content

Is there a way to filter fastq reads based on Qscore before paired end merging #575

@simmimourya

Description

@simmimourya

I have forward and reverse FASTQ files from Illumina paired-end sequencing, and I'm merging these reads using vsearch, which works great. However, I need to filter the reads based on a quality score (Qscore) threshold of 30 before merging. Specifically, within each forward and reverse FASTQ file, I want to discard any reads that have a quality score below this threshold. Once the reads are filtered, I want to perform a paired-end merge on the filtered reads.

Did I miss something? Also, is this the right way to approach the problem?

Here’s the code I’m using and the error I encountered during filtering:

def filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False):

    filtered_r1_path = os.path.join(output_dir, "filtered_r1.fastq")
    filtered_r2_path = os.path.join(output_dir, "filtered_r2.fastq")
    
    # Filter R1 and R2 reads
    subprocess.run(f"vsearch --fastq_filter {r1_path} --fastqout {filtered_r1_path} --fastq_qmin {qscore_threshold}", shell=True, check=True)
    subprocess.run(f"vsearch --fastq_filter {r2_path} --fastqout {filtered_r2_path} --fastq_qmin {qscore_threshold}", shell=True, check=True)
    
    # Merge filtered reads
    merged_output_prefix = os.path.join(output_dir, "merged")
    merged_output_file = f"{merged_output_prefix}.fastq"
    
    if not os.path.exists(merged_output_file) or merge_override:
        subprocess.run(f"vsearch --fastq_mergepairs {filtered_r1_path} --reverse {filtered_r2_path} --fastqout {merged_output_file}", shell=True, check=True)
    else:
        print(f"Using existing merged file: {merged_output_file}")
    
    return merged_output_file

# Example usage
r1_path = "forward.fastq.gz"
r2_path = "reverse.fastq.gz"
output_dir = '../results/'
filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False)

This is the error I'm seeing upon running the script:

vsearch v2.28.1_linux_x86_64, 124.5GB RAM, 16 cores
https://github.com/torognes/vsearch

Reading input file

Fatal error: FASTQ quality value (16) below qmin (30)
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In[35], line 36
     34 r2_path = "reverse.fastq.gz"
     35 output_dir = '../results/'
---> 36 filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False)

Cell In[35], line 19, in filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold, merge_override)
     16 filtered_r2_path = os.path.join(output_dir, "filtered_r2.fastq")
     18 # Filter R1 and R2 reads
---> 19 subprocess.run(f"vsearch --fastq_filter {r1_path} --fastqout {filtered_r1_path} --fastq_qmin {qscore_threshold}", shell=True, check=True)
     20 subprocess.run(f"vsearch --fastq_filter {r2_path} --fastqout {filtered_r2_path} --fastq_qmin {qscore_threshold}", shell=True, check=True)
     22 # Merge filtered reads

File /opt/conda/lib/python3.9/subprocess.py:528, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    526     retcode = process.poll()
    527     if check and retcode:
--> 528         raise CalledProcessError(retcode, process.args,
    529                                  output=stdout, stderr=stderr)
    530 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command 'vsearch --fastq_filter forward.fastq.gz --fastqout ../results/filtered_r1.fastq --fastq_qmin 30' returned non-zero exit status 1.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions