# Assembling RADseq data with ipyrad



Restriction site-Associated DNA sequencing (RADseq) is a high-throughput genotyping technique used in molecular biology and genomics. It is one method to generate reduced representation libraries, which allow us to prepare and sequence hundreds to thousands of genomic regions from across the genome without sequencing the entire genome. RADseq methods use restriction enzymes to cut up the genome and then sequence DNA regions that are adjacent to these cut sites. The idea is that within the same species or relatively closely related species, restriction enzyme cut sites should mostly be at the same places and allow for the selection of shared loci across samples without needing to develop sepecific probes.


One of the most commonly used RADseq approaches, and the one that we'll use here, is double digest RADseq (ddRADseq) which uses two enzymes to cut up the genome and then a size selection step to further reduce the total set of total set of loci, which should ideally result in fewer loci that require less sequencing effort and that overlap among samples.

We’ll be working with empirical double digest RADseq data [(Peterson et al. 2012)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0037135) that I (Sean Harrington) generated as part of my PhD research. The data are for a species of rattlesnake, the red diamond rattlesnake (*Crotalus ruber*), that is distributed across the Baja California peninsula and into southern California. I was interested in identifying if there is any population structure in *C. ruber* and inferring what population genetic and environmental forces have resulted in any existing structure. The data are single-end reads generated on an Illumina hiSeq. My analyses of these data are published in [Harrington et al. 2018](https://onlinelibrary.wiley.com/doi/full/10.1111/jbi.13114).



The dataset is reasonably small and we should be able to quickly process and analyze it.

We will use [ipyrad](https://ipyrad.readthedocs.io/en/master/) to process and assemble the raw data into alignments. ipyrad is a flexible python-based pipeline for taking various types of restriction-site associated data, processing them, and generated aligned datasets.

ipyrad is capable of generating datasets either by mapping your raw reads to a reference genome or using a de novo assembly method that does not require a reference. We will use the de novo method here.

If you need help with ipyrad outside of this workshop for specific issues, you can always post [here](https://app.gitter.im/#/room/#dereneaton_ipyrad:gitter.im). The developers are very responsive to queries.

ipyrad is certainly not the only option for assembling RADseq data. [Stacks](https://catchenlab.life.illinois.edu/stacks/) and [dDocent](https://www.ddocent.com//) are other popular options, or there are various ways to manually assemble or map RADseq data.









## Files and basic setup

The files we will use are:

- all_ruber.fastq
- barcodes_samples.txt
- names_ruber_all.txt

We will copy these from the google bucket for this tutorial into your GCP instance.

In [None]:
!gsutil -m cp -r gs://radseq_cloud/ .

## fastq format

Before we start doing anything with the data, it's worth seeing what the raw data look like. The standard format for raw data for genomic sequences is fastq.

Let's take a look at the first 8 lines of the fastq file:

In [None]:
!zcat radseq_cloud/ruber_data/all_ruber.fastq.gz | head -n 8

- note that these reads are gzipped (compressed; end in .gz): you cannot directly look at them with `head` but instead need to use `zcat`, which reads gzipped files, and pipe the output to `head`. Fastq files are typically gzipped to save disk space, and most genomics programs can read gzipped fastqs.

Each read from the sequencer is represented by 4 lines: the first 4 lines are the first read, the second set of 4 lines are the second read, etc. For each read, the first line is the header, and always starts with `@`. This contains a sequence identifier and various information about the read, often including information about the sequencing run. The second line, after the header, is the actual DNA sequence of the read. The next line always starts with `+` and may contain either no additional text, or the sequence identifier and extra information, as in the header. Line 4 for each read, following the `+` line, indicates the quality score for each DNA base in the read. This line will be exactly the same length as the DNA sequence in the second line, with e.g., the 4th character in this line corresponding to the quality of the 4th base in the sequence, etc.

<br>

### Installing ipyrad

Next up, we'll use mamba and pip to create a conda environment and install ipyrad into your instance.


In [4]:
# Download the latest Mambaforge installer
! curl -fL -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"

# Install Mambaforge without output
! bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge3 > /dev/null 2>&1


# Add the Miniforge bin directory to the system PATH environment variable
import os
os.environ["PATH"] = os.environ["HOME"] + "/miniforge3/bin:" + os.environ["PATH"]

# Verify that conda is available in the PATH
! mamba --version

# Use mamba to install ipyrad
! mamba install ipyrad -c conda-forge -c bioconda -y

# Use pip to isntall the necessary psutil
# ! pip install psutil==6.1.1

# handle some compatibility with psutil
! rm -f /opt/conda/lib/python3.10/site-packages/psutil/_psutil_linux.cpython-310-x86_64-linux-gnu.so

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 74.3M  100 74.3M    0     0   170M      0 --:--:-- --:--:-- --:--:--  170M
mamba 1.5.12
conda 24.11.3

Looking for: ['ipyrad']

conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
bioconda/linux-64                                           Using cache
bioconda/noarch                                             Using cache

Pinned packages:
  - python 3.10.*


Transaction

  Prefix: /opt/conda

  All requested packages already installed

[?25l[2K[0G[?25h

## Running iPyRad


First, we need to generate a params file that contains the parameters we need to specify for ipyrad. In your scripts directory, run:

In [5]:
!ipyrad -n ruber_denovo

ipyrad.assemble.utils.IPyradError: 
    Error: Params file already exists: params-ruber_denovo.txt
    Use force argument to overwrite.
    


This will create a params file with the defaults that ipyrad uses, we can modify these as we need . Whatever comes after the -n is what the assembly will be named

Let’s go look at and edit that.

## add in images of editing file, etc.


We’ll change a few of these parameters:

- [1]: This is where output will do, edit this to `/home/jupyter/RADseq_cloud_learn/ipyrad_out`

- [2]: this needs to reflect the path to the `all_ruber.fastq.gz` file, which is: `/home/jupyter/RADseq_cloud_learn/radseq_cloud/ruber_data/all_ruber.fastq.gz`

- [3]: this needs to be the path to `barcodes_samples.txt`: `/home/jupyter/RADseq_cloud_learn/radseq_cloud/ruber_data/barcodes_samples.txt`

- [7]: dataype should be `ddrad`

- [8]: restriction overhang is: `TGCAGG, GATC` these are the overhangs created by the restriction enzymes for ddRAD that was used for these data. I find these to be a pain to figure out, this is covered in the ipyrad params documentation

- [14]: This is the clustering threshold for clustering reads into loci within samples. This is an important paramater that can have large effects on your final dataset. The default of `0.85` is good for phylogenetic datasets, but for population genetics, you will often want to use a higher threshold like 0.9 or 0.93. Let's use `0.9` here.

- [27]: change to `*`, this will generate all output formats that ipyrad is currently capable of

The rest of these are at generally reasonable values, although depending on your data, you may want to modify some of these. The parameters are all well documented [here](https://ipyrad.readthedocs.io/en/latest/6-params.html).

For our final dataset, we'll want to set parameter [21] "min_sample per locus" to something higher to end up with a reasonable amount of missing data, but we'll deal with this later.

We'll start by running steps 1-5:

In [7]:
# Run ipyrad with those parameters for steps 1-5 and using 16 cores
!ipyrad -p params-ruber_denovo.txt -s 12345 -c 16


 -------------------------------------------------------------
  ipyrad [v.0.9.104]
  Interactive assembly and analysis of RAD-seq data
 ------------------------------------------------------------- 
  Parallel connection | fresh-noncontainer: 16 cores
  
  Step 1: Demultiplexing fastq data to Samples

  Encountered an Error.
  Message:     Error: Step 1 requires that you enter one of the following:
        (1) a sorted_fastq_path
        (2) a raw_fastq_path + barcodes_path
    
^C
KeyboardInterrupt


This should take around 30 minutes. While that's running, familiarize yourself with the steps in ipyrad, which are thoroughly documented [here](https://ipyrad.readthedocs.io/en/master/7-outline.html).


## Branching an assembly

We only ran steps 1-5 above because the Fastq file that we started with includes mostly individuals of the red diamond rattlesnake, *Crotalus ruber*, but also a few individuals of other rattlesnake species to serve as outgrpoups. Right now, we want to make a dataset that includes only *C. ruber* individuals that we can run some popgen analyses on in the next session.

ipyrad includes functionality to make new “branches” of the assembly using different parameters and/or including/excluding different individuals, and we’ll take advantage of that functionality here.

- If we wanted to include all samples in the same dataset, we could've just run all 7 steps at once.

To create a new branch with only the desired individuals:

In [6]:
# branch the assembly
!ipyrad -p params-ruber_denovo.txt -b ruber_only_denovo radseq_cloud/ruber_data/names_ruber_all.txt

ipyrad.assemble.utils.IPyradError: 
            Could not find saved Assembly file (.json) in expected location.
            Checks in: [project_dir]/[assembly_name].json
            Checked: /home/jupyter/RADseq_cloud_learn/ruber_denovo.json
            


This will use our old assembly and params file to generate a new branch, with params file `params-ruber_only_denovo.txt` that includes only samples in the `names_ruber_all.txt file`.

We need to further edit this file to change parameter [21] “min_sample per locus”. The parameter defines how many how many individual samples a locus must have data for to include that locus in the final dataset. It controls the amount of missing data in the final dataset. Here, let's set this to `26` - this is about 75% of individuals and should result in a matrix that is ~75% or greater complete.

Use your favorite text editor and make this change in the file params-ruber_only_denovo.txt:

`26               ## [21] [min_samples_locus]: Min # samples per locus for output`

Once that change has been made, run the final 2 steps in ipyrad. This should be very fast.

In [None]:
!ipyrad -p params-ruber_only_denovo.txt -s 67 -c 16

## Examining the output


Before you start analyzing your data, you should always take a look at the output stats.

Take a look at the `ruber_only_denovo_stats.txt` file in the `ipyrad_out/ruber_only_denovo_outfiles` directory by opening it in Jupyterlab.


There should be about 2498 loci recovered in the assembly (last column of row `total_filtered_loci`). If we scroll down a bit in the table `The number of loci recovered for each Sample`, we can see that SD_Field_0506 has almost no loci shared with other samples, and SD_Field_1453 has only about half as many loci as most samples. We’ll want to remove these samples before moving on. 

- Note that SD_Field_0506 is an obviously failed sample, but for SD_Field_1453, you would likely want to try out some preliminary downstream analyses with and without this sample – I’ve already analyzed these data and decided it’s best to remove it.

## Branch to remove low data samples

Start by making a new names file to exclude SD_Field_0506 and SD_Field_1453 called `names_ruber_reduced.txt` and delete the lines containing `SD_Field_0506` and `SD_Field_1453`.


Then do the branching and run step 7 on that new branch:

In [None]:
# branch
!ipyrad -p params-ruber_only_denovo.txt -b ruber_reduced_denovo radseq_cloud/ruber_data/names_ruber_reduced.txt

In [None]:
!ipyrad -p params-ruber_reduced_denovo.txt -s 7 -c 16

Look at the stats for the new assembly in `ipyrad_out/ruber_reduced_denovo_outfiles/ruber_reduced_denovo_stats.txt`

You should now see a slight decrease in the number of loci (I see 2451), but pretty good coverage across individuals, with no single sample having maassive amounts of missing data. This looks like a good dataset to move forward with.

We have all sorts of variously formatted data files in the output directory. We'll upload the contents of that directory, as well as some of the metadata to a bucket so that we can use that data in the downstream analyses in other notebooks.

In [None]:
# Create a new bucket
! gsutil mb -l us-east4 gs://ruber-ipyrad-out

# Copy the ipyrad output
! gsutil -m cp /home/jupyter/RADseq_cloud_learn/ipyrad_out/ruber_reduced_denovo_outfiles/* gs://ruber-ipyrad-out/

# Copy over the file of locality coordinates for each sample
! gsutil cp /home/jupyter/RADseq_cloud_learn/radseq_cloud/ruber_data/Localities.csv gs://ruber-ipyrad-out/