# Assembling RADseq data with ipyrad



Restriction site-Associated DNA sequencing (RADseq) is a high-throughput genotyping technique used in molecular biology and genomics. It is one method to generate reduced representation libraries, which allow us to prepare and sequence hundreds to thousands of genomic regions from across the genome without sequencing the entire genome. RADseq methods use restriction enzymes to cut up the genome and then sequence DNA regions that are adjacent to these cut sites. The idea is that within the same species or relatively closely related species, restriction enzyme cut sites should mostly be at the same places and allow for the selection of shared loci across samples without needing to develop sepecific probes.


One of the most commonly used RADseq approaches, and the one that we'll use here, is double digest RADseq (ddRADseq) which uses two enzymes to cut up the genome and then a size selection step to further reduce the total set of total set of loci, which should ideally result in fewer loci that require less sequencing effort and that overlap among samples.


This figure from [(Peterson et al. 2012)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0037135) shows how sequencing is targeted at specific reginos by RADseq and ddRADseq approaches:

<img src="images/ddrad_peterson.png" width=50% />


There are various other RADseq approaches, including the original formulation of RADseq that used a single enzyme and no size selection, as described [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0003376) or 3RAD which incorporates an extra enzyme to cut apart adapter dimers as descibed [here](https://peerj.com/articles/7724/). Use of different strategies, restriction enzymes, and size selection windows will affect what loci you target from your organisms of interest, and the best strategy will depend on genomic properties of your target organism and your desired sequencing effort.


<br>

We’ll be working with empirical double digest RADseq data  that I (Sean Harrington) generated as part of my PhD research. The data are for a species of rattlesnake, the red diamond rattlesnake (*Crotalus ruber*), that is distributed across the Baja California peninsula and into southern California. 



<img src="images/ruber.jpg" width=50% />





I was interested in identifying if there is any population structure in *C. ruber* and inferring what population genetic and environmental forces have resulted in any existing structure. The data are single-end reads generated on an Illumina hiSeq. My analyses of these data are published in [Harrington et al. 2018](https://onlinelibrary.wiley.com/doi/full/10.1111/jbi.13114).





The dataset is reasonably small and we should be able to quickly process and analyze it.

We will use [ipyrad](https://ipyrad.readthedocs.io/en/master/) to process and assemble the raw data into alignments. ipyrad is a flexible python-based pipeline for taking various types of restriction-site associated data, processing them, and generated aligned datasets.

ipyrad is capable of generating datasets either by mapping your raw reads to a reference genome or using a de novo assembly method that does not require a reference. We will use the de novo method here.

If you need help with ipyrad, you can always post [here](https://app.gitter.im/#/room/#dereneaton_ipyrad:gitter.im). The developers are very responsive to queries.

ipyrad is certainly not the only option for assembling RADseq data. [Stacks](https://catchenlab.life.illinois.edu/stacks/) and [dDocent](https://www.ddocent.com//) are other popular options, or there are various ways to manually assemble or map RADseq data.




We will be using a python kernel throughout this tutorial, but all commands will be bash commands, and so will be preceded by `!`




## Files and basic setup

The files we will use are:

- all_ruber.fastq
- barcodes_samples.txt
- names_ruber_all.txt

We will copy these from the google bucket for this tutorial into your GCP instance.

In [None]:
! gsutil -m cp -r gs://radseq_cloud/ .

## fastq format

Before we start doing anything with the data, it's worth seeing what the raw data looks like. The standard format of raw genomic sequence data is fastq.

Let's take a look at the first 8 lines of the fastq file. The following command will help us:

In [None]:
! zcat radseq_cloud/ruber_data/all_ruber.fastq.gz | head -n 8

- note that these reads are gzipped (compressed; end in .gz): you cannot directly look at them with `head` but instead need to use `zcat`, which reads gzipped files, and pipe the output to `head`. Fastq files are typically gzipped to save disk space, and most genomics programs can read gzipped fastqs.

You should see this:


<img src="images/fastq.png" width=70% />

Each read from the sequencer is represented by four lines. For example, the first four lines correspond to the first read, the next four lines to the second read, and so on. Here's the breakdown of each line for a single read:

1. **Line 1 (Header)**:  
   This line always begins with `@` and contains the sequence identifier along with optional information about the read, such as details about the sequencing run.
  
2. **Line 2 (DNA Sequence)**:  
   This line contains the actual DNA sequence of the read, represented as a string of nucleotide bases (A, T, C, G).

3. **Line 3 (+ Separator)**:  
   This line starts with a `+` and may either be empty or contain additional information, such as the sequence identifier again. This line serves primarily as a separator.

4. **Line 4 (Quality Scores)**:  
   This line contains the quality scores for each base in the DNA sequence. The number of characters in this line matches the length of the sequence from Line 2. Each character corresponds to the quality score for the respective base in the sequence (e.g., the 4th character in this line represents the quality of the 4th base in the sequence).

This structure repeats for each read in the FASTQ file.


### Installing ipyrad

Next up, we'll use mamba to install ipyrad. If mamba is not already installed, you can do so with: 
```
# Download the latest Mambaforge installer
! curl -fL -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"

# Install Mambaforge without output
! bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge3 > /dev/null 2>&1
# If you want to see all the output you can remove `> /dev/null 2>&1` which redirects output so that we don't see it


# Add the Miniforge bin directory to the system PATH environment variable
import os
os.environ["PATH"] = os.environ["HOME"] + "/miniforge3/bin:" + os.environ["PATH"]
```

In [None]:
# Verify that mamba/conda is available in the PATH
! mamba --version

# Use mamba to install ipyrad - similarly without output using `> /dev/null 2>&1`
! mamba install ipyrad -c conda-forge -c bioconda -y > /dev/null 2>&1

## Running iPyRad


First, we need to generate a params file that contains the parameters we need to specify for ipyrad. In your scripts directory, run:

In [None]:
! ipyrad -n ruber_denovo

This will create a params file with the defaults that ipyrad uses, we can modify these as we need. Whatever comes after the -n is what the assembly will be named

Let’s go look at that and edit it. You can directly use the editor that the JupyterLab interface provides. In the left pane, if it's not already open and active, click on the folder icon, then double click the "params-ruber_denovo.txt" file, and it will open in a text editor.

<img src="images/params_file_location.png" width=50% />


In the editor that opens up, I uncheck "Wrap Words" from the "View menu" so that each line of code stays on a single line.


We’ll change a few of these parameters:

- [1]: This is where output will be written, edit this to `/home/jupyter/RADseq_cloud_learn/ipyrad_out`

- [2]: this needs to reflect the path to the `all_ruber.fastq.gz` file, which is: `/home/jupyter/RADseq_cloud_learn/radseq_cloud/ruber_data/all_ruber.fastq.gz`

- [3]: this needs to be the path to `barcodes_samples.txt`: `/home/jupyter/RADseq_cloud_learn/radseq_cloud/ruber_data/barcodes_samples.txt`

- [7]: dataype should be `ddrad`

- [8]: restriction overhang is: `TGCAGG, GATC` these are the overhangs created by the restriction enzymes for ddRAD that was used for these data. These can be difficult to find and is covered in the ipyrad params documentation

- [14]: This is the clustering threshold for clustering reads into loci within samples. This is an important paramater that can have large effects on your final dataset. The default of `0.85` is good for phylogenetic datasets, but for population genetics, you will often want to use a higher threshold like 0.9 or 0.93. Let's use `0.9` here.

- [27]: change to `*`, this will generate all output formats that ipyrad is currently capable of

The rest of these are at generally reasonable values, although depending on your data, you may want to modify some of these. The parameters are all well documented [here](https://ipyrad.readthedocs.io/en/latest/6-params.html).

For our final dataset, we'll want to set parameter [21] "min_sample per locus" to something higher to end up with a reasonable amount of missing data, but we'll deal with this later.

Your final file should look like this (also showing the view menu where you can uncheck "Wrap Words"):

<img src="images/edited_params.png" width=100% />


Close out of that file and save when prompted.

<br>
<br>


We'll start by running steps 1-5:

In [None]:
# Run ipyrad with those parameters for steps 1-5 and using 16 cores
! ipyrad -p params-ruber_denovo.txt -s 12345 -c 16

This should take around 15 minutes. While that's running, familiarize yourself with the steps in ipyrad, which are thoroughly documented [here](https://ipyrad.readthedocs.io/en/master/7-outline.html).


## Branching an assembly

We only completed steps 1–5 in the process outlined above because the initial FASTQ file contains primarily *Crotalus ruber* (red diamond rattlesnakes) individuals, but also a few representatives from other rattlesnake species included as outgroups. Steps 1–5 assemble loci within individual samples, but they do not yet involve alignment across individuals. At this stage, our goal is to generate a dataset containing only C. ruber individuals for subsequent population genomics analyses in the next session. After we remove the outghroup individuals from the dataset, we will execute the final steps to create the necessary alignments across individuals.

The ipyrad tool offers functionality to create new "branches" of the assembly using different parameters, such as including or excluding specific individuals. We will use this capability to isolate the desired individuals for our analysis.

If we wanted to include all samples in the dataset, we could have run all seven steps in a single process.
To create a new branch containing only the selected individuals, follow these steps:

In [None]:
# branch the assembly
! ipyrad -p params-ruber_denovo.txt -b ruber_only_denovo radseq_cloud/ruber_data/names_ruber_all.txt

This will use our old assembly and params file to generate a new branch, with params file `params-ruber_only_denovo.txt` that includes only samples in the `names_ruber_all.txt file`.

We need to further edit this file to change parameter [21] “min_sample per locus”. The parameter defines how many how many individual samples a locus must have data for to include that locus in the final dataset. It controls the amount of missing data in the final dataset. Here, let's set this to `26` - this is about 75% of individuals and should result in a matrix that is ~75% or greater complete.

As we did for the "params-ruber_denovo.txt", open the new "params-ruber_only_denovo.txt" file in the text editor and make this change:

`26               ## [21] [min_samples_locus]: Min # samples per locus for output`

Once that change has been made, run the final 2 steps in ipyrad. This should be fast and take just a minute or two.

In [None]:
! ipyrad -p params-ruber_only_denovo.txt -s 67 -c 16

## Examining the output


Before you start analyzing your data, you should always take a look at the output stats.

Navigate to the `ipyrad_out/ruber_only_denovo_outfiles` directory and open the `ruber_only_denovo_stats.txt` file it in the text editor.


There should be about 2498 loci assembled and retained in the assembly (last column of row `total_filtered_loci`):

<img src="images/recovered_loci.png" width=50% />





<br>

If we scroll down a bit in the table `The number of loci recovered for each Sample`, we can see that SD_Field_0506 has almost no loci shared with other samples, and SD_Field_1453 has only about half as many loci as most samples. We’ll want to remove these samples before moving on. 

<img src="images/drop_samples.png" width=50% />

- Note that SD_Field_0506 is an obviously failed sample, but for SD_Field_1453, you would likely want to try out some preliminary downstream analyses with and without this sample – I’ve already analyzed these data and decided it’s best to remove it.

## Branch to remove low data samples

Start by making a new names file to exclude SD_Field_0506 and SD_Field_1453 called `names_ruber_reduced.txt`. To do this, go into `radseq_cloud/ruber_data`, right click on `names_ruber_all.txt`, click "duplicate", and then right click and rename the duplicate file to `names_ruber_reduced.txt`.


Open `names_ruber_reduced.txt` in the text editor and delete the lines containing `SD_Field_0506` and `SD_Field_1453` and save and close the file.


Then do the branching and run step 7 on that new branch:

In [None]:
# branch
! ipyrad -p params-ruber_only_denovo.txt -b ruber_reduced_denovo radseq_cloud/ruber_data/names_ruber_reduced.txt

In [None]:
# Run step 7
! ipyrad -p params-ruber_reduced_denovo.txt -s 7 -c 16

Look at the stats for the new assembly in `ipyrad_out/ruber_reduced_denovo_outfiles/ruber_reduced_denovo_stats.txt`

You should now see a slight decrease in the number of loci (I see 2451), but pretty good coverage across individuals, with no single sample having maassive amounts of missing data. This looks like a good dataset to move forward with.

We have all sorts of variously formatted data files in the output directory. This data will all be saved in your Google Cloud instance and you can access it from the other notebooks, so long as you don't delete your instance. If you are going to delete your instance, you should first download your files or add them to a bucket. 


This is an example of how you could create a bucket and upload the results from ipyrad into it, and then also copy the file with coordinates into it. `your-new-bucket` is a placeholder for whatever you would actually call your bucket. Keep in mind that if you actually create a bucket and put data into it, there will be a cost associated with that storage as long as it persists.

In [None]:
# Create a new bucket
# ! gsutil mb -l us-east4 gs://your-new-bucket

# Copy the ipyrad output
# ! gsutil -m cp /home/jupyter/RADseq_cloud_learn/ipyrad_out/ruber_reduced_denovo_outfiles/*  gs://your-new-bucket/

# Copy over the file of locality coordinates for each sample
# ! gsutil cp /home/jupyter/RADseq_cloud_learn/radseq_cloud/ruber_data/Localities.csv gs://your-new-bucket/
