# Example _de novo_ RADseq assembly using _pyRAD_

----------  

Please direct questions about _pyRAD_ analyses to the google group thread ([link](https://groups.google.com/forum/#!forum/pyrad-users))  

--------------  



+  This tutorial is meant as a walkthrough for a single-end RADseq analyses. If you have not yet read the [__full tutorial__](http://www.dereneaton.com/software/pyrad), you should start there for a broader description of how _pyRAD_ works. If you are new to RADseq analyses, this tutorial will provide a simple overview of how to execute _pyRAD_, what the data files look like, and how to check that your analysis is working, and the expected output formats.  



+  Each cell in this tutorial begins with the header (%%bash) indicating that the code should be executed in a command line shell, for example by copying and pasting the text into your terminal (but excluding the %%bash header).

-------------  


Begin by executing the command below. This will download an example simulated RADseq data set and unarchive it into your current directory.

In [1]:
pwd

u'/Users/sr320/git-repos/nb-2016/util'

In [2]:
mkdir rad-seq-example

In [3]:
cd rad-seq-example/

/Users/sr320/git-repos/nb-2016/util/rad-seq-example


In [5]:
%%bash
wget -q dereneaton.com/downloads/simRADs.zip
unzip simRADs.zip

Archive:  simRADs.zip
  inflating: simRADs.barcodes        
  inflating: simRADs_R1.fastq.gz     


----------------  

#### The two necessary files below should now be located in your current directory.

+ simRADs.fastq.gz : Illumina fastQ formatted reads (gzip compressed)
+ simRADs.barcodes : barcode map file  

-----------------   



In [6]:
ls

simRADs.barcodes     simRADs.zip          simRADs_R1.fastq.gz


In [8]:
!head simRADs.barcodes

1A0	CATCAT
1B0	TTTTAA
1C0	AGGGGA
1D0	TAAGGT
2E0	TTTATA
2F0	GAGTAT
2G0	ATAGAG
2H0	ATGAGG
3I0	GGGTTT
3J0	TTAAAA


#### We begin by creating the params.txt file which is used to set all parameters for an analysis.

In [9]:
%%bash
## I have pyRAD in my $PATH so that I can call it by simply typing pyRAD.
## If you haven't done this then you will need to type the full path to 
## the pyRAD script to execute it.

## call pyRAD with the (-n) option
pyRAD -n

	new params.txt file created


------------   

The params file lists on each line one parameter followed by a __##__ mark, after which any comments can be left. In the comments section there is a description of the parameter and in parentheses the step of the analysis affected by the parameter. Lines 1-12 are required, the remaining lines are optional. The params.txt file is further described in the general tutorial.

#### Let's take a look at the default settings. 

In [10]:
%%bash
cat params.txt

./                        ## 1. Working directory                                 (all)
./*.fastq.gz              ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
./*.barcodes              ## 3. Loc. of barcode file (if not line 18)             (s1)
vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                     ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
2                         ## 7. N processors (parallel)                           (all)
6                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                         ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.88                       ## 10. Wclust: clustering threshold as a decimal        (s3,s6)
rad                       ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)
4                    

#### To change parameters you can edit params.txt in any text editor. Here to automate things I use the script below.

In [33]:
%%bash
sed -i '' 's/2                         ## 7/2                         ## 7/g' params.txt
sed -i '' 's/.88                       ## 10/.85                       ## 10/g' params.txt
sed -i '' '/## 14. /c\c85m4p3            ## 14. outprefix... ' params.txt
sed -i '' '/## 24./c\8                   ## 24. maxH raised ... ' params.txt
sed -i '' '/## 30./c\*                   ## 30. all output formats... ' params.txt
sed -i '' 's/vsearch                   ## 4/'/Applications/bioinfo/vsearch-1.11.1-osx-x86_64/bin/vsearch''                   ## 4/g' params.txt


sed: 1: "/## 14. /c\c85m4p3      ...": extra characters after \ at the end of c command
sed: 1: "/## 24./c\8             ...": extra characters after \ at the end of c command
sed: 1: "/## 30./c\*             ...": extra characters after \ at the end of c command
sed: 1: "s/vsearch               ...": bad flag in substitute command: 'A'


#### Let's have a look at the changes:

In [34]:
%%bash
cat params.txt

./                        ## 1. Working directory                                 (all)
./*.fastq.gz              ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
./*.barcodes              ## 3. Loc. of barcode file (if not line 18)             (s1)
vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                     ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
3                         ## 7. N processors (parallel)                           (all)
6                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                         ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.85                       ## 10. Wclust: clustering threshold as a decimal        (s3,s6)
rad                       ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)
4                    

--------------   

__Let's take a look at what the raw data look like.__

Your input data will be in fastQ format, usually ending in .fq or .fastq. Your data could be split among multiple files, or all within a single file (de-multiplexing goes much faster if they happen to be split into multiple files). The file/s may be compressed with gzip so that they have a .gz ending, but they do not need to be. The location of these files should be entered on line 2 of the params file. Below are the first three reads in the example file.

In [27]:
!gunzip simRADs_R1.fastq.gz

In [28]:
%%bash
less simRADs_R1.fastq | head -n 12 | cut -c 1-90

@lane1_fakedata0_R1_0 1:N:0:
TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R1_1 1:N:0:
TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R1_2 1:N:0:
TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB


------------   

Each read takes four lines. The first is the name of the read (its location on the plate). The second line contains the sequence data. The third line is a spacer. And the fourth line the quality scores for the base calls. In this case arbitrarily high since the data were simulated. 

These are 100 bp single-end reads prepared as RADseq. The first six bases form the barcode and the next five bases (TGCAG) the restriction site overhang. All following bases make up the sequence data. 

----------------   

## Step 1: de-multiplexing ##

This step uses information in the barcodes file to sort data into a separate file for each sample.  Below is the barcodes file, with sample names and their barcodes each on a separate line with a tab between them.

In [29]:
%%bash
cat simRADs.barcodes

1A0	CATCAT
1B0	TTTTAA
1C0	AGGGGA
1D0	TAAGGT
2E0	TTTATA
2F0	GAGTAT
2G0	ATAGAG
2H0	ATGAGG
3I0	GGGTTT
3J0	TTAAAA
3K0	GGATTG
3L0	AAGAAG


Step 1 writes the de-multiplexed data to a new file for each sample in a new directory created within the working directory called fastq/.

In [37]:
%%bash
pyRAD -p params.txt -s 1



     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 1: sorting reads by barcode
	 .

You can see that this created a new file for each sample in the directory 'fastq/'

In [38]:
%%bash
ls fastq/

1A0_R1.fq.gz
1B0_R1.fq.gz
1C0_R1.fq.gz
1D0_R1.fq.gz
2E0_R1.fq.gz
2F0_R1.fq.gz
2G0_R1.fq.gz
2H0_R1.fq.gz
3I0_R1.fq.gz
3J0_R1.fq.gz
3K0_R1.fq.gz
3L0_R1.fq.gz


#### The statistics for step 1

A new directory called stats will also have been created. Each step of the _pyRAD_ analysis will create a new stats output file in this directory. The stats output for step 1 is below:

In [39]:
%%bash
cat stats/s1.sorting.txt

file    	Nreads	cut_found	bar_matched
simRADs_R1.fastq	480000	480000	480000


sample	true_bar	obs_bars	N_obs
3L0    	AAGAAG    	AAGAAG	40000   
1C0    	AGGGGA    	AGGGGA	40000   
2G0    	ATAGAG    	ATAGAG	40000   
2H0    	ATGAGG    	ATGAGG	40000   
1A0    	CATCAT    	CATCAT	40000   
2F0    	GAGTAT    	GAGTAT	40000   
3K0    	GGATTG    	GGATTG	40000   
3I0    	GGGTTT    	GGGTTT	40000   
1D0    	TAAGGT    	TAAGGT	40000   
3J0    	TTAAAA    	TTAAAA	40000   
2E0    	TTTATA    	TTTATA	40000   
1B0    	TTTTAA    	TTTTAA	40000   

nomatch  	_            	0


### Step 2: quality filtering

This step filters reads based on quality scores, and can be used to detect Illumina adapters in your reads, which is sometimes a problem with homebrew type library preparations. Here the filter is set to the default value of 0, meaning it filters only based on quality scores of base calls. The filtered files are written to a new directory called edits/.

In [40]:
%%bash
pyRAD -p params.txt -s 2



     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

	step 2: editing raw reads 
	............

In [41]:
%%bash
ls edits/

1A0.edit
1B0.edit
1C0.edit
1D0.edit
2E0.edit
2F0.edit
2G0.edit
2H0.edit
3I0.edit
3J0.edit
3K0.edit
3L0.edit


The filtered data are written in fasta format (quality scores removed) into a new directory called edits/. Below I show a preview of the file which you can view most easily using the `less` command (I use `head` here to make it fit in the text window better).

In [42]:
%%bash
head -n 10 edits/1A0.edit | cut -c 1-80

>1A0_0_r1
TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC
>1A0_1_r1
TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC
>1A0_2_r1
TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC
>1A0_3_r1
TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC
>1A0_4_r1
TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC


### Step 3: clustering within-samples

Step 3 de-replicates and then clusters reads within each sample by the set clustering threshold and writes the clusters to new files in a directory called clust.xx

In [43]:
%%bash
pyRAD -p params.txt -s 3



     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	de-replicating files for clustering...

	step 3: within-sample clustering of 12 samples at 
	        '.85' similarity. Running 3 parallel jobs
	 	with up to 6 threads per job. If needed, 
		adjust to avoid CPU and MEM limits

	sample 1C0 finished, 2000 loci
	sample 1A0 finished, 2000 loci
	sample 1B0 finished, 2000 loci
	sample 1D0 finished, 2000 loci
	sample 2E0 finished, 2000 loci
	sample 2F0 finished, 2000 loci
	sample 2G0 finished, 2000 loci
	sample 2H0 finished, 2000 loci
	sample 3I0 finished, 2000 loci
	sample 3K0 finished, 2000 loci
	sample 3J0 finished, 2000 loci
	sample 3L0 finished, 2000 loci


Once again, I recommend you use the unix command 'less' to look at the clustS files. These contain each cluster separated by "//". For the first few clusters below you can see that there is one or two alleles in the cluster and one or a few reads that contained a (simulated) sequencing error. 

In [44]:
%%bash
less clust.85/1A0.clustS.gz | head -n 26 | cut -c 1-80

"clust.85/1A0.clustS.gz" may be a binary file.  See it anyway? 

---------------


The stats output tells you how many clusters were found, and their mean depth of coverage. It also tells you how many pass  your minimum depth setting. You can use this information to decide if you wish to increase or decrease the mindepth before it is applied for making consensus base calls in steps 4 & 5.

In [45]:
%%bash
head -n 40 stats/s3.clusters.txt


taxa	total	dpt.me	dpt.sd	d>5.tot	d>5.me	d>5.sd	badpairs
1A0	2000	20.0	0.0	2000	20.0	0.0	0
1B0	2000	20.0	0.0	2000	20.0	0.0	0
1C0	2000	20.0	0.0	2000	20.0	0.0	0
1D0	2000	20.0	0.0	2000	20.0	0.0	0
2E0	2000	20.0	0.0	2000	20.0	0.0	0
2F0	2000	20.0	0.0	2000	20.0	0.0	0
2G0	2000	20.0	0.0	2000	20.0	0.0	0
2H0	2000	20.0	0.0	2000	20.0	0.0	0
3I0	2000	20.0	0.0	2000	20.0	0.0	0
3J0	2000	20.0	0.0	2000	20.0	0.0	0
3K0	2000	20.0	0.0	2000	20.0	0.0	0
3L0	2000	20.0	0.0	2000	20.0	0.0	0

    ## total = total number of clusters, including singletons
    ## dpt.me = mean depth of clusters
    ## dpt.sd = standard deviation of cluster depth
    ## >N.tot = number of clusters with depth greater than N
    ## >N.me = mean depth of clusters with depth greater than N
    ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N
    ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)

HISTOGRAMS

    
sample: 1A0
bins	depth_histogram	cnts
   :	0------------50-------------100

### Steps 4 & 5: Call consensus sequences

#### Step 4 jointly infers the error-rate and heterozygosity across samples.

In [66]:
%%bash
pyRAD -p params.txt -s 4



     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 4: estimating error rate and heterozygosity
	............

In [None]:
%%bash
less stats/Pi_E_estimate.txt

taxa	H	E
3K0	0.00135982	0.00048078	
1C0	0.00134858	0.00048372	
1D0	0.00135375	0.00048822	
3I0	0.00129751	0.00048694	
2H0	0.00133223	0.00049211	
2F0	0.00135365	0.0004995	
2E0	0.00126915	0.00051556	
1B0	0.00149924	0.00049663	
1A0	0.00136043	0.00051028	
3J0	0.00144422	0.0005089	
2G0	0.00138185	0.00051206	
3L0	0.00143349	0.00051991	
3K0	0.00135982	0.00048078	
1C0	0.00134858	0.00048372	
1D0	0.00135375	0.00048822	
3I0	0.00129751	0.00048694	
2H0	0.00133223	0.00049211	
2F0	0.00135365	0.0004995	
2E0	0.00126915	0.00051556	
1B0	0.00149924	0.00049663	
1A0	0.00136043	0.00051028	
3J0	0.00144422	0.0005089	
2G0	0.00138185	0.00051206	
3L0	0.00143349	0.00051991	
3K0	0.00135982	0.00048078	
1C0	0.00134858	0.00048372	
1D0	0.00135375	0.00048822	
3I0	0.00129751	0.00048694	
2H0	0.00133223	0.00049211	
2F0	0.00135365	0.0004995	
2E0	0.00126915	0.00051556	
1B0	0.00149924	0.00049663	
1A0	0.00136043	0.00051028	
3J0	0.00144422	0.0005089	
2G0	0.00138185	0.00051206	
3L0	0.00143349	0.00051991	


#### Step 5 calls consensus sequences using the parameters inferred above, and filters for paralogs.

In [None]:
%%bash
pyRAD -p params.txt -s 5

#### The stats output for step 5

In [None]:
%%bash
less stats/s5.consens.txt

### Step 6: Cluster across samples

Step 6 clusters consensus sequences across samples. It will print its progress to the screen. This uses 6 threads by default. If you enter 0 for param 37 it will use all available processors. 

In [None]:
%%bash
pyRAD -p params.txt -s 6 

## Step 7: Assemble final data sets

The final step is to output data only for the loci that you want to have included in your data set. This filters once again for potential paralogs or highly repetitive regions, and includes options to minimize the amount of missing data in the output. 

In [None]:
%%bash
pyRAD -p params.txt -s 7

### Final stats output

In [None]:
%%bash
less stats/c85m4p3.stats

---------------  

## Output formats ##

We created 11 output files from our analysis. The standard two (.loci and .excluded_loci), as well as the 9 additional ones listed in the params file. These are all shown below.

In [None]:
%%bash 
ls outfiles/

### Loci format  
The ".loci" file contains each locus listed in a fasta-like format that also shows which sites are variable below each locus. Autapomorphies are listed as '-' and shared SNPs as '*'. This is a custom format that is human readable and also used as input to perform D-statistic tests in pyRAD. This is the easiest way to visualize your results. I recommend viewing the file with the command `less`. Below I use a head and cut to make it easy to view in this window.

In [None]:
%%bash 
head -n 39 outfiles/c85m4p3.loci | cut -c 1-75

### PHY format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.phy | cut -c 1-85

### NEX format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.nex | cut -c 1-85

### Alleles format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.alleles| cut -c 1-85

### STRUCTURE (.str) format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.str | cut -c 1-20

### GENO (.geno) format (used in _Admixture_)

In [None]:
%%bash 
head -n 40 outfiles/c85m4p3.geno 

### SNPs format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.snps | cut -c 1-85

### UNLINKED_SNPs format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.unlinked_snps | cut -c 1-85

## OTHER FORMATS  

You may also produce some more complicated formatting options that involve pooling individuals into groups or populations. This can be done for the "treemix" and "migrate" outputs, which are formatted for input into the programs _TreeMix_ and _migrate-n_, respectively. Grouping individuals into populations is done with the final lines of the params file as shown below, and similar to the assignment of individuals into clades for hierarchical clustering (see full tutorial). 

Each line designates a group, and has three arguments that are separated by space or tab. The first is the group name, the second is the minimum number of individuals that must have data in that group for a locus to be included in the output, and the third is a list of the members of that group. Lists of taxa can include comma-separated names and wildcard selectors, like below. Example:


In [None]:
%%bash 
## append group designations to the params file
echo "pop1 4 1A0,1B0,1C0,1D0 " >> params.txt
echo "pop2 4 2E0,2F0,2G0,2H0 " >> params.txt
echo "pop3 4 3* " >> params.txt

## view params file
cat params.txt

## Creating population output files  
Now if we run _pyRAD_ with the 'm' (migrate) or 't' (treemix) output options, it will create their output files. 

In [None]:
%%bash 
pyRAD -p params.txt -s 7

## TREEMIX format

In [None]:
%%bash 
less outfiles/c85m4p3.treemix.gz | head -n 30

## MIGRATE-n FORMAT

In [None]:
%%bash 
head -n 40 outfiles/c85m4p3.migrate | cut -c 1-85