# Example _de novo_ RADseq assembly using _pyRAD_

# Modification to start looking at Ostrea data

----------  

Please direct questions about _pyRAD_ analyses to the google group thread ([link](https://groups.google.com/forum/#!forum/pyrad-users))  

--------------  



+  This tutorial is meant as a walkthrough for a single-end RADseq analyses. If you have not yet read the [__full tutorial__](http://www.dereneaton.com/software/pyrad), you should start there for a broader description of how _pyRAD_ works. If you are new to RADseq analyses, this tutorial will provide a simple overview of how to execute _pyRAD_, what the data files look like, and how to check that your analysis is working, and the expected output formats.  



+  Each cell in this tutorial begins with the header (%%bash) indicating that the code should be executed in a command line shell, for example by copying and pasting the text into your terminal (but excluding the %%bash header).

-------------  


Begin by executing the command below. This will download an example simulated RADseq data set and unarchive it into your current directory.

In [16]:
pwd

u'/Volumes/web/halfshell/working-directory/16-05-17'

In [4]:
cd /Volumes/web/halfshell/working-directory


/Volumes/web/halfshell/working-directory


In [5]:
mkdir 16-05-17

In [6]:
cd 16-05-17

/Volumes/web/halfshell/working-directory/16-05-17


In [7]:
ls | head

------------   

The params file lists on each line one parameter followed by a __##__ mark, after which any comments can be left. In the comments section there is a description of the parameter and in parentheses the step of the analysis affected by the parameter. Lines 1-12 are required, the remaining lines are optional. The params.txt file is further described in the general tutorial.

### evolving params file

In [20]:
%%bash
cat params.txt

./                        ## 1. Working directory                                 (all)
./*.fq.gz              ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
./*.barcodes              ## 3. Loc. of barcode file (if not line 18)             (s1)
/Applications/bioinfo/vsearch-1.11.1-osx-x86_64/bin/vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
/Applications/bioinfo/muscle3.8.31_i86darwin64                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                      ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
6                         ## 7. N processors (parallel)                           (all)
5                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                         ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.88                       ## 10. Wclust: clustering threshold as a decimal        (s3,s6)
gbs                    

#### To change parameters you can edit params.txt in any text editor. Here to automate things I use the script below.

--------------   

__Let's take a look at what the raw data look like.__

Your input data will be in fastQ format, usually ending in .fq or .fastq. Your data could be split among multiple files, or all within a single file (de-multiplexing goes much faster if they happen to be split into multiple files). The file/s may be compressed with gzip so that they have a .gz ending, but they do not need to be. The location of these files should be entered on line 2 of the params file. Below are the first three reads in the example file.

## Sample Description

<img src="http://eagle.fish.washington.edu/cnidarian/skitch/Genotype_by_sequencing_November_2015_·_RobertsLab_project-olympia_oyster-genomic_Wiki_🔊_1CEB70ED.png" alt="Genotype_by_sequencing_November_2015_·_RobertsLab_project-olympia_oyster-genomic_Wiki_🔊_1CEB70ED.png"/>

In [14]:
mkdir fastq

In [15]:
!cp /Volumes/web/nightingales/O_lurida/20160223_gbs/*1.fq.gz fastq/

In [18]:
ls fastq/

160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_1.fq.gz
[31m1HL_10A_1.fq.gz[m[m*
[31m1HL_11A_1.fq.gz[m[m*
[31m1HL_12A_1.fq.gz[m[m*
[31m1HL_13A_1.fq.gz[m[m*
[31m1HL_14A_1.fq.gz[m[m*
[31m1HL_15A_1.fq.gz[m[m*
[31m1HL_16A_1.fq.gz[m[m*
[31m1HL_17A_1.fq.gz[m[m*
[31m1HL_19A_1.fq.gz[m[m*
[31m1HL_1A_1.fq.gz[m[m*
[31m1HL_20A_1.fq.gz[m[m*
[31m1HL_21A_1.fq.gz[m[m*
[31m1HL_22A_1.fq.gz[m[m*
[31m1HL_23A_1.fq.gz[m[m*
[31m1HL_24A_1.fq.gz[m[m*
[31m1HL_25A_1.fq.gz[m[m*
[31m1HL_26A_1.fq.gz[m[m*
[31m1HL_27A_1.fq.gz[m[m*
[31m1HL_28A_1.fq.gz[m[m*
[31m1HL_29A_1.fq.gz[m[m*
[31m1HL_2A_1.fq.gz[m[m*
[31m1HL_31A_1.fq.gz[m[m*
[31m1HL_33A_1.fq.gz[m[m*
[31m1HL_34A_1.fq.gz[m[m*
[31m1HL_35A_1.fq.gz[m[m*
[31m1HL_3A_1.fq.gz[m[m*
[31m1HL_4A_1.fq.gz[m[m*
[31m1HL_5A_1.fq.gz[m[m*
[31m1HL_6A_1.fq.gz[m[m*
[31m1HL_7A_1.fq.gz[m[m*
[31m1HL_8A_1.fq.gz[m[m*
[31m1HL_9A_1.fq.gz[m[m*
[31m1NF_10A_1.fq.gz

In [16]:
!gunzip *.gz

^C


In [17]:
ls | head

1HL_10A_1.fq*
1HL_10A_2.fq*
1HL_10A_2.fq.gz*
1HL_11A_1.fq.gz*
1HL_11A_2.fq.gz*
1HL_12A_1.fq.gz*
1HL_12A_2.fq.gz*
1HL_13A_1.fq.gz*
1HL_13A_2.fq.gz*
1HL_14A_1.fq.gz*


In [28]:
%%bash
less simRADs_R1.fastq | head -n 12 | cut -c 1-90

@lane1_fakedata0_R1_0 1:N:0:
TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R1_1 1:N:0:
TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R1_2 1:N:0:
TTTTAATGCAGTGAGTGGCCATGCAATATATATTTACGGGCGCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGCTTAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB


------------   

Each read takes four lines. The first is the name of the read (its location on the plate). The second line contains the sequence data. The third line is a spacer. And the fourth line the quality scores for the base calls. In this case arbitrarily high since the data were simulated. 

These are 100 bp single-end reads prepared as RADseq. The first six bases form the barcode and the next five bases (TGCAG) the restriction site overhang. All following bases make up the sequence data. 

----------------   

## Step 1: de-multiplexing ##

In [None]:
already done by bgi

In [8]:
pwd

u'/Volumes/web-1/halfshell/working-directory/16-04-15c'

### Step 2: quality filtering

In [21]:
%%bash
pyRAD -p params.txt -s 2



     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

	step 2: editing raw reads 
	................................................................................................

In [22]:
%%bash
ls edits/

1HL_10A_1.edit
1HL_11A_1.edit
1HL_12A_1.edit
1HL_13A_1.edit
1HL_14A_1.edit
1HL_15A_1.edit
1HL_16A_1.edit
1HL_17A_1.edit
1HL_19A_1.edit
1HL_1A_1.edit
1HL_20A_1.edit
1HL_21A_1.edit
1HL_22A_1.edit
1HL_23A_1.edit
1HL_24A_1.edit
1HL_25A_1.edit
1HL_26A_1.edit
1HL_27A_1.edit
1HL_28A_1.edit
1HL_29A_1.edit
1HL_2A_1.edit
1HL_31A_1.edit
1HL_33A_1.edit
1HL_34A_1.edit
1HL_35A_1.edit
1HL_3A_1.edit
1HL_4A_1.edit
1HL_5A_1.edit
1HL_6A_1.edit
1HL_7A_1.edit
1HL_8A_1.edit
1HL_9A_1.edit
1NF_10A_1.edit
1NF_11A_1.edit
1NF_12A_1.edit
1NF_13A_1.edit
1NF_14A_1.edit
1NF_15A_1.edit
1NF_16A_1.edit
1NF_17A_1.edit
1NF_18A_1.edit
1NF_19A_1.edit
1NF_1A_1.edit
1NF_20A_1.edit
1NF_21A_1.edit
1NF_22A_1.edit
1NF_23A_1.edit
1NF_24A_1.edit
1NF_25A_1.edit
1NF_26A_1.edit
1NF_27A_1.edit
1NF_28A_1.edit
1NF_29A_1.edit
1NF_2A_1.edit
1NF_30A_1.edit
1NF_31A_1.edit
1NF_32A_1.edit
1NF_33A_1.edit
1NF_4A_1.edit
1NF_5A_1.edit
1NF_6A_1.edit
1NF_7A_1.edit
1NF_8A_1.edit
1NF_9A_1.edit
1SN_10A_1.edit
1SN_11A_1.edit
1SN_12A_1.edit
1SN_13A_1.ed

The filtered data are written in fasta format (quality scores removed) into a new directory called edits/. Below I show a preview of the file which you can view most easily using the `less` command (I use `head` here to make it fit in the text window better).

In [42]:
%%bash
head -n 10 edits/1A0.edit | cut -c 1-80

>1A0_0_r1
TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC
>1A0_1_r1
TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC
>1A0_2_r1
TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC
>1A0_3_r1
TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC
>1A0_4_r1
TGCAGTGAGTGGCCATGCAATATATATTTACGGGCTCATAGAGACCCTCAAGACTGCCAACCGGGTGAATCACTATTTGC


### Step 3: clustering within-samples

Step 3 de-replicates and then clusters reads within each sample by the set clustering threshold and writes the clusters to new files in a directory called clust.xx

In [23]:
%%bash
pyRAD -p params.txt -s 3



     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	de-replicating files for clustering...

	step 3: within-sample clustering of 96 samples at 
	        '.88' similarity. Running 6 parallel jobs
	 	with up to 6 threads per job. If needed, 
		adjust to avoid CPU and MEM limits

	sample 1HL_28A_1 finished, 222733 loci
	sample 1NF_29A_1 finished, 208385 loci
	sample 1SN_23A_1 finished, 239636 loci
	sample 1NF_4A_1 finished, 229489 loci
	sample 1SN_30A_1 finished, 245445 loci
	sample 1NF_13A_1 finished, 254548 loci
	sample 1HL_17A_1 finished, 210640 loci
	sample 1HL_23A_1 finished, 206951 loci
	sample 1NF_31A_1 finished, 225088 loci
	sample 1HL_19A_1 finished, 227411 loci
	sample 1SN_14A_1 finished, 224374 loci
	sample 1SN_21A_1 finished, 255586 loci
	sample 1HL_11A_1 finished, 248490 loci
	sample 1NF_2A_1 finished, 213124 loci
	sample 1NF_32A

Once again, I recommend you use the unix command 'less' to look at the clustS files. These contain each cluster separated by "//". For the first few clusters below you can see that there is one or two alleles in the cluster and one or a few reads that contained a (simulated) sequencing error. 

In [44]:
%%bash
less clust.85/1A0.clustS.gz | head -n 26 | cut -c 1-80

"clust.85/1A0.clustS.gz" may be a binary file.  See it anyway? 

---------------


The stats output tells you how many clusters were found, and their mean depth of coverage. It also tells you how many pass  your minimum depth setting. You can use this information to decide if you wish to increase or decrease the mindepth before it is applied for making consensus base calls in steps 4 & 5.

In [25]:
%%bash
head -n 40 stats/s3.clusters.txt


taxa	total	dpt.me	dpt.sd	d>4.tot	d>4.me	d>4.sd	badpairs
1HL_10A_1	214259	10.557	58.189	80791	25.057	92.955	0
1HL_11A_1	248490	10.108	55.431	88283	25.292	91.045	0
1HL_12A_1	212718	8.428	55.01	74289	20.75	91.814	0
1HL_13A_1	188068	10.091	62.775	71921	23.46	100.068	0
1HL_14A_1	162946	8.863	47.184	60154	20.894	76.155	0
1HL_15A_1	174131	8.961	41.968	64243	21.168	67.352	0
1HL_16A_1	179175	9.342	50.905	66488	22.091	81.995	0
1HL_17A_1	210640	12.256	65.195	86125	27.361	100.04	0
1HL_19A_1	227411	11.168	64.315	86180	26.56	102.625	0
1HL_1A_1	181357	8.836	46.895	65464	21.267	76.478	0
1HL_20A_1	196210	10.522	60.132	73341	25.17	96.588	0
1HL_21A_1	208214	8.888	47.476	70596	22.838	79.696	0
1HL_22A_1	185962	8.033	43.009	63342	20.121	72.16	0
1HL_23A_1	206951	12.385	63.232	88083	26.646	95.071	0
1HL_24A_1	187361	10.38	55.382	71337	24.343	87.973	0
1HL_25A_1	191963	10.912	54.612	77634	24.281	84.102	0
1HL_26A_1	219979	10.208	61.922	80543	24.846	100.661	0
1HL_27A_1	210750	10.728	58.494	84222	24.131	90.89	0
1H

### Steps 4 & 5: Call consensus sequences

#### Step 4 jointly infers the error-rate and heterozygosity across samples.

In [24]:
%%bash
pyRAD -p params.txt -s 4



     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 4: estimating error rate and heterozygosity
	................................................................................................

In [26]:
%%bash
less stats/Pi_E_estimate.txt

taxa	H	E
1NF_1A_1	0.01185106	0.00226008	
1NF_19A_1	0.01143968	0.00219682	
1HL_14A_1	0.01171074	0.00226455	
1SN_24A_1	0.01193965	0.00226339	
1NF_18A_1	0.01197555	0.0021366	
1NF_12A_1	0.01122061	0.00189642	
1HL_5A_1	0.01127751	0.00221106	
1HL_15A_1	0.01183613	0.0021897	
1NF_25A_1	0.01152226	0.0020995	
1HL_6A_1	0.01178931	0.00229567	
1SN_29A_1	0.0112643	0.00221565	
1HL_16A_1	0.01177196	0.0021416	
1NF_11A_1	0.01146095	0.0020609	
1HL_1A_1	0.01176947	0.00225293	
1HL_22A_1	0.0118077	0.00223998	
1NF_10A_1	0.0113126	0.00216754	
1NF_27A_1	0.01192426	0.002178	
1HL_8A_1	0.01161779	0.00224845	
1HL_24A_1	0.01143772	0.00209474	
1HL_33A_1	0.01175588	0.00206054	
1HL_31A_1	0.01186919	0.00216658	
1NF_20A_1	0.01145001	0.00204476	
1HL_4A_1	0.01149712	0.00216607	
1NF_28A_1	0.01145683	0.00215104	
1HL_13A_1	0.01161161	0.00217329	
1NF_33A_1	0.01129083	0.00213716	
1NF_15A_1	0.01110835	0.00212711	
1NF_6A_1	0.01141639	0.00202308	
1HL_2A_1	0.0113882	0.00212527	
1NF_30A_1	0.01124972	0.00210509	
1SN_16A_1	0.01150705

#### Step 5 calls consensus sequences using the parameters inferred above, and filters for paralogs.

In [27]:
%%bash
pyRAD -p params.txt -s 5



     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 5: creating consensus seqs for 96 samples, using H=0.01146 E=0.00214
	................................................................................................

#### The stats output for step 5

In [None]:
%%bash
less stats/s5.consens.txt

### Step 6: Cluster across samples

Step 6 clusters consensus sequences across samples. It will print its progress to the screen. This uses 6 threads by default. If you enter 0 for param 37 it will use all available processors. 

In [28]:
%%bash
pyRAD -p params.txt -s 6 

vsearch v1.11.1_osx_x86_64, 32.0GB RAM, 8 cores
https://github.com/torognes/vsearch


	finished clustering




     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 6: clustering across 96 samples at '.88' similarity 

Reading file /Volumes/web/halfshell/working-directory/16-05-17/clust.88/cat.haplos_ 100%
631997213 nt in 6607451 seqs, min 33, max 152, avg 96
Counting unique k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 333349 Size min 1, max 262, avg 19.8
Singletons: 85039, 1.3% of seqs, 25.5% of clusters


## Step 7: Assemble final data sets

The final step is to output data only for the loci that you want to have included in your data set. This filters once again for potential paralogs or highly repetitive regions, and includes options to minimize the amount of missing data in the output. 

In [29]:
%%bash
pyRAD -p params.txt -s 7

	ingroup 1HL_10A_1,1HL_11A_1,1HL_12A_1,1HL_13A_1,1HL_14A_1,1HL_15A_1,1HL_16A_1,1HL_17A_1,1HL_19A_1,1HL_1A_1,1HL_20A_1,1HL_21A_1,1HL_22A_1,1HL_23A_1,1HL_24A_1,1HL_25A_1,1HL_26A_1,1HL_27A_1,1HL_28A_1,1HL_29A_1,1HL_2A_1,1HL_31A_1,1HL_33A_1,1HL_34A_1,1HL_35A_1,1HL_3A_1,1HL_4A_1,1HL_5A_1,1HL_6A_1,1HL_7A_1,1HL_8A_1,1HL_9A_1,1NF_10A_1,1NF_11A_1,1NF_12A_1,1NF_13A_1,1NF_14A_1,1NF_15A_1,1NF_16A_1,1NF_17A_1,1NF_18A_1,1NF_19A_1,1NF_1A_1,1NF_20A_1,1NF_21A_1,1NF_22A_1,1NF_23A_1,1NF_24A_1,1NF_25A_1,1NF_26A_1,1NF_27A_1,1NF_28A_1,1NF_29A_1,1NF_2A_1,1NF_30A_1,1NF_31A_1,1NF_32A_1,1NF_33A_1,1NF_4A_1,1NF_5A_1,1NF_6A_1,1NF_7A_1,1NF_8A_1,1NF_9A_1,1SN_10A_1,1SN_11A_1,1SN_12A_1,1SN_13A_1,1SN_14A_1,1SN_15A_1,1SN_16A_1,1SN_17A_1,1SN_18A_1,1SN_19A_1,1SN_1A_1,1SN_20A_1,1SN_21A_1,1SN_22A_1,1SN_23A_1,1SN_24A_1,1SN_25A_1,1SN_26A_1,1SN_27A_1,1SN_28A_1,1SN_29A_1,1SN_2A_1,1SN_30A_1,1SN_31A_1,1SN_32A_1,1SN_3A_1,1SN_4A_1,1SN_5A_1,1SN_6A_1,1SN_7A_1,1SN_8A_1,1SN_9A_1
	addon 
	exclude 
	
	final stats written to:
	 /Volumes/w



     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

......

### Final stats output

In [None]:
%%bash
less stats/c85m4p3.stats

---------------  

## Output formats ##

We created 11 output files from our analysis. The standard two (.loci and .excluded_loci), as well as the 9 additional ones listed in the params file. These are all shown below.

In [30]:
%%bash 
ls outfiles/

gbs-001.excluded_loci
gbs-001.loci


### Loci format  
The ".loci" file contains each locus listed in a fasta-like format that also shows which sites are variable below each locus. Autapomorphies are listed as '-' and shared SNPs as '*'. This is a custom format that is human readable and also used as input to perform D-statistic tests in pyRAD. This is the easiest way to visualize your results. I recommend viewing the file with the command `less`. Below I use a head and cut to make it easy to view in this window.

In [31]:
%%bash 
head -n 39 outfiles/gbs-001.loci | cut -c 1-75

>1HL_11A_1    NGTGACCTCGAGCATGTGAC--ATTTCAAAGCCAAATTAACTTTTAGAGAGAAAAAYCCC-
>1HL_34A_1    TGTGACCTCGAGCATGTGAC--ATTTCAAAGCCAAATTAACTTTTAGASAGAAAAMCCCCA
>1NF_9A_1     TGTGACCTCGAGCATGTGAC--ATTTCAAMGCCAAATTAACTTTTAGAGAGAAAAACCCCA
>1SN_11A_1    WGTGACCTCGAGCATGTGACNNATTTCAAAGCCAAATTAACTTTTAGANAGAAAA-YCCCA
>1SN_13A_1    TGTGACCTCGAGCATGTGRC--ATTTCAAAGCCAAATTAACTTTTAGAGAGAAAA-YSCCA
>1SN_14A_1    TGTGACCTCGAGCATGTGAC--ATTTCAAAGCCAAATTAACTTTTAGASAGAAAAMCCCCA
//            -                 -          -                  *      **-   
>1HL_10A_1    CTATAGATATACAAACACTATGTANTCTAAGCCTTTCGGGTACAGGCTCGTCAATATACTC
>1HL_11A_1    CTATAGATATACAAACACTATGTAATCTAAGCCTTTCGGGTACAGGCTCGTCAATATACTC
>1HL_14A_1    CTATAGATATACAAACACTATGTAATCTAAGCCTTTCGGGTACAGGCTCGTCAATATACTC
>1HL_15A_1    CTATAGATATACAAACACTATGTAATCTAAGCCTTTCGGGTACAGGCTCGTCAATATACTC
>1HL_16A_1    CTATAGATATACAAACACTATGTAATCTAAGCCTTTCGGGTACAGGCTCGTCAATATACTC
>1HL_17A_1    CTATAGATATACAAACACTATGTAATCTAAGCCTTTCGGGTACAGGCTCGTCAATATACTC
>1HL_19A_1  

### PHY format

In [32]:
%%bash 
head -n 50 outfiles/c85m4p3.phy | cut -c 1-85

head: outfiles/gbs-001.phy: No such file or directory


### NEX format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.nex | cut -c 1-85

### Alleles format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.alleles| cut -c 1-85

### STRUCTURE (.str) format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.str | cut -c 1-20

### GENO (.geno) format (used in _Admixture_)

In [None]:
%%bash 
head -n 40 outfiles/c85m4p3.geno 

### SNPs format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.snps | cut -c 1-85

### UNLINKED_SNPs format

In [None]:
%%bash 
head -n 50 outfiles/c85m4p3.unlinked_snps | cut -c 1-85

## OTHER FORMATS  

You may also produce some more complicated formatting options that involve pooling individuals into groups or populations. This can be done for the "treemix" and "migrate" outputs, which are formatted for input into the programs _TreeMix_ and _migrate-n_, respectively. Grouping individuals into populations is done with the final lines of the params file as shown below, and similar to the assignment of individuals into clades for hierarchical clustering (see full tutorial). 

Each line designates a group, and has three arguments that are separated by space or tab. The first is the group name, the second is the minimum number of individuals that must have data in that group for a locus to be included in the output, and the third is a list of the members of that group. Lists of taxa can include comma-separated names and wildcard selectors, like below. Example:


In [None]:
%%bash 
## append group designations to the params file
echo "pop1 4 1A0,1B0,1C0,1D0 " >> params.txt
echo "pop2 4 2E0,2F0,2G0,2H0 " >> params.txt
echo "pop3 4 3* " >> params.txt

## view params file
cat params.txt

## Creating population output files  
Now if we run _pyRAD_ with the 'm' (migrate) or 't' (treemix) output options, it will create their output files. 

In [None]:
%%bash 
pyRAD -p params.txt -s 7

## TREEMIX format

In [None]:
%%bash 
less outfiles/c85m4p3.treemix.gz | head -n 30

## MIGRATE-n FORMAT

In [None]:
%%bash 
head -n 40 outfiles/c85m4p3.migrate | cut -c 1-85