Skip to content

This repo contains a tutorial on how to conduct basic population genomic analyses (i.e., compute genotype likelihoods, ancestry and relatedness) using population-level sampling and HTS data.

Notifications You must be signed in to change notification settings

siriusb-nox/PopGen_DARWIN_2024

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Workshop "Principles on population genomic analyses" - 4-8 March 2024

1. Introduction

This repository contains a tutorial guide to basic population genetic analysis, using as input data illumina read sequencing derived from population level sampling. The tutorial is based upon the study of Pérez-Escobar et al., (2021), and will make use of read sequence data obtained from 32 date palm and their close wild living relativces (Phoenix dactylifera, P. atlantica, P. canariensis). It also includes data produced in intermediary steps to ensure that the tutorial can be executed completely.

This tutorial is intended for users with a basic knowledge in programming and is designed to run in UNIX environments. The participant should ideally have experience using shell and text file manipulation (e.g., using awk, sed, grep, among others), but also some experience coding in R (for plotting figures). The workshop will be run on pre-configured laptops (Ubuntu 22.04). A basic introduction to the UNIX enviroment with some useful commands is available here.

This tutorial requires the following programs/dependencies (it is highly recommended to have these installed before starting the tutorial). Please make sure that the dependencies on which these programs run are also available:

  1. PALEOMIX: PALEOMIX is a bioinformatics pipeline designed to analyzing contermporary and ancient DNA (aDNA) sequencing data in a population genomics framework. The pipeline has three modules but here we will work only with the "bam_pipeline" module. It begins by adapter trimming and quality filtering read data. Then the filtered read data is aligned to a reference genome using BWA/Bowtie2 algorithms. The output are BAM files. Read trimming and quality filtering as well as read mapping are all set and controlled through a *.yaml file.
  2. samtools: This program manipulates (read/writing) *.sam and *.bam files (i.e., mapped reads against a reference genome).
  3. Adapterremoval: AdapterRemoval is a tool designed for the removal of adapter sequences and low-quality bases from high-throughput sequencing data.
  4. BWA/Bowtie2: These software align short read sequence data against a reference genome. BWA relies on the Burrows-Wheeler Transform (BWT) algorithm to index the reference genome and align (map) short reads in a fast manner to said reference. Bowtie2 in contrast uses the FM-index (an approach based on the BWT algorithm) to align short reads. Both tools can map reads using different sensitivity thresholds, thus resulting on less accurate (faster) or more comprehensive (slower) aligning strategies.
  5. ANGSD: This program calculate Genotype Likelihoods (GLs) and several other metrics from GLs, using as input a series of BAM files. The software also include a series of utilities that we will explore in this workshop, including NGSadmix (to conduct admixture/structure analysis) and PCAngsd (principal component analysis from Genotype Likelihoods). The outcome of both NSGadmix and PCAngds is visualised in R (through the package ggplot2).

2. Workshop structure

This tutorial is divided into three steps:

A. Read trimming, mapping and validation

B. Genotype Likelihood analysis

C. Principal component and Admixture/structure

Figure 1 Figure 1: Simplified view of tutorial/pipeline

Important

The base data needed to run this tutorial is available in the different subfolders of this repo (e.g., /home/ontasia*/Documents/ONT-workshop-March-2024/fastq/ and /home/ontasia*/Documents/ONT-workshop-March-2024/BAM_CP/), which will be copied in your local machine.

2.1. Pipeline configuration

In any bioinformatics pipeline, it is essential to relate which programs the pipeline depends on. All the files needed to execute this tutorial are available at /home/ontasia*/Documents.

For users with programs installed in a UNIX environment on personal computers, these can be entered in the current session (terminal) using the following command, for example:

PATH=$PATH:/directory/of/the/folder/programx

About

This repo contains a tutorial on how to conduct basic population genomic analyses (i.e., compute genotype likelihoods, ancestry and relatedness) using population-level sampling and HTS data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published