# PCA and Friends Session 2
This is the second session, planned for September 24, 2021. We're going to create our Principal Components for real genomic data and get our feed wet with some of the standard tools.

## Prequisites
In order to follow this session, we recommend that you first install miniconda (https://docs.conda.io/en/latest/miniconda.html). Once installed, you can use our environment file to easily install all the tools you need for this session.

First, clone the github repository if you haven't done so already:

In [None]:
!git clone https://github.com/stschiff/exp_dat_reading_group_2021.git
!cd exp_dat_reading_group_2021/session_2


If you have already cloned the repository you can make sure you have the latest update by running `git pull` inside of it.

Once miniconda has been installed, you can install the environment via:

In [None]:
!conda env create -f environment.yml
!conda activate PCA_session_2

That's it. Now you should have all the tools ready.

## Getting genotype data
We are going to use Poseidon (https://poseidon-framework.github.io/#/) to easily retrieve genotype data together with some useful annotation. The tool for accessing the Poseidon package repository is named `trident`, and if you followed the recommendation for installing the conda environment above, you should have it installed already. To make sure, check with `which trident` and `trident --version`.

`trident` is a command line tool to manage Poseidon packages. Here we'll use it to automatically download packages that we need for this session. You can list all available packages like so:

In [2]:
!trident list --remote --packages

Downloading sample list from remote
Preparing output table
found 146 packages
.-----------------------------------------.----------------.
|                  Title                  | Nr Individuals |
| 2010_RasmussenNature                    | 1              |
| 2012_KellerNatureCommunications         | 1              |
| 2012_MeyerScience                       | 6              |
| 2012_PattersonGenetics                  | 1036           |
| 2012_PickrellNatureCommunications       | 9              |
| 2014_FuNature                           | 1              |
| 2014_GambaNatureCommunications          | 13             |
| 2014_LazaridisNature                    | 1222           |
| 2014_MalaspinasCurrentBiology           | 2              |
| 2014_OlaldeNature                       | 1              |
| 2014_RaghavanNature                     | 6              |
| 2014_RaghavanScience                    | 4              |
| 2014_RasmussenNature                    | 3              |
| 2014_

Here we specifically need packages `2012_PattersonGenetics`, `2014_LazaridisNature` and `2019_Jeong_InnerEurasia`, which contain a lot of present-day individuals from around the world, and `2014_RaghavanNature`, which contains a famous 22,000 year old individual from Siberia. Let's fetch those packages and copy them into a local folder called `session_2/poseidon-repository` within this repository:

In [8]:
!mkdir -p poseidon-repository
!trident fetch -d poseidon-repository -f "*2012_PattersonGenetics*,*2014_LazaridisNature*,*2019_Jeong_InnerEurasia*,*2014_RaghavanNature*"

Searching POSEIDON.yml files... 3 found
Initializing packages... 
[2K[1G> 1 [2K[1G> 2 [2K[1G> 3 
Packages loaded: 3
Downloading package list from remote
Determine requested packages... 4 requested and available
Comparing local and remote package state
Handling packages
2012_PattersonGenetics                  local 1.0.1 = remote 1.0.1
2014_LazaridisNature                    local 1.1.2 = remote 1.1.2
2014_RaghavanNature                     > 14.8MB to download
[2K[1G> 5.4% [2K[1G> 10.8% [2K[1G> 16.2% [2K[1G> 21.6% [2K[1G> 27.0% [2K[1G> 32.4% [2K[1G> 37.8% [2K[1G> 43.2% [2K[1G> 48.6% [2K[1G> 54.1% [2K[1G> 59.5% [2K[1G> 64.9% [2K[1G> 70.3% [2K[1G> 75.7% [2K[1G> 81.1% [2K[1G> 86.5% [2K[1G> 91.9% [2K[1G> 97.3% [2K[1G> 100.0% 
2019_Jeong_InnerEurasia                 local 1.1.2 = remote 1.1.2


Great, now we have those packages. You can checkout the files, e.g.:

In [6]:
!ls poseidon-repository/2014_LazaridisNature

2014_LazaridisNature.bed  2014_LazaridisNature.bim  2014_LazaridisNature.janno
2014_LazaridisNature.bib  2014_LazaridisNature.fam  POSEIDON.yml


And you can see three genotype files (`.bed`, `.bim` and `.fam`) and an annotation file ending with `.janno`.

You can also view lots of things about those packages using `trident`. For example:

In [9]:
!trident list --groups -d poseidon-repository/

Searching POSEIDON.yml files... 4 found
Initializing packages... 
[2K[1G> 1 [2K[1G> 2 [2K[1G> 3 [2K[1G> 4 
Packages loaded: 4
Preparing output table
found 367 groups/populations
.-----------------------------------------------.-------------------------.----------------.
|                     Group                     |        Packages         | Nr Individuals |
| AA                                            | 2014_LazaridisNature    | 12             |
| Abazin                                        | 2019_Jeong_InnerEurasia | 8              |
| Abazin_outlier                                | 2019_Jeong_InnerEurasia | 2              |
| Abkhasian                                     | 2014_LazaridisNature    | 9              |
| Adygei                                        | 2012_PattersonGenetics  | 16             |
| Adygei                                        | 2019_Jeong_InnerEurasia | 15             |
| Afar.WGA                                      | 2014_LazaridisNature

or:

In [10]:
!trident summarise -d poseidon-repository

Searching POSEIDON.yml files... 4 found
Initializing packages... 
[2K[1G> 1 [2K[1G> 2 [2K[1G> 3 [2K[1G> 4 
Packages loaded: 4
.------------------------.--------------------------------------------------------------.
|        Summary         |                            Value                             |
| Nr Individuals         | 3029                                                         |
| Individuals            | ABA-035, ABA-048, ABA-052, ABA-056, ABA-065, ABA-069, ABA-0… |
| Nr Groups              | 351                                                          |
| Groups                 | Russian: 71, Yoruba: 70, Bashkir: 53, Spanish: 53, Turkish:… |
| Nr Publications        | 4                                                            |
| Publications           | PattersonGenetics2012, LazaridisNature2014, RaghavanNature2… |
| Nr Countries           | 82                                                           |
| Countries              | Russia: 1008, Pakistan: 195, 

OK, for further analysis we want to merge these two packages. In `trident` we can use the `forge` command for that. But we first need a population list to know what we like to extract and merge. For this session, such a list is already provided, named `forge_file.txt`.