# PCA and Friends Session 2
This is the second session, planned for September 24, 2021. We're going to create our Principal Components for real genomic data and get our feed wet with some of the standard tools.
 
## Prequisites (requires your terminal)
In order to follow this session, we recommend that you first install miniconda (https://docs.conda.io/en/latest/miniconda.html). Once installed, you can use our environment file to easily install all the tools you need for this session.

First, clone the github repository if you haven't done so already:

```{bash}
git clone https://github.com/stschiff/exp_dat_reading_group_2021.git
cd exp_dat_reading_group_2021/session_2
```


If you have already cloned the repository you can make sure you have the latest update by running `git pull` inside of it.

Once miniconda has been installed, you can install the environment via:

```{bash}
conda env create -f environment.yml
conda activate PCA_session_2
```

That's it. Now you should have all the tools ready.

### _Note_:

At the point of writing this tutorial, the version of `trident` on conda was v0.18.1. We would like to use the much faster version v0.21.0, which can be installed with the following commands:

_On a Mac:_

```
conda install -c https://169038-42372094-gh.circle-artifacts.com/0/tmp/artifacts/packages poseidon-trident

```

_on Linux_:

```
conda install -c https://169039-42372094-gh.circle-artifacts.com/0/tmp/artifacts/packages poseidon-trident

```

## Getting genotype data
We are going to use Poseidon (https://poseidon-framework.github.io/#/) to easily retrieve genotype data together with some useful annotation. The tool for accessing the Poseidon package repository is named `trident`, and if you followed the recommendation for installing the conda environment above, you should have it installed already. To make sure, check with `which trident` and `trident --version`.

`trident` is a command line tool to manage Poseidon packages. Here we'll use it to automatically download packages that we need for this session. You can list all available packages like so:

In [2]:
!trident list --remote --packages

Downloading sample list from remote
Preparing output table
found 146 packages
.-----------------------------------------.----------------.
|                  Title                  | Nr Individuals |
| 2010_RasmussenNature                    | 1              |
| 2012_KellerNatureCommunications         | 1              |
| 2012_MeyerScience                       | 6              |
| 2012_PattersonGenetics                  | 1036           |
| 2012_PickrellNatureCommunications       | 9              |
| 2014_FuNature                           | 1              |
| 2014_GambaNatureCommunications          | 13             |
| 2014_LazaridisNature                    | 1222           |
| 2014_MalaspinasCurrentBiology           | 2              |
| 2014_OlaldeNature                       | 1              |
| 2014_RaghavanNature                     | 6              |
| 2014_RaghavanScience                    | 4              |
| 2014_RasmussenNature                    | 3              |
| 2014_

Here we specifically need packages `2012_PattersonGenetics`, `2014_LazaridisNature` and `2019_Jeong_InnerEurasia`, which contain a lot of present-day individuals from around the world, and `2014_RaghavanNature`, which contains a famous 22,000 year old individual from Siberia. Let's fetch those packages and copy them into a local folder called `session_2/poseidon-repository` within this repository:

In [3]:
!mkdir -p poseidon-repository
# This will take a few seconds to pull the data from the server
!trident fetch -d poseidon-repository -f "*2012_PattersonGenetics*,*2014_LazaridisNature*,*2019_Jeong_InnerEurasia*,*2014_RaghavanNature*"

trident v0.21.0 for poseidon v2.4.0
https://poseidon-framework.github.io

Searching POSEIDON.yml files... 2 found
Checking Poseidon versions... 
Initializing packages... 
[2K[1G> 1 [2K[1G> 2 
Packages loaded: 2
Downloading package list from remote
Determine requested packages... 4 requested and available
Comparing local and remote package state
Handling packages
2012_PattersonGenetics                  > 68.3MB to download
[2K[1G> 5.1% [2K[1G> 10.2% [2K[1G> 15.4% [2K[1G> 20.5% [2K[1G> 25.6% [2K[1G> 30.7% [2K[1G> 35.9% [2K[1G> 41.0% [2K[1G> 46.1% [2K[1G> 51.2% [2K[1G> 56.4% [2K[1G> 61.5% [2K[1G> 66.6% [2K[1G> 71.7% [2K[1G> 76.9% [2K[1G> 82.0% [2K[1G> 87.1% [2K[1G> 92.2% [2K[1G> 97.4% [2K[1G> 100.0% 
2014_LazaridisNature                    local 1.1.2 = remote 1.1.2
2014_RaghavanNature                     > 14.8MB to download
[2K[1G> 5.4% [2K[1G> 10.8% [2K[1G> 16.2% [2K[1G> 21.6% [2K[1G> 27.0% [2K[1G> 32.4% [2K[1G> 37.8% [2K[1G> 

Great, now we have those packages. You can checkout the files, e.g.:

In [4]:
!ls poseidon-repository/2014_LazaridisNature

2014_LazaridisNature.bed   2014_LazaridisNature.fam
2014_LazaridisNature.bib   2014_LazaridisNature.janno
2014_LazaridisNature.bim   POSEIDON.yml


And you can see three genotype files (`.bed`, `.bim` and `.fam`) and an annotation file ending with `.janno`.

You can also view lots of things about those packages using `trident`. For example:

In [5]:
!trident list --groups -d poseidon-repository/

trident v0.21.0 for poseidon v2.4.0
https://poseidon-framework.github.io

Searching POSEIDON.yml files... 5 found
Checking Poseidon versions... 
Initializing packages... 
[2K[1G> 1 [2K[1G> 2 [2K[1G> 3 [2K[1G> 4 [2K[1G> 5 
Packages loaded: 5
Preparing output table
found 370 groups/populations
.-----------------------------------------------.-------------------------.----------------.
|                     Group                     |        Packages         | Nr Individuals |
| AA                                            | 2014_LazaridisNature    | 12             |
| Abazin                                        | 2019_Jeong_InnerEurasia | 8              |
| Abazin_outlier                                | 2019_Jeong_InnerEurasia | 2              |
| Abkhasian                                     | 2014_LazaridisNature    | 9              |
| Adygei                                        | 2012_PattersonGenetics  | 16             |
| Adygei                                      

or:

In [6]:
!trident summarise -d poseidon-repository

trident v0.21.0 for poseidon v2.4.0
https://poseidon-framework.github.io

Searching POSEIDON.yml files... 5 found
Checking Poseidon versions... 
Initializing packages... 
[2K[1G> 1 [2K[1G> 2 [2K[1G> 3 [2K[1G> 4 [2K[1G> 5 
Packages loaded: 5
.------------------------.--------------------------------------------------------------.
|        Summary         |                            Value                             |
| Nr Individuals         | 3033                                                         |
| Individuals            | ABA-035, ABA-048, ABA-052, ABA-056, ABA-065, ABA-069, ABA-0… |
| Nr Groups              | 354                                                          |
| Groups                 | Russian: 71, Yoruba: 70, Bashkir: 53, Spanish: 53, Turkish:… |
| Nr Publications        | 5                                                            |
| Publications           | PattersonGenetics2012, LazaridisNature2014, RaghavanNature2… |
| Nr Countries           | 83

OK, for further analysis we want to merge these two packages. In `trident` we can use the `forge` command for that. But we first need a population list to know what we like to extract and merge. For this session, such a list is already provided, named `forge_file.txt`.

Let's look at the `forge_file.txt`:

In [7]:
!head forge_file.txt

Abkhasian
Adygei
Albanian
Aleut
Altaian
Ami
Armenian
Atayal
Avar
Azeri


OK, so there are many population names here, here is how many:

In [2]:
!wc -l forge_file.txt

     120 forge_file.txt


So 120 populations. Let's use them to forge a new Poseidon package that contains only the genotype and metadata for individuals that belong to one of these 120 groups. Forge takes a number of options (check them out using `trident forge --help`), here we're just using a basic sequence of options (note this will take a few minutes):

In [10]:
!trident forge -d poseidon-repository -o forged_package -n PCA_package_1 --forgeFile forge_file.txt --intersect

trident v0.21.0 for poseidon v2.4.0
https://poseidon-framework.github.io

Searching POSEIDON.yml files... 5 found
Checking Poseidon versions... 
Initializing packages... 
[2K[1G> 1 [2K[1G> 2 [2K[1G> 3 [2K[1G> 4 [2K[1G> 5 
Packages loaded: 5
4 packages contain data for this forging operation
Creating new package directory: forged_package
Creating new package entity
Creating POSEIDON.yml
Creating .bib file
Compiling genotype data
Processing SNPs...


[2K[1G> 0 [2K[1G> 1000 [2K[1G> 2000 [2K[1G> 3000 [2K[1G> 4000 [2K[1G> 5000 [2K[1G> 6000 [2K[1G> 7000 [2K[1G> 8000 [2K[1G> 9000 [2K[1G> 10000 [2K[1G> 11000 [2K[1G> 12000 [2K[1G> 13000 [2K[1G> 14000 [2K[1G> 15000 [2K[1G> 16000 [2K[1G> 17000 [2K[1G> 18000 [2K[1G> 19000 [2K[1G> 20000 [2K[1G> 21000 [2K[1G> 22000 [2K[1G> 23000 [2K[1G> 24000 [2K[1G> 25000 [2K[1G> 26000 [2K[1G> 27000 [2K[1G> 28000 [2K[1G> 29000 [2K[1G> 30000 [2K[1G> 31000 [2K[1G> 32000 [2K[1G> 33000 [2K[1G> 34000 [2K[1G> 35000 [2K[1G> 36000 [2K[1G> 37000 [2K[1G> 38000 [2K[1G> 39000 [2K[1G> 40000 [2K[1G> 41000 [2K[1G> 42000 [2K[1G> 43000 [2K[1G> 44000 [2K[1G> 45000 [2K[1G> 46000 [2K[1G> 47000 [2K[1G> 48000 [2K[1G> 49000 [2K[1G> 50000 [2K[1G> 51000 [2K[1G> 52000 [2K[1G> 53000 [2K[1G> 54000 [2K[1G> 55000 [2K[1G> 56000 [2K[1G> 57000 [2K[1G> 58000 [2K[1G> 59000 [2K[1G> 60000 [2K[1G> 61000 [2K[1G> 62000 [2K