# Recreating the NearEastPublic dataset analyses
We used the publicly available data from Lazaridis, I., Patterson, N., Mittnik, A. et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513, 409–413 (2014). https://doi.org/10.1038/nature13673

## Note on smartpca settings:
In their original publication, Lazarids et al. set `numoutlieriters: 0` for the smartpca PCA analyses. Back then the `shrinkmode` option was not available in smartpca, so we set `shrinkmode: NO`. 
The HO-array is often used as reference for mapping ancient samples (especially the HO-west subset). In many analyses the default smartpca settigs are used, which corresponds to `numoutlieriters: 5`. So we additionally analysed this setting. Since its introduction, the `shrinkmode` is frequently applied, so we also did all analyses with `shrinkmode: YES` as well.

This leads to four tested settings for the HO-west dataset:
- `numoutlieriters: 0` and `shrinkmode: NO`
- `numoutlieriters: 0` and `shrinkmode: YES`
- `numoutlieriters: 5` and `shrinkmode: NO`
- `numoutlieriters: 5` and `shrinkmode: YES`

To economize on resources, we did not test all four settings for the HO-global set. Instead, we only analyzed the `numoutlieriters: 0` and `shrinkmode: NO` combination. 

In [2]:
from filter_merge_utils import *

Uncomment and run the following cell to download the dataset from the Reich Lab website.

In [5]:
! wget https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/NearEastPublic.tar.gz
! mkdir NearEastPublic && mv NearEastPublic.tar.gz NearEastPublic
! cd NearEastPublic && tar -xf NearEastPublic.tar.gz
! mkdir raw && mv ./* raw
! mkdir global && mkdir westEurasia && mkdir westEurasia_ancient

In [3]:
base_dir = pathlib.Path("NearEastPublic")
dataset_prefix = base_dir / "raw" / "HumanOriginsPublic2068"

## Global EigenDataset
Next, we filter the dataset to reproduce the dataset of global samples.
For this, we use the list of global populations as provided in the supplement of the publication.

In [4]:
global_prefix = base_dir / "global" / "HumanOriginsPublic2068.global"
global_population_file = pathlib.Path(f"{global_prefix}.populations.txt")

global_set = get_pop_set_from_string("AA, Algonquin, Ami, Atayal, Basque, BedouinB, Biaka, Bougainville, Brahui, Cabecar, Chipewyan, Chukchi, Damara, Datog, Dinka, Esan, Eskimo, Georgian, Gui, GujaratiD, Hadza, Han, Itelmen, Ju_hoan_North, Kalash, Karitiana, Kharia, Korean, Koryak, LaBrana, Lahu, Lodhi, Loschbour, MA1, Mala, Mandenka, Masai, Mbuti, Mozabite, Naxi, Nganasan, Onge, Papuan, Pima, Sandawe, Sardinian, She, Somali, Stuttgart, Surui, Tubalar, Ulchi, Vishwabrahmin, Yoruba")
save_pop_set(global_set, global_population_file)

# filter the global populations
filter_dataset(
    prefix_in=dataset_prefix,
    prefix_out=global_prefix,
    poplistname=global_population_file,
    redo=False
)

## West-Eurasian EigenDataset
Next, we filter the dataset to reproduce the dataset of west-eurasion samples.
For this, we use the list of west-eurasian populations as provided in the supplement of the publication.

In [7]:
westeurasian_prefix = base_dir / "westEurasia" / "HumanOriginsPublic2068.westEurasian"
westeurasian_population_file = pathlib.Path(f"{westeurasian_prefix}.populations.txt")

westeurasian_set = get_pop_set_from_string("Abkhasian, Adygei, Albanian, Armenian, Ashkenazi_Jew, Jew_Ashkenazi, Balkar, Basque, BedouinA, BedouinB, Belarusian, Bergamo, Bulgarian, Canary_Islanders, Canary_Islander, Chechen, Croatian, Cypriot, Czech, Druze, English, Estonian, Finnish, French, French_South, Georgian, Georgian_Jew, Jew_Georgian, Greek, Hungarian, Icelandic, Iranian, Iranian_Jew, Jew_Iranian, Iraqi_Jew, Jew_Iraqi, Italian_South, Jordanian, Kumyk, LaBrana, Lebanese, Lezgin, Libyan_Jew, Jew_Libyan, Lithuanian, Loschbour, Maltese, Mordovian, Moroccan_Jew, Jew_Moroccan, Motala12, Motala_merge, North_Ossetian, Norwegian, Orcadian, Palestinian, Russian, Sardinian, Saudi, Scottish, Sicilian, Spanish, Spanish_North, Stuttgart, Syrian, Tunisian_Jew, Jew_Tunisian, Turkish, Turkish_Jew, Jew_Turkish, Tuscan, Ukrainian, Yemenite_Jew, Jew_Yemenite")

save_pop_set(westeurasian_set, westeurasian_population_file)

# filter the westeurasian populations
filter_dataset(
    prefix_in=dataset_prefix,
    prefix_out=westeurasian_prefix,
    poplistname=westeurasian_population_file,
    redo=True
)

### Merging the west-eurasian samples with ancient samples

In [8]:
ancient_prefix = base_dir / "raw" / "AncientLazaridis2016"
merged_prefix = base_dir / "westEurasia_ancient" / "AncientLazaridis2016_ModernWestEurasia"

# The ancient samples contain some samples we want to exclude prior to PCA (e.g. Chimp sequences)
with tempfile.TemporaryDirectory() as tmpdir:
    tmpdir = pathlib.Path(tmpdir)
    tmp_merged_prefix = tmpdir / "merged"
    merge_datasets(
        prefix_ds1=westeurasian_prefix,
        prefix_ds2=ancient_prefix,
        prefix_out=tmp_merged_prefix,
        redo=True
    )

    ind_file = pathlib.Path(f"{tmp_merged_prefix}.ind")
    ind_df = indfile_to_dataframe(ind_file)

    keep_populations = tmpdir / "exclude.poplist.txt"
    exclude = ["Mota", "Denisovan", "Chimp", "Mbuti.DG", "Altai",
               "Vi_merge", "Clovis", "Kennewick", "Chuvash", "Ust_Ishim",
               "AG2", "MA1", "MezE", "hg19ref", "Kostenki14"]
    keep = [p for p in ind_df.population.unique() if p not in exclude]
    keep_populations.open("w").write("\n".join(keep))

    filter_dataset(
        prefix_in=tmp_merged_prefix,
        prefix_out=merged_prefix,
        poplistname=keep_populations,
        redo=True
    )

# finally, save the population names of the modern samples in a specific file such that we can later use it for the PCA projection
ancient_populations = ["Anatolia_ChL", "Anatolia_N", "Armenia_ChL", "Armenia_EBA", "Armenia_MLBA", "CHG", "EHG", "Europe_EN", "Europe_LNBA", "Europe_MNChL", "Iberia_BA", "Iran_ChL", "Iran_HotuIIIb", "Iran_LN", "Iran_N", "Levant_BA", "Levant_N", "Natufian", "SHG", "Steppe_EMBA", "Steppe_Eneolithic", "Steppe_IA", "Steppe_MLBA", "Switzerland_HG", "WHG"]

ind_file = pathlib.Path(f"{merged_prefix}.ind")
ind_df = indfile_to_dataframe(ind_file)

modern = [p for p in ind_df.population.unique() if p not in ancient_populations]
modern_populations = base_dir / "westEurasia_ancient" / "modern.poplist.txt"
modern_populations.open("w").write("\n".join(modern))

parameter file: /var/folders/c3/6cf6l4n106v0gfcqwr1bt8nr0000gn/T/tmpjq5ork21
geno1: NearEastPublic/westEurasia/HumanOriginsPublic2068.westEurasian.geno
snp1: NearEastPublic/westEurasia/HumanOriginsPublic2068.westEurasian.snp
ind1: NearEastPublic/westEurasia/HumanOriginsPublic2068.westEurasian.ind
geno2: NearEastPublic/raw/AncientLazaridis2016.geno
snp2: NearEastPublic/raw/AncientLazaridis2016.snp
ind2: NearEastPublic/raw/AncientLazaridis2016.ind
genooutfilename: /var/folders/c3/6cf6l4n106v0gfcqwr1bt8nr0000gn/T/tmpnbouugix/merged.geno
snpoutfilename: /var/folders/c3/6cf6l4n106v0gfcqwr1bt8nr0000gn/T/tmpnbouugix/merged.snp
indoutfilename: /var/folders/c3/6cf6l4n106v0gfcqwr1bt8nr0000gn/T/tmpnbouugix/merged.ind
hashcheck: NO
packed geno read OK
end of inpack
packed geno read OK
end of inpack
numsnps: 621799  numindivs: 1107
packedancestrymap output
##end of mergeit run


538