# Genomic Data of 2155 Dogs

For our third empirical study, we use genomic data of 2155 dogs published by Morrill et al. (2022; https://doi.org/10.1126/science.abk0639).

To run this notebook, you first need to download the supplementary data from the Dryad repository: https://doi.org/10.5061/dryad.g4f4qrfr0.

You only need to download the `DarwinsArk.zip` and `GeneticData.zip` directories, move them into a new directory called `DogGenomes` and unpack them there.

Using the scripts provided under DOI https://doi.org/10.5281/zenodo.5808329, we first filtered the dogs for dogs with genetic data. We then extracted two groups of dogs:
- pure-bred (n = 601) -> `pure_bred`
- highly-admixed (n = 1200) -> `highly_admixed`

Using PLINK, we extracted the respective genetic data using the IDs of the dogs in the respective group and converted the data to EIGENSTRAT format.

Note that we did not apply additional LD pruning or MAF filtering, as the provided data was already filtered by the authors of the study.

In [18]:
import pandas as pd
import pathlib
import subprocess
import tempfile

from pandora.converter import run_convertf, FileFormat

PLINK = "plink"  # TODO: Replace with a path to PLINK if not in $PATH
CONVERTF = "convertf"  # TODO: Replace with a path to convertf if not in $PATH

base_dir = pathlib.Path("DogGenomes")
darwins_ark = base_dir / "DarwinsArk"
genetic_data = base_dir / "GeneticData"
plink_prefix = genetic_data / "DarwinsArk_gp-0.70_snps-only_maf-0.02_geno-0.20_hwe-midp-1e-20_het-0.25-1.00_N-2155"

In [19]:
# Filtering according to the scripts provided under DOI https://doi.org/10.5281/zenodo.5808329
dogs = pd.read_csv(darwins_ark / "DarwinsArk_20191115_dogs.csv")
answers = pd.read_csv(darwins_ark / "DarwinsArk_20191115_answers.csv")
breedcalls = pd.read_csv(darwins_ark / "DarwinsArk_20191115_breedcalls.csv")

dogs_surveyed = answers.dog.unique()
dogs_filtered = dogs.loc[dogs.id.isin(dogs_surveyed) | dogs.id.isin(breedcalls.dog)].copy()
dogs_filtered["surveyed"] = dogs_filtered.id.isin(dogs_surveyed)
dogs_filtered["confirmed_purebred"] = dogs_filtered.conf & dogs_filtered.surveyed

dogs_with_genetic_data = dogs_filtered.loc[lambda x: x.id.isin(breedcalls.dog.unique())]

max_pct_per_dog = breedcalls.groupby("dog").pct.max().reset_index()
merged = max_pct_per_dog.merge(dogs_with_genetic_data, left_on="dog", right_on="id")

confirmed_purebred = dogs_with_genetic_data.loc[dogs_with_genetic_data.confirmed_purebred].id
print(f"Number of confirmed purebred dogs: {confirmed_purebred.shape[0]}")

highly_admixed = merged.loc[lambda x: x.pct < 0.45].id
print(f"Number of highly admixed dogs: {highly_admixed.shape[0]}")

Number of confirmed purebred dogs: 601
Number of highly admixed dogs: 1071


In [26]:
def filter_and_convert(prefix: pathlib.Path, ids: list[int]):
    with tempfile.NamedTemporaryFile(mode="w") as tmpfile:
        tmpfile.write("\n".join(map(str, ids)))
        tmpfile.flush()
        cmd = [
            str(PLINK),
            "--bfile",
            str(plink_prefix),
            "--keep",
            tmpfile.name,
            "--make-bed",
            "--out",
            str(prefix),
        ]
        subprocess.run(cmd, check=True)

    run_convertf(
        convertf=CONVERTF,
        in_prefix=prefix,
        in_format=FileFormat.PACKEDPED,
        out_prefix=prefix,
        out_format=FileFormat.EIGENSTRAT,
    )

In [27]:
pure_bred_prefix = genetic_data / "pure_bred"
filter_and_convert(pure_bred_prefix, confirmed_purebred)

highly_admixed_prefix = genetic_data / "highly_admixed"
filter_and_convert(highly_admixed_prefix, highly_admixed)

FileNotFoundError: [Errno 2] No such file or directory: 'plink'