# Recreating the Çayönü dataset analyses

In their analyses, Mathieson et al. place 13 Çayönü samples on the west-eurasian subset of the Human Origins (NearEastPublic) dataset.

N. Ezgi Altınışık et al. ,A genomic snapshot of demographic and cultural dynamism in Upper Mesopotamia during the Neolithic Transition.Sci. Adv.8,eabo3609(2022). https://doi.org/10.1126/sciadv.abo3609

**Note: run the NearEastPublic.ipynb notebook first as we need the west-eurasian dataset for these analyses**



In [None]:
# ! mkdir Cayonu && cd Cayonu && wget https://zenodo.org/api/records/7305608/files-archive && mv files-archive files-archive.zip
# ! cd Cayonu && unzip files-archive.zip && rm files-archive.zip
# ! cd Cayonu && mkdir westEurasian_ancient

In [8]:
import pathlib
import shutil

from filter_merge_utils import merge_datasets, indfile_to_dataframe

base_dir = pathlib.Path("Cayonu")
ancient_prefix = base_dir / "Cayonu.HO"
west_eurasia_prefix = pathlib.Path("NearEastPublic") / "westEurasia" / "HumanOriginsPublic2068.westEurasian"

In [5]:
# the naming scheme for the SNPs in NearEastPublic HO data and the Cayonu data differ, so we adjust the naming scheme to be able to merge the data 
# in the original HO data, the SNP name look like this: "rs112869874"
# but in the Cayonu naming scheme their name follows the convention chromosomeID_physicalPositionOnChromosome

renamed_west_eurasia_prefix = base_dir / "westEurasia" / "HumanOriginsPublic2068.westEurasian.renamed"
renamed_west_eurasia_prefix.mkdir(exist_ok=True, parents=True)

# the ind and geno files are not affected by this, so we can simply copy them
for file_suffix in [".ind", ".geno"]:
    src = f"{west_eurasia_prefix}{file_suffix}"
    dst = f"{renamed_west_eurasia_prefix}{file_suffix}"
    shutil.copy(src, dst)

# the snp file is affected, so we read each line in the HO SNP file and write the line with the new identifier 
src = pathlib.Path(f"{west_eurasia_prefix}.snp")
dst = pathlib.Path(f"{renamed_west_eurasia_prefix}.snp")

with dst.open("w") as snp_out:
    for snp_line in src.open():
        # SNP line looks like this:
        # rs3094315     1        0.020130          752566 G A
        orig_name, chrom_id, relative_pos, absolute_pos, allel1, allel2 = snp_line.split()
        new_name = f"{chrom_id}_{absolute_pos}"
        new_snp_line = "\t".join([new_name, chrom_id, relative_pos, absolute_pos, allel1, allel2]) + "\n"
        snp_out.write(new_snp_line)

In [9]:
merged_prefix = base_dir / "westEurasian_ancient" / "Cayonu_ModernWestEurasia"
merged_prefix.parent.mkdir(exist_ok=True)

# The ancient samples contain some samples we want to exclude prior to PCA (e.g. Chimp sequences)
merge_datasets(
    prefix_ds1=renamed_west_eurasia_prefix,
    prefix_ds2=ancient_prefix,
    prefix_out=merged_prefix,
    redo=False
)

ancient_ind_file = pathlib.Path(f"{ancient_prefix}.ind")
ancient_ind_data = indfile_to_dataframe(ancient_ind_file)
ancient_populations = ancient_ind_data.population.unique().tolist()

# finally, save the population names of the modern samples in a specific file such that we can later use it for the PCA projection
ind_file = pathlib.Path(f"{merged_prefix}.ind")
ind_df = indfile_to_dataframe(ind_file)

modern = [p for p in ind_df.population.unique() if p not in ancient_populations]
modern_populations = base_dir / "westEurasian_ancient" /  "modern.poplist.txt"
modern_populations.open("w").write("\n".join(modern))

parameter file: /var/folders/c3/6cf6l4n106v0gfcqwr1bt8nr0000gn/T/tmpsp7x1gj7
geno1: Cayonu/westEurasia/HumanOriginsPublic2068.westEurasian.renamed.geno
snp1: Cayonu/westEurasia/HumanOriginsPublic2068.westEurasian.renamed.snp
ind1: Cayonu/westEurasia/HumanOriginsPublic2068.westEurasian.renamed.ind
geno2: Cayonu/Cayonu.HO.geno
snp2: Cayonu/Cayonu.HO.snp
ind2: Cayonu/Cayonu.HO.ind
genooutfilename: Cayonu/westEurasian_ancient/Cayonu_ModernWestEurasia.geno
snpoutfilename: Cayonu/westEurasian_ancient/Cayonu_ModernWestEurasia.snp
indoutfilename: Cayonu/westEurasian_ancient/Cayonu_ModernWestEurasia.ind
hashcheck: NO
packed geno read OK
end of inpack
numsnps: 621799  numindivs: 826
packedancestrymap output
##end of mergeit run


526