## **Summary Code Note**
This file summarizes the codes, functions, and packages that we wrote and ran to retrieve mtDNA sequences from NCBI, and then clean the raw data to get the final dataset 3 (which only includes complete genomes), and finally use the summary information of each sequence to create the Isolate Explanation table. </br>
For the purpose of showing only the way to get expected outcomes, and giving the readers a whole picture of section Retrieving Data and Data Wrangling in Method & Materials of an article, we only showed the codes that call the modules with functions of main packages, and some extra codes (loops) repeating the same functions. To know more about the specific details of codes of how the packages, modules, functions, etc., were written, you can look up the git link below:

https://github.com/vy-phung/Haplogroup.git

Notices:
- Most of the files input as well as the saved outputs were stored in the [google drive folder](https://drive.google.com/drive/folders/15OPBImGAG51vukHfADV3-zG8RKU8A9fe?usp=drive_link). </br>
The paths of the those input and output files are showed in the codes below sorted in the same order in the google drive above.


**CONTENTS:**
1. Get Data from NCBI
* Entrez Direct
* Missing Data
2. Data wrangling
* Dataset 1
* Dataset 2
* Dataset 3
3. Tables
* Isolate Explanation Table
* Table 1
* Table 2
* Table 3 and its subtables

### **1. Get Data from NCBI**

#### **Entrez Direct**

- To retrieve mtDNA sequences of 11 countries in South East Asia, we used [Entrez Direct](https://www.ncbi.nlm.nih.gov/books/NBK179288/) to search for the common keywords: </br>
`"Homo sapiens AND mitochondrion AND <Country Name>`

- Below is a bash script file we wrote to get and then save the collected data. The file "countries.txt" in DataList contains the names of 11 countries (Brunei, Cambodia, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, Thailand, Timor-Leste, Viet Nam). The collected fasta files were saved in OldCountryFasta folder.

In [None]:
# create a bash script file to run and get mtDNA off 11 countries by using the above mentioned keywords
import os
os.system('''echo "#!/bin/bash
# Download and Install Entrez Direct
sh -c "$(wget -q https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"
export PATH=${HOME}/edirect:${PATH}

# run code to get data of 11 countries
DataList=/content/drive/MyDrive/OUCRUwork/RetrieveData/others/extraData/countries.txt
Field_Separator=$IFS
IFS=,
for val in `cat $DataList`
do ${HOME}/edirect/esearch -db nucleotide -query "Homo sapiens AND mitochondrion AND $val" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta
> /content/drive/MyDrive/RetrieveData/OldCountryFasta/$val.fasta;
done " >> /content/drive/MyDrive/OUCRUwork/RetrieveData/others/codes/NCBIBash.sh''')

In [1]:
def saveFile(name,content):
  Name = name
  fi = open(Name, "a")
      # Add new change in to saved file
  with open(Name, "w") as external_file:
    add_text = content
    print(add_text, file=external_file)
    external_file.close()

In [32]:
file = '''
#!/bin/bash

greeting () {
  echo "Hello $1"
}

#greeting "Joe"
'''
saveFile("hello.sh",file)

In [33]:
! cat hello.sh


#!/bin/bash

greeting () {
  echo "Hello $1"
}

#greeting "Joe"



In [42]:
! source hello.sh; greeting "Jay"

Hello Jay


#### **Missing data**

After gaining the data from the above common keywords, we ralized that there were still some more missing data if we search the keywords specifically. Therfore, below is the the way we got for each specific country

**Myanmar** </br>
New Data:
- 21219640.Inland post-glacial dispersal in East Asia revealed by mitochondrial haplogroup M9a'b: Myanmar, Vietnam (filter them): keywords for Myanmar: HM346895, HM346896
- 24467713.Summerer et al. (2014).txt: Myanmar: missing 327 small coding region JX288765-JX289091 among 371 files of this article (keywords: Large-scale mitochondrial DNA analysis in Southeast Asia reveals evolutionary effects of cultural isolation in the multi-ethnic population of Myanmar)
- 25826227.Li et al. (2015).txt: all 937 files for Myanmar but only 92 files exist (keywords: Ancient inland human dispersals from Myanmar into interior East Asia since the Late Pleistocene)


In [None]:
! for i in HM346895 HM346896; do ${HOME}/edirect/esearch -db nucleotide -query $i -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Myanmar/extra/$i.fasta; done

! ${HOME}/edirect/esearch -db nucleotide -query "Large-scale mitochondrial DNA analysis in Southeast Asia reveals evolutionary effects of cultural isolation in the multi-ethnic population of Myanmar" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Myanmar/extra/$i.fasta

! ${HOME}/edirect/esearch -db nucleotide -query "Ancient inland human dispersals from Myanmar into interior East Asia since the Late Pleistocene" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Myanmar/extra/$i.fasta

**Philippines** </br>
New data:
+ 21281460.Loo et al. (2011).txt: all 46 files for Philip but only exist 12 (keyword: Genetic affinities between the Yami tribe people of Orchid Island and the Philippine Islanders of the Batanes archipelago and Philippines)
+ 21796613.Scholes et al. (2011).txt: 60 files for Philip but only exist 9 (keywords: Genetic diversity and evidence for population admixture in Batak Negritos from Palawan)
+ Philippines: '28535779.Carriers of mitochondrial DNA macrohaplogroup R colonized Eurasia.Larruga et al. (2017)': keywords: "Carriers of mitochondrial DNA macrohaplogroup R colonized Eurasia AND Philippines"

In [None]:
!${HOME}/edirect/esearch -db nucleotide -query 'Genetic affinities between the Yami tribe people of Orchid Island and the Philippine Islanders of the Batanes archipelago and Philippines' -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Philippines/extra/Philippines2.fasta

!${HOME}/edirect/esearch -db nucleotide -query 'Genetic diversity and evidence for population admixture in Batak Negritos from Palawan' -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Philippines/extra/Philippines2.fasta

!${HOME}/edirect/esearch -db nucleotide -query 'Carriers of mitochondrial DNA macrohaplogroup R colonized Eurasia AND Philippines' -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Philippines/extra/Philippines2.fasta

**Singapore** </br>
New data:
+ 36382667.Sui et al. (2023).txt: there are 7 and already exist 5 so there are 2 more but no idea if these 2 ref seqs are from Singapore or not (keywords: Death associated protein‑3 (DAP3) and DAP3 binding cell death enhancer‑1 (DELE1) in human colorectal cancer, and their impacts on clinical outcome and chemoresistance)
+ 37025097.Zhao et al. (2023).txt: 14 files and already exist 11 files but don’t know about the other 3 belongs to Singapore or not

In [None]:
!${HOME}/edirect/esearch -db nucleotide -query "Death associated protein‑3 (DAP3) and DAP3 binding cell death enhancer‑1 (DELE1) in human colorectal cancer, and their impacts on clinical outcome and chemoresistance" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Singapore/extra/Singapore2.fasta

!${HOME}/edirect/esearch -db nucleotide -query "Effect of COP1 in Promoting the Tumorigenesis of Gastric Cancer by Down-Regulation of CDH18 via PI3K/AKT Signal Pathway" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Singapore/extra/Singapore2.fasta

**Thailand** </br>
New data:
+ 11310578.Mitochondrial DNA polymorphisms in Thailand.Fucharoen et al. (2001)
+ 19148289.The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers.Jin et al. (2009): Thai (80, keyword: The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers and Thailand), Vietnam (84, keyword: Viet Nam)
+ 27837350.Complete mitochondrial genomes of Thai and Lao populations indicate an ancient origin of Austroasiatic groups and demic diffusion in the spread of Tai–Kadai languages. Kutanan et al. (2017): Thai, Lao (search on the table of isolate name): total 1234
+ 32304863.A Matrilineal Genetic Perspective of Hanging Coffin
Custom in Southern China and Northern Thailand (its old name is Unpublished.The Population History and Cultural Dispersal Pattern of Hanging.Zhang et al. (2020))


In [None]:
!${HOME}/edirect/esearch -db nucleotide -query 'Mitochondrial DNA polymorphisms in Thailand' -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Thailand/extra/Thailand2.fasta

!${HOME}/edirect/esearch -db nucleotide -query 'The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers AND Thailand ' -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Thailand/extra/Thailand2.fasta

'''27837350.Complete mitochondrial genomes of Thai and Lao populations indicate an ancient origin of Austroasiatic groups and demic diffusion in the
spread of Tai–Kadai languages. Kutanan et al. (2017): Thai, Lao (search on the table of isolate name):
total 1234: Laos: LUA101-LUA149: LA1 + VIE101-VIE149: LA2; the others are Thai'''
# download data for both Thai and Lao
!${HOME}/edirect/esearch -db nucleotide -query "Complete mitochondrial genomes of Thai and Lao populations indicate an ancient origin of Austroasiatic groups and demic diffusion in the spread of Tai–Kadai languages " -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Laos/extra/Laos_Thai.fasta

!${HOME}/edirect/esearch -db nucleotide -query "The Population History and Cultural Dispersal Pattern of Hanging" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/RetrieveData/Dataset1/Thailand/extra/Thailand.fasta

**Laos** </br>
New Data:
- 21333001.Southeast Asian diversity: first insights into the complex mtDNA structure of Laos.Bodner et al. (2011): 214

In [None]:
!${HOME}/edirect/esearch -db nucleotide -query "Southeast Asian diversity: first insights into the complex mtDNA structure of Laos" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Laos/extra/Laos_Thai.fasta

**Timor-Leste** </br>
New data:
+ Genetic admixture history of Eastern Indonesia as revealed by Y-chromosome and mitochondrial DNA analysis.Mona et al. (2009): No clear location/country but this article includes for countries: Indo, Timor (330 files)
+ 25757516.Gomes et al. (2015).txt: this study has 324 files which are all from East Timor (Timor-Leste) (KJ655583-KJ655889: D-loop, KJ676774-KJ676790: complete genome) (keywords: Human settlement history between Sunda and Sahul: a focus on East Timor (Timor-Leste) and the Pleistocenic mtDNA diversity)

In [None]:
!${HOME}/edirect/esearch -db nucleotide -query 'Genetic admixture history of Eastern Indonesia as revealed by Y-chromosome and mitochondrial DNA analysis' -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Timor-Leste/extra/Timor2.fasta

!${HOME}/edirect/esearch -db nucleotide -query 'Human settlement history between Sunda and Sahul: a focus on East Timor (Timor-Leste) and the Pleistocenic mtDNA diversity' -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Timor-Leste/extra/Timor2.fasta

**Viet Nam** </br>
New data:
+ 21219640.Inland post-glacial dispersal in East Asia revealed by mitochondrial haplogroup M9a'b: Myanmar, Vietnam: keywords for VN: HM346881, HM346883, HM346885, HM346886, HM346889
+ .Direct Submission.VN.Phan et al. (2016): Vietnam (DQ834255, DQ834258)
+ 19148289.The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers.Jin et al. (2009): Vietnam (84, keyword: The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers and Viet Nam)
+ 20513740.Tracing the Austronesian footprint in Mainland Southeast Asia: a perspective from mitochondrial DNA.Peng et al. (2010): 335 (Cham+Kinh)
+ '.Direct Submission.Phan et al. (2016)': there are 10 files for VN if search by keyword “Phan (2016) AND Homo sapiens AND mitochondrion” and among them already existed 2 files. They dont have a title for this so I still cannot find the article and there is no explanation for the isolate; the isolate only has “VN”

In [None]:
! for i in HM346881 HM346883 HM346885 HM346886 HM346889; do ${HOME}/edirect/esearch -db nucleotide -query $i -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Vietnam/extra/$i.fasta; done

! for i in DQ834255 DQ834258; do ${HOME}/edirect/esearch -db nucleotide -query $i -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Vietnam/extra/$i.fasta; done

!${HOME}/edirect/esearch -db nucleotide -query 'The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers AND Viet Nam' -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Vietnam/extra/Vietnam2.fasta

!${HOME}/edirect/esearch -db nucleotide -query 'Tracing the Austronesian footprint in Mainland Southeast Asia: a perspective from mitochondrial DNA' -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Vietnam/extra/Vietnam2.fasta

!${HOME}/edirect/esearch -db nucleotide -query "Phan (2016) AND Homo sapiens AND mitochondrion" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/RetrieveData/Dataset1/Vietnam/extra/Vietnam.fasta

**Malaysia** </br>
1. Search term "Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes AND Malaysia ":
- Miss data for Malaysia which is 267 files but all are control region
2. Search term "Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes AND Malay":
- Miss 4 complete genome of Malaysia: </br>
2: 9_N21(Tor57), 10_M21c(Tor61): Aboriginal Malay (Semelai) (using key word Malay) </br>
2: 7_N22(Tor55), 12_M22(Tor63): Aboriginal Malay (Temuan) (using key word Malayu)
3. 16982817.Hill et al. (2006): 6 files and one of them which is ORA131B already exist but the other not (keywords: Phylogeography and ethnogenesis of aboriginal Southeast Asians AND Malaysia)
4. 22729749.Evolutionary history of continental southeast asians: 'early train' hypothesis based on genetic analysis of mitochondrial and autosomal
DNA data.Jinam et al. (2012): Malay: 86 genome: 23 Bidayuh (BD); 24 Jehai (JH); 21 Seletar (SL); 18 Temuan (TM)

In [None]:
# add more 267 files having control region for Malaysia
!${HOME}/edirect/esearch -db nucleotide -query "Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes AND Malaysia " -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/RetrieveData/others/extraData/Malaysia_extra.fasta

# add more files for complete genome of Malaysia (using key word Malay) (3 files)
!${HOME}/edirect/esearch -db nucleotide -query "Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes AND Malay " -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/RetrieveData/others/extraData/Malaysia_extra_complete.fasta

# using key word Malayu (2 files)
!${HOME}/edirect/esearch -db nucleotide -query "Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes AND Malayu " -sort "Date Released" | ${HOME}/edirect/efetch -format fasta >> /content/drive/MyDrive/RetrieveData/others/extraData/Malaysia_extra_complete.fasta

!${HOME}/edirect/esearch -db nucleotide -query "Phylogeography and ethnogenesis of aboriginal Southeast Asians AND Malaysia" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Malaysia/extra/Malaysia3.fasta

!${HOME}/edirect/esearch -db nucleotide -query "Evolutionary history of continental southeast asians: 'early train' hypothesis based on genetic analysis of mitochondrial and autosomal DNA data " -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Malaysia/extra/Malaysia3.fasta

**Indonesia** </br>
New Data:
- '21407194.Larger mitochondrial DNA than
Y-chromosome differences between.Gunnarsdottir et al. (2011)'. Key word: Larger mitochondrial DNA than Y-chromosome differences between matrilocal and patrilocal groups from Sumatra (72 files)
- 21407194.Gunnarsdottir et al. (2011): HM596654
- 16982817.Hill et al. (2006): 97 files and 4 of them exist (DQ981465-68), the others not (keyword: Phylogeography and ethnogenesis of aboriginal Southeast Asians AND Indonesia)
- Unpublished.Ngili et al. (2009).txt: all 206 files from Indonesia (check lại có duplicate với old data ko)

In [None]:
!${HOME}/edirect/esearch -db nucleotide -query "Larger mitochondrial DNA than Y-chromosome differences between matrilocal and patrilocal groups from Sumatra" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/RetrieveData/Dataset1/Indonesia/extra/Indonesia.fasta

!${HOME}/edirect/esearch -db nucleotide -query "HM596654" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/RetrieveData/Dataset1/Indonesia/extra/Indonesia.fasta

!${HOME}/edirect/esearch -db nucleotide -query "Phylogeography and ethnogenesis of aboriginal Southeast Asians AND Indonesia" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/RetrieveData/Dataset1/Indonesia/extra/Indonesia.fasta

!${HOME}/edirect/esearch -db nuccore -query "Ngili (2009)" -sort "Date Released" | ${HOME}/edirect/efetch -format fasta > /content/drive/MyDrive/Database/Dataset1/Indonesia/extra/Indonesia3.fasta

### **2. Data wrangling**

From this Data Wrangling, we used the codes we wrote to get the output. The packages containing specific functions were cloned from the [git link](https://github.com/vy-phung/Haplogroup.git) mentioned at first. </br>
After cloning, the packages were called and imported to run the function.

In [None]:
# git clone
! git clone https://github.com/vy-phung/Haplogroup.git

#### **Dataset1**
After getting raw above data, we created function `SplitSeq` to create the folder Dataset1. </br>
This folder includes only unique data which means the duplicated mtDNA sequences getting from NCBI above were removed. </br>
Dataset 1 still has reference sequences, D-loop, non-homosapien, etc

**Instruction of SplitSeq**</br>
**Code:** </br>
`SplitSeq(nameFile, country,seqFolder,seqNameFolder,newCountryFasta, exist)`

  **Explanation:**
  - **nameFile:** input file
  - **country:** the country of the input file to run
  - **seqFolder:** name of folder of the output fasta files after running and splitting files
  - **seqNameFolder:** name of another folder which saves the others besides the above output. For example, after running ReadSummary, this will hold the results of reference aritlces such as authors, titles, pubmedID, etc.
  - **newCountryFasta**: a name of folder that we saved a big new fasta file which has the same collected sequences from NCBI, but the difference is they were labellee by the order "Country.Isolate.AccessionNumber.Haplogroup"
  - **exist:** Only receiving "Yes" or "No". "Yes" means some sequences in a big input fasta file had been processed or already existed, and therefore the function only runs the left over ones. "No" means there is no sequences of the input file existed or processed.

In [None]:
# Download and Install Entrez Direct
!sh -c "$(wget -q https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"
!export PATH=${HOME}/edirect:${PATH}


Entrez Direct has been successfully downloaded and installed.

In order to complete the configuration process, please execute the following:

  echo "export PATH=/root/edirect:\${PATH}" >> ${HOME}/.bashrc

or manually edit the PATH variable assignment in your .bashrc file.

Would you like to do that automatically now? [y/N]
y
OK, done.

To activate EDirect for this terminal session, please execute the following:

export PATH=${HOME}/edirect:${PATH}



In [None]:
# Download haplogrep
!curl -sL haplogrep.now.sh | bash

In [None]:
# calling function
from DataWrangling import Dataset1, openFile
# running function to get 11 countries data
for country in openFile("/content/drive/MyDrive/OUCRUwork/RetrieveData/others/extraData/countries.txt").split(','):
    nameFile = '/content/drive/MyDrive/OUCRUwork/RetrieveData/Dataset1/'+ country +'/old/'+country+'.fasta'
    seqFolder = '/content/drive/MyDrive/OUCRUwork/RetrieveData/Dataset1/' + country + '/fasta/'
    seqNameFolder = '/content/drive/MyDrive/OUCRUwork/RetrieveData/Dataset1/' + country + '/'
    newCountryFasta = '/content/drive/MyDrive/OUCRUwork/RetrieveData/Dataset1/' + country + '/new/'
    exist = 'No'
    Dataset1.splitSeq(nameFile, country, seqFolder,seqNameFolder,newCountryFasta, exist)

After running `splitSeq` funcion, there are 3 main outputs:
+ Saving each new fasta file of a sequence of an input country into the declared folder with its new label name in the order "Country.Isolate.AccessionNumber.Haplogroup".
+ Additional files seqName.txt and refName.txt helped to read the summary information of the sequences and saved them for the upcoming purpose (creating Isolate Explantion Table, and checking if there are any duplicated sequences).
+ A big fasta file including all the sequences of that input country. This file contains the exact same contents as a big old fasta file retrieved from NCBI, but it sequences were labeled with a new name in the order "Country.Isolate.AccessionNumber.Haplogroup".

#### **Dataset2**

After having Dataset1, we ran function `remove(CountryFolder, country)` to remove reference, non-homosapiens sequences.

**Explanation:**

- **CountryFolder**: the name of the folder of that country saving the output </br>
- **country**: name of the input country

In [None]:
# calling function
from DataWrangling.Dataset2_3 import remove
from DataWrangling import openFile
# run function
for country in openFile("/content/drive/MyDrive/OUCRUwork/RetrieveData/others/extraData/countries.txt").split(','):
    CountryFolder = '/content/drive/MyDrive/OUCRUwork/RetrieveData/Dataset2/' + country + '/fasta/'
    remove(CountryFolder, country)

The final output is the sequences which were not reference, non-homosapiens, and D-loop. All the sequences were saved in the new folder named Dataset 2.

#### **Datatset3**
After getting Dataset2, we ran function `removeControlReg(country)` to remove control region (or D-loop) of Dataset 2, and remain only complete genomes in this dataset folder.

**Explanation:**
- **country**: input country

In [None]:
# calling function
from DataWrangling.Dataset2_3 import removeControlReg
from DataWrangling import openFile
# run function
for country in openFile("/content/drive/MyDrive/OUCRUwork/RetrieveData/others/extraData/countries.txt").split(','):
    removeControlReg(country)

The final output is the sequences which were not control region, and only complete genomes. All the sequences were saved in the new folder named Dataset 3 which was also the clean data used for all the data analysis processes and the results of this study.

### **3. Tables**

In folder RetrieveData/others, there is a file named translation.txt which contains the explanation of all isolates of 4932 sequences to help running the below functions for creating the output tables.

#### **Isolate Explanation Table**

To understand briefly how did we create the Full Isolate Explanation Table, these are the steps to get this table:
1. Running the first cell box below to create an orignal Isolate Explanation Table
2. We used the file "Changed_SEA_haplogroups.xlsx" (to know more how we created this file, scrolling to Table1 section below) and added the columns (References, Ethnicity, Location, Language, Language Family, Haplogroup) having the same AccessionNumber between this file and original Isolate Explanation Table.
3. We used the columns References to create more new columns: PubmedID, Title, Years, Author(s), and we add the column "names" which were the names we used to label for each sequence
4. From column "Haplogroup", we ran the code at the third cell box below to add columns Haplogroup1, Haplogroup2, Haplogroup3.
5. We ran the code at second cell box below to add polymorphism column.
6. We add a column "Ancient and Present" by looking for keywords in each article of column "References" and classified if their published sequences were ancient or present day.

In [None]:
# run function
'''Creating an original Isolate Explanation Table which only includes 4 columns:
Country, Isolate, AccessionNumber, and Explanation.'''
from Tables.IsoTab import *
import pandas as pd
countries = 'Brunei Cambodia Indonesia Laos Malaysia Myanmar Philippines Singapore Thailand Timor-Leste Vietnam'
output = ''
for country in countries.split():
  table = explainIso(country)
  df = IsoTab(table,country)
  if len(output) < 1:
    output =df
  else:
    output = pd.concat([output,df])
  print('finish',country)
output.to_excel('/content/drive/MyDrive/OUCRUwork/RetrieveData/tables/IsolateExplanation.xlsx')

Adding polymorphism

In [None]:
'''After having above Isolate explanation table, we wanted to add a column poymorphism for each sequence.
Therefore, this function below was used to create that column
The input here contains all the fasta files of 11 countries in Dataset3'''

# call function
from Tables.IsoTab import polymorphism
# create input
! ls /content/drive/MyDrive/OUCRUwork/RetrieveData/Dataset3/*/fasta/* > Inputs.txt
# run function
polymorphism.polymorphism('Inputs.txt')
polymorphism.tablePoly('Inputs.txt')

Adding columns Haplo1, Haplo2, Haplo3 from using the original Haplogroup column

In [None]:
from Tables.IsoTab import polymorphism
table3a = pd.read_excel('/content/drive/MyDrive/OUCRUwork/RetrieveData/tables/table3/table3a_CountryAndEthnicity.xlsx', index_col='Country and Ethnicity')
table = pd.concat([table3a.iloc[[0,1],:],table3a.iloc[2:,:].astype(float)])
haplo = polymorphism.groupHaplo(table)

#### **Table1**

Running the cell box below at this Table1 section created 2 outputs
1. Eleven tables for eleven SEA countries which in each table having the columns:  
Country, References, Explanation, Ethnicity, Location, Population (or Label), Language, Sample size, Haplogroup's columns (each column is the name of a different haplogroup)
2. A table named "SEA_haplogroups" which includes the same above columns, but it combined all the 11 tables above into 1 big file.

After having "SEA_Haplogroups" table, we added a new column "Language Family" by using the column "Ethnicity" to look up on website World Ethnic Group. We also deleted the column Population (or Label) due to its unnecessary. Moreover, in columns Ethnicity, Location, Language, we added more new information after reading carefully again the papers of the sequences, and included more the nearby location or close-related ethnicity or language for the uncertain information. </br>
All these changes were saved in the new file named **"Changed_SEA_Haplogroups"**.

In [None]:
# call function
from Tables.tables import table1
# run function
countries = 'Brunei Cambodia Indonesia Laos Malaysia Myanmar Philippines Singapore Thailand Timor-Leste Vietnam'
for country in countries.split():
  data = table1(country)
  data.to_csv('/content/drive/MyDrive/RetrieveData/tables/table1/'+country+'1.csv')
  print(country,'finish')
# merge all 11 countries
countries = 'Cambodia Indonesia Laos Malaysia Myanmar Philippines Singapore Thailand Timor-Leste Vietnam'
df = pd.read_csv('/content/drive/MyDrive/RetrieveData/tables/table1/Brunei1.csv')
for country in countries.split():
  df1 = pd.read_csv('/content/drive/MyDrive/RetrieveData/tables/table1/'+country+'1.csv')
  df = pd.concat([df, df1], ignore_index=True, sort=False)
# save the total 11 countries files
df = df.fillna(0)
df = df.replace(0,'-')
df.to_csv('/content/drive/MyDrive/RetrieveData/tables/SEA_haplogroups.csv')

#### **Table2**

There 2 main outputs of section Table 2 after running the cell box below:
1. Eleven haplogroup frequence tables for eleven SEA countries which for each country, the number of haplogroups was different and somes had the same or different haplogroups.
2. A big Haplofrequence table includes the total of the haplogroups of all 11 countries.

All the tables above have the same columns:
- Haplogroup (name of haplogroup)
- Number of Individuals (count the appearance of that haplogroup)
- Frequency (the percentage of that haplogroup in the total of the number of all haplogroups in that country for first output and in 11 countries for second output).

In [None]:
# call function
from Tables.tables import table2
# run function
table2 = table2('/content/drive/MyDrive/RetrieveData/tables/table2/Haplogroups.csv')
table2.to_csv('/content/drive/MyDrive/RetrieveData/tables/table2/Haplofrequency.csv')

#### **Table3**

In this Table3 section we created 4 main outputs:
1. table3a_CountryAndEthnicity:
> - Haplogroup column (names of Haplogroup)
> - Ethnicities of 11 countries which each country was highlighted in different colors. </br>
> - In each country, there were : </br>
>> - A Total column (the total numbers of that specific halogroup / the total number of all types of haplogroups in that country). </br>
>> - Next to Total column is the name of each different ethnicity appearing in that country (the total number of the specific haplogroup in the specific ethnicity/the total number of all types of haplogroups in that specific ethnicity).
2. table3b_Ethnicity:
> - Haplogroup column (names of Haplogroup)
> - A Total column (the total numbers of that specific haplogroup / the total number of all types of haplogroups). </br>
> - Next to Total column is the name of each different ethnicity appearing in all 11 countries (the total number of the specific haplogroup in the specific ethnicity/the total number of all types of haplogroups in that specific ethnicity).
3. table3c_LanguageFamily: </br>
has the same format of columns as table3b_Ethnicity, but the change the independent variable from Ethnicity to Language Family.
4. table3d_CountryAndLanguageFamily: </br>
has the same format of columns as table3a_CountryAndEthnicity, but the change the independent variable from Ethnicity to Language Family.

table 3a_CountryAndEthnicity

In [None]:
# call function
from Tables.tables.table3.table3 import createTable3ad
# run function for Ethnicity
countries = 'Brunei Cambodia Indonesia Laos Malaysia Myanmar Philippines Singapore Thailand Timor-Leste Vietnam'
data = ''
groups = ['Country','Ethnicity']
for country in countries.split():
  df = createTable3ad(country,groups,'/content/drive/MyDrive/OUCRUwork/RetrieveData/tables/table1/Changed_SEA_haplogroups.csv',groups[-1])
  if len(data) < 1:
    data = df
  else:
    add = df.drop(['Haplogroup'],axis=1)
    data = pd.concat([data,add],axis=1)
  print(country,'finish')

In [None]:
data.to_excel('/content/drive/MyDrive/OUCRUwork/RetrieveData/tables/table3/table3a_CountryAndEthnicity.xlsx')

table3d_CountryAndLanguageFamily

In [None]:
# call function
from Tables.tables.table3.table3 import createTable3ad
# run function for Language Family
countries = 'Brunei Cambodia Indonesia Laos Malaysia Myanmar Philippines Singapore Thailand Timor-Leste Vietnam'
data = ''
groups = ['Country','Language family']
for country in countries.split():
  df = createTable3ad(country,groups,'/content/drive/MyDrive/OUCRUwork/RetrieveData/tables/table1/Changed_SEA_haplogroups.csv',groups[-1])
  if len(data) < 1:
    data = df
  else:
    add = df.drop(['Haplogroup'],axis=1)
    data = pd.concat([data,add],axis=1)
  print(country,'finish')

In [None]:
data.to_excel('/content/drive/MyDrive/OUCRUwork/RetrieveData/tables/table3/table3d_CountryAndLanguageFamily.xlsx')

Table 3b,c

In [None]:
# call function
from Tables.tables.table3.table3 import createTable3bc
# run function to get table3b_Ethnicity
groups = ['Ethnicity']
data = createTable3bc(groups,'/content/drive/MyDrive/OUCRUwork/RetrieveData/tables/table1/Changed_SEA_haplogroups.csv','Ethnicity')
data.to_excel('/content/drive/MyDrive/OUCRUwork/RetrieveData/tables/table3/table3b.xlsx')

In [None]:
# run function to get table3c_LanguageFamily
from Tables.tables.table3.table3 import createTable3bc
groups = ['Language family']
data = createTable3bc(groups,'/content/drive/MyDrive/OUCRUwork/RetrieveData/tables/table1/Changed_SEA_haplogroups.csv','Language family')
data.to_excel('/content/drive/MyDrive/OUCRUwork/RetrieveData/tables/table3/table3c.xlsx')