## Primer: genomski podatki

Razvoj na področju biotehnologije omogoča pridobivanje vedno več podatkov o organizmih.  
Eden najpogostejših podatkovnih tipov, s katerimi primerjamo vrste, so **genski zapisi**. Ti so primerni za računalniško obdelavo, saj jih lahko poenostavimo v zaporedje štirih nukleotidov: A, C, G in T.  
Celoten človeški genski zapis, ki določa vse — od barve oči do nagnjenosti k določenim boleznim — je zapisan v približno $3 \times 10^{9}$ parih baz DNK.

Pri razmnoževanju pride do prepisovanja in kombiniranja DNA zapisov staršev. Ta proces seveda ni popoln, zato prihaja do napak – *mutacij*.  
Dolgoročna posledica mutacij je nastanek različnih živalskih vrst, kar pomeni, da imajo sorodnejše vrste bolj podobne genske zapise.

Iz baze genskih zapisov smo naložili zaporedja mitohondrijskega gena **citokrom b** (*cytb*) za 30 različnih vrst.
Mitohondrijski geni, kot je *cytb*, se pogosto uporabljajo pri filogenetskih analizah, ker se prenašajo skoraj izključno po materini liniji in se spreminjajo počasneje kot jedrni geni.  
Zaradi tega so zelo uporabni za raziskovanje evolucijskih odnosov med vrstami.

Podatke najprej pridobimo iz spleta.

## Example: genomic data

Advances in biotechnology have made it possible to acquire much more detailed data about living organisms.  
One of the most common data types used to compare species are **genetic sequences**. These are well-suited for computational analysis, as they can be represented as sequences of four nucleotides: A, C, G, and T.  
The complete human genome, which determines everything from eye color to susceptibility to certain diseases, consists of roughly $3 \times 10^{9}$ DNA base pairs.

During reproduction, the DNA of both parents is transcribed and combined. This process is not perfect, and errors — *mutations* — occur.  
Over time, these mutations lead to the emergence of different species, meaning that closely related species have more similar genetic sequences.

From a genetic database, we have retrieved the mitochondrial **cytochrome b** (*cytb*) gene sequences for 30 species.
Mitochondrial genes like *cytb* are often used in phylogenetic analysis because they are inherited almost exclusively through the maternal line and evolve more slowly than nuclear genes.  
This makes them particularly useful for studying evolutionary relationships between species.

First we obtain the data from the internet.

In [1]:
from Bio import Entrez
from Bio import SeqIO
import json

species = [
    # Primates (incl. Neanderthal)
    ("Homo sapiens",                    "NC_012920.1"),
    ("Homo sapiens neanderthalensis",   "NC_011137.1"),
    ("Pan troglodytes",                 "NC_001643.1"), #chimpanzee
    ("Gorilla gorilla",                 "NC_001645.1"), 
    ("Pongo abelii",                    "NC_002083.1"), #orangutan

    # Mammals (other; predators & herbivores / omnivores)
    ("Canis lupus familiaris",          "NC_002008.4"),
    ("Equus caballus",                  "NC_001640.1"),
    ("Bos taurus",                      "NC_006853.1"),
    ("Felis catus",                     "NC_001700.1"),
    ("Mus musculus",                    "NC_005089.1"),
    ("Panthera tigris",                 "NC_010642.1"),  
    ("Panthera leo",                    "NC_028306.1"),  
    ("Ursus arctos",                    "NC_003427.1"),   
    ("Cervus elaphus",                  "NC_027844.1"),  
    ("Ovis aries",                      "NC_001941.1"),  

    # Birds
    ("Gallus gallus",                   "AY235570.1"),
    ("Taeniopygia guttata",             "NC_007897.1"),
    ("Columba livia",                   "NC_025926.1"),
    ("Spheniscus demersus",             "NC_008434.1"),
    ("Anas platyrhynchos",              "NC_009684.1"),

    # Reptiles / Amphibians
    ("Bufo gargarizans",                "KU321581"), 
    ("Chamaeleo calyptratus",           "NC_012420.1"),
    ("Aneides aeneus",               "OM743432 "), 
    ("Xenopus laevis",                  "NC_001573.1"),
    ("Chelonoidis carbonarius",                "OQ789392"),

    # Fish
    ("Takifugu rubripes",               "NC_004299.1"),
    ("Danio rerio",                     "NC_002333.2"),
    ("Salvelinus malma malma",           "MF680544"),
    ("Cyprinus carpio",                 "X61010.1"),
    ("Salmo salar",                     "NC_001960.1"),
]

# Data loading
target_gene = "cytb"  
infile = "../data/seqs.json"
seqs = dict()
for name, sid in species:
    print("Loading ...", name)
    t = False
    while not t:
        try:
            handle = Entrez.efetch(db="nucleotide", rettype="gb", id=sid,
                           email="a@gmail.com")
            rec = SeqIO.read(handle, "gb")
            
            for feature in rec.features:
                if feature.type == "gene" or feature.type == "CDS":
                    gene_name = feature.qualifiers.get("gene", [""])[0]
                    if gene_name.lower() == target_gene.lower():
                        gene_seq = feature.extract(rec.seq)
                        break          
            t = True
        except:
            continue
    seqs[name] = str(gene_seq)   
    
json.dump(seqs, open(infile, "w"))

Loading ... Homo sapiens
Loading ... Homo sapiens neanderthalensis
Loading ... Pan troglodytes
Loading ... Gorilla gorilla
Loading ... Pongo abelii
Loading ... Canis lupus familiaris
Loading ... Equus caballus
Loading ... Bos taurus
Loading ... Felis catus
Loading ... Mus musculus
Loading ... Panthera tigris
Loading ... Panthera leo
Loading ... Ursus arctos
Loading ... Cervus elaphus
Loading ... Ovis aries
Loading ... Gallus gallus
Loading ... Taeniopygia guttata
Loading ... Columba livia
Loading ... Spheniscus demersus
Loading ... Anas platyrhynchos
Loading ... Bufo gargarizans
Loading ... Chamaeleo calyptratus
Loading ... Aneides aeneus
Loading ... Xenopus laevis
Loading ... Chelonoidis carbonarius
Loading ... Takifugu rubripes
Loading ... Danio rerio
Loading ... Salvelinus malma malma
Loading ... Cyprinus carpio
Loading ... Salmo salar


In [2]:
sequences = json.load(open("../data/seqs.json"))

print(len(sequences["Homo sapiens"]))
print(sequences["Homo sapiens"])

1141
ATGACCCCAATACGCAAAACTAACCCCCTAATAAAATTAATTAACCACTCATTCATCGACCTCCCCACCCCATCCAACATCTCCGCATGATGAAACTTCGGCTCACTCCTTGGCGCCTGCCTGATCCTCCAAATCACCACAGGACTATTCCTAGCCATGCACTACTCACCAGACGCCTCAACCGCCTTTTCATCAATCGCCCACATCACTCGAGACGTAAATTATGGCTGAATCATCCGCTACCTTCACGCCAATGGCGCCTCAATATTCTTTATCTGCCTCTTCCTACACATCGGGCGAGGCCTATATTACGGATCATTTCTCTACTCAGAAACCTGAAACATCGGCATTATCCTCCTGCTTGCAACTATAGCAACAGCCTTCATAGGCTATGTCCTCCCGTGAGGCCAAATATCATTCTGAGGGGCCACAGTAATTACAAACTTACTATCCGCCATCCCATACATTGGGACAGACCTAGTTCAATGAATCTGAGGAGGCTACTCAGTAGACAGTCCCACCCTCACACGATTCTTTACCTTTCACTTCATCTTGCCCTTCATTATTGCAGCCCTAGCAACACTCCACCTCCTATTCTTGCACGAAACGGGATCAAACAACCCCCTAGGAATCACCTCCCATTCCGATAAAATCACCTTCCACCCTTACTACACAATCAAAGACGCCCTCGGCTTACTTCTCTTCCTTCTCTCCTTAATGACATTAACACTATTCTCACCAGACCTCCTAGGCGACCCAGACAATTATACCCTAGCCAACCCCTTAAACACCCCTCCCCACATCAAGCCCGAATGATATTTCCTATTCGCCTACACAATTCTCCGATCCGTCCCTAACAAACTAGGAGGCGTCCTTGCCCTATTACTATCCATCCTCATCCTAGCAATAATCCCCATCCTCCATATATCCAAACAACAAAGCATAATATTTCGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCAGACCT

##### Vprašanje 5-3-1

Kako bi lahko primerjali živalske vrste glede na zapise, ki so podani kot nizi znakov? Prva ideja je, da podatke pretvorimo v **vektorski prostor**, v katerem lahko računamo razdalje med vrstami.  

*Namig:* zaporedja lahko razbiješ na manjše dele in prešteješ število pojavitev posameznih znakov, parov, trojk, ... **k-terk**.  
Po želji lahko upoštevaš tudi položaj v zaporedju.

Dopolni in si pomagaj s funkcijo `seq_to_kmer_count`, ki niz znakov pretvori v vektor števila pojavitev vseh mogočih k-terk.

Prevedi podatke v ustrezno obliko, izvedi **hierarhično gručenje** in prikaži rezultate.  
Ali so vrste na dendrogramu postavljene smiselno?  
Dobiti moraš sliko, podobno tej:


##### Question 5-3-1

How could you compare animal species based on records that are given as character strings?  
The first idea is to convert the data into a **vector space**, where we can compute distances between species.

💡 *Hint:* You can break sequences into smaller parts and count the occurrences of individual characters, pairs, triples, ... **k-mers**.  
You can also take the position within the sequence into account if you wish.

Complete and use the `seq_to_kmer_count` function, which converts a character string into a vector representing the counts of all possible k-mers.

Convert the data into the appropriate format, perform **hierarchical clustering**, and visualize the results.  
Are the species positioned in a meaningful way on the dendrogram?  
You should obtain an image similar to this one:

<img src="../slike/nizi_dendrogram.png"></img>

In [3]:
from itertools import product
import numpy as np
def seq_to_kmer_count(seq, k=4):
    """
    Pretvori zaporedje seq v vektor x.
         AAAA AAAC AAAG AAAT ... TTTG TTTT
    x = [   1  1      2   10 ...   12    7]
    len(x) == len(seq) - k + 1
    """   
    ktuples = list(zip(*[seq[i:] for i in range(k)]))
    kmers   = list(product(*(k*[["A", "C", "T", "G"]])))
    
    x = np.zeros((len(kmers), ))
    ### Your code here ### 

    return x

[Odgovor](205-3.ipynb#Odgovor-5-3-1)

[Answer](205-3.ipynb#Answer-5-3-1)