# External Database Description Initialization

In [1]:
import sys
sys.path.append("/Users/kemalinecik/git_nosync/idtrack")

import os
import time
import pickle
import idtrack
import copy

Initialize the idtrack API first

In [2]:
local_dir = "/Users/kemalinecik/Downloads/idtrack_temp"
idt = idtrack.API(local_repository=local_dir)
idt.configure_logger()

Let's initialize idtrack for 'human'

In [3]:
organism_formal_name, last_ensembl_release = idt.get_ensembl_organism("homo sapiens")
organism_formal_name, last_ensembl_release

2025-05-27 10:48:56 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.


('homo_sapiens', 114)

In [4]:
dm = idt.get_database_manager(organism_name='homo_sapiens', last_ensembl_release=114)

2025-05-27 10:48:58 INFO:database_manager: Available MySQL databases for homo_sapiens in 38 assembly and 0 release is being fetched.
2025-05-27 10:48:59 INFO:database_manager: Available MySQL databases for homo_sapiens in 38 assembly and 114 release is being fetched.
2025-05-27 10:49:00 INFO:database_manager: Exporting to the following file `homo_sapiens_assembly-38.h5` with key `ens114_common_availabledatabases`


```python
dm = idtrack.DatabaseManager(
    organism='homo_sapiens',
    ensembl_release=None,
    ignore_before=90,
    ignore_after=95,
    form=copy.deepcopy(idtrack.DB.backbone_form),
    local_repository=local_dir,
)
```

In [5]:
%%time
_ = dm.create_database_content(just_download=True)

2025-05-27 10:49:08 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `79`
2025-05-27 10:49:09 INFO:database_manager: Raw table for `gene` on ensembl release `79` was downloaded for following columns: gene_id, stable_id, version.
2025-05-27 10:49:09 INFO:database_manager: Exporting to the following file `homo_sapiens_assembly-38.h5` with key `ens79_mysql_gene_COL_gene_id_COL_stable_id_COL_version`
2025-05-27 10:49:10 INFO:database_manager: Exporting to the following file `homo_sapiens_assembly-38.h5` with key `ens79_processed_idsraw_gene_gene`
2025-05-27 10:50:16 INFO:database_manager: Raw table for `object_xref` on ensembl release `79` was downloaded for following columns: ensembl_id, ensembl_object_type, xref_id, object_xref_id.
2025-05-27 10:50:16 INFO:database_manager: Exporting to the following file `homo_sapiens_assembly-38.h5` with key `ens79_mysql_object_xref_COL_ensembl_id_COL_ensembl_object_type_COL_object

CPU times: user 4h 54min 43s, sys: 12min 46s, total: 5h 7min 29s
Wall time: 6h 37min 38s


In [6]:
%%time
df = dm.create_database_content()
dm.external_inst.create_template_yaml(df)

2025-05-27 17:26:46 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `79`
2025-05-27 17:26:46 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `80`
2025-05-27 17:26:46 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `81`
2025-05-27 17:26:46 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `82`
2025-05-27 17:26:46 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `83`
2025-05-27 17:26:46 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `84`
2025-05-27 17:26:46 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`,

CPU times: user 1.61 s, sys: 569 ms, total: 2.18 s
Wall time: 2.76 s


See the file `homo_sapiens_externals_template.yaml` in local_dir. 

In [8]:
!ls -l /Users/kemalinecik/Downloads/idtrack_temp

total 61807128
-rw-r--r--@ 1 kemalinecik  staff  15889966491 May 27 17:26 homo_sapiens_assembly-37.h5
-rw-r--r--@ 1 kemalinecik  staff  15755222398 May 27 14:08 homo_sapiens_assembly-38.h5
-rw-r--r--@ 1 kemalinecik  staff        55077 May 27 17:26 homo_sapiens_externals_template.yml


Modify the file `blabla.yaml` in local_dir based on the instructions given in `create_template_yaml` method docstring.

Here, these external databases are chosen:
- `Clone_based_ensembl_gene`
- `Clone_based_vega_gene`
- `EntrezGene`
- `HGNC Symbol`
- `Havana gene`
- `NCBI gene`
- `NCBI gene (formerly Entrezgene)`
- `RFAM`
- `UniProtKB Gene Name`
- `Vega gene`
- `Vega_gene`
- `CCDS`
- `Havana transcript`
- `RefSeq_mRNA`
- `RefSeq_mRNA_predicted`
- `RefSeq_ncRNA`
- `RefSeq_ncRNA_predicted`
- `Havana translation`
- `RefSeq_peptide`
- `RefSeq_peptide_predicted`
- `Uniprot/SPTREMBL`
- `Uniprot/SWISSPROT`

The modified file `homo_sapiens_externals_modified.yml` should be in the directory of local_dir. 

This yaml file is needed for initializing the graph.

In [9]:
!ls -l /Users/kemalinecik/Downloads/idtrack_temp

total 61807256
-rw-r--r--@ 1 kemalinecik  staff  15889966491 May 27 17:26 homo_sapiens_assembly-37.h5
-rw-r--r--@ 1 kemalinecik  staff  15755222398 May 27 14:08 homo_sapiens_assembly-38.h5
-rw-r--r--@ 1 kemalinecik  staff        65514 May 27 17:59 homo_sapiens_externals_modified.yml
-rw-r--r--@ 1 kemalinecik  staff        55077 May 27 17:26 homo_sapiens_externals_template.yml


In [11]:
%%time
yaml_dict = dm.external_inst.load_modified_yaml()

CPU times: user 73.4 ms, sys: 8.31 ms, total: 81.7 ms
Wall time: 111 ms


Uses default if this step is skipped.