## Bot: Add GenBank Assembly accessions
#### As an example, we will create a bot to add GenBank Assembly accessions to bacterial organisms in wikidata

An assembly is a specific sample from a biological organism that was sequenced and analyzed. Often, well-studied single organisms or strains can be sequenced multiple times and the data deposited into repositories. In this bot, we'll add the assembly information onto bacterial strains that have been sequenced only once. This data is used by wikigenomes.org in annotating bacterial genomes.


See: https://www.wikidata.org/wiki/Q21079489#P4333

In [1]:
# make the cells wider
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

### Data Source
Genbank is a repository for genomic information. They provide a flat file with information about organisms [here](ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt)

In [2]:
# Download data
!wget -nc ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt

In [3]:
from time import strftime, gmtime
from tqdm import tqdm
import pandas as pd
from datetime import datetime
from wikidataintegrator import wdi_core, wdi_login, wdi_helpers, wdi_property_store

# create a file called local.py with your credentials
from local import WDUSER, WDPASS

In [4]:
PROPS = {
    'NCBI Taxonomy ID': 'P685',
    'GenBank Assembly accession': 'P4333',
    'stated in': 'P248',
    'retrieved': 'P813',
    'reference URL': 'P854'
}

ITEMS = {
    'GenBank': 'Q901755'
}

### load and pre-process data

In [5]:
# load in csv
df = pd.read_csv("prokaryotes.txt", sep='\t', low_memory=False)
# filter for complete genomes only
df = df.query("Status == 'Complete Genome'")
# create a dict where the key is the taxID, value is the list of accessions for that taxID
d = df.groupby("TaxID").agg({'Assembly Accession': lambda x: list(x)}).to_dict()['Assembly Accession']
# filter out the ones where there is more than one accession
d = {k: v[0] for k, v in d.items() if len(v) == 1}

In [6]:
# preview 10 items. key is taxid, value is genbank assembly
print(list(d.items())[:10])

[(679936, 'GCA_000237975.1'), (385025, 'GCA_000299965.1'), (1146883, 'GCA_000284015.1'), (1366, 'GCA_002310475.1'), (712710, 'GCA_001717525.1'), (1335303, 'GCA_000464955.2'), (9, 'GCA_900128725.1'), (172042, 'GCA_002355935.1'), (1335307, 'GCA_000439695.1'), (196620, 'GCA_000011265.1')]


In [7]:
# ~ 5k items to do
len(d)

4958

### Core properties
WDI allows you to define core properties which should have unique values across all of Wikidata. WDI will automatically check that these are unique and throw exceptions on failure. These core props are also used to retrieve items if the QID is not known.

#### You can add to the core properties by defining another, as below

In [8]:
wdi_property_store.wd_properties['P4333'] = {
    'core_id': True
}

### Login

In [9]:
# you can login very easily!
login = wdi_login.WDLogin(WDUSER, WDPASS)

Successfully logged in as Gstupp


### Create References

In [10]:
# We can define a helper function to create the reference statements
def create_reference(genbank_id):
    stated_in = wdi_core.WDItemID(ITEMS['GenBank'], PROPS['stated in'], is_reference=True)
    retrieved = wdi_core.WDTime(strftime("+%Y-%m-%dT00:00:00Z", gmtime()), PROPS['retrieved'], is_reference=True)
    url = "https://www.ncbi.nlm.nih.gov/genome/?term={}".format(genbank_id)
    ref_url = wdi_core.WDUrl(url, PROPS['reference URL'], is_reference=True)
    return [stated_in, retrieved, ref_url]

### The Meat of the Bot

In [11]:
def run_one(taxid, genbank_id):
    # create a statement for the ncbi tax id
    ncbi_statement = wdi_core.WDExternalID(str(taxid), PROPS['NCBI Taxonomy ID'])
    # we are going to retrieve the item to be modified based on the NCBI Taxonomy ID, which should already exist on all organisms.
    try:
        item = wdi_core.WDItemEngine(data=[ncbi_statement], domain="organism", search_only=True, item_name="organism")
    except wdi_core.ManualInterventionReqException as e:
        # if there are more than one items with this ncbi tax id, this will throw an error!
        # instead, catch it and log the error
        msg = wdi_helpers.format_msg(genbank_id, PROPS['GenBank Assembly accession'], "", str(e), type(e))
        wdi_core.WDItemEngine.log("ERROR", msg)
        return
    
    if item.wd_item_id:
        # if the item exists, create the genbank statement
        reference = create_reference(genbank_id)
        genbank_statement = wdi_core.WDExternalID(genbank_id, PROPS['GenBank Assembly accession'], references=[reference])
        # create the item object, specifying the qid
        item = wdi_core.WDItemEngine(data=[genbank_statement], wd_item_id=item.wd_item_id)
        # use this helper method to perform the write. It automatically writes to a log file and captures errors
        # wdi also has an automatic backoff and retry functionality
        wdi_helpers.try_write(item, record_id=genbank_id, record_prop=PROPS['GenBank Assembly accession'],
                              login=login, edit_summary="update GenBank Assembly accession")
    else:
        # if the item doesn't exist, log it and skip
        msg = wdi_helpers.format_msg(genbank_id, PROPS['GenBank Assembly accession'], "",
                               "No organism found with taxid {}".format(taxid))
        wdi_core.WDItemEngine.log("WARNING", msg)

### Run!

In [12]:
# this will take a while to run (5000 * 1sec/item == 1.5 hrs)
for taxid, genbank_id in tqdm(d.items()):
    #run_one(taxid, genbank_id)
    pass

100%|██████████| 4958/4958 [00:00<00:00, 1207417.94it/s]


## The meat of the bot V2 (fast run mode)

In [13]:
# instead of using wdi and search_only to retrieve the item, we'll do it manually, all at once
tax_qid_map = wdi_helpers.id_mapper(PROPS['NCBI Taxonomy ID'], return_as_set=True)
# filter out those where the same taxid is used across more than one item
tax_qid_map = {k:list(v)[0] for k,v in tax_qid_map.items() if len(v)==1}

In [14]:
def run_one(taxid, genbank_id):
    # get the QID
    taxid = str(taxid)
    if taxid not in tax_qid_map:
        msg = wdi_helpers.format_msg(genbank_id, PROPS['GenBank Assembly accession'], "",
                               "organism with taxid {} not found or skipped".format(taxid))
        wdi_core.WDItemEngine.log("WARNING", msg)
        return None
    qid = tax_qid_map[taxid]
    reference = create_reference(genbank_id)
    genbank_statement = wdi_core.WDExternalID(genbank_id, PROPS['GenBank Assembly accession'], references=[reference])
    
    # create the item object, specifying the qid
    item = wdi_core.WDItemEngine(data=[genbank_statement], wd_item_id=qid, fast_run=True, 
                                 fast_run_base_filter={PROPS['GenBank Assembly accession']: ''})

    wdi_helpers.try_write(item, record_id=genbank_id, record_prop=PROPS['GenBank Assembly accession'],
                          login=login, edit_summary="update GenBank Assembly accession")

In [None]:
# if no write is required, this will finish in a minute
for taxid, genbank_id in tqdm(d.items()):
    run_one(taxid, genbank_id)

In [19]:
# Check out the log
open("logs/WD_bot_run-20171013_13:20.log").read().split("\n")[-10:]

 'INFO;10/13/2017 13:31:02;GCA_000007765.2;P4333;Q21064832;UPDATE;None',
 '']

### What about if want to update the retrieved date?
In version 2, the reference is never checked. Only the value. We can modify the bot so that the reference is also checked and updated if the retrieved date is older than X days (e.g. 180 days)

In [20]:
from wikidataintegrator import ref_handlers
from functools import partial

update_retrieved_if_new = partial(ref_handlers.update_retrieved_if_new, days=180)

def run_one(taxid, genbank_id):
    # get the QID
    taxid = str(taxid)
    if taxid not in tax_qid_map:
        msg = wdi_helpers.format_msg(genbank_id, PROPS['GenBank Assembly accession'], "",
                               "organism with taxid {} not found or skipped".format(taxid))
        wdi_core.WDItemEngine.log("WARNING", msg)
        return None
    qid = tax_qid_map[taxid]
    reference = create_reference(genbank_id)
    genbank_statement = wdi_core.WDExternalID(genbank_id, PROPS['GenBank Assembly accession'], references=[reference])
    
    # create the item object, specifying the qid
    item = wdi_core.WDItemEngine(data=[genbank_statement], wd_item_id=qid, fast_run=True, 
                                 fast_run_base_filter={PROPS['GenBank Assembly accession']: ''},
                                 global_ref_mode='CUSTOM', fast_run_use_refs=True,
                                 ref_handler=update_retrieved_if_new)

    wdi_helpers.try_write(item, record_id=genbank_id, record_prop=PROPS['GenBank Assembly accession'],
                          login=login, edit_summary="update GenBank Assembly accession")


In [None]:
for taxid, genbank_id in tqdm(d.items()):
    run_one(taxid, genbank_id)