# UKBB Cohort Curation

## Introduction 

This tool was created to facilitate the cohort curation from several data sources in the UK BioBank.

Currently, we support querying:
* the main datafile which includes survey responses, self-reported conditions, and hospital recorded conditions
* the gp_clinical table within the SQL database accessible from the data portal, which contains data recorded by general practitioners

This notebook provides a walk-through of how to use the module in a curation pipeline. Typically, such a pipeline will start with some general search terms that will be used to query the databases within UK Biobank. In this example, the relevant search terms are 'borderline glaucoma', 'eye disease and 'optical coherence tomography'. The aim is to create a cohort that has a 'bordline glaucoma' eye condition, does not have any other 'eye disease', and 'optical coherence tomography' imaging data available.

Let's start by importing the module

In [1]:
import ukbcohort as uk

Next we generate a dictionary of the relevant datafield:code combinations with the appropriate conditional logic that satisfies our cohort requirements. In this example, we want all our participants to have a 'borderline glaucoma' condition, have OCT data available, and not have any other eye disease. We can configure this logic by specifying either "any_of", "all_of", or "none_of" keys within the dictionary, followed by the tuples of the datafield, code combinations. Conditions that are optionals, such as variations of 'borderline glaucoma' belong to the "any_of" key, whereas conditions that are undesired, such as variations of other 'eye diseases' belong to the 'none_of' key. Condtions that are necessary, such as the availability of 'optical coherence tomography' data, belong to the 'all_of' key. 

In [2]:
cohort_dictionary = {"all_of": [], "any_of": [["read_3", "F4252"], ["read_3", "F4251"]], "none_of": []}
{"any_of": [["read_3", "XE18j"], ["read_3", "F4251"], ["read_3", "F4252"], ["read_3", "XE18j"], ["read_3", "XE18j"]]}

The graphic belows displays a Venn diagram representing the conditiona logic specified in the dictionary above. 

<img width='50%' src="cohort_selection.png" />

The next step is to query the databases using this cohort_dictionary. The module will query the 'main' dataset, and optionally the 'gp_clinical' data as well. Querying both databases will ensure the most comprehensive search of the cohort within UK Biobank. We have to specify the path to the main dataset, and pass in the credentials to use to access the data portal website in order to access the GP clinical database. We can also optionall chose to write the participant IDs within our cohort to a file within "write_dir". 

In [3]:
main_filename = "../dataFiles/main_head100.csv"
cred_path = "../credentials.py"
out_file = "example_cohort.txt"
write_dir = "example_cohort"

queries = uk.query.create_queries(cohort_criteria=cohort_criteria_updated, main_filename=main_filename,
                               portal_access=True)
cohort_ids = uk.query.query_databases(cohort_criteria=cohort_dictionary, queries=queries, main_filename=main_filename, 
                                      credentials_path=cred_path, write_dir=write_dir, 
                                      driver_path=driver_path, driver_type=driver_type, 
                                      timeout=120, portal_access=True, out_filename=out_filename, write=True)


AttributeError: module 'ukbcohort' has no attribute 'query'