# UKBB Cohort Curation

## Introduction 

This tool was created to facilitate the cohort curation from several data sources in the UK BioBank.

Currently, we supports querying:
* the main datafile which includes survey responses, self-reported conditions, and hospital recorded conditions
* the gp_clinical table on the UKBB website, which contains data recorded by general practitioners

In [1]:
import ukbcohort as uk

hello


## Define environment variables and paths

This tool relies on a number of files. Some come provided in this repo, some are system specific. 
To install the driver for a headless browser, please refer to the [read me](https://github.ibm.com/isabeki/ukbcohort/blob/master/README.md). 
This step is necessary to interact with the UKBB database website.

This tool also assumes that you have been granted access to UKBB data for a specific project. 
A credentials file is used to store access relevant data (application id, username, and password). 
Presently, the main dataset needs to be downloaded from the UKBB website.



In [2]:
downloadDirectory='../dataFiles'

pathToShowcase='../dataFiles/showcase_toyData.csv'
pathToCoding='../dataFiles/codings_toyData.csv'
pathToReadcode='../dataFiles/readcodes_toyData.csv'

pathToMain='../toy-data/ukb41268_head100.csv'

pathToCredentials = '.'
driverType = 'chrome'
pathToDriver = "prototype_notebooks/going_headless/chromedriver"


If you want to download the latest showcase and coding files from the UKBB website, run the following cell:

In [3]:
uk.utils.download_latest_files(downloadDirectory=downloadDirectory)

## Decide which conditions are of interest

The curation of the cohort happens in three steps:
1. Choose a list of relevant search terms and interactively go through all fields that contain your terms, either in the description of a field directly or in one of the associated codes. Keep in mind that **_not havign_ a condition may be as important as _having_ a condition.** All these fields should be included after the first step.
2. Decide which of the tagged conditions are mandatory fields in your target cohort, optional fields, or fields your cohort should not have.
3. Query the database with the resulting dictionary.

### Step 1: Choose a list of relevant search terms:

Start by defining a list of conditions you want to look out for. This list should include conditions relevant to your cohort (regardless of whether they should be excluded or included in the end). 

Let's start with an example. We want to end up with a cohort in which each patient has an OCT taken, has borderline glaucoma, but has never had cancer.

In [3]:
searchTerms = ['borderline glaucoma', 'cancer', 'optical-coherence tomography']

Next, we search the showcase, coding descriptions, and readcodes to find possibly relevant fields. 
* We have the chance to include **'any'** codes for a particular field. This is something we would do for cancer diagnosis, since we will want to exclude anyone who has had 'any' cancer diagnosis.
* Alternatively, we can **'choose'** which codes to include for a field. An example here is that we might want to pick a very particular diagnosis (like borderline glaucoma). 
* Entries that seem irrelevant can be skipped by hitting enter. 

In [4]:
# construct the dataframe with the right files
searchDf = uk.filter.construct_search_df(pathToShowcase=pathToShowcase, pathToCoding=pathToCoding, pathToReadcode=pathToReadcode)
# filter the dataframe to only contain conditions that match the search terms
searchDf = uk.filter.construct_candidate_df(searchDf=searchDf, searchTerms=searchTerms)
# interactively filter conditions
searchDict = uk.filter.select_conditions(searchDf=searchDf)

[93mThe following fields have potentially relevant values. Please choose if you want to include all patients who have any value in this field [a], none [hit enter], or if you would like to choose specific values [c].[0m


Include gp_clinical, read_3? [a/c/_]  c


[1mPlease choose which codes to include [i] or skip entry [hit enter], skip rest of field [s].[0m


    Include CA - Bone cancer? [i/_/s]  i
    Include Bone cancer? [i/_/s]  i
    Include Cancer of cervix? [i/_/s]  i
Include gp_clinical, read_2? [a/c/_]  c


[1mPlease choose which codes to include [i] or skip entry [hit enter], skip rest of field [s].[0m


    Include No history of breast cancer? [i/_/s]  
    Include Borderline glaucoma? [i/_/s]  i
Include Optical-coherence tomography device ID? [a/c/_]  a
Include Cancer code, self-reported? [a/c/_]  a
Include Diagnoses - main ICD9? [a/c/_]  c


[1mPlease choose which codes to include [i] or skip entry [hit enter], skip rest of field [s].[0m


    Include 3650 Borderline glaucoma? [i/_/s]  i
Include Cancer year/age first occurred? [a/c/_]  a
Include Non-cancer illness code, self-reported? [a/c/_]  c


[1mPlease choose which codes to include [i] or skip entry [hit enter], skip rest of field [s].[0m


    Include gynaecological disorder (not cancer)? [i/_/s]  


### Step 2: Decide if conditions are mandatory, optional, or should be excluded

The searchDict now contains the information of all relevant fields. 
By default, querying with this dictionary would result in a dataset of people who have any of the included conditions (the union of people with cancer, OCT, and borderline glaucoma). 
But we are interested in updating this. 

In [5]:
searchDict = uk.filter.update_inclusion_logic(searchDict=searchDict, searchDf=searchDf)

[1mPlease choose if the following conditions are mandatory (each patient in your cohort will have this condition) [92m[m][0m[1m, optional (all patients in your cohort will have one or more of these conditions) [94m[o][0m[1m, or undesired (none of the patients in your cohort will have this condition) [91m[e][0m[1m


gp_clinical, read_3, Bone cancer e
gp_clinical, read_3, Bone cancer e
gp_clinical, read_3, Cancer of cervix e
gp_clinical, read_2, Borderline glaucoma o
Optical-coherence tomography device ID, any m
Cancer code, self-reported, any e
Diagnoses - main ICD9, 3650 Borderline glaucoma i
Cancer year/age first occurred, any e


#### Please note

<img width='50%' src="cohort_selection.png" />

Using **m**, we can set a field + value as a mandatory entry. Any patient returned in the cohort will have this condition. In our case, we want any patient returned to have OCT images. 

When several fields can contain the same diagnosis, we can use **o** to include any of the following. In our cohort, we don't care about who diagnosed borderline glaucoma. We are happy to include to union of people who told their gp of their condition and who got diagnosed with it in a hospital.

Using **e** on a key means that we want patients who have had this condition removed from our final set of patients. We would like all patients that had any cancer condition to be removed. 

### Step 3: Use the searchDict to query all databases

In [6]:
queryStrings = uk.query.createQueryStrings(searchDict=searchDict, pathToMain=pathToMain)

In [7]:
eids = uk.query.query_databases(searchDict=searchDict, queryStrings=queryStrings, pathToMain=pathToMain, pathToCredentials=pathToCredentials, pathToDriver=pathToDriver, driverType='chrome')

Querying pg_clinical table with: SELECT distinct eid FROM gp_clinical WHERE read_2 = 'F450.'
Querying pg_clinical table with: SELECT distinct eid FROM gp_clinical WHERE read_3 = 'XE1vd' OR read_3 = 'XE1vd' OR read_3 = 'XE1vi'
Querying main dataset with: (t5270_0_0.notnull() or t5270_1_0.notnull())
Querying main dataset with: (t20001_0_0.notnull() or t20001_0_1.notnull() or t20001_0_2.notnull() or t20001_0_3.notnull() or t20001_0_4.notnull() or t20001_0_5.notnull() or t20001_1_0.notnull() or t20001_1_1.notnull() or t20001_1_2.notnull() or t20001_1_3.notnull() or t20001_1_4.notnull() or t20001_1_5.notnull() or t20001_2_0.notnull() or t20001_2_1.notnull() or t20001_2_2.notnull() or t20001_2_3.notnull() or t20001_2_4.notnull() or t20001_2_5.notnull() or t20001_3_0.notnull() or t20001_3_1.notnull() or t20001_3_2.notnull() or t20001_3_3.notnull() or t20001_3_4.notnull() or t20001_3_5.notnull()) or (t84_0_0.notnull() or t84_0_1.notnull() or t84_0_2.notnull() or t84_0_3.notnull() or t84_0_4.no

In [34]:
print('The final cohort contains {} patients'.format(eids[0]))

KeyError: 0

In [2]:
import pandas as pd

In [5]:
type(pd.DataFrame)

In [6]:
i = 0

In [7]:
if i: 
    print(i)