## Clever Data Generation

### New workflow:

Generate table in one query for efficiency. Start with the `tbl_srcode` table as a basis and join all other features to that. So, the CLEVER source data will be a copy of the clinical codings in the primary care data, with each entry augmented with info on:

* person/date/snomedcode - already in srcode table
* concept_name - taken from concept tables
* flags if codes relate to sub-groups of interest as per Mai's keyword searches over the concept names - currently "speech and language" and "eye and vision" though there are additional scripts for concepts like "DCD"/"Social" that have yet to be provided
* flags is persons belong to different cohorts - currently taken from education data and identify CIN/CLA/exclusions

The query starts with the standard SQL `SELECT` statements, gathering features of interest from the source data tables. These `SELECT` statements correspond with subqueries of the source data tables to which they relate - subqueries are executed as the data is `JOIN`ed to the `tbl_srcode` table, allowing for subsets of codes/cohorts to be established as the data is joined to the main `tbl_srcode` table in one query. 

The script is commented which hopefully clarifies this logic as the script progresses:

In [None]:
%%bigquery


---------------------------------------------------------------------------------
CREATE OR REPLACE TABLE `yhcr-prd-phm-bia-core.CB_1741_Relins.clever_data_test` AS
SELECT 
---------------------------------------------------------------------------------
    
    /* #########################################################################
    Select statements are divided on the basis of the source table as per the 
    join statements below. Each group of select statements corresponds with a 
    join statement in the second part of the query. See the join statments (also 
    divided in this way) for more information on the data sources.
    ########################################################################## */
    
    /* features from the tbl_srcode table in primary care */
    
    -----------------------------------------------------------------------------
    sr.person_id, sr.dateevent, sr.snomedcode, 
    -----------------------------------------------------------------------------
    
    /* text description of each snomed from the concept table */
    
    -----------------------------------------------------------------------------
    snomed.concept_name,
    -----------------------------------------------------------------------------
    
    /* a TRUE/FALSE feature for codes that are part of `speech` subgroup */
    
    -----------------------------------------------------------------------------
    CASE 
        WHEN speech.concept_code IS NOT NULL THEN TRUE 
        ELSE FALSE 
    END AS is_speech_and_language,
    -----------------------------------------------------------------------------
    
    /* a TRUE/FALSE feature for codes that are part of `eye` subgroup */
    
    -----------------------------------------------------------------------------
    CASE 
        WHEN eye.concept_code IS NOT NULL THEN TRUE 
        ELSE FALSE 
    END AS is_eye_and_vision, 
    -----------------------------------------------------------------------------
    
    /* a TRUE/FALSE feature for individuals that have a PERM exclusion 
       on their academic record */
    
    -----------------------------------------------------------------------------
    CASE
        WHEN perm.person_id IS NOT NULL THEN TRUE
        ELSE FALSE
    END AS has_been_perm_excluded,
    -----------------------------------------------------------------------------
    
    /* a TRUE/FALSE feature for individuals that have a FIXD exclusion 
       on their academic record */
        
    -----------------------------------------------------------------------------
    CASE
        WHEN fixd.person_id IS NOT NULL THEN TRUE
        ELSE FALSE
    END AS has_been_fixd_excluded,
    -----------------------------------------------------------------------------
    
    /* a TRUE/FALSE feature for individuals who appear in the children in need 
       data */
        
    -----------------------------------------------------------------------------
    CASE
        WHEN cin.person_id IS NOT NULL THEN TRUE
        ELSE FALSE
    END AS is_child_in_need,
    -----------------------------------------------------------------------------
    
    /* a TRUE/FALSE feature for individuals who appear in the looked after 
       children data */
        
    -----------------------------------------------------------------------------
    CASE
        WHEN cla.person_id IS NOT NULL THEN TRUE
        ELSE FALSE
    END AS is_looked_after_child
    -----------------------------------------------------------------------------
    
    /* features from the demographics table. further detail below:
        
       Creates an `age` feature as the difference in years from the date of  
       birth to the date the event was recorded /*
        
    -----------------------------------------------------------------------------
       FLOOR(DATE_DIFF(sr.dateevent, demo.DOB_formatted, DAY) / 365.25) AS age,
    -----------------------------------------------------------------------------
    
    /* the `census_ethnicity` feature is formatted as 
       "<ethnic group> - <ethnic subgroup>" as per the 2011 UK census categories. 
       These two statements parse these features from the `census_ethnicity` 
       feature. /*
        
    -----------------------------------------------------------------------------
    CASE
        WHEN REGEXP_EXTRACT(demo.census_ethnicity, r'^(.+?):') IS NOT NULL 
        THEN REGEXP_EXTRACT(demo.census_ethnicity, r'^(.+?):')
        ELSE "unknown"        
    END AS ethnic_group, 
    CASE
        WHEN REGEXP_EXTRACT(demo.census_ethnicity, r':(.+?)-') IS NOT NULL 
        THEN REGEXP_EXTRACT(demo.census_ethnicity, r':(.+?)-')
        ELSE "unknown"        
        END AS ethnic_subgroup, 
    -----------------------------------------------------------------------------
    
    /* The sex features are recorded as concept codes in the demographics table. 
       This statment converts the codes into "male"/"female"/"unknown" 
       categories. /*
    
    -----------------------------------------------------------------------------
    CASE
        WHEN demo.remapped_gender = 45766034 THEN "male"            
        WHEN demo.remapped_gender = 45766035 THEN "female"            
        ELSE "unknown"        
        END AS sex, 
    demo.LSOA as LSOA_code, 
    -----------------------------------------------------------------------------
    
/* ##############################################################################
The source table is the primary care `tbl_srcode` table. This query creates a 
copy of each row of `tbl_srcode` with a subset of 3 columns. Each 
subsequent join adds a new column that adds either demographic features (in the 
case of the deomographics table) or a boolean TRUE/FALSE features. The boolean 
features can relate either to the clinical codes, in the case of conditions of 
interest like speech or vision codes, or to the individual person_id in the case 
of education or social care features. More detail is given for each specific 
join below:
############################################################################## */
    
/* begin with every row from `tbl_srcode` as basis */

---------------------------------------------------------------------------------
FROM yhcr-prd-phm-bia-core.CB_FDM_PrimaryCare_V8.tbl_srcode sr 
---------------------------------------------------------------------------------

/* Join a description for each of the snomed codes in the table the concept 
   table contains codings other than snomed so needs subsetting based only 
   on snomed codes */ 

---------------------------------------------------------------------------------
LEFT JOIN (
  SELECT concept_code, concept_name FROM CB_CDM_VOCAB.concept
  WHERE vocabulary_id = "SNOMED"
) snomed 
ON sr.snomedcode = CAST(snomed.concept_code AS STRING)
---------------------------------------------------------------------------------

/* Join a concept code for each snomed that contains either "speech" or "Speech" 
   in its description. The concept code will be NULL for any snomeds that don't 
   match with this subgroup. This will be used to create the boolean TRUE/FALSE 
   `is_speech_and_language` feature above. */ 
    
---------------------------------------------------------------------------------
LEFT JOIN (
  SELECT concept_code
  FROM CB_CDM_VOCAB.concept
  WHERE (vocabulary_id = "SNOMED")
    AND (valid_end_date > CAST('2017-12-30' AS DATE)) 
    AND (concept_name LIKE '%speech%' OR concept_name LIKE '%Speech%')
) speech 
ON sr.snomedcode = CAST(speech.concept_code AS STRING)
---------------------------------------------------------------------------------

/* Join a concept code for each snomed that contains one of the substrings below 
   relating to vision. As with the speech snomeds, this will return a NULL value
   for any unrelated codes that is used above to create a TRUE/FALSE
   `is_eye_and_vision` feature */ 

---------------------------------------------------------------------------------
LEFT JOIN (
  SELECT concept_code FROM CB_CDM_VOCAB.concept
  WHERE vocabulary_id = "SNOMED"
    AND (valid_end_date > CAST('2017-12-30' AS DATE)) 
    AND (LOWER(concept_name) LIKE '%vision%' OR LOWER(concept_name) LIKE '%eye%') 
    AND domain_id IN("Observation", "Procedure") 
    AND LOWER(concept_name) NOT LIKE '%adult%' 
    AND LOWER(concept_name) NOT LIKE '%revision%' 
    AND LOWER(concept_name) NOT LIKE '%provision%' 
    AND LOWER(concept_name) NOT LIKE '%supervision%' 
    AND LOWER(concept_name) NOT LIKE '%division%'
) eye 
ON sr.snomedcode = CAST(eye.concept_code AS STRING)
---------------------------------------------------------------------------------

/* Join with the exclusions table for individuals with a permenent exclusion
   on their academic record. person_ids that don't have a corresponding exclusion
   will return NULL */

---------------------------------------------------------------------------------
LEFT JOIN (
  SELECT person_id FROM `CB_FDM_DepartmentForEducation.exclusions_cleaned`
  WHERE CATEGORY = "PERM"
  GROUP BY person_id
) perm
ON sr.person_id = perm.person_id
---------------------------------------------------------------------------------

/* Join with the exclusions table for individuals with a fixed exclusion
   on their academic record as above */

---------------------------------------------------------------------------------
LEFT JOIN (
  SELECT person_id FROM `CB_FDM_DepartmentForEducation.exclusions_cleaned`
  WHERE CATEGORY = "FIXD"
  GROUP BY person_id
) fixd
ON sr.person_id = fixd.person_id
---------------------------------------------------------------------------------

/* Join with the children in need table to id unique individuals therein. As above
   person_ids that dont appear in this dataset will return NULL. */

---------------------------------------------------------------------------------
LEFT JOIN (
    SELECT DISTINCT person_id 
    FROM `CB_FDM_DepartmentForEducation.src_ChildrenInNeed`
) cin
ON sr.person_id = cin.person_id
---------------------------------------------------------------------------------

/* Join with the looked after children table to id unique individuals therein as 
   above */

---------------------------------------------------------------------------------
LEFT JOIN (
    SELECT DISTINCT person_id 
    FROM `CB_FDM_DepartmentForEducation.src_ChildrenLookedAfter`
) cla
ON sr.person_id = cla.person_id
---------------------------------------------------------------------------------

/* Join with the demographics table for information on age, sex, ethnicity and 
geolocation */ 

---------------------------------------------------------------------------------
LEFT JOIN `CB_STAGING_DATABASE.src_DemoGraphics_MASTER` demo
ON sr.person_id = demo.person_id
---------------------------------------------------------------------------------

## Notes:

### Identifying relevant concepts:

There's a major issue with the original logic that joins the raw `concept_names` to the `SRCode` table. Simply joining the `concept_name` to `CB_FDM_PrimaryCare_V8.tbl_srcode` by `concept_id` takes a 689M row table and inflates it to 1B+ rows. This results from many duplicates of individual `concept_code`s in the `CB_CDM_VOCAB.concept` table - several of which map to wildly different `concept_name`s, for example:

`concept_code` 001 has the following `concept_name`s:

    “Cholera”
    “Heart transplant or implant of heart assist system w MCC”
    “Craniotomy Age >17 with Complications, Comorbidities”
    “Central Nervous System and Cranial Nerves, Bypass”

The entries in the `CB_CDM_VOCAB.concept` table are derived from multiple vocabularies (e.g. SNOMED, ICD10, CTV3 ect...), so just joining by `concept_id` alone results in the duplicate codes that are found in more than one vocabulary joining to each corresponding row in the `SRCode` table and duplicating them with contradictory `concept_name`s. The concepts need to be filtered so that only the SNOMED concepts are being joined, by looking for only rows that have the value `SNOMED` in the `vocabulary_id` column.
