## Notes on variables

Summary of cleaning:

* `person_id`: self explanatory
* `tbl_autism_amalgamated_ptl_oct2022_start_date`: random column added for fdm spec - dropping as FDM builder handles this
* `tbl_autism_amalgamated_ptl_oct2022_end_date`: random column added for fdm spec - dropping as FDM builder handles this
* `date`: every entry is "31/10/2022". Drop.
* `trust`: BDCT is coded as "BDCT", "bDCT" and "BDCT" - normalise all these
* `department`: One entry coded as "cAHMS" - change to "CAHMS"
* `gender_assigned_at_birth`: Lots of variations of M/F/Trans - normalise these.
* `year_of_birth`: seems to *mostly* agree with demographics. combine with `month of birth` to form `dob` variable. Drop.
* `month_of_birth`: As above. Drop
* `age_will_update_automatically`: Not necessary - can calculate as required. Drop.
* `pathway_allocated`: looks weird and already advised to ignore. Drop.
* `asd_adhd`: describes ASD/ADHD assessments (?) in one string variable e.g. "ASD"/"ASD & ADHD" etc. Convert to two `asd` and `adhd` variables. Drop.
* `lac`: Messy combo of Ys and Ns. Cleaned and converted to bool.
* `ehcp`: similar to above. cleaned.
* `started_assessment`: Advised to ignore. Drop.
* `started_october_2022`: Advised to ignore. Drop.
* `qb_dd_mm_yyyy`: 4 random entries that can't be converted - cleaning as with other date variables
* `cognitive_assessment`: 3 random non-date entries. Nothing too concerning
* `referral_date`: Just dates - convert to datetime.
* `breach_date_13_weeks_to_first_contact`: Just dates. Convert to datetime.
* `weeks_waiting_until_case_closed_will_update_automatically`: Numeric. Convert to float.
* `date_closed_dd_mm_yyyy_when_closed_from_neuro`: Mostly dates, a couple of corrections and a few duplicates. Seems mostly fine so correct as usual. 
* `tbl_autism_amalgamated_ptl_oct2022_end_date`: same as above
* `gp`: GP names as strings - possibly data risk. Check.
* `ethnicity`: odd codes that aren't intuitive - A, C, F, H, J, NSP (maybe not specified). Seems to be a decent amount of disagreement with demographics ethnicity even considering only legible codings. No changes. Leave for now.
* `ccg`: Lots of potential overlap i.e. "Bradford"/"Bradford District"/"Bradford City" - looks like it could do with cleaning. Not sure how to aggregate Bradford CCGs and not sure if this info is useful. Check.
* `first_contact_autism_assessment_dd_mm_yyyy`: A couple of entries with multiple dates e.g. "26/08/2020 + 25/11/2020". Check if multi-date entries should be treated differently
* `school_info_dd_mm_yyyy`: A handful of non-date entries, meaning unclear: "Transfer of Care"/"Outstanding"/"Not in Education"/"Home Schooled"/"In Progress"/"Needed"/"Not in school"/"Requested". Check if non-dates should be considered as non-NA
* `school_obs_dd_mm_yyyy`: A few text entries that say "YES" or other odd strings and quite a few multi-date entries - cleaning as with other date variables. Check about "YES".
* `ados_dd_mm_yyyy`: A few text entries that say "School" or other odd strings and a few multi-date entries. Check about random entries.
* `salt_dd_mm_yyyy`: Quire a few text entries that say "x" and some other odd strings, loads of multi-dates. Check about "x" and random entries.
* `mdt_dd_mm_yyyy`: Huge number of "Clinical Partners" entries (about 1/3) and significant number of "Helios". Treating as other date variables for the moment but may need to capture these entries. Check.
* `feedback_given_dd_mm_yyyy`: Loads of entries with multiple dates. A few "Helios" and "Clinical Partners" entries. Treated as usual at the moment. Query.
* `jac_report_circulated_dd_mm_yyyy`: Lots of notes like "closed without decision"/"D/Cd ...". Treat as normal now. Query.
* `referral_source`: Loads of categories, looks like many could be cleaned into groups e.g. ['Paediatrician', 'Paediatrics', 'Community Paediatrics', 'Community paediatrics', 'Comm Paeds', 'Paediatricians', 'Paeditrician', 'Community Paediatric', 'Comm Paeds/GP', 'paediatrics', 'Community Paeds', 'community paeds', 'Paeditircian', 'Fast-track due to reccomendation from paeds for ADHD', 'Community Paed', 'S Bowring Paediatrics', 'Paeds', 'Paeditrician ', 'Dr Bowring - Paediatrics', 'Paedrician']. Have a proper go at cleaning and query.
* `diagnosis`: Seems to be a combination of ASD, ADHD, "Disch" (presumably discharged) and "No Diagnosis" but lots of categories. For the moment will create `asd_diagnosis` and `adhd_diagnosis` variables with obs that obviously denote one of the two. Entries for "diagnosis agreed" can probably be cross-referenced with assessment type to mark diagnosis type as ASD/ADHD. Possibly also "Disch-Complete". Check.
* `outsource`: Mostly No. Lots of categories otherwise, could maybe be grouped. Query.

In [None]:
import re
import pandas as pd
import numpy as np
from datetime import datetime, date, timedelta

In [None]:
# from FDMBuilder.FDM_helpers import clear_dataset
# clear_dataset("CB_FDM_ASD_PTL")

In [None]:
project_id = "yhcr-prd-phm-bia-core"
dd_loc = "CB_STAGING_DATABASE_WAREHOUSE_FDM_Format.tbl_autism_amalgamated_ptl_oct2022"
sql = f"SELECT * FROM `{project_id}.{dd_loc}` dd "
asd_data = pd.read_gbq(sql)

In [None]:
asd_data.head()

In [None]:
asd_data.info()

# Cleaning

## tbl_autism_amalgamated_ptl_oct2022_start/end_date

random columns added for some spurious FDM reason - drop as FDMBuilder (below) handles start/end dates

In [None]:
asd_data.drop(["tbl_autism_amalgamated_ptl_oct2022_start_date",
               "tbl_autism_amalgamated_ptl_oct2022_end_date"], 
              inplace=True, 
              axis=1)

## Date

Drop "Date" as every entry is "31/10/2022"

In [None]:
asd_data.date.value_counts()

In [None]:
asd_data.drop("date", inplace=True, axis=1)

## Trust

BDCT is coded as "BDCT", "bDCT" and "BDCT" - normalise all these

In [None]:
asd_data.trust.value_counts()

In [None]:
asd_data["trust"] = asd_data.trust.apply(
    lambda x: "BDCT" if type(x) == str and x.strip().lower() == "bdct" else x
)

In [None]:
asd_data.trust.value_counts()

## Department

One entry coded as "cAHMS" - change to "CAHMS"

In [None]:
asd_data.department.value_counts()

In [None]:
asd_data["department"] = asd_data.department.apply(
    lambda x: "CAMHS" if type(x) == str and x.strip().lower() == "camhs" else x
)

In [None]:
asd_data.department.value_counts()

## Gender_assigned_at_birth

Lots of variations of M/F/Trans - normalise these.

In [None]:
asd_data.gender_assigned_at_birth.value_counts()

In [None]:
normalised_genders = {
    'Male': "M",  'M': "M",  'Male ': "M",  'male': "M",  ' Male': "M", 
    'F': "F",  'Female': "F",  'female': "F", 
    'Transgender': "Transgender", 'Female but identifies as male': "Transgender", 
    'Indeterminate': "Indeterminate"
}

asd_data["gender"] = asd_data.gender_assigned_at_birth.apply(
    lambda x: normalised_genders[x] if type(x) == str else x
)

In [None]:
asd_data.gender.value_counts()

## Year_of_birth/Month_of_birth/Age_will_update_automatically

can drop Age_will_update_automatically

ages mostly seem to agree with demographics and fewer missing entries so will keep for the moment - will adjust to datetime to make life easier

In [None]:
demo_loc = "CB_STAGING_DATABASE.src_DemoGraphics_MASTER"
sql = ("SELECT person_id, dob_formatted AS demo_dob "
       f"FROM `{project_id}.{demo_loc}` demo "
       f"WHERE EXISTS(SELECT person_id FROM `{project_id}.{dd_loc}` dd WHERE dd.person_id = demo.person_id)"
      )
test_dobs = pd.read_gbq(sql)
test_dobs["demo_dob"] = test_dobs.demo_dob.astype("datetime64[ns]")
test_dobs = test_dobs.merge(asd_data[["person_id", "year_of_birth", "month_of_birth"]], 
                            on="person_id")
test_dobs["year"] = (test_dobs.year_of_birth
 .apply(lambda x: re.sub("[^0-9]+", "", x).lower() if type(x) == str else x)
 .replace("", None)
 .astype(float)
 .apply(lambda x: None if x < 1980 or x > 2023 else x)
)
test_dobs["month"] = (test_dobs.month_of_birth
 .apply(lambda x: re.sub("[^0-9]+", "", x).lower() if type(x) == str else x)
 .replace("", None)
 .astype(float)
 .apply(lambda x: None if x < 1 or x > 12 else x)
)
def convert_date(x):
    if np.isnan(x.year) or np.isnan(x.month):
        output = None
    else:
        output = date(int(x.year), int(x.month), 15)
    return output
test_dobs["dd_date"] = test_dobs.apply(convert_date, axis=1).astype("datetime64[ns]")
test_dobs["days_diff"] = (abs(test_dobs.dd_date - test_dobs.demo_dob))
diff_dobs = sum(test_dobs.days_diff > timedelta(days=0))
diff_dobs_100 = sum(test_dobs.days_diff > timedelta(days=100)) 
diff_dobs_year = sum(test_dobs.days_diff > timedelta(days=365))
print(f"""
    No of entries where d.o.bs differ = {diff_dobs} ({np.round(diff_dobs / asd_data.person_id.nunique() * 100, 2)}%)
    No of entries where d.o.bs differ by 100 days or more = {diff_dobs_100} ({np.round(diff_dobs_100 / asd_data.person_id.nunique() * 100, 2)}%)
    No of entries where d.o.bs differ by one year or more = {diff_dobs_year} ({np.round(diff_dobs_year / asd_data.person_id.nunique() * 100, 2)}%)
""")

In [None]:
asd_data["year_ob"] = (asd_data.year_of_birth
 .apply(lambda x: re.sub("[^0-9]+", "", x).lower() if type(x) == str else x)
 .replace("", None)
 .astype(float)
 .apply(lambda x: None if x < 1980 or x > 2023 else x)
)
asd_data["month_ob"] = (asd_data.month_of_birth
 .apply(lambda x: re.sub("[^0-9]+", "", x).lower() if type(x) == str else x)
 .replace("", None)
 .astype(float)
 .apply(lambda x: None if x < 1 or x > 12 else x)
)
def convert_date(x):
    if np.isnan(x.year_ob) or np.isnan(x.month_ob):
        output = None
    else:
        output = date(int(x.year_ob), int(x.month_ob), 15)
    return output
asd_data["dob"] = asd_data.apply(convert_date, axis=1).astype("datetime64[ns]")
dob_drop_cols = ["year_of_birth", "month_of_birth", "year_ob", "month_ob", 
                 "age_will_update_automatically"]
asd_data.drop(dob_drop_cols, inplace=True, axis=1)

## Ethnicity

odd codes that aren't intuitive - A, C, F, H, J, NSP (maybe not specified). Seems to be a decent amount of disagreement with demographics ethnicity even considering only legible codings. Maybe needs checking. No changes.

In [None]:
ethnic_group_regex = "REGEXP_EXTRACT(demo.census_ethnicity, r'^(.+?):')"
ethnic_group = f"""
    CASE
        WHEN {ethnic_group_regex} IS NOT NULL THEN {ethnic_group_regex}
        ELSE "Unknown"
    END AS ethnic_group
"""
sql = (f"SELECT person_id, {ethnic_group} "
       f"FROM `{project_id}.{demo_loc}` demo "
       f"WHERE EXISTS(SELECT person_id FROM `{project_id}.{dd_loc}` dd WHERE dd.person_id = demo.person_id)"
      )
test_eths = pd.read_gbq(sql)
test_eths = test_eths.merge(asd_data[["person_id", "ethnicity"]], 
                            on="person_id")

In [None]:
test_eths.groupby(["ethnic_group", "ethnicity"]).agg("count").sort_values("person_id", ascending=False).head(20)

## GP

Too varied to do anything with - probs not useful anyways. Leave for now

In [None]:
asd_data.gp.value_counts()

## CCG

Lots of potential overlap i.e. "Bradford"/"Bradford District"/"Bradford City" - looks like it could do with cleaning. Not sure how to aggregate Bradford CCGs and not sure if this info is useful, so leave for now

In [None]:
asd_data.ccg.value_counts()

## Pathway_allocated

Looks v odd - advised to ignore so will drop.

In [None]:
asd_data.pathway_allocated.value_counts()

In [None]:
asd_data.drop("pathway_allocated", axis=1, inplace=True)

## ASD_ADHD

Codes for ASD and ADHD - separate into two binary variables

In [None]:
asd_data.asd_adhd.value_counts()

In [None]:
def parse_asd_adhd_strings(string):
    if not type(string) == str or "no " in string.lower():
        return (False, False)
    else:
        string_list = re.split(r"\W+", string.lower())
    contains_asd = "asd" in string_list or "asc" in string_list
    contains_adhd = "adhd" in string_list
    return contains_asd, contains_adhd
asd_data[["asd_assessment", "adhd_assessment"]] = pd.DataFrame(
    asd_data.asd_adhd.apply(parse_asd_adhd_strings).to_list()
)
asd_data.drop("asd_adhd", axis=1, inplace=True)

## LAC

Messy combo of Ys and Ns. Cleaned and converted to bool.

In [None]:
asd_data.lac.value_counts()

In [None]:
corrected_lac = {
    'No': False, 'Yes': True, 'YES': True, 'y': True, 'No ': False, 'Y': True, 
    'LAC': True
}
asd_data["lac"] = (asd_data
                   .lac
                   .apply(lambda x: corrected_lac[x] if not x is None else None)
                   .astype("float"))

In [None]:
asd_data.lac.value_counts()

## EHCP

similar to above. Clean

In [None]:
asd_data.ehcp.value_counts()

In [None]:
asd_data["ehcp"] = (asd_data
                   .ehcp
                   .apply(lambda x: corrected_lac[x] if not x is None else None)
                   .astype("float"))

In [None]:
asd_data.ehcp.value_counts()

## Started_assessment

Advised to ignore so will drop

In [None]:
asd_data.drop("started_assessment", axis=1, inplace=True)

## Started_October_2022

Same as above

In [None]:
asd_data.drop("started_october_2022", axis=1, inplace=True)

## Helper functions

In [None]:
import re
        
def no_nas(string):
    if type(string) != str:
        return string
    elif re.sub("[^a-z]+", "", string.lower()) == "na":
        return None
    else:
        return string
    
def convert_date_string(date_string, return_non_dates=False):
    date_string = no_nas(date_string)
    if date_string is None:
        return None
    date_list = re.split(r"[^\w]+", date_string)
    if len(date_list) == 1:
        return date_string if return_non_dates else None
    elif len(date_list) > 3:
        if return_non_dates:
            return date_string 
        else:
            day, month, year = date_list[:3]
    elif len(date_list) == 2:
        day = "15"
        month, year = date_list
    else:
        day, month, year = date_list
    m_format = "%b" if len(month) == 3 else "%m"
    y_format = "%y" if len(year) == 2 else "%Y"
    datetime_string = f"{day} {month} {year}"
    try:
        date_obj = datetime.strptime(datetime_string, f'%d {m_format} {y_format}')
        return date_obj.strftime('%Y/%m/%d') if not return_non_dates else None
    except:
        return date_string if return_non_dates else None
    
def print_problem_dates(date_series):
    for date in date_series.apply(no_nas):
        convert_date_string(date, verbose=True)

## first_contact_autism_assessment_dd_mm_yyyy

A couple of entries with multiple dates e.g. "26/08/2020 + 25/11/2020" - for the moment cleaning dates and adjusting multiple date entries to first date. Check if multi-date entries should be treated differently

remove "dd_mm_yyyy" as unnecessarily verbose and will be converted to datetime so unnecessary anyways

In [None]:
asd_data.first_contact_autism_assessment_dd_mm_yyyy.apply(
    lambda x: convert_date_string(x, return_non_dates=True)
).value_counts()

In [None]:
# manually convert a few dates:

fca_corrections = {
    "21/12/20220": "21/12/2022",
    "27/108/2021": "27/08/2021",
    "10/08/22`": "10/08/22",
}
asd_data["first_contact_autism_assessment"] = (
    asd_data.
    first_contact_autism_assessment_dd_mm_yyyy.
    apply(lambda x: fca_corrections[x] if x in fca_corrections.keys() else x)
)
asd_data.drop("first_contact_autism_assessment_dd_mm_yyyy", 
              axis=1, 
              inplace=True)

In [None]:
asd_data["first_contact_autism_assessment_notes"] = (
    asd_data.
    first_contact_autism_assessment. 
    apply(lambda x: convert_date_string(x, return_non_dates=True))
)
    
asd_data["first_contact_autism_assessment"] = (
    asd_data.
    first_contact_autism_assessment.
    apply(convert_date_string).
    astype("datetime64[ns]")
)

In [None]:
asd_data.first_contact_autism_assessment.min()

In [None]:
asd_data.first_contact_autism_assessment.max()

# school_info_dd_mm_yyyy

A handful of non-date entries, meaning unclear: "Transfer of Care"/"Outstanding"/"Not in Education"/"Home Schooled"/"In Progress"/"Needed"/"Not in school"/"Requested" - for the moment cleaning dates and recording non-dates as NAs. Check if non-dates should be considered as non-NA

As before remove "dd_mm_yyyy"

In [None]:
asd_data.school_info_dd_mm_yyyy.apply(
    lambda x: convert_date_string(x, return_non_dates=True)
).value_counts()

In [None]:
# manually convert a few dates:

asd_data["school_info"] = (
    asd_data
    .school_info_dd_mm_yyyy
    .apply(lambda x: "17/09/2021" if x == "17/09/2021 (in Helios Report)" else x)
)
asd_data.drop("school_info_dd_mm_yyyy", 
              axis=1, 
              inplace=True)

In [None]:
asd_data["school_info_notes"] = (
    asd_data
    .school_info
    .apply(lambda x: convert_date_string(x, return_non_dates=True))
)
    
asd_data["school_info"] = (
    asd_data
    .school_info
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)

## school_obs_dd_mm_yyyy

A few text entries that say "YES" or other odd strings and quite a few multi-date entries - cleaning as with other date variables. Check about "YES".

In [None]:
asd_data.school_obs_dd_mm_yyyy.apply(
    lambda x: convert_date_string(x, return_non_dates=True)
).value_counts()

In [None]:
asd_data["school_obs"] = (
    asd_data
    .school_obs_dd_mm_yyyy
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)

asd_data["school_obs_notes"] = (
    asd_data
    .school_obs_dd_mm_yyyy
    .apply(lambda x: convert_date_string(x, return_non_dates=True))
)

asd_data.drop("school_obs_dd_mm_yyyy", 
              axis=1, 
              inplace=True)

## ados_dd_mm_yyyy

A few text entries that say "School" or other odd strings and a few multi-date entries - cleaning as with other date variables. Check about random entries.

In [None]:
asd_data.ados_dd_mm_yyyy.apply(
    lambda x: convert_date_string(x, return_non_dates=True)
).value_counts()

In [None]:
asd_data["ados"] = (
    asd_data
    .ados_dd_mm_yyyy
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)

asd_data["ados_notes"] = (
    asd_data
    .ados_dd_mm_yyyy
    .apply(lambda x: convert_date_string(x, return_non_dates=True))
)
    
asd_data.drop("ados_dd_mm_yyyy", 
              axis=1, 
              inplace=True)

## salt_dd_mm_yyyy

Quire a few text entries that say "x" and some other odd strings, loads of multi-dates - cleaning as with other date variables. Check about "x" and random entries.

In [None]:
asd_data.salt_dd_mm_yyyy.apply(
    lambda x: convert_date_string(x, return_non_dates=True)
).value_counts()

In [None]:
salt_corrections = {
    "25/01/022": "25/01/22",
    "June 2020 - TC": "15/06/2020",
    "07/12/20121": "07/12/2021",
}
asd_data["salt"] = asd_data.salt_dd_mm_yyyy.apply(
    lambda x: salt_corrections[x] if x in salt_corrections.keys() else x
)

In [None]:
asd_data["salt_notes"] = (
    asd_data
    .salt_dd_mm_yyyy
    .apply(lambda x: convert_date_string(x, return_non_dates=True))
)
    
asd_data["salt"] = (
    asd_data
    .salt_dd_mm_yyyy
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)

asd_data.drop("salt_dd_mm_yyyy", 
              axis=1, 
              inplace=True)

## qb_dd_mm_yyyy

4 random entries that can't be converted - cleaning as with other date variables

In [None]:
asd_data.qb_dd_mm_yyyy.apply(
    lambda x: convert_date_string(x, return_non_dates=True)
).value_counts()

In [None]:
asd_data["qb"] = asd_data.qb_dd_mm_yyyy.apply(
    lambda x: "09/02/2022" if x == "09/02/2022 - 4pm" else x
)

In [None]:
asd_data["qb_notes"] = (
    asd_data
    .qb_dd_mm_yyyy
    .apply(lambda x: convert_date_string(x, return_non_dates=True))
)
    
asd_data["qb"] = (
    asd_data
    .qb_dd_mm_yyyy
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)

asd_data.drop("qb_dd_mm_yyyy", 
              axis=1, 
              inplace=True)

## cognitive_assessment

3 random non-date entries. Nothing too concerning

In [None]:
asd_data.cognitive_assessment.apply(
    lambda x: convert_date_string(x, return_non_dates=True)
).value_counts()

In [None]:
asd_data["cognitive_assessment_date"] = (
    asd_data
    .cognitive_assessment
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)

asd_data["cognitive_assessment_date_notes"] = (
    asd_data
    .cognitive_assessment
    .apply(lambda x: convert_date_string(x, return_non_dates=True))
)

asd_data.drop("cognitive_assessment",
              axis=1,
              inplace=True)

## mdt_dd_mm_yyyy

Huge number of "Clinical Partners" entries (about 1/3) and significant number of "Helios". Treating as other date variables for the moment but may need to capture these entries. Check.

In [None]:
(asd_data
 .mdt_dd_mm_yyyy
 .apply(lambda x: convert_date_string(x, return_non_dates=True))
 .value_counts())

In [None]:
(asd_data
 .mdt_dd_mm_yyyy
 .apply(lambda x: convert_date_string(x, return_non_dates=True))
 .apply(lambda x: x if type(x) == str and len(x) < 14 else None)
 .value_counts())

In [None]:
mdt_corrections = {
    "00/09/2020": "15/09/2020",
    "27/108/2021": "27/08/2021"
}
asd_data["mdt"] = asd_data.mdt_dd_mm_yyyy.apply(
    lambda x: mdt_corrections[x] if x in mdt_corrections.keys() else x
)

In [None]:
asd_data["mdt_notes"] = (
    asd_data
    .mdt
    .apply(lambda x: convert_date_string(x, return_non_dates=True))
)
    
asd_data["mdt"] = (
    asd_data
    .mdt
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)
asd_data.drop("mdt_dd_mm_yyyy", 
              axis=1, 
              inplace=True)

## feedback_given_dd_mm_yyyy

Loads of entries with multiple dates. A few "Helios" and "Clinical Partners" entries. Treated as usual at the moment. Query.

In [None]:
(asd_data
 .feedback_given_dd_mm_yyyy
 .apply(lambda x: convert_date_string(x, return_non_dates=True))
 .value_counts())

In [None]:
(asd_data
 .feedback_given_dd_mm_yyyy
 .apply(lambda x: convert_date_string(x, return_non_dates=True))
 .apply(lambda x: x if type(x) == str and len(x) < 14 else None)
 .value_counts())

In [None]:
feedback_corrections = {
    "00/09/2020": "15/09/2020",
    "006/04/2022": "06/04/2022",
    "02/08/25021": "02/08/2021",
    "01/11/222": "01/11/2022",
    "17/0821": "17/08/21"
    
}
asd_data["feedback_given"] = asd_data.feedback_given_dd_mm_yyyy.apply(
    lambda x: feedback_corrections[x] if x in feedback_corrections.keys() else x
)

In [None]:
asd_data["feedback_given"] = (
    asd_data
    .feedback_given
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)

asd_data["feedback_given_notes"] = (
    asd_data
    .feedback_given_dd_mm_yyyy
    .apply(lambda x: convert_date_string(x, return_non_dates=True))
)
    
asd_data.drop("feedback_given_dd_mm_yyyy", 
              axis=1, 
              inplace=True)

## jac_report_circulated_dd_mm_yyyy

Lots of notes like "closed without decision"/"D/Cd ...". Treat as normal now. Query.

In [None]:
(asd_data
 .jac_report_circulated_dd_mm_yyyy
 .apply(lambda x: convert_date_string(x, return_non_dates=True))
 .value_counts())

In [None]:
(asd_data
 .jac_report_circulated_dd_mm_yyyy
 .apply(lambda x: convert_date_string(x, return_non_dates=True))
 .apply(lambda x: x if type(x) == str and len(x) < 14 else None)
 .value_counts())

In [None]:
jac_corrections = {
    "06/04/22 ADHD": "06/04/22",
    "010/09/22": "10/09/22",
    "90/06/2022": "09/06/2022",
    "23/05/222": "23/05/2022",
    "23/4/22/": "23/04/22",
    "27/09/20222": "27/09/2022",
    "00/09/2020": "15/09/2020",
    "0703/2021": "07/03/2021",
    "03/011/2022": "03/11/2022",
    "`01/09/21": "01/09/21"
}
asd_data["jac_report_circulated"] = asd_data.jac_report_circulated_dd_mm_yyyy.apply(
    lambda x: jac_corrections[x] if x in jac_corrections.keys() else x
)

In [None]:
asd_data["jac_report_circulated_notes"] = (
    asd_data
    .jac_report_circulated
    .apply(lambda x: convert_date_string(x, return_non_dates=True))
)
    
asd_data["jac_report_circulated"] = (
    asd_data
    .jac_report_circulated
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)
asd_data.drop("jac_report_circulated_dd_mm_yyyy", 
              axis=1, 
              inplace=True)

## referral_date

Just dates - convert to datetime.


In [None]:
(asd_data
 .referral_date
 .apply(lambda x: convert_date_string(x, return_non_dates=True))
 .value_counts())

In [None]:
asd_data["referral_date"] = (
    asd_data
    .referral_date
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)

## referral_source

Loads of categories, looks like many could be cleaned into groups e.g. ['Paediatrician', 'Paediatrics', 'Community Paediatrics', 'Community paediatrics', 'Comm Paeds', 'Paediatricians', 'Paeditrician', 'Community Paediatric', 'Comm Paeds/GP', 'paediatrics', 'Community Paeds', 'community paeds', 'Paeditircian', 'Fast-track due to reccomendation from paeds for ADHD', 'Community Paed', 'S Bowring Paediatrics', 'Paeds', 'Paeditrician ', 'Dr Bowring - Paediatrics', 'Paedrician']. Have a proper go at cleaning and query.


In [None]:
asd_data.referral_source.value_counts()[:20]

## breach_date_13_weeks_to_first_contact

Just dates. Convert to datetime.

In [None]:
(asd_data
 .breach_date_13_weeks_to_first_contact
 .apply(lambda x: convert_date_string(x, return_non_dates=True))
 .value_counts())

In [None]:
asd_data["breach_date_13_weeks_to_first_contact"] = (
    asd_data
    .breach_date_13_weeks_to_first_contact
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)

## weeks_waiting_until_case_closed_will_update_automatically

deterministic - delete

In [None]:
asd_data.drop("weeks_waiting_until_case_closed_will_update_automatically",
              axis=1,
              inplace=True)

## diagnosis

Seems to be a combination of ASD, ADHD, "Disch" (presumably discharged) and "No Diagnosis" but lots of categories. For the moment will create `asd_diagnosis` and `adhd_diagnosis` variables with obs that obviously denote one of the two. Entries for "diagnosis agreed" can probably be cross-referenced with assessment type to mark diagnosis type as ASD/ADHD. Possibly also "Disch-Complete". Check.

In [None]:
asd_data.diagnosis.value_counts()

In [None]:
asd_data[["asd_diagnosis", "adhd_diagnosis"]] = pd.DataFrame(
    asd_data.diagnosis.apply(parse_asd_adhd_strings).to_list()
)

In [None]:
diag_agreed = asd_data.diagnosis == 'Disch-Diagnosis agreed'
asd_data.loc[diag_agreed & asd_data.asd_assessment,"asd_diagnosis"] = True

## date_closed_dd_mm_yyyy_when_closed_from_neuro

Mostly dates, a couple of corrections and a few duplicates. Seems mostly fine so correct as usual. 

In [None]:
(asd_data
 .date_closed_dd_mm_yyyy_when_closed_from_neuro
 .apply(lambda x: convert_date_string(x, return_non_dates=True))
 .value_counts())

In [None]:
# manually convert a few dates:

closed_corrections = {
    "06/0//6/22": "06/06/22",
    "20/04/022": "20/04/2022",
    "27/10/20222": "27/10/2022",
}
asd_data["date_closed_when_closed_from_neuro"] = (
    asd_data.
    date_closed_dd_mm_yyyy_when_closed_from_neuro.
    apply(lambda x: closed_corrections[x] if x in closed_corrections.keys() else x)
)

In [None]:
asd_data["date_closed_notes"] = (
    asd_data
    .date_closed_when_closed_from_neuro
    .apply(lambda x: convert_date_string(x, return_non_dates=True))
)
    
asd_data["date_closed_when_closed_from_neuro"] = (
    asd_data
    .date_closed_when_closed_from_neuro
    .apply(convert_date_string)
    .astype("datetime64[ns]")
)
asd_data.drop("date_closed_dd_mm_yyyy_when_closed_from_neuro",
              axis=1,
              inplace=True)

## outsourced

Mostly No. Lots of categories otherwise, could maybe be grouped. Potential groups:

* No
* Clinical Partners
* Healios
* Outsourced
* Socrates
* No longer assessed

Query.

In [None]:
asd_data.outsourced_yn.value_counts()

In [None]:
asd_data.info()

In [None]:
asd_data.to_gbq("CB_FDM_ASD_PTL.tbl_autism_amalgamated_ptl_oct2022",
                location="europe-west2",
                if_exists="replace")

# Build FDM

In [None]:
from FDMBuilder.FDMTable import *
from FDMBuilder.FDMDataset import *
from FDMBuilder.testing_helpers import *

In [None]:
asd_table = FDMTable(
    source_table_id = "CB_FDM_ASD_PTL.tbl_autism_amalgamated_ptl_oct2022",
    dataset_id="CB_FDM_ASD_PTL"
)

In [None]:
asd_table.head()

In [None]:
asd_table.quick_build(fdm_start_date_cols="referral_date",
                         fdm_start_date_format="DMY")

In [None]:
dataset = FDMDataset(
    dataset_id="CB_FDM_ASD_PTL"
)
dataset.build(extract_end_date="2023-01-01")

The script is flagging up one of the entries as being an error as the DOB in demographics is "2024-01-01" - this is obviously an error so just add the flagged entry back into the table and call it a day

In [None]:
asd_table.recombine()
asd_table.drop_column("fdm_problem")