# All Questions

## RQs

* RQ1: What are the pathways to an ASD diagnosis and support for children and young people in Bradford and how do they differ by ethnicity and SEP? 
    * RQ1.A) What is the demographic makeup of the children and young people who received an Autism diagnosis in Bradford between X and X? 
    * RQ1.B) Do referral and diagnosis rates differ across different areas of Bradford? If so, what factors might explain the variations?
    * RQ1.C) How does the pathway differ by ethnicity and SES? 
    * RQ1.D) What other factors seem to influence the pathway?

## Variables

* Demographics at time of diagnosis – age, ethnicity, gender, SEP (household income by postcode and/or maternal education level), coexisting diagnoses e.g. ADHD, LD etc. 
* Age of first record of service involvement and which service(s)
* Age at first referral to CAMHS or paediatrics and referral source
* Number of services involved between referral and diagnosis (ideally also nature of involvement/intervention e.g. child protection plan, SAL therapy input etc.)
* The above 3 factors by ethnicity and SEP and then both ethnicity and SEP combined 
* The date of first referral for an Autism diagnostic assessment for those identified with an Autism diagnosis (this should hopefully be logged by GPs if not by services, so there should be an identifiable date on the system)
* The age the identified children were when the first referral for an Autism assessment was made
* Average age of diagnosis and the demographics Leanne asked for in the email below (gender, socioeconomic status, ethnicity) AND:
* home location (postcode would be fine – just would like to see district differences)
* presence of any additional diagnosis e.g. mental health difficulty, physical health condition, learning disability, learning difficulty etc.)
* Whether they are classed as looked after by the local authority (i.e. foster care, kinship arrangement etc). or currently/previously on a child in need or child protection plan
* Any past and current involvement of other services e.g. CAMHS, Speech and language, social care etc., - it would be good to look at involvement of services pre-referral, between referral and assessment, and post diagnosis
* Whether the child is currently on an EHCP and the date this was started in relation to the diagnostic assessment 

## Analysis

* For the whole cohort and also for those currently 0-25 years old): Average age of autism diagnosis by: Gender, Ethnicity, Socio-economic status, Presence of an intellectual disability. It would also be helpful to know the interaction of the above the factors in terms of their impact on average age of autism diagnosis.
* Multiple logistic regressions to explore unique effect of multiple independent variables (in this case SES, ethnicity etc.) on a single dependent/outcome variable (ASD diagnosis, age of diagnosis etc.), whilst accounting for all other explanatory variables, and examination of the interactions between these variables
* A Latent Class Analysis (LCA) to explore the relationships between the observed variables, and to identify combinations of factors that frequently occur together to affect the outcome (age at diagnosis)

# Quick Wins

* What is the demographic makeup of the children and young people who received an Autism diagnosis in Bradford between X and X? 
* Whether the child is currently on an EHCP and the date this was started in relation to the diagnostic assessment 
* Whether they are classed as looked after by the local authority (i.e. foster care, kinship arrangement etc). or currently/previously on a child in need or child protection plan
* For the whole cohort and also for those currently 0-25 years old): Average age of autism diagnosis by: Gender, Ethnicity, Socio-economic status, Presence of an intellectual disability. It would also be helpful to know the interaction of the above the factors in terms of their impact on average age of autism diagnosis.
* Demographics at time of diagnosis – age, ethnicity, gender, SEP (household income by postcode and/or maternal education level), coexisting diagnoses e.g. ADHD, LD etc. 

* home location (postcode would be fine – just would like to see district differences)
* Multiple logistic regressions to explore unique effect of multiple independent variables (in this case SES, ethnicity etc.) on a single dependent/outcome variable (ASD diagnosis, age of diagnosis etc.), whilst accounting for all other explanatory variables, and examination of the interactions between these variables

In [None]:
import pandas as pd
import numpy as np
from tableone import TableOne
import matplotlib.pyplot as plt

In [None]:
%%bigquery asd_data
SELECT * FROM `yhcr-prd-phm-bia-core.CY_ASD_data.ASD_master_tab`

## Basic Demographic Makeup

At the moment we have data on:

* Age
* Ethnic Group
* Sex
* Children in need
* Children in care
* Child protection plans
* I've thrown exclusions data in there as a bonus

I'm working on getting some SEP stats - at the moment there isn't anything of great quality that I know of that can span the whole cohort reliably. I'm hoping we can use pseudonymised addresses to get IMD by postcode - at the moment we only have partial postcodes, which isn't granular enough to use the census data


In [None]:
columns =  [
    'diagnosis_date', 'age', 'age_at_diagnosis', 'ethnic_group', 'sex', 
    'perm_exclusion', 'fixed_term_exclusion',   'has_protection_plan', 
    'in_care', 'child_in_need'
]
categorical = ["ethnic_group", "sex", "perm_exclusion", "fixed_term_exclusion",
               "has_protection_plan", "in_care", "child_in_need"] 
table_1 = TableOne(
    asd_data, 
    columns, 
    categorical)
table_1

The above is for the whole cohort i.e. anybody with one of the ASD diagnostic codes. I'm no expert, but the ages of some of the individuals don't make sense to me:

In [None]:
asd_data.age.hist(bins=114)

I'm leaving things as is for the moment, but I suggest we might want to set an upper age limit.

One of the research questions asked for individuals under the age of 25, included below:

In [None]:
table_1 = TableOne(
    asd_data[asd_data.age < 25], 
    columns, 
    categorical)
table_1

For any of the analyses above and/or below, I can easily subset the data by age (or any of the above variables) so feel free to ask for any subgroup data of interest.

## Age at Diagnosis

Currently we still don't have any better data than estimating age at diagnosis from the first date an ASD SNOMED code enters the individual's primary care record. This might not be the best methodology, but it's all we have for now.

The following are some very ugly graphs with average age at diagnosis broken down by the different demographics - I can make these a lot more visually appealing in the future but we're having some issues with our virtual environments at the moment, so they'll have to do for now. 

### By gender - whole cohort:

In [None]:
(asd_data
 .groupby("sex")["age_at_diagnosis"]
 .agg("mean").plot(kind="bar"))

### By Gender < 25 years old:

In [None]:
(asd_data[asd_data.age < 25]
 .groupby("sex")["age_at_diagnosis"]
 .agg("mean").plot(kind="bar"))

### By Ethnic Group - Whole Cohort:

In [None]:
(asd_data
 .groupby("ethnic_group")["age_at_diagnosis"]
 .agg("mean").plot(kind="bar"))

### By Ethnic Group < 25 years old:

In [None]:
(asd_data[asd_data.age < 25]
 .groupby("ethnic_group")["age_at_diagnosis"]
 .agg("mean").plot(kind="bar"))

## Geographic Distribution (within Bradford)

The following is a ward-level breakdown and choropleth map of the address data for each individual in the cohort. It looks a little odd given the most densely populated areas (to my knowledge) aren't host to many ASD diagnoses, but I'll leave the interpretation to folk with more first-hand knowledge of the problem. Should definitely be looked into further though:

In [None]:
from google.cloud import bigquery
import contextily as cx
import geopandas

sql = """
    SELECT *
    FROM `yhcr-prd-phm-bia-core.CY_LOOKUPS.tbl_ward_boundaries`
"""
ward_gdf = bigquery.Client().query(sql).to_geodataframe()

contains_bradford = lambda x: x.str.contains("Bradford").any()
ward_counts = (asd_data[["ward_code", "lsoa_name"]]
               .groupby("ward_code")
               .agg([("n", "count"), ("contains_bradford", contains_bradford)])
               .reset_index())
ward_counts.columns = ["ward_code", "n", "contains_bradford"]
ward_counts = geopandas.GeoDataFrame(
    ward_counts.merge(ward_gdf)
)

### Top 20 Wards:

In [None]:
ward_counts[["ward_name", "n"]].sort_values("n", ascending=False).head(20)

### Choropleth Map of residence:

In [None]:
ward_counts = ward_counts.to_crs(epsg=3857)
ax = ward_counts[ward_counts.contains_bradford].plot(column="n",  
                      alpha=0.5,  
                      edgecolor="k",  
                      linewidth=2,   
                      cmap="OrRd",  
                      figsize=(20,20))
cx.add_basemap(ax, source=cx.providers.Stamen.TonerLite)

## Next Seps - Regression analyses

### Diagnosis:

I can really quickly run the regression on diagnosis I've been asked for, but we need to decide on the cohort we're going to test at the moment. The most sensible suggestions I can think of are:

* Individuals in the primary care data - we should think pretty carefully about setting a max age for this, as it's pretty certain the data is of dubious quality the older the individuals we look at are
* Individuals in the education data - this sets a nice limit on the age, and we can use some of the education census data if we like (things like free school meals and special educational needs)

### Age at Diagnosis:

I can run this quickly with the existing cohort, but I'd want to set some age limits to avoid the probably naff data on folk being diagnosed at the older ages. Let me know what would be sensible here and I'll run those quickly.

Let me know thoughts on both of these and I can get on with it - should be a pretty quick turnaround once I know.