# ASD Cohort - Initial Analysis

The following is a (very) rough analysis of the ASD based on some of the questions Danielle and Leanne have asked in our discussions last year. I realise it's certainly not reportable in this format. It's more intended as a primer, so that I can gauge what is needed - where further detail or different formatting is required etc.

Ignore all the code in cells - our platform makes it prohibitively difficult to remove it. 

In [None]:
import pandas as pd
from google.cloud import bigquery
import contextily as cx
import geopandas
import numpy as np
from tableone import TableOne
import matplotlib.pyplot as plt

In [None]:
%%bigquery asd_data
SELECT * FROM `yhcr-prd-phm-bia-core.CY_ASD_data.ASD_master_tab`

## Basic Demographic Makeup

At the moment we have data on:

* Age
* Ethnic Group
* Sex
* Children in need
* Children in care
* Child protection plans
* I've thrown exclusions data in there as a bonus

I'm working on getting some SEP stats - at the moment there isn't anything of great quality that I know of that can span the whole cohort reliably. I'm hoping we can use pseudonymised addresses to get IMD by postcode - at the moment we only have partial postcodes, which isn't granular enough to use the census data

the following are table 1 stats for the full cohort i.e. anybody with one of the ASD diagnostic codes in their primary care records:

In [None]:
columns =  [
    'diagnosis_date', 'age', 'age_at_diagnosis', 'ethnic_group', 'sex', 
    'perm_exclusion', 'fixed_term_exclusion',   'has_protection_plan', 
    'in_care', 'child_in_need'
]
categorical = ["ethnic_group", "sex", "perm_exclusion", "fixed_term_exclusion",
               "has_protection_plan", "in_care", "child_in_need"] 
table_1 = TableOne(
    asd_data, 
    columns, 
    categorical)
table_1

I'm no expert, but the ages of some of the individuals don't make sense to me:

In [None]:
asd_data.age.hist(bins=114)

I'm leaving things as is for the moment, but I suggest we might want to set an upper age limit.

One of the research questions asked for individuals under the age of 25, included below:

In [None]:
table_1 = TableOne(
    asd_data[asd_data.age < 25], 
    columns, 
    categorical)
table_1

For any of the analyses above and/or below, I can easily subset the data by age (or any of the above variables) so feel free to ask for any subgroup data of interest.

## Age at Diagnosis

Currently we still don't have any better data than estimating age at diagnosis from the first date an ASD SNOMED code enters the individual's primary care record. This might not be the best methodology, but it's all we have for now.

The following are some very ugly graphs with average age at diagnosis broken down by the different demographics - I can make these a lot more visually appealing in the future but we're having some issues with our virtual environments at the moment, so they'll have to do for now. 

### By gender - whole cohort:

In [None]:
(asd_data
 .groupby("sex")["age_at_diagnosis"]
 .agg("mean").plot(kind="bar"))

### By Gender < 25 years old:

In [None]:
(asd_data[asd_data.age < 25]
 .groupby("sex")["age_at_diagnosis"]
 .agg("mean").plot(kind="bar"))

### By Ethnic Group - Whole Cohort:

In [None]:
(asd_data
 .groupby("ethnic_group")["age_at_diagnosis"]
 .agg("mean").plot(kind="bar"))

### By Ethnic Group < 25 years old:

In [None]:
(asd_data[asd_data.age < 25]
 .groupby("ethnic_group")["age_at_diagnosis"]
 .agg("mean").plot(kind="bar"))

## Geographic Distribution (within Bradford)

The following is a ward-level breakdown and choropleth map of the address data for each individual in the cohort. It looks a little odd given the most densely populated areas (to my knowledge) aren't host to many ASD diagnoses, but I'll leave the interpretation to folk with more first-hand knowledge of the problem. Should definitely be looked into further though:

In [None]:

sql = """
    SELECT *
    FROM `yhcr-prd-phm-bia-core.CY_LOOKUPS.tbl_ward_boundaries`
"""
ward_gdf = bigquery.Client().query(sql).to_geodataframe()

contains_bradford = lambda x: x.str.contains("Bradford").any()
ward_counts = (asd_data[["ward_code", "lsoa_name"]]
               .groupby("ward_code")
               .agg([("n", "count"), ("contains_bradford", contains_bradford)])
               .reset_index())
ward_counts.columns = ["ward_code", "n", "contains_bradford"]
ward_counts = geopandas.GeoDataFrame(
    ward_counts.merge(ward_gdf)
)

### Top 20 Wards:

In [None]:
ward_counts[["ward_name", "n"]].sort_values("n", ascending=False).head(20)

### Choropleth Map of residence:

In [None]:
ward_counts = ward_counts.to_crs(epsg=3857)
ax = ward_counts[ward_counts.contains_bradford].plot(column="n",  
                      alpha=0.5,  
                      edgecolor="k",  
                      linewidth=2,   
                      cmap="OrRd",  
                      figsize=(20,20))
cx.add_basemap(ax, source=cx.providers.Stamen.TonerLite)

## Next Seps - Regression analyses

### Diagnosis:

I can really quickly run the regression on diagnosis I've been asked for, but we need to decide on the cohort we're going to test at the moment. The most sensible suggestions I can think of are:

* Individuals in the primary care data - we should think pretty carefully about setting a max age for this, as it's pretty certain the data is of dubious quality the older the individuals we look at are
* Individuals in the education data - this sets a nice limit on the age, and we can use some of the education census data if we like (things like free school meals and special educational needs)

### Age at Diagnosis:

I can run this quickly with the existing cohort, but I'd want to set some age limits to avoid the probably naff data on folk being diagnosed at the older ages. Let me know what would be sensible here and I'll get on with it.

## Other questions:

Most of the discussion around pathways and interactions with other services are going to be very difficult without guidance from somebody who knows exactly what information we're looking for and (most importantly) where to find it in the Connected Bradford data. Aside from Mai, who I know has been working on some basic path analysis, I can't think anyone who'd be able to provide this info. Indeed, I'm pretty sure a lot of the data won't even exist based on some of the conversations we've been having with Diane Daley, and with the state of the secondary use data that we already have.

Questions around interactions with other conditions might be more doable, but only if we can reasonably expect the conditions to be coded for in the primary care data i.e. the issues mentioned above with the secondary use services will make looking for data from other services very difficult (CAHMS and the like). I can pretty quickly check for anything of interest, so we can at least get a quick answer on what's possible in this domain. 