Notes:

* I'm tempted to remove the "All Other" ethnicity stats from the visualisations, as it's too heterogeneous a group to really say much about and it detracts from the comparison of Whites/Asians which are more reliable/informative

## ASD Cohort Description

This study explores the demographic features of individuals 18 years old or younger, identified as having an Autism Spectrum Disorder (ASD) in the Connected Bradford primary care dataset. The primary cohort is defined as any individual with one or more SNOMED read-codes for ASD diagnoses in their primary care record. Information on age, ethnic group (using the XXXX census ??GROUP DESCRIPTION?? - include figure with short names and full descriptions), are taken from the primary care record and form the basis of this analysis. An "age at diagnosis" variable is defined as the date at which the first ASD SNOMED code appears on an individual's records. The vast majority of the cohort fall either in the "White" or "Asian" ethnic groups, so the other ethnicities have been grouped into an "All Other" ethnic group as low numbers of records prevent meaningful analysis of their subgroupings.

The likelihood of diagnosis, based on the demographic features in the ASD cohort, is also described. The cohort for this analysis is defined as individuals 18 years old or younger appearing in the connected bradford education census data, and the outcome variable is a diagnosis of ASD as defined by individuals from the census data that also appear in the ASD cohort. Sex and ethnic group observations are taken from the primary care data or, where absent, are determined from the data in the education census records. Adjusted odds of ASD diagnosis and confidence intervals for each of the demographic variables are calculated using logistic regression.

## Limitations

* Absence of record in ASD SNOMEDS or census data assumes no ASD/presence in bradford (kids could be NEET)
* Cohort has yet to be validated
* Age at diagnosis crude 

In [None]:
import pandas as pd
from google.cloud import bigquery
import contextily as cx
import numpy as np
from tableone import TableOne
import matplotlib.pyplot as plt
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import plotly.express as px
import geopandas

In [None]:
asd_sql = "SELECT * FROM `yhcr-prd-phm-bia-core.CB_ASD_data.ASD_master_tab`"
asd_data = pd.read_gbq(asd_sql)

In [None]:
asd_data = asd_data[asd_data.age <= 18]
ethnic_group_map = {
    "Unknown": None,
    "Black or African or Caribbean or Black British": "Other",
    "Mixed multiple ethnic groups": "Other",
    "Other ethnic group": "Other",
    "Asian or Asian British": "Asian",
    "White": "White"
}
asd_data["ethnic_group_full"] = asd_data.ethnic_group
asd_data["ethnic_group"] = asd_data.ethnic_group.apply(
    lambda x: ethnic_group_map[x]
)
asd_data["sex"] = asd_data.sex.apply(
    lambda x: None if x == "Unknown" else x
)

In [None]:
asd_data.head()

In [None]:
megan_data =  asd_data.pivot_table(values="age_at_diagnosis", index="person_id", columns=["ethnic_group", "sex"]).reset_index(drop=True)[["Asian", "White"]]

In [None]:
megan_data.Asian.Male.dropna().median()

### Ethnic group descriptions

* White: "White"
* Asian: "Asian or Asian British"
* All other: "Black or African or Caribbean or Black British/"Mixed Multiple Ethnic Groups"/"Other Ethnic Group"

## Basic Demographics

In [None]:
columns =  [
    'ethnic_group', 
]
categorical = ["ethnic_group"] 
table_1 = TableOne(
    asd_data[asd_data.age <= 18], 
    columns, 
    categorical,
    groupby="sex")
table_1

In [None]:
asd_data[asd_data.ethnic_group.isna()].groupby("sex").size()

In [None]:
pd.DataFrame(asd_data.groupby(["sex", "ethnic_group"]).size())

In [None]:
sex_stats = (asd_data[asd_data.age <= 18]
             .groupby(["sex"])["age_at_diagnosis"] 
             .agg(["mean", "median", "std"]))
sex_stats

In [None]:
sex_stats.to_csv("sex_stats.csv")

In [None]:
eth_stats = (asd_data[asd_data.age <= 18]
             .groupby(["ethnic_group"])["age_at_diagnosis"] 
             .agg(["mean", "median", "std"]))
eth_stats

In [None]:
eth_stats.to_csv("eth_stats.csv")

In [None]:
sex_eth_stats = (asd_data[asd_data.age <= 18]
             .groupby(["ethnic_group", "sex"])["age_at_diagnosis"] 
             .agg(["mean", "median", "std"]))
sex_eth_stats

In [None]:
sex_eth_stats.to_csv("sex_ethnic_group_stats.csv")

Table XXX shows the demographic breakdown of the ASD cohort. The ethnicities of the cohort do not appear to deviate significantly from those of the overall Bradford district (reference - this is general pop, really need figures for younger age group!!). The majority of the cohort is male (77.4%) ** insert comment about the actual breakdown of ethnicities in Bradford in this age group based on census **

Brad ethnicities reference - https://ubd.bradford.gov.uk/media/1682/2021-census-ethnic-group-religion-and-language.pdf

## Age at Diagnosis


In [None]:
color_discrete_map = {
    "Asian": "#EDAD06",
    "Other": "#661100",
    "White": "#332288",
    "Female": "#0F7732",
    "Male": "#882255"
}
# #EDAD06 #882255 #44AA99
labels = {
    "age_at_diagnosis": "Age at Diagnosis", 
    "sex": "Sex", 
    "ethnic_group": "Ethnic Group"
}

In [None]:
age_hist = px.histogram(
    asd_data,
    x="age_at_diagnosis",
    labels=labels,
    template="simple_white",
    nbins=18
)
age_hist.show(renderer="jpg")

In [None]:
age_hist.write_image("plots/age_hist.jpg", scale=2)

Figure XXXX shows the distribution of age at diagnosis - it appears to follow a bi-modal distribution with a sharp peak of diagnoses at age 4 and a shallower peak at age 9. ** maybe some commentary on the reasons for this / if this distribution is to be expected **

In [None]:
age_sex_box = px.box(
    asd_data,
    y="age_at_diagnosis", 
    x="sex",
    color="sex", 
    width=600, 
    height=500,
    labels=labels,
    template="simple_white",
    color_discrete_map=color_discrete_map
)
age_sex_box.update_layout(showlegend=False)
age_sex_box.show(renderer="jpg")

In [None]:
age_sex_box.write_image("plots/age_sex_box.jpg", scale=2)

In [None]:
age_eth_box = px.box(
    asd_data, 
    y="age_at_diagnosis",
    x="ethnic_group",
    color="ethnic_group",
    width=800,
    height=500,  
    labels=labels,
    template="simple_white",
    color_discrete_map=color_discrete_map
)
age_eth_box.update_layout(showlegend=False)
age_eth_box.show(renderer="jpg")

In [None]:
age_eth_box.write_image("plots/age_eth_box.jpg", scale=2)

Subdivision of age at diagnosis by sex shows little difference between the two sexes, with a slightly higher age of diagnosis amongst females. By ethnicity, it can be seen that a much higher average age of diagnosis can been seen among white individuals, with the bulk of diagnoses occurring at much younger ages among the other ethnic groups. 

In [None]:
white_asian = asd_data.ethnic_group.apply(lambda x: x in ["Asian", "White"])
age_eth_hist = px.histogram(
    asd_data[white_asian],
    x="age_at_diagnosis",
    color="ethnic_group",
    facet_row="ethnic_group",
    color_discrete_map=color_discrete_map,
    labels=labels,
    template="simple_white",
    nbins=18
)
age_eth_hist.for_each_annotation(lambda a: a.update(text=""))
age_eth_hist.show(renderer="jpg")

In [None]:
age_eth_hist.write_image("plots/age_eth_hist.jpg", scale=2)

Further comparison of the distribution of ages of diagnnamees in the white and asian ethnicities reveals a markedly different pattern. A distinct bimodal pattern of diagnosis ages is observed in the whites group, with peaks in diagnosis at 4 years of age and then at 9, whereas in the asian group a long tailed unimodal distribution can be seen with a peak only at age 4.

In [None]:
age_sex_eth_box = px.box(
    asd_data, 
    x="ethnic_group",
    y="age_at_diagnosis",
    color="sex",
    color_discrete_map=color_discrete_map,
    labels=labels,
    template="simple_white",
    width=800,
    height=500
)
age_sex_eth_box.show(renderer="jpg")

In [None]:
age_sex_eth_box.write_image("plots/age_sex_eth_box.jpg", scale=2)

Comparison of the sex differences in age at diagnosis between the different ethnic groups shows that females are diagnosed later on average in the white and other ethnic groups, but are diagnosed considerably earlier among the asian group.

## Regression on Age at Diagnosis

I'm leaving this out for two main reasons:

1. I think the descriptive analysis above says all you need to say about the relationships between the variables we're looking at, and the regression data adds nothing
2. (more importantly!) we've shown above that age at diagnosis is bimodal for the whites group but unimodal for asians - simple gaussian regression assumes a unimodal outcome variable and certainly isn't well suited to a comparison of the covariates in this case

If we're desperate to come up with some sort of model for the data, we'd need to discuss a bit more, and it would be considerably more complex as a piece of work

In [None]:
asd_data = asd_data.join(asd_data.ethnic_group.str.get_dummies())
asd_data.loc[asd_data.ethnic_group.isna(), ["Asian", "Other", "White"]] = None
asd_data["Male"] = (asd_data.sex == "Male").astype(int)

In [None]:
age_reg = smf.glm("age_at_diagnosis ~ Male + Asian + Other",   
                  data=asd_data).fit()
age_reg.summary()

## Geographic Distribution (within Bradford)

In [None]:
sql = """
    SELECT *
    FROM `yhcr-prd-phm-bia-core.CY_LOOKUPS.tbl_ward_boundaries`
"""
ward_gdf = bigquery.Client().query(sql).to_geodataframe()

contains_bradford = lambda x: x.str.contains("Bradford").any()
ward_counts = (asd_data[["ward_code", "lsoa_name"]]
               .groupby("ward_code")
               .agg([("n", "count"), ("contains_bradford", contains_bradford)])
               .reset_index())
ward_counts.columns = ["ward_code", "n", "contains_bradford"]
ward_counts = geopandas.GeoDataFrame(
    ward_counts.merge(ward_gdf)
)

In [None]:
ward_counts.sort_values("n", ascending=False).head(4)

In [None]:
ward_counts[ward_counts.contains_bradford].n.value_counts().sort_index()

### Choropleth Map of residence:

In [None]:
ward_counts = ward_counts.to_crs(epsg=3857)
ax = ward_counts[ward_counts.contains_bradford].plot(
    column="n",   
    alpha=0.5,  
    edgecolor="k",  
    linewidth=1,    
    legend=True, 
    cmap="OrRd",   
    scheme="User_Defined",
    classification_kwds=dict(bins=[30, 50, 70, 90, 110]),
    figsize=(8,8),
)
cx.add_basemap(ax, source=cx.providers.Stamen.TonerLite)
plt.axis("off")
plt.savefig("plots/asd_map.jpg", dpi=200)

Figure XXX shows a chloropleth map of the home residences of each of the individuals in with an ASD diagnosis by census ward. ** I don't really know the geography or geographic makeup of bradford well enough to make any intelligent comments here - I don't know weather the concentration of diagnoses in Keighley is worth pointing out? **

## Likelihood of Diagnosis

In [None]:
def return_yr_date_diff_sql(from_date, to_date, var_name):
    diff_fn = f"DATE_DIFF({to_date}, {from_date}, DAY) / 365.25"
    return f"FLOOR({diff_fn}) AS {var_name}"
age = return_yr_date_diff_sql("demo.DOB_formatted", "DATE('2023-03-20')", "age")
age_dec = "DATE_DIFF(DATE('2023-03-20'), demo.DOB_formatted, DAY) / 365.25 AS age"
ethnic_group_regex = "REGEXP_EXTRACT(demo.census_ethnicity, r'^(.+?):')"
ethnic_group = f"""
    CASE
        WHEN {ethnic_group_regex} IS NOT NULL THEN {ethnic_group_regex}
        ELSE "Unknown"
    END AS ethnic_group
"""

sex = """
    CASE
        WHEN demo.remapped_gender = 45766034 THEN "Male"
        WHEN demo.remapped_gender = 45766035 THEN "Female"
        ELSE "Unknown"
    END AS sex
"""
project = "yhcr-prd-phm-bia-core"
census_table = f"{project}.CB_FDM_DepartmentForEducation.src_census"
demographics_table = f"`{project}.CB_STAGING_DATABASE.src_DemoGraphics_MASTER`"
# build SQL query
census_sql = f"""
    SELECT census.person_id, {age_dec}, {sex}, {ethnic_group}, AcademicYear, 
        CensusDate, CensusTerm, FSMEligible, SENprovision,  SENprovisionMajor, 
        SENUnitIndicator, 
    FROM {census_table} census
    LEFT JOIN {demographics_table} demo
    ON census.person_id = demo.person_id
"""
census_data_all = pd.read_gbq(census_sql, progress_bar_type="tqdm")

In [None]:
census_data_all.info()

In [None]:
census_data_all

In [None]:
census_data = census_data_all[census_data_all.age <= 19.05]

In [None]:
census_data_all.age.value_counts().sort_index()[19:20]

In [None]:
census_data

In [None]:
make_list = lambda x: any([prov != "N" for prov in x])
asd_data["asd"] = True
census_agg = (census_data.
              groupby(["person_id", "sex", "ethnic_group", "age"])
              .agg({"FSMEligible": "any", "SENprovision": make_list})
              .reset_index()
              .merge(asd_data[["person_id", "asd"]], 
                     on="person_id", 
                     how="left")
              .fillna(False))

In [None]:
census_agg["ethnic_group_full"] = census_agg.ethnic_group
census_agg["ethnic_group"] = census_agg.ethnic_group.apply(
    lambda x: ethnic_group_map[x]
)
census_agg["sex"] = census_agg.sex.apply(
    lambda x: None if x == "Unknown" else x
)

In [None]:
columns = ['sex', 'ethnic_group']
categorical = ['sex', 'ethnic_group']
table_1 = TableOne(
    census_agg, 
    columns, 
    categorical,
    groupby="asd")
table_1

In [None]:
census_agg[(census_agg.sex == "Female") & (census_agg.ethnic_group.isna()) & census_agg.asd].shape

In [None]:
pd.crosstab(index=[census_agg.sex, census_agg.ethnic_group], 
            columns=census_agg.asd)

In order to compare the likelihood of ASD diagnosis between the different demographic groups, a cohort of individuals under the age of 18 from the Connected Bradford education census data was established. Table XXX shows the demographic breakdown of this cohort, along with a comparison of individuals with/without an ASD diagnosis. A much greater proportion of individuals with an ASD diagnosis are male and/or white, than those without a diagnosis.

In [None]:
census_agg = census_agg.join(census_agg.ethnic_group.str.get_dummies())
census_agg.loc[census_agg.ethnic_group.isna(), ["Asian", "Other", "White"]] = None
census_agg["Male"] = (census_agg.sex == "Male").astype(int)
census_agg["ASD"] = census_agg.asd.astype(int)

In [None]:
census_agg.info()

In [None]:
diag_reg = smf.logit("ASD ~ Male + White + Other", 
                    data=census_agg.dropna()).fit()

In [None]:
diag_reg.summary()

The following are the odds ratios for diagnosis, with a white, female baseline - hopefully pretty self explanatory but let me know if there are any questions:

In [None]:
def get_odds(model):
    params = model.params
    conf = model.conf_int()
    conf['Odds Ratio'] = params
    conf.columns = ['5%', '95%', 'Odds Ratio']
    return pd.DataFrame(np.exp(conf.iloc[1:,:][['Odds Ratio', '5%', '95%']]))

get_odds(diag_reg)