# asd_diagnoses_checks

Sanity checks of the ASD diagnoses between the PTL, primary care V4 and primary care staging databases.

## Highlights:

* **V4/Staging Differences**: Wildly different numbers of ASD diagnoses between V4 and staging primary care
* **V7/Staging***: Pretty similar numbers of diagnosed individuals. Likely the difference is just a more recent refresh in staging, needs querying
* **Diagnoses PTL vs V7**: Lots of diagnoses in PTL not found in primary care, seems to be more prevalent in more recent diagnoses but still plenty missing in earlier examples.

TODO:

* Comparison of "diagnosis dates" and JAC dates

In [None]:
from google.cloud import bigquery
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
import pandas as pd
import plotly.express as px
import numpy as np
import matplotlib

## V4/Staging Differences

The V4 version of the primary care data contains far fewer individuals with ASD diagnostic codes than the staging version. This is potentially a big issue as all of the prior analyses of ASD diagnoses are based on the V4 dataset.

In [None]:
def plot_venn(list1, list2, table_names):
    plt.rcParams['figure.facecolor'] = 'white'
    plt.figure(figsize=(5,5), dpi=150)
    v = venn2([set(list1), set(list2)], set_labels=None)
    plt.legend(handles=v.patches, labels=table_names)
    plt.show()

In [None]:
asd_snomed_codes = [
    "35919005", "442314000", "23560001", "231536004", "718393002", "408856003", 
    "373618009", "71961003", "702450004", "723332005", "712884004", 
    "39951000119105", "870307006", "870308001", "870305003", "870306002",
    "870303005", "870304004", "870269009", "870270005", "870268001", "870266002",
    "870267006", "870264004", "870265003", "870262000", "870263005", "870260008",
    "870261007", "870280009", "870282001", "68618008", "432091002", "708037001",
    "719600006", "766824003", "722287002", "771512003", "733623005", "43614003",
    "702732007", "408857007", "783089006", "191692007", "191693002", "191690004",
    "771448004", "770790004", "191689008" 
]
codes_str = ', '.join([f"'{code}'" for code in asd_snomed_codes])
staging_codes_tbl = "CB_STAGING_DATABASE_PrimaryCare.tbl_SRCode"
staging_query = f"""
    SELECT DISTINCT person_id
    FROM {staging_codes_tbl}
    WHERE SNOMEDCode IN ({codes_str})
"""
staging_ids = pd.read_gbq(staging_query).dropna()
staging_ids = list(staging_ids.person_id)
v4_codes_tbl = "CB_FDM_PrimaryCare_v4.tbl_SRCode"
v4_query = f"""
    SELECT DISTINCT person_id
    FROM {v4_codes_tbl}
    WHERE src_snomedcode IN ({codes_str})
"""
v4_ids = pd.read_gbq(v4_query)
v4_ids = list(v4_ids.person_id)

v7_codes_tbl = "CB_FDM_PrimaryCare_V7.tbl_srcode"
v7_query = f"""
    WITH codes AS ( 
        SELECT person_id, dateeventrecorded 
        FROM {v7_codes_tbl}
        WHERE snomedcode IN ({codes_str})
    )
    SELECT person_id, MIN(dateeventrecorded) AS diagnosis_date
    FROM codes
    GROUP BY person_id
"""
v7_data = pd.read_gbq(v7_query)
v7_ids = list(v7_data.person_id)

plot_venn(v4_ids, staging_ids, ["V4", "Staging"])

In [None]:
print(v7_query)

Issues seem to have been fixed in V7. Still needs to be queried as a lot of work has been done with the V4 cohort.

In [None]:
plot_venn(v7_ids, staging_ids, ["V7", "Staging"])

## PTL Diagnoses vs Primary Care

We'll focus on a comparison between V7 and the PTL, as the V4 data seems to be missing a lot of diagnoses.

In [None]:
%%bigquery ptl_data
SELECT * 
FROM `yhcr-prd-phm-bia-core.CB_FDM_ASD_PTL.tbl_autism_amalgamated_ptl_oct2022`

In [None]:
ptl_data = ptl_data[ptl_data.asd_assessment]

In [None]:
ptl_ids = list(ptl_data[ptl_data.asd_diagnosis].person_id.unique())
plot_venn(ptl_ids, v7_ids, ["PTL Diagnoses", "Primary Care Diagnoses"])

In [None]:
636/(1366+636)

about 31% of diagnoses in the PTL aren't in the primary care staging data. A few of the entries don't have a JAC report - circulated to GPs which promts the update of primary care records:

In [None]:
jac_circulated = ~ptl_data.jac_report_circulated_dd_mm_yyyy.isna()
no_jac_diagnoses = len(ptl_data[~jac_circulated & ptl_data.asd_diagnosis].person_id.unique())
print(f"{(no_jac_diagnoses)} individuals have an asd diagnoses but no JAC")

In [None]:
jac_ids = list(ptl_data[jac_circulated & ptl_data.asd_diagnosis].person_id.unique())
plot_venn(jac_ids, v7_ids, ["jac_ids", "v7_PC"])

In [None]:
581/(1244+581)

Doesn't seem like the missing diagnoses in PC are related to there not being a JAC

Splitting the data by the date the JAC was sent seems to suggest more of the diagnoses are missing for more recent diagnoses, but there are still plenty missing for diagnoses much earlier on.

In [None]:
ptl_data.loc[:,"diagnosed_in_v7"] = ptl_data.person_id.apply(
    lambda x: x in v7_ids
)

fig = px.histogram(ptl_data, 
                   x="jac_report_circulated_dd_mm_yyyy", 
                   color="diagnosed_in_v7",
                   nbins=200
                   )
fig.update_xaxes(range=["2019-03-01", "2023"])
fig.update_layout(width=750, 
                  height=500, 
                  xaxis_title="Date JAC Report Circulated",
                  yaxis_title=None,
                  legend=dict(title="Diagnosis in Primary Care",
                              x=0.4, y=1.25))
fig.show()

In [None]:
ptl_data = ptl_data.merge(v7_data, on="person_id", how="left")

In [None]:
ptl_data.loc[:,"jac_report_circulated_dd_mm_yyyy"] = (
    ptl_data.jac_report_circulated_dd_mm_yyyy.dt.tz_localize(None)
)
ptl_data.loc[:,"diagnosis_lag"] = (
    (ptl_data.diagnosis_date-ptl_data.jac_report_circulated_dd_mm_yyyy)
    .dt.days / 365
)

In [None]:
ptl_data[ptl_data.diagnosis_lag < -0][["person_id", "diagnosis_date", "jac_report_circulated_dd_mm_yyyy"]]

In [None]:
ptl_data[ptl_data.diagnosis_lag > 0][["person_id", "diagnosis_date", "jac_report_circulated_dd_mm_yyyy"]]