# Primary Care Dataset Error

Recently, a significant error with the primary care dataset has come to light. A problem in the data transformation process that takes the raw warehouse data and uploads it to the google cloud platform has resulted in large numbers of records being deleted in error. Unfortunately, it isn't possible to properly describe the exact cause or nature of the problem, as the scripts that contained the errors are no longer available. It is clear that the error has existed since the initial versions of the dataset were uploaded to the connected Bradford platform and have only been corrected very recently.

These issues have a considerable effect on our cohort of individuals with ASD diagnoses. We identified 4911 individuals with a diagnosis of ASD from the erroneous primary care datasets but, after correction, this number jumped to 6786. The demographic breakdown of the ASD cohort has changed with the inclusion of the erroneously deleted records, and so it is clear that the deleted data was not missing at random.

In [None]:
import pandas as pd
from google.cloud import bigquery
import contextily as cx
import geopandas
import numpy as np
from tableone import TableOne
import matplotlib.pyplot as plt
import plotly.express as px
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
%%bigquery asd_data_v4
SELECT * FROM `yhcr-prd-phm-bia-core.CB_ASD_data.ASD_master_tab`

In [None]:
%%bigquery asd_data_v7
SELECT * FROM `yhcr-prd-phm-bia-core.CB_ASD_data.ASD_master_tab_v7`

In [None]:
asd_data_v4.columns

In [None]:
v4_ids = list(asd_data_v4.person_id.unique())
asd_data_v7["missing_v4"] = asd_data_v7.person_id.apply(
    lambda x: not x in v4_ids
)

In [None]:
asd_data_v4

In [None]:
pd.set_option("display.max_rows", None)
columns = ['age', 'age_at_diagnosis', 'ethnic_group', 'sex', 
           'perm_exclusion', 'fixed_term_exclusion', 'has_protection_plan', 
           'in_care', 'child_in_need']
categorical = ['ethnic_group', 'sex', 
           'perm_exclusion', 'fixed_term_exclusion', 'has_protection_plan', 
           'in_care', 'child_in_need']
table_1_v4 = TableOne(
    asd_data_v4, 
    columns, 
    categorical)
table_1_v4

In [None]:
table_1_v7 = TableOne(
    asd_data_v7, 
    columns, 
    categorical)
table_1_v7

In [None]:
pd.set_option("display.max_rows", None)
columns = ['age', 'age_at_diagnosis', 'ethnic_group', 'sex', 
           'perm_exclusion', 'fixed_term_exclusion', 'has_protection_plan', 
           'in_care', 'child_in_need', "missing_v4"]
categorical = ['ethnic_group', 'sex', 
           'perm_exclusion', 'fixed_term_exclusion', 'has_protection_plan', 
           'in_care', 'child_in_need', "missing_v4"]
table_1_v7 = TableOne(
    asd_data_v7, 
    columns, 
    categorical,
    groupby="missing_v4")
table_1_v7