Primary Question: 
- What factors influence Indigenous healthcare access? 
- Patterns of healthcare access among different Indigenous groups

> Health Gaps: Compare healthcare access between Indigenous and non-Indigenous populations

> Within-group Analysis: compare First Nations, Métis, and Inuit populations, or rural vs urban Indigenous populations

In [4]:
import pandas as pd

# Health Indicators

## Load Data

In [5]:
def data_loader(data_type, table_name):
    path = f'data/{data_type}/{table_name}-eng/{table_name}.csv'
    df = pd.read_csv(path)
    return df

In [6]:
hi_df = data_loader('health-indicators', '13100924')
hi_df.head()

Unnamed: 0,REF_DATE,GEO,DGUID,Age group,Sex,Indigenous Identity,Indicators,Characteristics,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2015/2018,Canada,2021A000011124,"Total, 18 years and over",Both sexes,"Total, Indigenous identity","Perceived health, very good or excellent",Number of persons,Number,223,units,0,v1643071352,1.1.1.1.1.1,516400.0,,,,0
1,2015/2018,Canada,2021A000011124,"Total, 18 years and over",Both sexes,"Total, Indigenous identity","Perceived health, very good or excellent","Low 95% confidence interval, number of persons",Number,223,units,0,v1643071353,1.1.1.1.1.2,491400.0,,,,0
2,2015/2018,Canada,2021A000011124,"Total, 18 years and over",Both sexes,"Total, Indigenous identity","Perceived health, very good or excellent","High 95% confidence interval, number of persons",Number,223,units,0,v1643071354,1.1.1.1.1.3,543600.0,,,,0
3,2015/2018,Canada,2021A000011124,"Total, 18 years and over",Both sexes,"Total, Indigenous identity","Perceived health, very good or excellent",Percent,Percent,239,units,0,v1643071355,1.1.1.1.1.4,49.6,,,,1
4,2015/2018,Canada,2021A000011124,"Total, 18 years and over",Both sexes,"Total, Indigenous identity","Perceived health, very good or excellent","Low 95% confidence interval, percent",Percent,239,units,0,v1643071356,1.1.1.1.1.5,47.9,,,,1


In [14]:
# Check data types
hi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413100 entries, 0 to 413099
Data columns (total 19 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   ref_date             413100 non-null  object 
 1   geo                  413100 non-null  object 
 2   dguid                413100 non-null  object 
 3   age_group            413100 non-null  object 
 4   sex                  413100 non-null  object 
 5   indigenous_identity  413100 non-null  object 
 6   indicators           413100 non-null  object 
 7   characteristics      413100 non-null  object 
 8   uom                  413100 non-null  object 
 9   uom_id               413100 non-null  int64  
 10  scalar_factor        413100 non-null  object 
 11  scalar_id            413100 non-null  int64  
 12  vector               413100 non-null  object 
 13  coordinate           413100 non-null  object 
 14  value                185604 non-null  float64
 15  status           

The `value` column shows the percentage of individuals in each Indigenous group (and gender category) who selected a specific healthcare experience

## Rename Column Names


In [8]:
hi_df.columns

Index(['REF_DATE', 'GEO', 'DGUID', 'Age group', 'Sex', 'Indigenous Identity',
       'Indicators', 'Characteristics', 'UOM', 'UOM_ID', 'SCALAR_FACTOR',
       'SCALAR_ID', 'VECTOR', 'COORDINATE', 'VALUE', 'STATUS', 'SYMBOL',
       'TERMINATED', 'DECIMALS'],
      dtype='object')

In [11]:
# change column names to lower case
hi_df.columns = hi_df.columns.str.lower()

In [12]:
# replace spaces with underscores
hi_df.columns = hi_df.columns.str.replace(' ', '_')

In [13]:
hi_df.columns

Index(['ref_date', 'geo', 'dguid', 'age_group', 'sex', 'indigenous_identity',
       'indicators', 'characteristics', 'uom', 'uom_id', 'scalar_factor',
       'scalar_id', 'vector', 'coordinate', 'value', 'status', 'symbol',
       'terminated', 'decimals'],
      dtype='object')

## Statistics of the Data

In [15]:
# statistics of the numerical columns
hi_df.describe()

Unnamed: 0,uom_id,scalar_id,value,symbol,terminated,decimals
count,413100.0,413100.0,185604.0,0.0,0.0,413100.0
mean,231.0,0.0,62337.85,,,0.5
std,8.00001,0.0,454568.0,,,0.500001
min,223.0,0.0,0.0,,,0.0
25%,223.0,0.0,22.5,,,0.0
50%,231.0,0.0,59.9,,,0.5
75%,239.0,0.0,9700.0,,,1.0
max,239.0,0.0,25489200.0,,,1.0


Since standard deviation is 0 and min and max are the same, it means that all values in the column are the same.

Since count is 0, it means there are no non-null values in the column.

Some values are > 100 which is not possible for percentages. This indicates that there are some errors in the data.

In [17]:
# only keep values <= 100
hi_filtered_df = hi_df[hi_df['value'] <= 100]

In [18]:
# categorical columns
hi_filtered_df.describe(include=['object'])

Unnamed: 0,ref_date,geo,dguid,age_group,sex,indigenous_identity,indicators,characteristics,uom,scalar_factor,vector,coordinate,status
count,110910,110910,110910,110910,110910,110910,110910,110910,110910,110910,110910,110910,23700
unique,2,17,17,5,3,5,27,3,1,1,60273,60273,1
top,2015/2018,Canada,2021A000011124,"Total, 18 years and over",Both sexes,Non-Indigenous people,Has a regular healthcare provider,Percent,Percent,units,v1643071355,1.1.1.1.1.4,E
freq,56601,10887,10887,30207,42759,37350,4620,36970,110910,110910,2,2,23700


Since unique is 1, it means that all values in these columns are the same. So they are not useful for analysis.

In [20]:
# unique values in the categorical columns
for col in ['ref_date', 'geo', 'age_group', 'sex', 'indigenous_identity','indicators', 'characteristics']:
    print(f'{col}: {hi_filtered_df[col].nunique()} unique values')
    print(hi_filtered_df[col].unique())
    print()

ref_date: 2 unique values
['2015/2018' '2019/2022']

geo: 17 unique values
['Canada' 'Atlantic provinces' 'Newfoundland and Labrador'
 'Prince Edward Island' 'Nova Scotia' 'New Brunswick' 'Quebec' 'Ontario'
 'Prairie provinces' 'Manitoba' 'Saskatchewan' 'Alberta'
 'British Columbia' 'Territories' 'Yukon' 'Northwest Territories'
 'Nunavut']

age_group: 5 unique values
['Total, 18 years and over' '18 to 34 years' '35 to 49 years'
 '50 to 64 years' '65 years and over']

sex: 3 unique values
['Both sexes' 'Males' 'Females']

indigenous_identity: 5 unique values
['Total, Indigenous identity' 'First Nations (North American Indian)'
 'Métis' 'Inuk (Inuit)' 'Non-Indigenous people']

indicators: 27 unique values
['Perceived health, very good or excellent' 'Perceived health, good'
 'Perceived health, fair or poor'
 'Perceived mental health, very good or excellent'
 'Perceived mental health, good' 'Perceived mental health, fair or poor'
 'Perceived life stress, most days quite a bit or extremely 

In [48]:
healthcare_df[
    (healthcare_df['indigenous_group'] == 'First Nations') &
    (healthcare_df['gender'] == 'Men+') &
    (healthcare_df['healthcare_access_experience'] == 'Yes, had an unmet health care need in the past 12 months')
]

Unnamed: 0,ref_date,geo,dguid,healthcare_access_experience,indigenous_group,gender,statistics,uom,uom_id,scalar_factor,scalar_id,vector,coordinate,value,status,symbol,terminated,decimals
30,2024,Canada,2021A000011124,"Yes, had an unmet health care need in the past...",First Nations,Men+,Percentage,Percent,239,units,0,v1663833882,1.2.1.2.1,24.4,,,,1
31,2024,Canada,2021A000011124,"Yes, had an unmet health care need in the past...",First Nations,Men+,Low 95% confidence interval,Percent,239,units,0,v1663833883,1.2.1.2.2,19.2,,,,1
32,2024,Canada,2021A000011124,"Yes, had an unmet health care need in the past...",First Nations,Men+,High 95% confidence interval,Percent,239,units,0,v1663833884,1.2.1.2.3,30.3,,,,1


We'll filter for statistics == 'Percentage'

In [50]:
# filter statistics == 'Percentage'
hc_filtered_df = healthcare_df[healthcare_df['statistics'] == 'Percentage']

## Drop Unnecessary Columns

In [51]:
hc_filtered_df.columns

Index(['ref_date', 'geo', 'dguid', 'healthcare_access_experience',
       'indigenous_group', 'gender', 'statistics', 'uom', 'uom_id',
       'scalar_factor', 'scalar_id', 'vector', 'coordinate', 'value', 'status',
       'symbol', 'terminated', 'decimals'],
      dtype='object')

In [52]:
selected_cols = ['indigenous_group', 'gender', 'healthcare_access_experience', 'value']

hc_filtered_df = hc_filtered_df[selected_cols]

## Drop Missing Values 

In [53]:
# Check NA value counts and proportions
na_counts = hc_filtered_df.isna().sum()
na_proportions = hc_filtered_df.isna().mean()
na_summary = pd.DataFrame({'count': na_counts, 'proportion': na_proportions})
na_summary

Unnamed: 0,count,proportion
indigenous_group,0,0.0
gender,0,0.0
healthcare_access_experience,0,0.0
value,12,0.022989


NA values only take up ~2% of the data, so we can drop them

In [56]:
# Drop rows with NA values
hc_cleaned_df = hc_filtered_df.dropna()
print(f"Dropped {hc_filtered_df.shape[0] - hc_cleaned_df.shape[0]} rows with NA values")

Dropped 12 rows with NA values


No vs. Yes

In [67]:
hc_cleaned_df[hc_cleaned_df['healthcare_access_experience'].str.contains("No")]['value'].mean()

54.84027777777778

In [68]:
hc_cleaned_df[hc_cleaned_df['healthcare_access_experience'].str.contains("Yes")]['value'].mean()

42.09074074074075

## Store Cleaned Data as Parquet

In [None]:
# save as parquet
hc_cleaned_df.to_parquet('data/healthcare-access/41100081-eng/healthcare_access.parquet', index=False)

# Second Healthcare Access Dataset

In [58]:
healthcare_df2 = data_loader('healthcare-access', '41100040')
healthcare_df2.head()

Unnamed: 0,REF_DATE,GEO,DGUID,Aboriginal identity,Age group,Sex,Access to and use of health care services,Statistics,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2017,Canada,2016A000011124,"Total, Aboriginal identity","Total, 15 years and over",Both sexes,"Total, regular medical doctor status",Number of persons,Persons,249,units,0,v1234407391,1.1.1.1.1.1,998520.0,,,,0
1,2017,Canada,2016A000011124,"Total, Aboriginal identity","Total, 15 years and over",Both sexes,"Total, regular medical doctor status",Percent,Percent,239,units,0,v1234407392,1.1.1.1.1.2,100.0,,,,1
2,2017,Canada,2016A000011124,"Total, Aboriginal identity","Total, 15 years and over",Both sexes,"Total, regular medical doctor status",Low 95% confidence interval,Percent,239,units,0,v1234407393,1.1.1.1.1.3,100.0,,,,1
3,2017,Canada,2016A000011124,"Total, Aboriginal identity","Total, 15 years and over",Both sexes,"Total, regular medical doctor status",High 95% confidence interval,Percent,239,units,0,v1234407394,1.1.1.1.1.4,100.0,,,,1
4,2017,Canada,2016A000011124,"Total, Aboriginal identity","Total, 15 years and over",Both sexes,Has a regular medical doctor,Number of persons,Persons,249,units,0,v1234407395,1.1.1.1.2.1,794220.0,,,,0


In [59]:
healthcare_df2.describe()

Unnamed: 0,REF_DATE,UOM_ID,SCALAR_ID,VALUE,SYMBOL,TERMINATED,DECIMALS
count,191100.0,191100.0,191100.0,92828.0,0.0,0.0,191100.0
mean,2017.0,241.5,0.0,3077.978569,,,0.75
std,0.0,4.330138,0.0,18807.493791,,,0.433014
min,2017.0,239.0,0.0,0.1,,,0.0
25%,2017.0,239.0,0.0,30.2,,,0.75
50%,2017.0,239.0,0.0,84.6,,,1.0
75%,2017.0,241.5,0.0,100.0,,,1.0
max,2017.0,249.0,0.0,998520.0,,,1.0


In [62]:
# remove outliers in value column
hc_cleaned_df2 = healthcare_df2[healthcare_df2['VALUE'] < 100]

In [63]:
hc_cleaned_df2.describe()

Unnamed: 0,REF_DATE,UOM_ID,SCALAR_ID,VALUE,SYMBOL,TERMINATED,DECIMALS
count,52716.0,52716.0,52716.0,52716.0,0.0,0.0,52716.0
mean,2017.0,239.125199,0.0,43.849014,,,0.98748
std,0.0,1.111908,0.0,30.205379,,,0.111191
min,2017.0,239.0,0.0,0.1,,,0.0
25%,2017.0,239.0,0.0,16.4,,,1.0
50%,2017.0,239.0,0.0,36.5,,,1.0
75%,2017.0,239.0,0.0,74.1,,,1.0
max,2017.0,249.0,0.0,99.9,,,1.0
