Primary Question: 
- What factors influence Indigenous healthcare access? 
- Patterns of healthcare access among different Indigenous groups

> Health Gaps: Compare healthcare access between Indigenous and non-Indigenous populations

> Within-group Analysis: compare First Nations, Métis, and Inuit populations, or rural vs urban Indigenous populations

In [36]:
import pandas as pd

# Healthcare Access

## Load Data

In [37]:
def data_loader(data_type, table_name):
    path = f'data/{data_type}/{table_name}-eng/{table_name}.csv'
    df = pd.read_csv(path)
    return df

In [38]:
healthcare_df = data_loader('healthcare-access', '41100081')
healthcare_df.head()

Unnamed: 0,REF_DATE,GEO,DGUID,Selected characteristics of health care access and experiences,Indigenous group,Gender,Statistics,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2024,Canada,2021A000011124,"Total, unmet health care needs in the past 12 ...",First Nations,"Total, gender",Percentage,Percent,239,units,0,v1663833852,1.1.1.1.1,100.0,,,,1
1,2024,Canada,2021A000011124,"Total, unmet health care needs in the past 12 ...",First Nations,"Total, gender",Low 95% confidence interval,Percent,239,units,0,v1663833853,1.1.1.1.2,100.0,,,,1
2,2024,Canada,2021A000011124,"Total, unmet health care needs in the past 12 ...",First Nations,"Total, gender",High 95% confidence interval,Percent,239,units,0,v1663833854,1.1.1.1.3,100.0,,,,1
3,2024,Canada,2021A000011124,"Total, unmet health care needs in the past 12 ...",First Nations,Men+,Percentage,Percent,239,units,0,v1663833855,1.1.1.2.1,100.0,,,,1
4,2024,Canada,2021A000011124,"Total, unmet health care needs in the past 12 ...",First Nations,Men+,Low 95% confidence interval,Percent,239,units,0,v1663833856,1.1.1.2.2,100.0,,,,1


In [39]:
# Check data types
healthcare_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1566 entries, 0 to 1565
Data columns (total 18 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   REF_DATE                                                        1566 non-null   int64  
 1   GEO                                                             1566 non-null   object 
 2   DGUID                                                           1566 non-null   object 
 3   Selected characteristics of health care access and experiences  1566 non-null   object 
 4   Indigenous group                                                1566 non-null   object 
 5   Gender                                                          1566 non-null   object 
 6   Statistics                                                      1566 non-null   object 
 7   UOM                                                

The `value` column shows the percentage of individuals in each Indigenous group (and gender category) who selected a specific healthcare experience

## Rename Column Names


In [40]:
healthcare_df.columns

Index(['REF_DATE', 'GEO', 'DGUID',
       'Selected characteristics of health care access and experiences',
       'Indigenous group', 'Gender', 'Statistics', 'UOM', 'UOM_ID',
       'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'VALUE', 'STATUS',
       'SYMBOL', 'TERMINATED', 'DECIMALS'],
      dtype='object')

In [41]:
# change column names to lower case
healthcare_df.columns = healthcare_df.columns.str.lower()

In [42]:
# replace spaces with underscores
healthcare_df.columns = healthcare_df.columns.str.replace(' ', '_')

In [43]:
# rename `selected characteristics of health care access and experiences` to `healthcare_access_experience`
healthcare_df.rename(columns={'selected_characteristics_of_health_care_access_and_experiences': 'healthcare_access_experience'}, inplace=True)


## Statistics of the Data

In [44]:
# statistics of the numerical columns
healthcare_df.describe()

Unnamed: 0,ref_date,uom_id,scalar_id,value,symbol,terminated,decimals
count,1566.0,1566.0,1566.0,1530.0,0.0,0.0,1566.0
mean,2024.0,239.0,0.0,51.694575,,,1.0
std,0.0,0.0,0.0,34.336181,,,0.0
min,2024.0,239.0,0.0,2.0,,,1.0
25%,2024.0,239.0,0.0,20.9,,,1.0
50%,2024.0,239.0,0.0,43.25,,,1.0
75%,2024.0,239.0,0.0,89.275,,,1.0
max,2024.0,239.0,0.0,100.0,,,1.0


Since standard deviation is 0 and min and max are the same, it means that all values in the column are the same.

Since count is 0, it means there are no non-null values in the column.

Therefore, only the `value` column is useful for analysis.

In [45]:
# categorical columns
healthcare_df.describe(include=['object'])

Unnamed: 0,geo,dguid,healthcare_access_experience,indigenous_group,gender,statistics,uom,scalar_factor,vector,coordinate,status
count,1566,1566,1566,1566,1566,1566,1566,1566,1566,1566,36
unique,1,1,58,3,3,3,1,1,1566,1566,1
top,Canada,2021A000011124,"Total, unmet health care needs in the past 12 ...",First Nations,"Total, gender",Percentage,Percent,units,v1663833852,1.1.1.1.1,F
freq,1566,1566,27,522,522,522,1566,1566,1,1,36


Since unique is 1, it means that all values in these columns are the same. So they are not useful for analysis.

In [49]:
# unique values in the categorical columns
for col in ['indigenous_group', 'gender', 'statistics']:
    print(f'{col}: {healthcare_df[col].nunique()} unique values')
    print(healthcare_df[col].unique())
    print()

indigenous_group: 3 unique values
['First Nations' 'Métis' 'Inuk (Inuit)']

gender: 3 unique values
['Total, gender' 'Men+' 'Women+']

statistics: 3 unique values
['Percentage' 'Low 95% confidence interval' 'High 95% confidence interval']



95% confident that true percentage falls between 'Low 95% confidence interval' and 'High 95% confidence interval'

In [48]:
healthcare_df[
    (healthcare_df['indigenous_group'] == 'First Nations') &
    (healthcare_df['gender'] == 'Men+') &
    (healthcare_df['healthcare_access_experience'] == 'Yes, had an unmet health care need in the past 12 months')
]

Unnamed: 0,ref_date,geo,dguid,healthcare_access_experience,indigenous_group,gender,statistics,uom,uom_id,scalar_factor,scalar_id,vector,coordinate,value,status,symbol,terminated,decimals
30,2024,Canada,2021A000011124,"Yes, had an unmet health care need in the past...",First Nations,Men+,Percentage,Percent,239,units,0,v1663833882,1.2.1.2.1,24.4,,,,1
31,2024,Canada,2021A000011124,"Yes, had an unmet health care need in the past...",First Nations,Men+,Low 95% confidence interval,Percent,239,units,0,v1663833883,1.2.1.2.2,19.2,,,,1
32,2024,Canada,2021A000011124,"Yes, had an unmet health care need in the past...",First Nations,Men+,High 95% confidence interval,Percent,239,units,0,v1663833884,1.2.1.2.3,30.3,,,,1


We'll filter for statistics == 'Percentage'

In [50]:
# filter statistics == 'Percentage'
hc_filtered_df = healthcare_df[healthcare_df['statistics'] == 'Percentage']

## Drop Unnecessary Columns

In [51]:
hc_filtered_df.columns

Index(['ref_date', 'geo', 'dguid', 'healthcare_access_experience',
       'indigenous_group', 'gender', 'statistics', 'uom', 'uom_id',
       'scalar_factor', 'scalar_id', 'vector', 'coordinate', 'value', 'status',
       'symbol', 'terminated', 'decimals'],
      dtype='object')

In [52]:
selected_cols = ['indigenous_group', 'gender', 'healthcare_access_experience', 'value']

hc_filtered_df = hc_filtered_df[selected_cols]

## Drop Missing Values 

In [53]:
# Check NA value counts and proportions
na_counts = hc_filtered_df.isna().sum()
na_proportions = hc_filtered_df.isna().mean()
na_summary = pd.DataFrame({'count': na_counts, 'proportion': na_proportions})
na_summary

Unnamed: 0,count,proportion
indigenous_group,0,0.0
gender,0,0.0
healthcare_access_experience,0,0.0
value,12,0.022989


NA values only take up ~2% of the data, so we can drop them

In [56]:
# Drop rows with NA values
hc_cleaned_df = hc_filtered_df.dropna()
print(f"Dropped {hc_filtered_df.shape[0] - hc_cleaned_df.shape[0]} rows with NA values")

Dropped 12 rows with NA values
