## Introduction

Quality of life (QoL) refers to an individual's perception of their position in life, covering physical health, mental health, and social relationships, relative to their expectations and cultural context (World Health Organization, 1995).  In health research, QoL is affected by many factors, including health drivers and health barriers.  Health drivers such as physical activity and mental health consultations typically have a positive influence on QoL, while health barriers like smoking, alcohol consumption, and cannabis use are often linked to negative health outcomes (Haskell et al. 2007; Rehm et al., 2009).  

This project aims to explore the relationship between these health drivers and barriers, focusing on their impact on physical and mental health outcomes.  Using Visualizations and correlation statistics, we will analyze variables such as body mass index (BMI), self-reported health status, and stress to evaluate the influence of health barriers.  Following this, we will perform subgroup analyses to determine how age, sex, region, and education are impacted by these health barriers. Depending on our findings we will either help identify the most vulnerable populations and suggest targeted interventions to improve their QoL, or try to suggest more appropriate ways to analyze this dataset. 

## Dataset and Guiding Questions

We chose the Canadian Community Health Survey as our dataset. Specifically, we will be using the 2019-2020 microdata file available at Statistics Canada (Statistics Canada, 2023).  

The Canadian Community Health Survey is an annual survey that collects data from respondents across Canada and differentiates them by their province and respective health regions. The survey is voluntary response but is designed to have a large representative sample, all identifying data is removed prior to the release of the microdata file. The survey collects data on a wide variety of health indicators, potential health determinants, and plenty of demographic data to assist in analysis. The datafile comes in the form of a CSV file and comes with a data dictionary and a very accessible format given the sheer size of the CSV file.  

The dataset has 691 individual columns including the respondent’s record number with the columns representing the variables. For each of these 691 variables there are 108,252 individual responses as the rows for the data file. We plan to choose a specific number of variables and analyze them according to the questions listed below. 

Our guiding questions can be listed into two categories:

a. Health Drivers Vs. Health Barriers
   1. How does alcohol consumption affect mental health among different age groups?
   2. How does cannabis use impact stress levels?
   3. What is the relationship between smoking and physical health outcomes?
   4. Which demographic groups (age, sex, region) are most affected by health barriers?
   
b. Health Drivers Vs. Health Improvements
   1. What is the relationship between regular exercise and self-reported health status?
   2. How does stress influence the maintenance or improvement of mental health?

In [4]:
#Make sure all appropriate libraries loaded
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import plotly.express as px
from pydataset import data


In [7]:
#Load dataset attached to submission
df=pd.read_csv('pumf_cchs.csv')

In [15]:
#This should filter all of the variables we want on the general level. You will need to determine which variables you will use
#IMPORTANT I did not include all the CCC values for chronic disease here. I can look and see what we need, and try to adjust shortly
#If you think anything is missing here, feel free to change
df_general=df.filter(items=[
    'ALC_015',
    'ALC_020',
    'GEN_010',
    'GEN_015',
    'GEN_020',
    'GEN_025',
    'CCC_195',
    'GEN_005',
    'HWTDGISW',
    'CAN_015',
    'SMK_005',
    'SMK_060',
    'PAADVACV',
    'PAA_030',
    'PAA_060',
    'PAA_095',
    'SBE_005',
    'SBE_010',
    'PAA_005',
    'DHHGAGE',
    'DHH_SEX',
    'GEOGPRV',
    'EHG2DVH3',
    'HWT_050',
    'PEX_005',
    'ADM_RNO1',
    'GENDVHDI'
    ])
print(df_general.head())

   ALC_015  ALC_020  GEN_010  GEN_015  GEN_020  GEN_025  CCC_195  GEN_005  \
0      5.0      3.0      9.0      3.0      2.0      2.0      2.0      3.0   
1      1.0      1.0      4.0      3.0      3.0      6.0      1.0      3.0   
2     96.0     96.0      7.0      3.0      3.0      6.0      2.0      2.0   
3     96.0     96.0      8.0      3.0      3.0      6.0      2.0      3.0   
4     96.0     96.0      0.0      5.0      4.0      6.0      2.0      5.0   

   HWTDGISW  CAN_015  ...  SBE_010  PAA_005  DHHGAGE  DHH_SEX  GEOGPRV  \
0       1.0      2.0  ...      6.0      2.0      3.0      2.0     47.0   
1       2.0      2.0  ...      6.0      2.0      5.0      1.0     47.0   
2       2.0      2.0  ...      1.0      6.0      5.0      2.0     59.0   
3       2.0      2.0  ...      6.0      6.0      5.0      1.0     13.0   
4       2.0      2.0  ...      6.0      6.0      4.0      1.0     46.0   

   EHG2DVH3  HWT_050  PEX_005  ADM_RNO1  GENDVHDI  
0       3.0      3.0     96.0      1000 

## Analysis

### Health Drivers Vs. Health Barriers

#### Question 1 - How does alcohol consumption affect mental health among different age groups? 

To determine how Alcohol Consumption Affected Mental Health Among Different Age Groups, the data was first cleaned.  Due to the size of the initial dataset, any data columns containing unnecessary variables were removed.  Following this, it was made sure there were no missing values, and removed any values that were of no interest to my analysis, such as ‘valid skip’, do not know, and refuse to say.​  As they were not pertinent to the analysis, we felt it was safe to remove them.

In [None]:
#Make a copy of our initial filtered data, and then filter for our question
df3=df_general.copy()
df_q3=df3.filter(items=['ALC_015','ALC_020','GEN_010','GEN_015','GEN_020','GEN_025','CCC_195','DHHGAGE'])

In [None]:
#Some basic info on the dataset
print(df_q3.shape)
print(df_q3.info())

In [None]:
#Ensure no null values
print(df_q3.isnull().sum())

In [None]:
#Removed invalid values that relected responses like unknown, did not answer, or valid skip
df_q3=df_q3[~df_q3['GEN_010'].isin(range(97,100))]
df_q3=df_q3[~df_q3['GEN_015'].isin(range(7,10))]
df_q3=df_q3[~df_q3['GEN_020'].isin(range(7,9))]
df_q3=df_q3[~df_q3['GEN_025'].isin(range(6,10))]
df_q3=df_q3[~df_q3['ALC_015'].isin(range(96,100))]
df_q3=df_q3[~df_q3['ALC_020'].isin(range(96,100))]
df_q3=df_q3[~df_q3['CCC_195'].isin(range(7,9))]

Upon inspection of the data dictionary, it was discovered that some likret scales went in the direction of negative to positive, while other went from positive to negative.  To avoid confusion, the variable order was changed so that all progressed in a single logical direction​.

In [None]:
#Convert GEN_015,_020,_025 so they progress in a logical orderm with 1=more negative and 5=more positive
gen_015_new_order={1:5,2:4,3:3,4:2,5:1}
gen_020_new_order={1:5,2:4,3:3,4:2,5:1}
gen_025_new_order={1:5,2:4,3:3,4:2,5:1}

df_q3['GEN_015'] = df_q3['GEN_015'].map(gen_015_new_order)
df_q3['GEN_020'] = df_q3['GEN_020'].map(gen_015_new_order)
df_q3['GEN_025'] = df_q3['GEN_025'].map(gen_015_new_order)

In [None]:
#Create labels that represent the values listed in the data dictionary
perceived_mental_health={1: 'Poor', 2: 'Fair', 3: 'Good', 4: 'Very good', 5: 'Excellent'}
perceived_life_stress= {1: 'Extremely stressful',2: 'Quite a bit stressful',3: 'A bit stressful',4: 'Not very stressful',5: 'Not at all stressful'}
perceived_work_stress={1: 'Extremely stressful',2: 'Quite a bit stressful',3: 'A bit stressful',4: 'Not very stressful',5: 'Not at all stressful'}
alc_freq={
    1: 'Less than once/month',2: 'Once/month',3: '2-3 times/month',
    4: 'Once/week',5: '2-3 times/week',6: '4-6 times/week',7: 'Every day'}
alc_binge_freq= {1: 'Never',2: 'Less than once/month',3: 'Once/month',4: '2-3 times/month',
                 5: 'Once/week',6: 'More than once/week',}
age_group={1:"12 to 17 years",2:"18 to 34 years",3:"35 to 49 years",4:"50 to 64 years",5:"65 and older"}

In [None]:
#Create new columns that list the previously created labels
df_q3["perceived_mental_health"] = df_q3[["GEN_015"]].copy().replace({"GEN_015": perceived_mental_health})
df_q3["perceived_life_stress"] = df_q3[["GEN_020"]].copy().replace({"GEN_020": perceived_life_stress})
df_q3["perceived_work_stress"] = df_q3[["GEN_025"]].copy().replace({"GEN_025": perceived_work_stress})
df_q3["alc_freq"] = df_q3[["ALC_015"]].copy().replace({"ALC_015": alc_freq})
df_q3["alc_binge_freq"] = df_q3[["ALC_020"]].copy().replace({"ALC_020": alc_binge_freq})
df_q3["age_group"] = df_q3[["DHHGAGE"]].copy().replace({"DHHGAGE": age_group})

In [None]:
#Make a copy of the dataset, so if we make a mistake we don't need to start from the beginning
df_q3a=df_q3.copy()

In [None]:
#Create a correlation amongst our variables of interest, and then plot them in a heatmap
labels = {
    'ALC_015': 'Alcohol Use Frequency',
    'ALC_020': 'Binge Drinking Frequency',
    'GEN_010': 'Life Satisfaction',
    'GEN_015': 'Self-Perceived Mental Health',
    'GEN_020': 'Perceived Life Stress',
    'CCC_195': 'Mood Disorders',
    'DHHGAGE': 'Age Group'
}
correlations = df_q3a[['ALC_015', 'ALC_020', 'GEN_010', 'GEN_015', 'GEN_020', 'CCC_195','DHHGAGE']].corr()
correlations.rename(columns=labels,index=labels, inplace=True)
print("Correlation Matrix:")
print(correlations)

#correlation heatmap
sns.heatmap(correlations, annot=True, cmap='coolwarm', center=0)
plt.title("Correlation Between Alcohol Use and Mental Health Variables")

plt.show()


Based on our correlation of all our variables of interest, we most see weak relationships, with a few moderate relationships, none of which were very surprising.  The strongest relationships were seen between alcohol consumpiton frequency and binge drinking frequency, and between life satisfaction and self-perceived mental health.  We also see a weak correlation between any alcohol consumption variables and mental health variables.

In [None]:
#Create multiple correlations based on age group
def correlation_by_age_group(df_q3a, age_groups):
    subset = df[df['DHHGAGE'] == age_groups]
    correlation_matrix = subset[['ALC_015', 'ALC_020', 'GEN_010', 'GEN_015', 'GEN_020', 'CCC_195']].corr()
    return correlation_matrix

correlation_age_2 = correlation_by_age_group(df_q3a, 2)
correlation_age_3 = correlation_by_age_group(df_q3a, 3)
correlation_age_4 = correlation_by_age_group(df_q3a, 4)
print(correlation_age_2,'\n')
print(correlation_age_3,'\n')
print(correlation_age_4,'\n')

In [None]:
#Plot the different heatmap for each correlation based on age group
labels = {
    'ALC_015': 'Alcohol Use Frequency',
    'ALC_020': 'Binge Drinking Frequency',
    'GEN_010': 'Life Satisfaction',
    'GEN_015': 'Self-Perceived Mental Health',
    'GEN_020': 'Perceived Life Stress',
    'CCC_195': 'Mood Disorders',
    'DHHGAGE': 'Age Group'
}


correlation_age_2 = correlation_by_age_group(df_q3a, 2)
correlation_age_3 = correlation_by_age_group(df_q3a, 3)
correlation_age_4 = correlation_by_age_group(df_q3a, 4)

correlation_age_2.rename(columns=labels,index=labels, inplace=True)
correlation_age_3.rename(columns=labels,index=labels, inplace=True)
correlation_age_4.rename(columns=labels,index=labels, inplace=True)

#heatmaps for the 3 age groups
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

sns.heatmap(correlation_age_2, annot=True, cmap='coolwarm', center=0, ax=axes[0, 0])
axes[0, 0].set_title('Correlation Heatmap for Age Group 18-34')

sns.heatmap(correlation_age_3, annot=True, cmap='coolwarm', center=0, ax=axes[0, 1])
axes[0, 1].set_title('Correlation Heatmap for Age Group 35-49')

sns.heatmap(correlation_age_4, annot=True, cmap='coolwarm', center=0, ax=axes[1, 0])
axes[1, 0].set_title('Correlation Heatmap for Age Group 50-64')

fig.delaxes(axes[1, 1])  #rmv empty plot

plt.tight_layout()
plt.show()

From the above output, we can see the correlations of our variables of interest, segmented by age categories. By segmenting by age categories we are able to highlight how differening age groups experience the relationships between alcohol use and mental health differently. A few things to note are:

- The relationship between alcohol consumption frequency and binge drinking frequency is stronger for all age groups.
- The relationship between self-perceived mental health and life satisfaction is highest in the 18-34 age group.
- The relationship between self-perceived mental health and mood disorders has a negative correlation for all age groups, suggesting that individuals with mood disorders are will more frequently have lower self-perceived mental health
- The relationship between perceived life stress and mood disorders has a negative correlation for all age groups, suggesting that those with mood disorders are more life to have higher perceived life stress

Despite this, similar to our previous correlation with all variables, we continue to see a weak correlation between alcohol consumption variables and mental health variables, indicating that there are additional variables that play a large role in affecting mental health.

In [None]:
#Look at the mean value for the various age groups for each variable of interest
age_group_means = df_q3a.groupby('age_group')[['ALC_015', 'ALC_020', 'GEN_010', 'GEN_015', 'GEN_020', 'GEN_025', 'CCC_195']].mean()

print(age_group_means)

In [None]:
#Creates a dataframe to give the proportion of GEN_015, grouped by ALC_015
avg_counts_alc_freq=df_q3a.groupby('ALC_015', as_index=False)['GEN_015'].value_counts(normalize=True, sort=False)
avg_counts_alc_freq

In [None]:
#Plotting alcohol consumption freq and perceived mental health proportions
#change the color palette
cmap = plt.get_cmap('coolwarm') 
colors = cmap(np.linspace(0, 1, 5))

plt.figure().set_figwidth(16)
plt.bar(avg_counts_alc_freq['ALC_015'].unique()-0.2, avg_counts_alc_freq['proportion'][::5], width=0.1, label='Poor', color=colors[0])
plt.bar(avg_counts_alc_freq['ALC_015'].unique()-0.1, avg_counts_alc_freq['proportion'][1::5], width=0.1, label='Fair', color=colors[1])
plt.bar(avg_counts_alc_freq['ALC_015'].unique(), avg_counts_alc_freq['proportion'][2::5], width=0.1, label='Good', color=colors[2])
plt.bar(avg_counts_alc_freq['ALC_015'].unique()+0.1, avg_counts_alc_freq['proportion'][3::5], width=0.1, label='Very good', color=colors[3])
plt.bar(avg_counts_alc_freq['ALC_015'].unique()+0.2, avg_counts_alc_freq['proportion'][4::5], width=0.1, label='Excellent', color=colors[4])
plt.xticks(avg_counts_alc_freq['ALC_015'].unique()+0.1/2,('Less than once/month','Once/month','2-3 times/month','Once/week','2-3 times/week',
                                                    '4-6 times/week','Every day'))
plt.xlabel('Perceived Mental Health Level by Alcohol Consumption Frequency')
plt.ylabel('Proportion')
plt.legend()
plt.show()


Based on output above, we can see that as alcohol consumption increase from less than once per month to 2-3 times per week, the proportion of individuals self perceived mental health improves slightly, with a larger proportion reporting excellent perceived mental health when consuming alcohol 2-3 times per week.  However for individuals that drink 4-6 times per week, or every day, the proportion that report excellent mental health drops. This may indicate that those drinking more frequently may temporarily increase their perceived mental health, but increasing the alcohol consumption frequency by too much will lead to worse self perceived mental health. 

In [None]:
#Creates a dataframe to give the proportion of GEN_015, grouped by ALC_020
avg_counts_binge_freq=df_q3a.groupby('ALC_020', as_index=False)['GEN_015'].value_counts(normalize=True, sort=False)
avg_counts_binge_freq

In [None]:
#Plotting Binge freq and perceived mental health proportions
#change the color palette
cmap = plt.get_cmap('coolwarm')
colors = cmap(np.linspace(0, 1, 5))

plt.figure().set_figwidth(16)
plt.bar(avg_counts_binge_freq['ALC_020'].unique()-0.2,avg_counts_binge_freq['proportion'][::5],width=0.1,label='Poor',color=colors[0])
plt.bar(avg_counts_binge_freq['ALC_020'].unique()-0.1,avg_counts_binge_freq['proportion'][1::5],width=0.1, label='Fair',color=colors[1])
plt.bar(avg_counts_binge_freq['ALC_020'].unique(),avg_counts_binge_freq['proportion'][2::5],width=0.1, label='Good',color=colors[2])
plt.bar(avg_counts_binge_freq['ALC_020'].unique()+0.1,avg_counts_binge_freq['proportion'][3::5],width=0.1, label='Very good',color=colors[3])
plt.bar(avg_counts_binge_freq['ALC_020'].unique()+0.2,avg_counts_binge_freq['proportion'][4::5],width=0.1, label='Excellent',color=colors[4])
plt.xticks(avg_counts_binge_freq['ALC_020'].unique()+0.1/2,('Never','Less than once/month','Once/month','2-3 times/month','Once/week',
                                                    'More than once/week'))
plt.xlabel('Perceived Mental Health Level by Binge Drinking Frequency')
plt.ylabel('Proportion')
plt.legend()
plt.show()

Based on output above, we can see that as binge drinking frequency increase, the proportion of individuals self perceived mental health steadily decreases, with individuals who binge drink more than once a week reporting the lowest proportions of very good and excellent mental health, while reporting the highest proportions of poor and fair mental health.  This suggests a possible link between binge drinking and worse perceived mental health.

In [None]:
#Plots average self-perceived mental health by alcohol consumption
avg_scores_num=df_q3a.groupby(
    ['ALC_015','ALC_020'])['GEN_015'].mean().reset_index()
alc_015_labels = {
    1: 'Less than once/month',
    2: 'Once/month',
    3: '2-3 times/month',
    4: 'Once/week',
    5: '2-3 times/week',
    6: '4-6 times/week',
    7: 'Every day'
}

# Mapping for ALC_020
alc_020_labels = {
    1: 'Never',
    2: 'Less than once/month',
    3: 'Once/month',
    4: '2-3 times/month',
    5: 'Once/week',
    6: 'More than once/week',
}
cmap = plt.get_cmap('coolwarm')
colors = cmap(np.linspace(0, 1, len(alc_020_labels)))
# Replace the numeric values with the corresponding labels
avg_scores_num['ALC_015'] = avg_scores_num['ALC_015'].map(alc_015_labels)
avg_scores_num['ALC_020'] = avg_scores_num['ALC_020'].map(alc_020_labels)

alc_020_order = list(alc_020_labels.values())  # Ensure correct order for categories
color_map = dict(zip(alc_020_order, colors))
# Plotting
plt.figure(figsize=(12, 6))
sns.barplot(data=avg_scores_num, x='ALC_015', y='GEN_015', hue='ALC_020', errorbar='sd',palette=color_map)
plt.title('Average Self-Perceived Mental Health by Alcohol Consumption')
plt.xlabel('Frequency of Alcohol Use')
plt.ylabel('Average Self-Perceived Mental Health')
plt.legend(title='Binge Drinking Frequency', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

To visualize the relationship between alcohol consumption and mental health among different age group, a barchart was created to compare the average self-perceived mental health by both frequency of alcohol use and frequency of binge drinking.  One of the main things noticed was the effect binge drinking had on mental health, with the avg self-perceived mental health of the most frequent binge drinkers dropping by almost 0.5 compared to those that did not​

In [None]:
#Plots average self-perceived mental health by age group and alcohol consumption
avg_scores2 = df_q3a.groupby(['DHHGAGE', 'ALC_015'])['GEN_015'].mean().reset_index()
alc_015_labels = {
    1: 'Less than once/month',
    2: 'Once/month',
    3: '2-3 times/month',
    4: 'Once/week',
    5: '2-3 times/week',
    6: '4-6 times/week',
    7: 'Every day'
}
age_group={1:"12 to 17 years",2:"18 to 34 years",3:"35 to 49 years",4:"50 to 64 years",5:"65 and older"}

cmap = plt.get_cmap('coolwarm')
colors = cmap(np.linspace(0, 1, len(alc_015_labels)))

avg_scores2['ALC_015'] = avg_scores2['ALC_015'].map(alc_015_labels)
avg_scores2['DHHGAGE'] = avg_scores2['DHHGAGE'].map(age_group)

alc_015_order = list(alc_015_labels.values())  # Ensure correct order for categories
color_map = dict(zip(alc_015_order, colors))

# Create a bar plot

plt.figure(figsize=(12, 6))
sns.barplot(x='DHHGAGE', y='GEN_015', hue='ALC_015', data=avg_scores2,palette=color_map)
plt.title('Average Self-Perceived Mental Health by Age Group and Alcohol Consumption')
plt.xlabel('Age Group')
plt.ylabel('Average Self-Perceived Mental Health')
plt.legend(title='Frequency of Alcohol Use',bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

From the above figure, we can see that as frequency of alcohol consumption increase, there is a slight increase in the average perceived mental health, most notably in those 18-34 and 50-64.  However, as individuals drink 4-6 times per week, or more, there is a decrease in average self perceived mental health.

In [None]:
#Plots average self-perceived mental health by age group and binge frequency
avg_scores3 = df_q3a.groupby(['DHHGAGE', 'ALC_020'])['GEN_015'].mean().reset_index()
alc_020_labels = {
    1: 'Never',
    2: 'Less than once/month',
    3: 'Once/month',
    4: '2-3 times/month',
    5: 'Once/week',
    6: 'More than once/week',
}
age_group={1:"12 to 17 years",2:"18 to 34 years",3:"35 to 49 years",4:"50 to 64 years",5:"65 and older"}
cmap = plt.get_cmap('coolwarm')
colors = cmap(np.linspace(0, 1, len(alc_020_labels)))

avg_scores3['ALC_020'] = avg_scores3['ALC_020'].map(alc_020_labels)
avg_scores3['DHHGAGE'] = avg_scores3['DHHGAGE'].map(age_group)

alc_020_order = list(alc_020_labels.values())  
color_map = dict(zip(alc_020_order, colors))
# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x='DHHGAGE', y='GEN_015', hue='ALC_020', data=avg_scores3,palette=color_map)
plt.title('Average Self-Perceived Mental Health by Age Group and Frequency of Binge Drinking')
plt.xlabel('Age Group')
plt.ylabel('Average Self-Perceived Mental Health')
plt.legend(title='Frequency of Binge Drinking',bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

Here we can see that as the frequency of binge drinking occurs, the average self perceived mental health of individuals decrease amongst all age groups.

#### Question 2 -    How does cannabis use impact stress levels? 

To explore the impact of cannabis use on one's stress level, we want to extract the interested varibles to this question that are cannabis use, perceived life stress and perceived work stress. Then these vriables will be compared and analyzd to determine the imacts of cannabis use. From the initial survey data, values that are no interest to this question and may create problems to the analysis such as "valide skip", "not state", "don't know" and "refusal" were removed from the columns. All values from varibles are converted into integer type as on the inital dataset, each survey code is represented by an integer. 

In [None]:
df_q4=df.copy() # make a copy of initial filtered data and filter for the varibles relavant to this question. 
df_q4=df_q4.filter(items=['GEN_020','GEN_025','CAN_015']) 
print(df_q4.head())

In [None]:
df_q4.to_csv('Question 4.csv',index=False) #Create a new csv for analysis for this guiding question. 

In [None]:
df_q4.info() #A peek to the dataset info for its data types and columns names. 
df_q4.describe()
df_q4.dtypes

In [None]:

df_q4.isnull().sum() #check for missing values

As we can see, after running the below code, all the values from interested varibles have been converted to Int32 type. 

In [None]:
df_q4[['CAN_015','GEN_020','GEN_025']]=df_q4[['CAN_015','GEN_020','GEN_025']].astype(int) #Converts Cannabis Use CAN_015, GEN_020, GEN_025 into integer type.
df_q4.info()

In [None]:
#Removing irrelavent values from perceived life stress veriable
df_q4=df_q4.drop(df_q4[df_q4['GEN_020']==7].index)  
df_q4=df_q4.drop(df_q4[df_q4['GEN_020']==8].index)

In [None]:
#Removing irrelavent values from perceived work stress varible
df_q4=df_q4.drop(df_q4[df_q4['GEN_025']==6].index)
df_q4=df_q4.drop(df_q4[df_q4['GEN_025']==7].index)
df_q4=df_q4.drop(df_q4[df_q4['GEN_025']==8].index)
df_q4=df_q4.drop(df_q4[df_q4['GEN_025']==9].index)

In [None]:
#Removing irrelavent values from cannabis use variable
df_q4=df_q4.drop(df_q4[df_q4['CAN_015']==7].index)
df_q4=df_q4.drop(df_q4[df_q4['CAN_015']==8].index)
df_q4=df_q4.drop(df_q4[df_q4['CAN_015']==9].index)

After running the above code, we can see from below that the minimum value is 1 and maximum is 5. values of 6 ,7 ,8, 9 have been removed from dataset. 

In [None]:
df_q4.describe()

In [None]:
# Group preceived life stress by cannabis use to prepare for the data analysis. 
df4_stress=df_q4.groupby('GEN_020', as_index=False)['CAN_015'].value_counts(normalize=True,sort=False)
display(df4_stress)

In [None]:
# Create a bar chart to visualize the impact of cannabis use to perceived life stress. 
plt.figure(figsize=(12, 6))
plt.bar(df4_stress['GEN_020'].unique(),df4_stress['proportion'][::2],width=0.3,label='Poeple Used Cannabis in Past 12 Months')
plt.bar(df4_stress['GEN_020'].unique()+0.3,df4_stress['proportion'][1::2],width=0.3, label="People Didn't Use Cannabis in Past 12 Months")
plt.xticks(df4_stress['GEN_020'].unique()+0.3/2,('Not at all stressful','Not very stressful','A bit stressful','Quite bit stressful','Extremely stressful'))
plt.xlabel('Preceived Life Stress by Canabis Use')
plt.ylabel('Percent')
plt.legend()
plt.show()

Based on the visualization of the graph, we can see that people who used cannais and feeling stressful is slightly more than people who used cannabis feeling not stressful. This might show us that using cannabis can still increase one's life stress level. From the group of people who didn't use cannabis, the amount of people feeling not stressful is also more than the amount of people feeling stressful. This tells us that not using cannabis will result in a less stressful life. However, the results can also be influnced by the factor that the number of people who didn't use cannabis is significantly more than the number of people who used cannabis in this dataset.

In [None]:
# Group perceived work stress results by cannabis use. 
df4_stress=df_q4.groupby('GEN_025', as_index=False)['CAN_015'].value_counts(normalize=True,sort=False)
display(df4_stress)

In [None]:
# Plot a bar chart to visulize the relationship of cannabis use and perceived work stress. 
plt.figure(figsize=(12, 6))
plt.bar(df4_stress['GEN_025'].unique(),df4_stress['proportion'][::2],width=0.3,label='Poeple Used Cannabis in Past 12 Months')
plt.bar(df4_stress['GEN_025'].unique()+0.3,df4_stress['proportion'][1::2],width=0.3, label="People Didn't Use Cannabis in Past 12 Months")
plt.xticks(df4_stress['GEN_025'].unique()+0.3/2,('Not at all stressful','Not very stressful','A bit stressful','Quite bit stressful','Extremely stressful'))
plt.xlabel('Preceived Work Stress by Canabis Use')
plt.ylabel('Percent')
plt.legend()
plt.show()

From the visualization of this graph, we can see that the amount of people who used canabis and feeling stressful is almost the same as the amount of people who used cannabis and feeling not stressful. This may reveal that the cannabis use does not make significant impact to one's work stress. As we can see people who didn't use canabis can still feel stressful. We didn't see a tendency of increassing stress level in both groups. This may reveal that using canabis does not make work much more stressful. 

#### Question 3 - What is the relationship between smoking and physical health outcomes

#### Question 4 - Which demographic groups (age, sex, region) are most affected by health barriers? 

This is a more comprehensive question that invovled different demogaphic of people. The health barries refer to cannabis use, smoking frequency and alcohol consumption. In the initial dataset, the demographic groups includes people of different gender, age, region and education level. This analysis will explore the impact of each health barries to each demographic group one by one. It's crucial for our analysis to further consider people from differnt demographic groups as the health barries ussually affects peole differntly to their background. To begin data analysis, there are several steps to conduct the data cleaning. The data cleaning process removes the values that are undesired to answering this quiding questions such as "don't know", "refusal", "not stated" and "valid skip". Then it will conver the values of all the desired varibels into integer type as the survey response received are all in format of integer. 

In [None]:
df_q5=df.copy() # make a copy of initial filtered data and filter for the varibles relavant to this question. 
df_q5=df_q5.filter(items=['CAN_015','SMK_005','SMMK_060','ALC_015','ALC_020','DHHGAGE','DHH_SEX','GEOGPRV','EHG2DVH3'])
print(df_q5.head())

In [None]:
df_q5.to_csv('Question 5.csv',index=False) #Create a new csv for analysis for this guiding question. 

In [None]:
#A peek to the dataset info for its data types and columns names. 
df_q5.info()
df_q5.describe()
df_q5.dtypes

In [None]:
df_q5.isnull().sum() #check for missing values

Data Cleaning:

In [None]:
#Removing undesired values from cannabis use variable.
df_q5=df_q5.drop(df_q5[df_q5['CAN_015']==7].index)
df_q5=df_q5.drop(df_q5[df_q5['CAN_015']==8].index)
df_q5=df_q5.drop(df_q5[df_q5['CAN_015']==9].index)

In [None]:
#Removing undesired values from smoking frequency variable.
df_q5=df_q5.drop(df_q5[df_q5['SMK_005']==7].index)
df_q5=df_q5.drop(df_q5[df_q5['SMK_005']==8].index)
df_q5=df_q5.drop(df_q5[df_q5['SMK_005']==9].index)

In [None]:
#Removing undesired values from alcohol consumption variable.
df_q5=df_q5.drop(df_q5[df_q5['ALC_015']==96].index)
df_q5=df_q5.drop(df_q5[df_q5['ALC_015']==97].index)
df_q5=df_q5.drop(df_q5[df_q5['ALC_015']==98].index)
df_q5=df_q5.drop(df_q5[df_q5['ALC_015']==99].index)

In [None]:
#Removing undesired values from alcohol consumption (drinking frequency) variable.
df_q5=df_q5.drop(df_q5[df_q5['ALC_020']==96].index)
df_q5=df_q5.drop(df_q5[df_q5['ALC_020']==97].index)
df_q5=df_q5.drop(df_q5[df_q5['ALC_020']==98].index)
df_q5=df_q5.drop(df_q5[df_q5['ALC_020']==99].index)

In [None]:
#Removing undesired values from household education level variable.
df_q5=df_q5.drop(df_q5[df_q5['EHG2DVH3']==9].index)

In [None]:
# Converted interested varibles into integer type
df_q5['CAN_015'] = df_q5['CAN_015'].astype(int)
df_q5['GEOGPRV'].dtypes
df_q5['SMK_005'] = df_q5['SMK_005'].astype(int)
df_q5['SMK_005'].dtypes
df_q5['ALC_015'] = df_q5['ALC_015'].astype(int)
df_q5['ALC_015'].dtypes
df_q5['ALC_020'] = df_q5['ALC_020'].astype(int)
df_q5['ALC_020'].dtypes
df_q5['DHH_SEX'] = df_q5['DHH_SEX'].astype(int)
df_q5['DHH_SEX'].dtypes
df_q5['DHHGAGE'] = df_q5['DHHGAGE'].astype(int)
df_q5['DHHGAGE'].dtypes
df_q5['GEOGPRV'] = df_q5['GEOGPRV'].astype(int)
df_q5['GEOGPRV'].dtypes
df_q5['EHG2DVH3'] = df_q5['EHG2DVH3'].astype(int)
df_q5['EHG2DVH3'].dtypes

After running the above code, we can see from below that the maximum values shows the undesired values have been removed from all the interested varibels, and all the values have been converted to integer type. 

In [None]:
df_q5.describe()

In [None]:
#Group cannabis use results based on sex type. 
df5_cansex=df_q5.groupby('DHH_SEX', as_index=False)['CAN_015'].value_counts(normalize=True,sort=False)
display(df5_cansex)

In [None]:
# Create a bar plot to show the cannabis use infromation based on each gender type. 
plt.figure(figsize=(10,7))
plt.bar(df5_cansex['DHH_SEX'].unique(),df5_cansex['proportion'][::2],width=0.3,label='Poeple Used Cannabis in Past 12 Months')
plt.bar(df5_cansex['DHH_SEX'].unique()+0.3,df5_cansex['proportion'][1::2],width=0.3, label="People Didn't Use Cannabis in Past 12 Months")
plt.xticks(df5_cansex['DHH_SEX'].unique()+0.3/2,('Male','Female'))
plt.xlabel('Gender Impacted by Canabis')
plt.ylabel('Percent')
plt.legend()
plt.show()

From the above bar plot, we can find that females uses canabis slightly more than males. 

In [None]:
#Create a pie chart to visualize the proportaion of male and female using cannabis. 
cannabis_sex_data=df_q5[['CAN_015', 'DHH_SEX']].dropna()
cannabis_sex=cannabis_sex_data.groupby('DHH_SEX')['CAN_015'].sum()
DHH_SEX_labels={
    1: "Male",
    2: "Female"
}
cannabis_sex.index=cannabis_sex.index.map(DHH_SEX_labels)

plt.figure(figsize=(5,5))
cannabis_sex.plot(kind='pie',autopct='%1.1f%%',colors=['blue','orange'],startangle=90,textprops={'fontsize':12})
plt.title('Cannabis Use by Sex')
plt.ylabel('')  
plt.show()

A supplementry pie chart is created to better visulize the proportion of male and female using cannabis. We can see that 8% more of females use cannabis compared to males. 

In [None]:
#Create a bar plot to explore the smoking frequency by each gender. 
smoke_sex=df_q5[['SMK_005', 'DHH_SEX']].dropna()
smoke_sex=smoke_sex.groupby('DHH_SEX')['SMK_005'].mean()
DHH_SEX_labels={
    1: "Male",
    2: "Female"
}
smoke_sex.index=smoke_sex.index.map(DHH_SEX_labels)

plt.figure(figsize=(8,6))
plt.ylim(0,3)
smoke_sex.plot(kind='bar', color=['skyblue', 'pink'])
plt.title('Average Smoke Frequency by Gender')
plt.xlabel('Sex')
plt.ylabel('Average Smoke Frequency')
plt.xticks(rotation=0)  
plt.grid(axis='y',linestyle='--',alpha=0.7)
plt.show()

Based on the bar chart, we can interpret that females smoke slighly more often than males. 

In [None]:
#Create a pie chart to visualize the proportaion of male and female for their smoking frequency. 
smoking_sex=df_q5[['SMK_005', 'DHH_SEX']].dropna()
smoking_bysex=smoking_sex.groupby('DHH_SEX')['SMK_005'].sum()
DHH_SEX_labels={
    1: "Male",
    2: "Female"
}
smoking_bysex.index=smoking_bysex.index.map(DHH_SEX_labels)

plt.figure(figsize=(5, 5))
smoking_bysex.plot(kind='pie', autopct='%1.1f%%', colors=['skyblue', 'pink'], startangle=90, textprops={'fontsize': 12})
plt.title('Smoke Frequency by Sex')
plt.ylabel('')  
plt.show()

A supplementry pie chart is created to better visulize the proportion of male and female for their smoking frequency. We can see that females smoke 8% more often than males. 

In [None]:
# Create bar chart for alcohol comsumptions based on two gender types. 
alcohol_sex=df_q5[['ALC_015', 'ALC_020', 'DHH_SEX']].dropna()
alc_015_bysex=alcohol_sex.groupby('DHH_SEX')['ALC_015'].mean()
alc_020_bysex=alcohol_sex.groupby('DHH_SEX')['ALC_020'].mean()
DHH_SEX_labels={
    1: "Male",
    2: "Female"
}
alc_015_bysex.index=alc_015_bysex.index.map(DHH_SEX_labels)
alc_020_bysex.index=alc_020_bysex.index.map(DHH_SEX_labels)

plt.figure(figsize=(14,6))

plt.subplot(1,2,1)  
alc_015_bysex.plot(kind='bar',color=['blue','orange'])
plt.title('Average Alcohol Consumption Frequency by Sex')
plt.xlabel('Sex')
plt.ylabel('Average Alcohol Consumption Frequency')
plt.xticks(rotation=0)  
plt.grid(axis='y',linestyle='--',alpha=0.7)

plt.subplot(1,2,2) 
alc_020_bysex.plot(kind='bar',color=['blue','orange'])
plt.title('Average Alcohol Consumption (Drink 4+/5+ One Occasion) by Sex')
plt.xlabel('Sex')
plt.ylabel('Average Alcohol Consumption (Drink 4+/5+ One Occasion')
plt.xticks(rotation=0)  
plt.grid(axis='y',linestyle='--',alpha=0.7)

From the above bar plots, we found that males generally consumes more alcohols than femles in terms of both alcohol comspmtion frequency and drinking 4+/5+ cans on one occasion. 

In [None]:
# Plot a bar chart to investigate the cannabis use by differnt age group. 
data_filtered=df_q5[['CAN_015', 'DHHGAGE']].dropna()
cannabis_use_different_age=data_filtered.groupby('DHHGAGE')['CAN_015'].mean()
DHHGAGE_labels={
    1: '12 to 17 years',
    2: '18 to 34 years',
    3: '35 to 49 years',
    4: '50 to 64 years',
    5: '65 years and older'
}
cannabis_use_different_age.index=cannabis_use_different_age.index.map(DHHGAGE_labels)

plt.figure(figsize=(10,6))
cannabis_use_different_age.plot(kind='bar',color='olive')
plt.title('Average Cannabis Use by Different Age Group')
plt.xlabel('Age Group')
plt.ylabel('Average Cannabis Use')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)
plt.show()

Based on the above graph, we can tell that people who are at 12 to 17 years and 65 years and older use cannabis most often, with age group 50 years to 64 years and 30 years to 49 years following closely behind. Poeple of 18 years to 34 years use cannabis least often. 

In [None]:
# Plot a bar chart to investigate the smoking freqnecy by differnt age group. 
smoke_data_filtered=df_q5[['SMK_005', 'DHHGAGE']].dropna()
smoke_byage=smoke_data_filtered.groupby('DHHGAGE')['SMK_005'].mean()
DHHGAGE_labels={
    1: "12 to 17 years",
    2: "18 to 34 years",
    3: "35 to 49 years",
    4: "50 to 64 years",
    5: "65 years and older"
}
smoke_byage.index=smoke_byage.index.map(DHHGAGE_labels)

plt.figure(figsize=(10,6))
smoke_byage.plot(kind='bar',color='purple')
plt.title('Average Smoke Frequency by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Average Smoke Frequency')
plt.xticks(rotation=45,ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)
plt.show()

From the bar plot, we found that people people who are at 12 to 17 years and 65 years and older smoke most often. However, the smoking frequency for the rest of age groups are approximately the same. 

In [None]:
# Create two bar charts to analyze the affect of alcohol consumption to each age group. 
alcohol_data_filtered=df_q5[['ALC_015', 'ALC_020', 'DHHGAGE']].dropna()
alc_015_by_age=alcohol_data_filtered.groupby('DHHGAGE')['ALC_015'].mean()
alc_020_by_age=alcohol_data_filtered.groupby('DHHGAGE')['ALC_020'].mean()
DHHGAGE_labels={
    1: "12 to 17 years",
    2: "18 to 34 years",
    3: "35 to 49 years",
    4: "50 to 64 years",
    5: "65 years and older"
}
alc_015_by_age.index=alc_015_by_age.index.map(DHHGAGE_labels)
alc_020_by_age.index=alc_020_by_age.index.map(DHHGAGE_labels)

plt.figure(figsize=(14,6))
plt.subplot(1,2,1)  
alc_015_by_age.plot(kind='bar',color='coral')
plt.title('Average Alcohol Consumption Frequency in Past 12 Months')
plt.xlabel('Age Group')
plt.ylabel('Average Alcohol Consumption Frequency')
plt.xticks(rotation=45,ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)

plt.subplot(1,2,2) 
plt.ylim(0,2.5)
alc_020_by_age.plot(kind='bar',color='mediumblue')
plt.title('Average Alcohol Consumption (Drink 4+/5+ One Occasion)')
plt.xlabel('Age Group')
plt.ylabel('Average Alcohol Consumption (Drink 4+/5+ One Occasion)')
plt.xticks(rotation=45,ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)

Based on the above graphs, people at the age of 12 to 17 years drink least often. This is becuase the law forbids minors to drink alcohol. From the age of 18 years to 34 years, people start to drink more often, and then reach the maximum alcohol comsumption frequcny at the age of 50 years to 64 years. However, people at younger age (18 years to 34 years) tend to drink most of alcohol at once. After the age of 35 years, people start to drink less and less alcohol at one occasion. This might be the result of physically aging that the body is less able to handle alcohols over time. 

In [None]:
# Create a bar chart to analyze cannabis use by region. 
cannabis_region=df_q5[['CAN_015', 'GEOGPRV']].dropna()
GEOGPRV_labels={
    10: "NEWFOUNDLAND AND LABRADOR",
    11: "PRINCE EDWARD ISLAND",
    12: "NOVA SCOTIA",
    13: "NEW BRUNSWICK",
    24: "QUEBEC",
    35: "ONTARIO",
    46: "MANITOBA",
    47: "SASKATCHEWAN",
    48: "ALBERTA",
    59: "BRITISH COLUMBIA",
    60: "YUKON/NORTHWEST/NUNAVUT TERRITORIES"
}
cannabis_region['GEOGPRV']=cannabis_region['GEOGPRV'].map(GEOGPRV_labels)
cannabis_byregion=cannabis_region.groupby('GEOGPRV')['CAN_015'].mean()

plt.figure(figsize=(10,7))
plt.ylim(0,2)
cannabis_byregion.plot(kind='bar', color='olive')
plt.title('Average Cannabis Use by Region')
plt.xlabel('Region')
plt.ylabel('Average Cannabis Use')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)
plt.tight_layout()
plt.show()

The above plot reveals that people among different region use cannabis approximately same fequently except for the YUKON/NORTHWEST/NUNAVUT territories. This may be due to the less population living in these territories. 

In [None]:
# Create bar plot for smoking frequency for people at each region. 
smoke_region=df_q5[['SMK_005', 'GEOGPRV']].dropna()
GEOGPRV_labels={
    10: "NEWFOUNDLAND AND LABRADOR",
    11: "PRINCE EDWARD ISLAND",
    12: "NOVA SCOTIA",
    13: "NEW BRUNSWICK",
    24: "QUEBEC",
    35: "ONTARIO",
    46: "MANITOBA",
    47: "SASKATCHEWAN",
    48: "ALBERTA",
    59: "BRITISH COLUMBIA",
    60: "YUKON/NORTHWEST/NUNAVUT TERRITORIES"
}
smoke_region['GEOGPRV']=smoke_region['GEOGPRV'].map(GEOGPRV_labels)
smoke_byregion=smoke_region.groupby('GEOGPRV')['SMK_005'].mean()

plt.figure(figsize=(10,7))
smoke_byregion.plot(kind='bar',color='purple')
plt.title('Average Smoke Frequency by Region')
plt.xlabel('Region')
plt.ylabel('Average Smoke Frequency')
plt.xticks(rotation=45,ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)
plt.tight_layout()
plt.show()

Based on the plot, we found that the smoking frequency for people at each provinces are approximately the same excep for the three territories and Alberta. The reason behind why Albertans smoke less frequently comapred to people at other provinces needs to be further investigated.

In [None]:
alcohol_data_filtered=df_q5[['ALC_015', 'ALC_020', 'GEOGPRV']].dropna()
alc_015_region=alcohol_data_filtered.groupby('GEOGPRV')['ALC_015'].mean()
alc_020_region=alcohol_data_filtered.groupby('GEOGPRV')['ALC_020'].mean()
GEOGPRV_labels={
    10: "NEWFOUNDLAND AND LABRADOR",
    11: "PRINCE EDWARD ISLAND",
    12: "NOVA SCOTIA",
    13: "NEW BRUNSWICK",
    24: "QUEBEC",
    35: "ONTARIO",
    46: "MANITOBA",
    47: "SASKATCHEWAN",
    48: "ALBERTA",
    59: "BRITISH COLUMBIA",
    60: "YUKON/NORTHWEST/NUNAVUT TERRITORIES"
}
alc_015_region.index=alc_015_region.index.map(GEOGPRV_labels)
alc_020_region.index=alc_020_region.index.map(GEOGPRV_labels)

plt.figure(figsize=(14,6))

plt.subplot(1,2,1)  
alc_015_region.plot(kind='bar',color='coral')
plt.title('Average Alcohol Consumption Frequency (Past 12 Months) By Region')
plt.xlabel('Region Group')
plt.ylabel('Average Alcohol Consumption Frequency')
plt.xticks(rotation=45,ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)

plt.subplot(1,2,2)  
alc_020_region.plot(kind='bar',color='mediumblue')
plt.title('Average Alcohol Consumption (Drink 4+/5+ One Occasion) By Region')
plt.xlabel('Region Group')
plt.ylabel('Average Alcohol Consumption (Drink 4+/5+ One Occasion)')
plt.xticks(rotation=45,ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)

In [None]:
cannabis_education=df_q5[['CAN_015', 'EHG2DVH3']].dropna()
EHG2DVH3_labels={
    1: "Less than secondary school graduation",
    2: "Secondary school graduation, no post-secondary education",
    3: "Post-secondary certificate/diploma/university degree"
}
cannabis_education['EHG2DVH3']=cannabis_education['EHG2DVH3'].map(EHG2DVH3_labels)
cannabis_education=cannabis_education.groupby('EHG2DVH3')['CAN_015'].mean()

plt.figure(figsize=(8.5,7))
cannabis_education.plot(kind='bar',color='olive',width=0.25)
plt.title('Average Cannabis Use by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Average Cannabis Use')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
smoke_education=df_q5[['SMK_005', 'EHG2DVH3']].dropna()
EHG2DVH3_labels={
    1: "Less than secondary school graduation",
    2: "Secondary school graduation, no post-secondary education",
    3: "Post-secondary certificate/diploma/university degree"
}
smoke_education['EHG2DVH3']=smoke_education['EHG2DVH3'].map(EHG2DVH3_labels)
smoke_education=smoke_education.groupby('EHG2DVH3')['SMK_005'].mean()

plt.figure(figsize=(8, 7))
plt.ylim(0,3)
smoke_education.plot(kind='bar', color='purple', width=0.3)
plt.title('Average Smoke Frequency by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Average Smoke Frequency')
plt.xticks(rotation=45,ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
alcohol_education=df_q5[['ALC_015', 'ALC_020', 'EHG2DVH3']].dropna()
alc_015_education=alcohol_education.groupby('EHG2DVH3')['ALC_015'].mean()
alc_020_education=alcohol_education.groupby('EHG2DVH3')['ALC_020'].mean()

EHG2DVH3_labels={
    1: "Less than secondary school graduation",
    2: "Secondary school graduation, no post-secondary education",
    3: "Post-secondary certificate/diploma/university degree"
}
alc_015_education.index=alc_015_education.index.map(EHG2DVH3_labels)
alc_020_education.index=alc_020_education.index.map(EHG2DVH3_labels)

plt.figure(figsize=(14,8))

plt.subplot(1,2,1)
alc_015_education.plot(kind='bar',color='coral')
plt.title('Average Alcohol Consumption by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Average Alcohol Consumption')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)

plt.subplot(1,2,2)
alc_020_education.plot(kind='bar',color='mediumblue')
plt.title('Average Alcohol Consumption by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Average Alcohol Consumption')
plt.xticks(rotation=45,ha='right')
plt.grid(axis='y',linestyle='--',alpha=0.7)
plt.tight_layout()
plt.show()

### Health Drivers Vs. Health Improvements

#### Question 5 - What is the relationship between regular exercise and self-reported health status? 
In order to answer our 5th question, I’ll start by charting reported health by reported physical activity against each other. Afterwards I’ll break down how different demographics and other relevant variables provided in the dataset affect how the two variables of interest affect their results. To begin I’ll give some details on reported health and reported physical activity. Reported health is broken down into 5 responses ranging from “Poor” to “Excellent” and reported physical health is broken down into if the respondent’s activity is “Above” or “Below” recommended activity guidelines.

In [None]:
df_question6=df_general.copy()
df_question6=df_question6.filter(items=[
    'GEN_005',
    'HWTDGISW',
    'PAADVACV',
    'DHHGAGE',
    'DHH_SEX',
    'GEOGPRV',
    'EHG2DVH3'
])
#print(df_question6.head())
#setting some print and summary functions to comments to remove clutter 

Concerning data cleaning I have cleaned the data to select the variables I will be working with and then removed rows with response that are not useful to my analysis. Responses such as “valid skip”, “refused”, and “don’t know” are not of interest to my analysis and have been removed from the dataset I will be working with. Finally, I run code to check for missing values in my dataset.

In [None]:
df_question6=df_question6.drop(df_question6[df_question6['GEN_005']==8].index)
df_question6=df_question6.drop(df_question6[df_question6['GEN_005']==7].index)
df_question6=df_question6.drop(df_question6[df_question6['HWTDGISW']==6].index)
df_question6=df_question6.drop(df_question6[df_question6['HWTDGISW']==9].index)
df_question6=df_question6.drop(df_question6[df_question6['PAADVACV']==9].index)
df_question6=df_question6.drop(df_question6[df_question6['PAADVACV']==6].index)
df_question6=df_question6.drop(df_question6[df_question6['PAADVACV']==3].index)
df_question6=df_question6.drop(df_question6[df_question6['EHG2DVH3']==9].index)
df_question6.isnull().sum()


I added a correlation heat map but there are no strong correlations between any of the variables included in my analysis.

In [None]:
cor6 = df_question6.corr()

labels6=['General Health', 'BMI', 'Physical Activity', 'Age', 'Sex','Location', 'Education']
sns.heatmap(cor6, annot=True, cmap='coolwarm', center=0, xticklabels=labels6, yticklabels=labels6)
plt.title("Correlation Between Physical Activity and Health Variables")
plt.show()

To start I’ve visualized the initial two variables physical activity and reported health. To do this I grouped physical activity by the reported health categories and visualised them in a bar chart. The initial plot shows a clear trend between the reported health and how much of that population is above the recommended activity guideline. As the reported health decreases the difference between the proportion of people above and below the activity guideline shows downwards and upward trends respectively.

In [None]:
df6_health=df_question6.groupby('GEN_005', as_index=False)['PAADVACV'].value_counts(normalize=True,sort=False)
plt.figure().set_figwidth(10)
plt.bar(df6_health['GEN_005'].unique(),df6_health['proportion'][::2],width=0.3,label='Percent Above Activity Guideline')
plt.bar(df6_health['GEN_005'].unique()+0.3,df6_health['proportion'][1::2],width=0.3, label='Percent Below Activity Guideline')
plt.xticks(df6_health['GEN_005'].unique()+0.3/2,('Excellent','Very Good','Good','Fair','Poor'))
plt.xlabel('Physical Activity By Health Status')
plt.ylabel('Percent')
plt.legend()
plt.show()

I created subplots showing the first relationship but divided up by age groups. As can be seen in the graphs for the younger age groups there is no discernible trend between physical activity and reported health. This plot indicates that the younger age group are overall more active than older age groups but there are other factors that are more important to the younger age groups reported health. The age groups over 50 years old do show the trend in the initial plot. From this it could be reasonably concluded that as age increases physical activity becomes more important to ones reported health.

In [None]:
df_question6_age1=df_question6.drop(df_question6[df_question6['DHHGAGE']!=2].index)
df6_health_age1=df_question6_age1.groupby('GEN_005', as_index=False)['PAADVACV'].value_counts(normalize=True,sort=False)
df_question6_age2=df_question6.drop(df_question6[df_question6['DHHGAGE']!=3].index)
df6_health_age2=df_question6_age2.groupby('GEN_005', as_index=False)['PAADVACV'].value_counts(normalize=True,sort=False)
df_question6_age3=df_question6.drop(df_question6[df_question6['DHHGAGE']!=4].index)
df6_health_age3=df_question6_age3.groupby('GEN_005', as_index=False)['PAADVACV'].value_counts(normalize=True,sort=False)
df_question6_age4=df_question6.drop(df_question6[df_question6['DHHGAGE']!=5].index)
df6_health_age4=df_question6_age4.groupby('GEN_005', as_index=False)['PAADVACV'].value_counts(normalize=True,sort=False)



fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0,0].bar(df6_health_age1['GEN_005'].unique(),df6_health_age1['proportion'][::2],width=0.3,label='Percent Above Activity Guideline')
axes[0,0].bar(df6_health_age1['GEN_005'].unique()+0.3,df6_health_age1['proportion'][1::2],width=0.3, label='Percent Below Activity Guideline')
axes[0,0].set_xticks(df6_health_age1['GEN_005'].unique()+0.3/2,('Excellent','Very Good','Good','Fair','Poor'))
axes[0,0].set_xlabel('Physical Activity By Health Status 18-34')
axes[0,0].set_ylabel('Percent')
axes[0,0].legend()

axes[0,1].bar(df6_health_age2['GEN_005'].unique(),df6_health_age2['proportion'][::2],width=0.3,label='Percent Above Activity Guideline')
axes[0,1].bar(df6_health_age2['GEN_005'].unique()+0.3,df6_health_age2['proportion'][1::2],width=0.3, label='Percent Below Activity Guideline')
axes[0,1].set_xticks(df6_health_age2['GEN_005'].unique()+0.3/2,('Excellent','Very Good','Good','Fair','Poor'))
axes[0,1].set_xlabel('Physical Activity By Health Status 35-49')
axes[0,1].set_ylabel('Percent')
axes[0,1].legend()

axes[1,0].bar(df6_health_age3['GEN_005'].unique(),df6_health_age3['proportion'][::2],width=0.3,label='Percent Above Activity Guideline')
axes[1,0].bar(df6_health_age3['GEN_005'].unique()+0.3,df6_health_age3['proportion'][1::2],width=0.3, label='Percent Below Activity Guideline')
axes[1,0].set_xticks(df6_health_age3['GEN_005'].unique()+0.3/2,('Excellent','Very Good','Good','Fair','Poor'))
axes[1,0].set_xlabel('Physical Activity By Health Status 50-64')
axes[1,0].set_ylabel('Percent')
axes[1,0].legend()

axes[1,1].bar(df6_health_age4['GEN_005'].unique(),df6_health_age4['proportion'][::2],width=0.3,label='Percent Above Activity Guideline')
axes[1,1].bar(df6_health_age4['GEN_005'].unique()+0.3,df6_health_age4['proportion'][1::2],width=0.3, label='Percent Below Activity Guideline')
axes[1,1].set_xticks(df6_health_age4['GEN_005'].unique()+0.3/2,('Excellent','Very Good','Good','Fair','Poor'))
axes[1,1].set_xlabel('Physical Activity By Health Status 65 plus')
axes[1,1].set_ylabel('Percent')
axes[1,1].legend()

plt.show()

Another breakdown was the two main variables of interest by the respondent’s level of education. The plots show that if a respondent did not complete high school, they are less likely to be active than respondents who completed high school or have a post secondary degree. There were minimal differences between those that completed high school or hold a post secondary degree. All plots follow the trend shown in the first graph.

In [None]:
df_question6edu1=df_question6.drop(df_question6[df_question6['EHG2DVH3']!=1].index)
df6_edu1=df_question6edu1.groupby('GEN_005', as_index=False)['PAADVACV'].value_counts(normalize=True,sort=False)
df_question6edu2=df_question6.drop(df_question6[df_question6['EHG2DVH3']!=2].index)
df6_edu2=df_question6edu2.groupby('GEN_005', as_index=False)['PAADVACV'].value_counts(normalize=True,sort=False)
df_question6edu2=df_question6.drop(df_question6[df_question6['EHG2DVH3']!=3].index)
df6_edu3=df_question6edu2.groupby('GEN_005', as_index=False)['PAADVACV'].value_counts(normalize=True,sort=False)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0,0].bar(df6_edu1['GEN_005'].unique(),df6_edu1['proportion'][::2],width=0.3,label='Percent Above Activity Guideline')
axes[0,0].bar(df6_edu1['GEN_005'].unique()+0.3,df6_edu1['proportion'][1::2],width=0.3, label='Percent Below Activity Guideline')
axes[0,0].set_xticks(df6_edu1['GEN_005'].unique()+0.3/2,('Excellent','Very Good','Good','Fair','Poor'))
axes[0,0].set_xlabel('Physical Activity By Health Status No HS Diploma')
axes[0,0].set_ylabel('Percent')
axes[0,0].legend()

axes[0,1].bar(df6_edu2['GEN_005'].unique(),df6_edu2['proportion'][::2],width=0.3,label='Percent Above Activity Guideline')
axes[0,1].bar(df6_edu2['GEN_005'].unique()+0.3,df6_edu2['proportion'][1::2],width=0.3, label='Percent Below Activity Guideline')
axes[0,1].set_xticks(df6_edu2['GEN_005'].unique()+0.3/2,('Excellent','Very Good','Good','Fair','Poor'))
axes[0,1].set_xlabel('Physical Activity By Health Status HS Diploma')
axes[0,1].set_ylabel('Percent')
axes[0,1].legend()

axes[1,0].bar(df6_edu3['GEN_005'].unique(),df6_edu3['proportion'][::2],width=0.3,label='Percent Above Activity Guideline')
axes[1,0].bar(df6_edu3['GEN_005'].unique()+0.3,df6_edu3['proportion'][1::2],width=0.3, label='Percent Below Activity Guideline')
axes[1,0].set_xticks(df6_edu3['GEN_005'].unique()+0.3/2,('Excellent','Very Good','Good','Fair','Poor'))
axes[1,0].set_xlabel('Physical Activity By Health Status Uni Degree')
axes[1,0].set_ylabel('Percent')
axes[1,0].legend()
fig.delaxes(axes[1, 1])
plt.show()

#### Question 6 - How does stress influence the maintenance or improvement of mental health?  

For our 6th question I started by charting reported mental health by the respondent’s stress level. After that ill break down the relationship by relevant demographic variables in the dataset along with some other visualizations of the data. Much like question 6 reported mental health is broken down into categories ranging from “poor” to “excellent”. Reported stress level is broken down into 5 categories ranging from “not at all stressful” to “extremely stressful”.

In [None]:
df_question7=df_general.copy()
df_question7=df_question7.filter(items=[
    'GEN_015',
    'GEN_020',
    'HWTDGISW',
    'PAADVACV',
    'DHHGAGE',
    'DHH_SEX',
    'GEOGPRV',
    'EHG2DVH3',
    'HWT_050'
])
#print(df_question7.head())

Like question 5 for data cleaning, I selected variables of interest and put them into a new data set for me to perform my analysis. I cleaned unwanted variables like “don’t know” and “refused” as they are not of interest to my analysis. Lastly, I checked the dataset for missing values.
For the initial graph plotting stress level by reported mental health status. As can be seen in the graph there is a distinct trend shown as respondents mental health decreases the percentage of the mental health level reporting higher stress levels grows. This trend shows the relationship that one could expect that stress does seem have be a determining factor or mental health. 

In [None]:
df_question7=df_question7.drop(df_question7[df_question7['GEN_015']==9].index)
df_question7=df_question7.drop(df_question7[df_question7['GEN_015']==8].index)
df_question7=df_question7.drop(df_question7[df_question7['GEN_015']==7].index)
df_question7=df_question7.drop(df_question7[df_question7['GEN_020']==8].index)
df_question7=df_question7.drop(df_question7[df_question7['GEN_020']==7].index)
df_question7=df_question7.drop(df_question7[df_question7['HWTDGISW']==6].index)
df_question7=df_question7.drop(df_question7[df_question7['HWTDGISW']==9].index)
df_question7=df_question7.drop(df_question7[df_question7['PAADVACV']==9].index)
df_question7=df_question7.drop(df_question7[df_question7['PAADVACV']==6].index)
df_question7=df_question7.drop(df_question7[df_question7['PAADVACV']==3].index)
df_question7=df_question7.drop(df_question7[df_question7['HWT_050']==6].index)
df_question7=df_question7.drop(df_question7[df_question7['HWT_050']==7].index)
df_question7=df_question7.drop(df_question7[df_question7['HWT_050']==8].index)
df_question7=df_question7.drop(df_question7[df_question7['HWT_050']==9].index)
df_question7=df_question7.drop(df_question7[df_question7['EHG2DVH3']==9].index)
df_question7.isnull().sum()

The correlation matrix with a heat map is included but it does not show any particular results of interest.

In [None]:
cor7 = df_question7.corr()

labels7=['Mental Health', 'Stress Level', 'BMI', 'Physical Activity', 'Age', 'Sex','Location', 'Education','Self Percieved Weight']
sns.heatmap(cor7, annot=True, cmap='coolwarm', center=0, xticklabels=labels7, yticklabels=labels7)
plt.title("Correlation Between Stress and Mental Health Variables")
plt.show()

For the first breakdown by demographics, I have broken down the first graph by the age groups reported in the dataset. In each graph the general trend of higher stress levels being seen at the lower reported mental health states. However, as age increases the number of respondents in the extremes of the reported stress levels differs significantly with older age groups responding more in the extremes.

In [None]:
df_healthstress=df_question7.groupby('GEN_015', as_index=False)['GEN_020'].value_counts(normalize=True, sort=False)
plt.figure().set_figwidth(10)
plt.bar(df_healthstress['GEN_015'].unique()-0.2,df_healthstress['proportion'][::5],width=0.1,label='Not at all stressful')
plt.bar(df_healthstress['GEN_015'].unique()-0.1,df_healthstress['proportion'][1::5],width=0.1, label='Not very stressful')
plt.bar(df_healthstress['GEN_015'].unique(),df_healthstress['proportion'][2::5],width=0.1, label='A bit stressful')
plt.bar(df_healthstress['GEN_015'].unique()+0.1,df_healthstress['proportion'][3::5],width=0.1, label='Quite a bit stressful')
plt.bar(df_healthstress['GEN_015'].unique()+0.2,df_healthstress['proportion'][4::5],width=0.1, label='Extremely stressful')
plt.xticks(df_healthstress['GEN_015'].unique(),('Excellent','Very Good','Good','Fair','Poor'))
plt.xlabel('Stress Level by Mental Health Status')
plt.ylabel('Percent')
plt.legend()
plt.show()

For my last visualisation I plotted reported mental health by self perceived weight to see how one’s self-perceived weight influences their mental health at all. In the graph there is a trend shown, as the respondents mental health decreases the percentage of the population in each mental health category reporting they view themselves as “just right” decreases. Likewise, the amount of each category’s population reporting that their view themselves as “underweight” or “overweight” increases as reported mental health decreases.

In [None]:
df_question7_age1=df_question7.drop(df_question7[df_question7['DHHGAGE']!=2].index)
df7_health_age1=df_question7_age1.groupby('GEN_015', as_index=False)['GEN_020'].value_counts(normalize=True,sort=False)
df_question7_age2=df_question7.drop(df_question7[df_question7['DHHGAGE']!=3].index)
df7_health_age2=df_question7_age2.groupby('GEN_015', as_index=False)['GEN_020'].value_counts(normalize=True,sort=False)
df_question7_age3=df_question7.drop(df_question7[df_question7['DHHGAGE']!=4].index)
df7_health_age3=df_question7_age3.groupby('GEN_015', as_index=False)['GEN_020'].value_counts(normalize=True,sort=False)
df_question7_age4=df_question7.drop(df_question7[df_question7['DHHGAGE']!=5].index)
df7_health_age4=df_question7_age4.groupby('GEN_015', as_index=False)['GEN_020'].value_counts(normalize=True,sort=False)

temp=pd.DataFrame({'GEN_015':5,'GEN_015':1,'proportion':0},index=[20])
df7_health_age1=pd.concat([df7_health_age1.iloc[:20], temp, df7_health_age1.iloc[20:]],ignore_index=True)
df7_health_age1.iloc[20,0]=5
df7_health_age1.iloc[20,1]=1
df7_health_age1.iloc[20,2]=0

temp=pd.DataFrame({'GEN_015':5,'GEN_015':1,'proportion':0},index=[20])
df7_health_age2=pd.concat([df7_health_age2.iloc[:20], temp, df7_health_age2.iloc[20:]],ignore_index=True)
df7_health_age2.iloc[20,0]=5
df7_health_age2.iloc[20,1]=1
df7_health_age2.iloc[20,2]=0

temp=pd.DataFrame({'GEN_015':5,'GEN_015':1,'proportion':0},index=[20])
df7_health_age4=pd.concat([df7_health_age4.iloc[:20], temp, df7_health_age4.iloc[20:]],ignore_index=True)
df7_health_age4.iloc[20,0]=5
df7_health_age4.iloc[20,1]=1
df7_health_age4.iloc[20,2]=0
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0,0].bar(df7_health_age1['GEN_015'].unique()-0.2,df7_health_age1['proportion'][::5],width=0.1,label='Not at all stressful')
axes[0,0].bar(df7_health_age1['GEN_015'].unique()-0.1,df7_health_age1['proportion'][1::5],width=0.1, label='Not very stressful')
axes[0,0].bar(df7_health_age1['GEN_015'].unique(),df7_health_age1['proportion'][2::5],width=0.1, label='A bit stressful')
axes[0,0].bar(df7_health_age1['GEN_015'].unique()+0.1,df7_health_age1['proportion'][3::5],width=0.1, label='Quite a bit stressful')
axes[0,0].bar(df7_health_age1['GEN_015'].unique()+0.2,df7_health_age1['proportion'][4::5],width=0.1, label='Extremely stressful')
axes[0,0].set_xticks(df7_health_age1['GEN_015'].unique(),('Excellent','Very Good','Good','Fair','Poor'))
axes[0,0].set_xlabel('Stress Level by Mental Health Status 18-34')
axes[0,0].set_ylabel('Percent')
axes[0,0].legend()

axes[0,1].bar(df7_health_age2['GEN_015'].unique()-0.2,df7_health_age2['proportion'][::5],width=0.1,label='Not at all stressful')
axes[0,1].bar(df7_health_age2['GEN_015'].unique()-0.1,df7_health_age2['proportion'][1::5],width=0.1, label='Not very stressful')
axes[0,1].bar(df7_health_age2['GEN_015'].unique(),df7_health_age2['proportion'][2::5],width=0.1, label='A bit stressful')
axes[0,1].bar(df7_health_age2['GEN_015'].unique()+0.1,df7_health_age2['proportion'][3::5],width=0.1, label='Quite a bit stressful')
axes[0,1].bar(df7_health_age2['GEN_015'].unique()+0.2,df7_health_age2['proportion'][4::5],width=0.1, label='Extremely stressful')
axes[0,1].set_xticks(df7_health_age2['GEN_015'].unique(),('Excellent','Very Good','Good','Fair','Poor'))
axes[0,1].set_xlabel('Stress Level by Mental Health Status 34-49')
axes[0,1].set_ylabel('Percent')
axes[0,1].legend()

axes[1,0].bar(df7_health_age3['GEN_015'].unique()-0.2,df7_health_age3['proportion'][::5],width=0.1,label='Not at all stressful')
axes[1,0].bar(df7_health_age3['GEN_015'].unique()-0.1,df7_health_age3['proportion'][1::5],width=0.1, label='Not very stressful')
axes[1,0].bar(df7_health_age3['GEN_015'].unique(),df7_health_age3['proportion'][2::5],width=0.1, label='A bit stressful')
axes[1,0].bar(df7_health_age3['GEN_015'].unique()+0.1,df7_health_age3['proportion'][3::5],width=0.1, label='Quite a bit stressful')
axes[1,0].bar(df7_health_age3['GEN_015'].unique()+0.2,df7_health_age3['proportion'][4::5],width=0.1, label='Extremely stressful')
axes[1,0].set_xticks(df7_health_age3['GEN_015'].unique(),('Excellent','Very Good','Good','Fair','Poor'))
axes[1,0].set_xlabel('Stress Level by Mental Health Status 50-64')
axes[1,0].set_ylabel('Percent')
axes[1,0].legend()

axes[1,1].bar(df7_health_age4['GEN_015'].unique()-0.2,df7_health_age4['proportion'][0::5],width=0.1,label='Not at all stressful')
axes[1,1].bar(df7_health_age4['GEN_015'].unique()-0.1,df7_health_age4['proportion'][1::5],width=0.1, label='Not very stressful')
axes[1,1].bar(df7_health_age4['GEN_015'].unique(),df7_health_age4['proportion'][2::5],width=0.1, label='A bit stressful')
axes[1,1].bar(df7_health_age4['GEN_015'].unique()+0.1,df7_health_age4['proportion'][3::5],width=0.1, label='Quite a bit stressful')
axes[1,1].bar(df7_health_age4['GEN_015'].unique()+0.2,df7_health_age4['proportion'][4::5],width=0.1, label='Extremely stressful')
axes[1,1].set_xticks(df7_health_age4['GEN_015'].unique(),('Excellent','Very Good','Good','Fair','Poor'))
axes[1,1].set_xlabel('Stress Level by Mental Health Status 65 plus')
axes[1,1].set_ylabel('Percent')
axes[1,1].legend()

plt.show()

## Discussion and Conclusion

From our analysis we saw that individuals who moderately consume alcohol may report a better self-perceived mental health and life satisfaction.  However, if alcohol consumption becomes excessive, especially in the case of frequent binge drinking, we can see a noticeable decrease in self-perceived mental health and life satisfacation. This trend is noticeable across all age groups to varying degrees.


## References

1. Canadian Community Health Survey – Annual Component (CCHS). Govermnent of Canada, Statistics Canada. (2023, December 29). https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&Id=1531795 

2. Canadian Community Health Survey: Public Use Microdata File. Govermnent of Canada, Statistics Canada (2024, September 22). https://www150.statcan.gc.ca/n1/en/catalogue/82M0013x 

3. Haskell, W. L., Lee, I. M., Pate, R. R., Powell, K. E., Benjamin, G. A., & Flegal, K. M. (2007). Physical activity and public health: Updated recommendation for adults from the American College of Sports Medicine and the American Heart Association. Circulation, 116(9), 1081-1093. 

4. Rehm, J., Taylor, B., & Room, R. (2006). Global burden of disease from alcohol, illicit drugs and tobacco. Drug and Alcohol Review, 25(6), 503-513.  