# Research Analysis Proposal: CSCS
---

### **Analysis 1: Impact of Social Media on Perceived Social Connection**

**Research Question:**  
What is the relationship between the frequency of social media usage and social connection levels with friends among Canadian respondents in 2022?

**Variables:**  
1. **CONNECTION_social_media_time_per_day** (independent quantitative variable): This measures how many hours respondents have spent using social media per day in the past week. 
<br>
Visualization: <br>
KDE - A KDE can illustrate the distribution of hours spent on social media. This is useful because it helps visualize the common ranges of social media usage among respondents and identify any patterns or outliers in the data. It illustrates the data with a smooth curve, representing all of the data, making it easy to see where most of the data falls and if there are any trends.

<br>

2. **CONNECTION_social_time_friends_p7d** (dependent quantitative variable): This measures how many hours respondents have spent time with friends in the past week. It reflects the extent of real-life social interactions, which is the outcome of interest in this analysis.
<br>
Visualization: <br>
Box Plot - A box plot clearly shows the median and interquartile range for how many hours respondents have spent with their friends. It indicates outliers which can be useful for this data since some people may respond very differently for this type of question. 
<br>

**Assumptions:**
<br>
- Linearity: There is a linear relationship between the independent variable (social media usage) and the dependent variable (time spent with friends). This can be assessed visually using scatter plots.
- Independence: Observations are independent of each other. Each respondent's social media usage and time spent with friends should not be influenced by others' responses.
- Homoscedasticity: The variance of the residuals is constant across all levels of the independent variable. 
- Normality of Residuals: The residuals of the model should be approximately normally distributed. This can be assessed using a histogram or a Q-Q plot of the residuals.

**Planned Analysis: Simple Linear Regression**  
<br>
Y=β 
0
​
 +β 
1
​
 x+ϵ
<br>
The coefficient 
𝛽
1
​
  will indicate the direction and strength of the relationship.
<br>

Null Hypothesis (H₀): 𝛽₁ = 0 (no relationship between social media usage and time spent with friends).

Alternative Hypothesis (H₁): 𝛽₁ ≠ 0 (there is a relationship, either positive or negative, between social media usage and time spent with friends).

--

*Estimate the Slope Coefficient (𝛽₁):*

After performing the regression analysis, I'll get an estimated value for the slope, 𝛽₁, which represents the relationship between social media usage and time spent with friends. I will look at the sign of the slope to see if it has a negative or positive correlation and the p-values will tell me if this is statistically significant or not.

*Calculate the p-value for 𝛽₁:*

The p-value associated with the slope 𝛽₁ tells us the likelihood of observing such a result (or one more extreme) if the null hypothesis (that 𝛽₁ = 0) were true.

*Set a Significance Level α=0.05.*

If the p-value for 𝛽₁ < 0.05: Reject the null hypothesis.
This means that there is significant evidence that social media usage is related to the time spent with friends.

If the p-value for 𝛽₁ ≥ 0.05: Fail to reject the null hypothesis.
This means that there is not enough evidence to conclude that social media usage impacts time spent with friends.

*Check the Confidence Interval for 𝛽₁:*

I can also look at the confidence interval for 𝛽₁ to see if it includes zero.
A confidence interval that does not include zero further suggests that the relationship is statistically significant.

**Visualization** - Scatter Plot <br> 
A scatter plot will visually illustrate any potential linear relationship and can help in interpreting the correlation results. It allows for a clear visual representation of the relationship (or lack thereof) between the two variables.  

**Hypothesized Results**<br>  
If a negative relationship is confirmed, it would suggest that higher social media engagement reduces time available for face-to-face interactions, supporting concerns about the impact of digital interaction on real-life social relationships.

If a positive relationship is found, it might suggest that social media has a complementary role in social interactions, potentially enabling more in-person meetings or supporting stronger social connections.

If no significant relationship is found, it would indicate that social media usage does not appear to meaningfully impact time spent with friends, suggesting that digital and real-life social worlds may operate independently in this context.

These findings could have important implications for understanding how digital communication affects real-life social relationships and may inform recommendations for balancing online and offline social interactions.

---

### **Analysis 2: Effect of Income on Participation in Social Activities**

**Research Question:**  
Is there an association between the income levels of Canadians and their participation in social activities in 2022?

**Variables:**  
1. **DEMO_household_income** (independent quantitative variable separated into categories): Respondents' yearly household income until Dec. 31, 2022.<br>
Visualization: <br>
Bar Plot - Since the question has already separated the income values into ranges for the respondents to pick from, a bar plot would be the best visualization tool. A bar plot can effectively display the distribution of household incomes among respondents separeted by bins of ranges (given in the data already). 
<br>

2. **CONNECTION_activities_meeting_organization_p3m** (dependent qualitative ordinal variable): How often respondents have attended a meeting of other organization(s) (i.e. outside of work).<br>
Visualization: <br>
Bar Plot - A bar plot can be used to display the frequency of responses for social activity participation. Each bar would represent a category of participation (e.g., "Never," "Monthly," etc.), with the height of the bars indicating the number of respondents in each category. A bar plot is the best for this qualitiative variable since it shows an overview relating to how often the respondents engage socially. 
<br>

**Assumptions:**

- Independence of Observations: Each respondent’s age and loneliness level should be independent. Each individual's responses should not affect others.
- Sampling Distribution: The sampling distribution of the sample mean should be approximately normal.
- Both variables in the analysis should be categorical. (age groups and levels of loneliness)
- Expected Frequency:
    - The expected frequency for each cell in the contingency table should be 5 or more. This helps to ensure that the Chi-Square approximation is valid. If any expected frequency is less than 5, the test results may not be reliable.
    - Fisher's Exact Test with Monte Carlo if the frequency is less than 5 and the table size is more than 2x2

**Summary Statistics**

Mean and Median: These two will help us see which age group reports higher or lower levels of social engagement on average.

Frequency Counts: Count the number of respondents in each social engagement category for each income group to observe distributions.

**Planned Analysis: Chi-Square Hypothesis Test** 
<br> 
Null Hypothesis (H₀): There is no association between income levels and frequency of participation in social activities. In other words, income levels do not affect or correlate with social activity participation frequency.

Alternative Hypothesis (H₁): There is an association between income levels and frequency of participation in social activities. This means that income levels and social activity participation frequency are related.
<br> 

--

Set up contingency table for income levels and participation frequency.

Run Chi-Square Test; check p-value and expected counts.
- Compute the Chi-Square Statistic
- Determine Degrees of Freedom
- Find the P-value: Compare the calculated Chi-Square statistic to a Chi-Square distribution with the appropriate degrees of freedom to determine the p-value.

If assumptions for Chi-Square are not met (the frequency is under 5), perform Fisher’s Exact Test with Monte Carlo (table is larger than 2x2).

If the p-value is less than 0.05: Reject the null hypothesis (H₀), suggesting a statistically significant association between income levels and social activity participation.

If the p-value is greater than 0.05: Fail to reject the null hypothesis, suggesting there is not enough evidence to support an association between income levels and social activity participation.

**Visualization** - Stacked Bar Plot
<br> A stacked bar plot shows the distribution of participation frequency within each income level, giving a clear overview of how often people in different income categories engage in social activities.
Since the income is categorized into ranges, the box plot can visually highlight any trends or disparities in social activity participation associated with those income levels.

**Hypothesized Results**  
<br> Based on prior research, we might hypothesize that income level influences participation in social activities. However, it is also possible that there may be no significant association, indicating that factors other than income (e.g., time availability, cultural preferences) could drive social participation across income groups. 

If income significantly affects social participation, it may indicate that economic resources support social engagement. If no significant difference is found, this could suggest that social participation is accessible across income levels.  Understanding the role of income in social participation could guide policies aimed at promoting inclusivity and access to social opportunities for lower-income groups.

---

### **Analysis 3: Association Between Age and Loneliness**

**Research Question:**  
Is there a relationship between age and loneliness levels among Canadian respondents in 2022?

**Variables:**  
1. **DEMO_age** (independent quantitative variable): Ages of respondents. <br>
Visualization: <br>
Histogram - A histogram is the best visualization method for the ages of the respondents because they can be separated into bins of different age ranges. This makes it easier to analyse trends among different age groups such as teenagers, middle-aged adults, and seniors.
<br>

2. **LONELY_dejong_emotional_social_loneliness_scale_miss** (dependent qualitative ordinal variable): The extent to which the following phrase applies to respondents' situations. - "I miss having people around." <br>
Visualization: <br>
Bar Plot - A bar plot can clearly show the frequency of data in each qualitative option that was given to the respondents. These ordinal categories can then be organized into a bar plot with the frequency of data for each, making it easy to see if there were more common answers than others.

**Assumptions:**

- Independence of Observations: Each respondent’s age and loneliness level should be independent. Each individual's responses should not affect others.
- Sampling Distribution: The sampling distribution of the sample mean should be approximately normal.
- Both variables in the analysis should be categorical. (age groups and levels of loneliness)
- Expected Frequency:
    - The expected frequency for each cell in the contingency table should be 5 or more. This helps to ensure that the Chi-Square approximation is valid. If any expected frequency is less than 5, the test results may not be reliable.
    - Fisher's Exact Test with Monte Carlo if the frequency is less than 5 and the table size is more than 2x2

**Summary Statistics**

Mean and Median: These two will help us see which age group reports higher or lower levels of loneliness on average.

Frequency Counts: Count the number of respondents in each loneliness category for each age group to observe distributions.

**Planned Analysis: Chi-Square Hypothesis Test** <br>  
Null Hypothesis (H0): There is no association between age categories and loneliness levels. In other words, age does not influence feelings of loneliness.

Alternative Hypothesis (H1): There is an association between age categories and loneliness levels, meaning age does influence feelings of loneliness.
<br> 

--

Create Age Categories: Divide the continuous age variable into categorical bins (e.g., 18-24, 25-34, etc.).

Set up contingency table for age and loneliness levels.

Run Chi-Square Test; check p-value and expected counts.

If assumptions for Chi-Square are not met (the frequency is under 5), perform Fisher’s Exact Test with Monte Carlo (table is larger than 2x2).

If the p-value is less than 0.05: Reject the null hypothesis (H₀), suggesting a statistically significant association between age and loneliness levels.

If the p-value is greater than 0.05: Fail to reject the null hypothesis, suggesting there is not enough evidence to support an association between age and loneliness levels.

**Visualization** - Box Plot <br>
A box plot can show the distribution of ages for each loneliness level, giving a visual sense of any trends. It can be easy to see which loneliness levels correspond to which median ages. 


**Hypothesized Results** <br>

Based on prior research, it’s expected that older age groups may report higher loneliness levels, especially relating to missing having people around. This could be due to lifestyle changes since their kids have moved away and they might have less people surrounding them at all times in their homes. Younger age groups may also report higher loneliness levels which could also reflect their lifestyle. However, there could also be no association at all between age and loneliness indicating that loneliness could be dependant on other factors. 

Understanding age and loneliness associations can guide mental health and social support initiatives, improving targeted interventions based on age demographics.

---


Project Group Request: Edie Chen, Jason Li, and Zain Elsayed