# GCAP3226 Week 3: Regression Analysis In-Class Exercise

## Instructions
This notebook contains exercises for you to complete during class. Follow the prompts and write your code in the designated cells. Make sure to interpret your results and answer any questions provided.

**Dataset**: GCAP3226_week3.csv - Hong Kong Waste Charging Policy Survey Data

**Learning Objectives**:
- Practice linear and logistic regression analysis
- Compare forward and backward selection methods
- Interpret regression results
- Create exploratory visualizations
- Understand the relationship between support levels and various factors


## Package Installation and Setup

**Your Task**: Install the required packages and import the necessary libraries.

**Required packages**:
- pandas
- numpy  
- scikit-learn
- statsmodels
- matplotlib

**Instructions**: 
1. Write code to install any missing packages using pip
2. Import all necessary libraries for the analysis


In [2]:
# Write your code here for package installation and imports


## Task 1: Data Loading and Initial Exploration

**Your Task**: Load the dataset and perform initial data exploration.

**Instructions**:
1. Load the `GCAP3226_week3.csv` dataset using pandas
2. Display the first few rows to understand the data structure
3. Check the data types and basic information about the dataset
4. Identify the key variables you'll be working with, especially:
   - `support_info` (response variable)
   - `government_consideration` (main predictor)
   - Other potential predictors


In [3]:
# Write your code here to load and explore the dataset
# 1. Load the dataset


# 2. Display first few rows


# 3. Check data info


# 4. Check basic statistics



## Task 2: Simple Linear Regression

**Your Task**: Run a simple linear regression of support level vs government consideration and interpret the results.

**Instructions**:
1. Prepare the data for simple linear regression:
   - Use `support_info` as the dependent variable (y)
   - Use `government_consideration` as the independent variable (x)
   - Handle any missing values appropriately

2. Fit the simple linear regression model using both sklearn and statsmodels

3. Create a scatter plot with the regression line

4. **Interpret the results**:
   - What is the coefficient for government_consideration?
   - Is the relationship statistically significant? (check p-value)
   - What does the R-squared tell us?
   - How would you interpret the coefficient in practical terms?

**Questions to Answer**:
- What is the relationship between government consideration and support level?
- Is this relationship statistically significant?
- How much of the variance in support level is explained by government consideration?


In [4]:
# Write your code here for simple linear regression
# 1. Prepare the data


# 2. Fit model using sklearn


# 3. Fit model using statsmodels for statistical tests


# 4. Print results


# 5. Create visualization


**Your Interpretation Space**: Write your interpretation of the simple linear regression results here.

*Questions to address*:
1. Coefficient interpretation: What does the coefficient mean?
2. Statistical significance: Is the relationship significant?
3. Model fit: How well does the model explain the data?
4. Practical meaning: What does this tell us about the relationship?


## Task 3: Multiple Linear Regression with Variable Selection

**Your Task**: Run multiple linear regression models using **backward** selection methods, then compare the results with the forward selection in the provided notebook.

**Instructions**:
1. **Data Preparation**:
   - Handle missing values appropriately
   - Consider centering continuous variables for better interpretability


2. **Backward Selection**:
   - Start with all predictors
   - Remove variables one at a time based on highest p-value
   - Stop when all remaining variables are significant
   - Record the final model

3. **Comparison**:
   - Compare which variables were selected by each method
   - Compare R-squared and adjusted R-squared values


**Questions to Answer**:
- Do both methods arrive at the same final model?


In [None]:
# Data preparation
# Recode household monthly income into a single variable with midpoints
conditions = [
    df['HouseholdMonthlyIncomeRange_Below15k'] == 1,
    df['HouseholdMonthlyIncomeRange_15,001-30,000'] == 1,
    df['HouseholdMonthlyIncomeRange_30,001-50,000'] == 1,
    df['HouseholdMonthlyIncomeRange_50,001-70,000'] == 1,
    df['HouseholdMonthlyIncomeRange_AboveHK70k'] == 1
]
income_values = [10, 22.5, 40, 60, 90]  # midpoints in thousands

df['income'] = np.select(conditions, income_values, default=np.nan)
print("Income variable created with these value counts:")
print(df['income'].value_counts().sort_index())


# Recode education level into a single ordinal variable
conditions = [
    df['HighestEducationLevel_Primaryorbelow'] == 1,
    df['HighestEducationLevel_Secondary'] == 1,
    df['HighestEducationLevel_DiplomaorBachelor'] == 1,
    df['HighestEducationLevel_Masterorabove'] == 1
]
education_values = [1, 2, 3, 4]  # ordinal values for education levels

df['education'] = np.select(conditions, education_values, default=np.nan)
print("\nEducation variable created with these value counts:")
print(df['education'].value_counts().sort_index())

# Recode age from binary columns to a single 'age' column with midpoints
age_columns = ['AgeRange_18-24', 'AgeRange_25-34', 'AgeRange_35-44', 'AgeRange_45-54', 'AgeRange_55-64', 'AgeRange_65+']
midpoints = [20, 30, 40, 50, 60, 70]

# Create 'age' column by finding the midpoint where the binary is 1
df['age'] = np.nan
for col, midpoint in zip(age_columns, midpoints):
    df.loc[df[col] == 1, 'age'] = midpoint

# Handle any missing ages if necessary (e.g., drop or impute)
df = df.dropna(subset=['age'])  # Example: drop rows with missing age

# Verify the new 'age' column
df['age'].value_counts()



# Center the remaining continuous variables that haven't been centered yet
df['fairness_c'] = df['fairness'] - df['fairness'].mean()
df['government_consideration_c'] = df['government_consideration'] - df['government_consideration'].mean()
df['recycling_effort_c'] = df['recycling_effort'] - df['recycling_effort'].mean()
df['income_c'] = df['income'] - df['income'].mean()
df['education_c'] = df['education'] - df['education'].mean()
df['recycle_frequency_c'] = df['recycle_frequency'] - df['recycle_frequency'].mean()
df['household_size_c'] = df['household_size'] - df['household_size'].mean()
df['total_score_c'] = df['total_score'] - df['total_score'].mean()
df['policy_helpfulness_c'] = df['policy_helpfulness'] - df['policy_helpfulness'].mean()
df['waste_severity_c'] = df['waste_severity'] - df['waste_severity'].mean()
df['age_c'] = df['age'] - df['age'].mean()

# Use centered versions of continuous variables
features = ['fairness_c', 'government_consideration_c', 'policy_helpfulness_c', 'waste_severity_c', 'recycling_effort_c',
            'LocalResidentcode', 'DailyWasteBags_More than 1 bag', 'DailyWasteBags_Exactly 1 bag',
            'HousingType_Other', 'HousingType_Private housing', 'HousingType_Subsidized housing',
            'age_c', 'income_c', 'education_c', 'recycle_frequency_c', 'household_size_c', 'total_score_c']

In [None]:
# Use centered versions of continuous variables
features = ['fairness_c', 'government_consideration_c', 'policy_helpfulness_c', 'waste_severity_c', 'recycling_effort_c',
            'LocalResidentcode', 'DailyWasteBags_More than 1 bag', 'DailyWasteBags_Exactly 1 bag',
            'HousingType_Other', 'HousingType_Private housing', 'HousingType_Subsidized housing',
            'age_c', 'income_c', 'education_c', 'recycle_frequency_c', 'household_size_c', 'total_score_c']

X_multi = df[features].dropna()     
y_multi = df.loc[X_multi.index, 'support_info'] 

# Fit multiple linear regression model using backward selection
            

**Your Comparison and Analysis**: Write your comparison of forward vs backward selection results here. Are the selected variables the same?


## Task 4: Exploratory Data Analysis and Visualization

**Your Task**: Create visualizations to explore the relationship between support level and potential explanatory variables of your choice.

**Instructions**:
1. **Variable Selection**: Choose 2 additional variables from the dataset that you think might be interesting to explore in relation to support level. Consider variables like:
   - Demographic variables (age, income, education)
   - Attitudinal variables (fairness, policy_helpfulness, waste_severity)
   - Behavioral variables (recycling_effort, recycle_frequency)
   - Other variables that interest you

2. **Create Visualizations**: For each selected variable, create appropriate visualizations:
   - Scatter plots for continuous variables
   - Box plots or bar charts for categorical variables
   - Consider creating grouped visualizations (e.g., support level by different categories)

3. **Analysis and Interpretation**: For each visualization:
   - Describe what you observe
   - Identify any patterns or relationships
   - Discuss potential implications for understanding support levels
   - Consider how these findings relate to your regression results

**Questions to Answer**:
- What patterns do you observe in the data?
- Are there any surprising relationships?
- How do these visual findings support or contradict your regression results?


In [None]:
# Write your code here for exploratory visualizations
# Choose 2 variables and create visualizations


**Your Visualization Analysis**: Write your analysis and interpretation of the visualizations here.


## Summary and Reflection

**Final Questions to Consider**:
- What is your overall assessment of the relationship between government consideration and support for the waste charging policy?
- What would you recommend to policymakers based on your analysis?


---

**End of Exercise**

*Congratulations on completing the regression analysis exercise! You have practiced:*
- *Simple and multiple linear regression*
- *Forward and backward variable selection*
- *Logistic regression for binary outcomes*
- *Exploratory data analysis and visualization*
- *Statistical interpretation and practical application*

*Remember to save your notebook and review your work before submission.*
