# Data Visualization for MSW Charging Scheme in Hong Kong

This Jupyter Notebook guides you through data visualization techniques using Python, focusing on the Municipal Solid Waste Charging Scheme dataset in Hong Kong. This aligns with the *Empowering Citizens through Data: Participatory Policy Analysis for Hong Kong* course, supporting SDG 12 (Responsible Consumption and Production). You will learn to:
1. Load and examine a dataset.
2. Understand its structure and variables.
3. Visualize categorical data with frequency tables, bar charts, and pie charts.
4. Analyze continuous data with summary statistics, box-whisker plots, and histograms.
5. Explore relationships using scatter plots.
6. Save visualizations to a directory.

Use GitHub Copilot to assist by writing prompts (e.g., comments like `# Write code to...`) to generate Python code. Libraries required: `pandas`, `matplotlib`, `seaborn`.



## Section 0: Install / Import Relevant Libraries

If Python is newly installed in your laptop, add a new Code cell, and ran `%pip install pandas matplotlib seaborn` to install relevant libraries


**Task**: Use GitHub Copilot to generate the import statements for `pandas`, `matplotlib`, and `seaborn`. Write a prompt as a comment (e.g., `# Write code to import pandas, matplotlib, and seaborn`) and let Copilot suggest the code.

In [None]:
# Write python to import pandas, matplotlib, and seaborn 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



## Section 1: Load and Examine the Dataset

**Tasks**: 
1. Create a new folder called `GCAP3226` on your desktop
2. Save `GCAP3226_week2.csv` in the newly created `GCAP3226` folder
3. File -> Open Folder -> Find the newly created `GCAP3226` folder -> Select Folder
4. Write a prompt for GitHub Copilot to generate Python code that loads `GCAP3226_week2.csv` into a pandas DataFrame and displays the first five rows.
5. Run the code.

In [None]:
# Example: Write code to load GCAP3226_week2.csv from the working directory into a pandas DataFrame; show the first five rows
df = pd.read_csv('GCAP3226_week2.csv')
df.head()   

## Section 2: Understand the Dataset Structure and the Variables

**Task**: Use GitHub Copilot to generate code that displays dataset information and summary statistics, and add comments explaining each line; write the prompt as a comment at the top of the next cell, then run the code.


In [None]:
# Example: Write code to display dataset info and summary statistics; add comment lines to explain what each line of code does
        

## Section 3: Categorical Data Visualization

**Task**: Use GitHub Copilot to generate code for a 
- frequency table
- bar chart, and 
- pie chart 

for the variable named `support_level`. 

Make sure the figures have good readability.

In [None]:
# Example: Write Python code to create a frequency table for the support_level variable


In [None]:
# Example: Write Python code to create a bar chart for support_level, ordered by strongly disagree (1) to strongly agree (5), with labels and axis titles
# Example: Write Python code to label the 1-5 scale as 1=Strongly oppose, 2=Oppose, 3=Neutral, 4=Support, 5=Strongly support

In [None]:
# Write Python code to create a pie chart for support_level, explicitly label the 1-5 scale as 1=Strongly oppose, 2=Oppose, 3=Neutral, 4=Support, 5=Strongly support   

In [None]:
# Write Python code to create bar charts for support_level and support_after_info; put the plots in the 1x2 grid. Modify the output using github copilot chat and apply the new code if necessary.
# Get the maximum count for y-axis limit
# Bar chart for support_level
# Bar chart for support_after_info

In [None]:
# Write code to create bar charts for perceived fairness, government_consideration, policy_helpfulness, and waste_severity with correct Likert scales and labels
# Define variables, titles, and Likert scale options for each
# Calculate max count for y-axis scaling

**More Tasks**: 
- To see the living district distribution of respondents (now the living district information is coded by 0/1, you may consult with Github Copilot the steps to do this task)
- To generate a cross table showing the percentage of respondents in each district who report seeing a food waste bin (`food_waste_behavior`)

In [None]:
# Example:
# Write Python code to generate the living district distribution of respondents (now the living district information is coded by 0/1)
# Step 1. Count the number of columns which contain the string "HongKongDistrict" in the dataframe
# Step 2. Combine all district columns into one Series and count frequencies
    # Extract district name after the underscore
    # Sum the values (number of participants for this district)
# Step 3. Generate a bar chart of the frequency table above, sorted by frequency in descending order.

In [None]:
# To generate a cross table showing the percentage of respondents in each district who report seeing a food waste bin (`food_waste_behavior`) The district information is coded by 0/1, for example, HongKongDistrict_CW	HongKongDistrict_KowloonCity	HongKongDistrict_Other	HongKongDistrict_North	HongKongDistrict_Southern. The value of food waste behavior are never_seen, seen_not_used, seen_and_used.
# Step 1. Create a DataFrame to hold district and food waste behavior
    # Filter respondents living in this district
        # Count food waste behavior
# Step 2. Create a pivot table for better visualization
# Step 3. Visualize the pivot table using a grouped bar chart

## Section 4: Analyze the Continuous Data (vs a Categorical Data)

A box-whisker plot, or box plot, is a graphical representation of the five-number summary of a dataset: the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It is a standardized way of displaying the distribution of data based on these five key statistics.

Histogram is a type of bar chart used to show the distribution of continuous numerical data by dividing the data into intervals called bins and displaying the frequency of data points within each bin as a vertical bar.

**Task**: Use GitHub Copilot to generate Python code for summary statistics, a box-whisker plot, and a histogram of the distance of the nearest recycle facility to the respondents (m)`Distance_artificial`. Write prompts as comments and run the code.

In [None]:
# Write code to display summary statistics for Distance_artificial
# Write code to create a box-whisker plot and histogram for Distance_artificial and put then in a 1x2 way.
# Box-whisker plot
# Histogram

## Section 5: Explore Relationships Between Variables

**Task**: Use GitHub Copilot to generate Python code to explore the relationship between the distance of the nearest recycle facility to the respondents (`Distance_artificial`) and the recycling effort (`recycling_effort`) . Write a prompt as a comment and run the code.

In [None]:
# Write code to create a scatter plot with jitter of `Distance_artificial` vs. `recycling_effort`

## Section 6: Save Visualizations to a Designated Directory

**Task**: Use GitHub Copilot to generate code to check if the `plots` directory exists, create it if it doesnâ€™t. Then, modify the previous visualization codes to save any one plot in png format to `plots/`.

In [None]:
# Write code to check if plots directory exists, create it if not, and list files  

## Reflection and Policy Insights

**Tasks**: 
- Summarize key insights from visualizations (100-150 words) (CILO 3).
- Propose one more question that can be answered using visualization. 