# <font size="6">**Introduction**</font>

This project analyzes city-level waste management data sourced from the World Bank Group.
The data provides information on municipal solid waste (MSW) generation, recycling and composting rates, waste treatment methods, and financial aspects related to waste management.

---
# <font size="6">**Purpose**</font>

The main purpose of this analysis is to:
- Understand how cities and countries manage municipal solid waste
- Identify trends in waste generation and waste diversion (recycling and composting)
- Highlight opportunities for improvement in waste management practices

---
# <font size="6">**Scope**</font>

This analysis will cover:
- Total MSW generation at both the country and city levels
- Diversion rates (recycling and composting) by city and country
- Waste management costs and strategies for disposal
- The geographical distribution of MSW and diversion efforts

We will focus on a dataset provided by the World Bank, which includes information for several countries and cities around the world.

---
# **How the Data is Accessed**

To begin the analysis:
1. We connect to Google Drive using Colab’s `drive.mount` function to access the data stored in the cloud.
2. The dataset is loaded into a pandas DataFrame (`df`), which is used for data manipulation and analysis.
3. We inspect the data by looking at the columns and unique values within them to better understand the structure.

The data file, named `city_level_data_0_0.csv`, contains detailed information about solid waste management practices across different cities and countries.

---
# **Major Steps of the Project**

The analysis is divided into the following major steps:
1. **Connect to Google Drive and Aquiring the Data**:
   - Mount Google Drive to access the CSV file in your Colab environment.
2. **Load and Inspect the Dataset**:
   - Load the data into pandas and check the columns, types, and unique values (countries and cities).
3. **Group and Analyze the Data**:
   - Summarize waste generation by country and city.
   - Calculate diversion rates (recycling and composting percentages).
   - Group cities by waste generation and diversion rates to identify trends.
4. **Save Processed Data**:
   - After cleaning and transforming the data, save the results as a new CSV file for future reference.



---
# <font size="6">**Step 1: Connect to Google Drive and Acquire the Data**</font>

In the first step of the project, we will connect to Google Drive to access the data stored in the cloud. The dataset we're using is obtained from the World Bank Group, specifically the **[World Bank's Waste Management Data](https://data.worldbank.org/indicator/EN.ATM.WAST.ZS)**. The data file, named `city_level_data_0_0.csv`, contains information on solid waste management practices across various cities and countries.

---
## **Acquiring the Data**

The dataset is publicly available for download on the World Bank Group website. To download the dataset:

1. Visit the **[World Bank Data Portal](https://data.worldbank.org/indicator/EN.ATM.WAST.ZS)**.
2. Navigate to the section on municipal solid waste management or search for "waste management data."
3. Download the relevant CSV file to your local system or directly to your Google Drive.

Once the file is downloaded, you can upload it to your Google Drive. In this project, the data is stored in a folder called "Colab Notebooks" on Google Drive for easy access in the Colab environment.

---
## **1.1 Mounting Google Drive**

We will use the following code to mount your Google Drive and make the file accessible:

In [1]:
# Step 1.1
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


When you run this line of code, it mounts your Google Drive to the Colab environment. The drive.mount('/content/gdrive') function will prompt you to authorize access by clicking on a link and providing the necessary permissions. Once mounted, the message Mounted at /content/gdrive will appear in the output. This means your Google Drive is now linked to the Colab environment, and you can access the files stored on it.

- Why this is important: Mounting Google Drive is necessary because it allows Colab to access files from your cloud storage without needing to upload them directly each time. This is particularly useful for large datasets, like the one we are using in this project.

---
## **1.2 Importing Libraries**
Next, we need to import the necessary libraries. We'll use pandas to work with tables of data and numpy for numerical calculations.

In [2]:
# Step 1.2
import numpy as np
import pandas as pd

In this step, we import the two key libraries we will be using for data analysis:

- numpy: A fundamental library for numerical computing. It allows us to perform complex mathematical operations and manipulate data efficiently. In this case, it will help with any calculations or transformations we need to perform on the dataset later.

- pandas: A powerful data manipulation and analysis library for Python. It helps us work with tables (DataFrames) of data, allowing us to read, clean, and analyze large datasets. Pandas is particularly useful for operations like sorting, grouping, filtering, and summarizing data.

These two libraries will be critical as we analyze the waste management data across different cities and countries.

---
## **1.3 Loading the Dataset**
After setting up the environment and importing the necessary libraries, we can now load the dataset into the Colab environment. The following line reads the dataset from your Google Drive using pandas and stores it in a variable called df. The dataset file, city_level_data_0_0.csv, contains data on waste management in various cities.

In [3]:
# Step 1.3
df=pd.read_csv('gdrive/My Drive/Colab Notebooks/city_level_data_0_0.csv')

This line of code loads the dataset from your Google Drive into a pandas DataFrame. The pd.read_csv() function is used to read CSV files and load them into a DataFrame. Here, the file path 'gdrive/My Drive/Colab Notebooks/city_level_data_0_0.csv' specifies the location of the CSV file in your Google Drive. When the file is successfully loaded, the data is stored in the variable df.

- Why this is important: This step is essential because it allows us to bring the dataset into the Colab environment, making it ready for analysis. The df variable will now contain all the data from the CSV file, and we can begin manipulating it using pandas methods.

- What to expect: After running this code, the data will be accessible in the df variable, which can be further explored and analyzed using various pandas functions.



---
## **Summary of Step 1: Connect to Google Drive and Acquire the Data**

In Step 1, we began by setting up the environment to access and work with the data. First, we connected our Google Drive to the Colab environment. This is crucial because it allows us to store and retrieve the dataset directly from the cloud. By mounting the Google Drive, we ensure that the data file is easily accessible for subsequent analysis.

Next, we acquired the dataset, which is publicly available from the **World Bank Group's website**. We focused on the **[World Bank's Waste Management Data](https://data.worldbank.org/indicator/EN.ATM.WAST.ZS)**, specifically the `city_level_data_0_0.csv` file. This dataset contains detailed information on waste management practices across various cities and countries.

After mounting the Google Drive, we loaded the dataset into our Colab environment using Python’s `pandas` library. This allows us to easily manipulate and analyze the data. We also verified that the data was loaded correctly by inspecting the columns and ensuring the dataset was accessible.

By the end of Step 1, we have successfully connected to Google Drive, acquired the data, and ensured that it is ready for analysis in the next steps. This prepares us to dive into the dataset, explore its structure, and begin cleaning and analyzing it in Step 2.


---
# <font size="6">**Step 2: Data Exploration and Preprocessing**</font>

In Step 2 of this project, we will focus on understanding the structure and contents of the dataset we’ve acquired. At this stage, the data is raw and may contain some inconsistencies, missing values, or inappropriate data types. The goal of this step is to explore and clean the data so that it is ready for analysis and visualization.

---
### What we will do in this step:
- **Explore the dataset's columns**: We will start by reviewing the columns available in the dataset to understand the variables it contains. This gives us insight into the types of information we can work with.
- **Identify unique values**: For key columns, such as `country_name`, we will check for unique values to understand the geographical coverage of the data and ensure there are no unexpected entries or missing data.
- **Handle data types**: Some columns may not be in the correct format for numerical analysis. We’ll convert any columns that should contain numbers (such as waste generation data) to the appropriate numeric types.
- **Check for missing values**: We will also inspect the dataset for missing values. In real-world datasets, it’s common to encounter gaps in the data. We’ll identify these missing values and decide whether to remove them or replace them with appropriate values (e.g., the mean or median).

This process is crucial because clean, well-organized data is the foundation for effective analysis. By the end of Step 2, we will have a clearer picture of the data’s structure, and we’ll be ready to start deeper analysis in the following steps.


---
## **2.1 Checking Unique Country Names**
To better understand the data, it's helpful to see the unique countries represented in the dataset. The following code extracts all unique country names from the country_name column of the DataFrame.

In [4]:
# Step 2.1
print(df['country_name'].unique())

['Afghanistan' 'Angola' 'Albania' 'United Arab Emirates' 'Argentina'
 'Armenia' 'American Samoa' 'Australia' 'Austria' 'Azerbaijan' 'Burundi'
 'Belgium' 'Benin' 'Burkina Faso' 'Bangladesh' 'Bulgaria' 'Bahrain'
 'Bosnia and Herzegovina' 'Belarus' 'Belize' 'Bolivia' 'Brazil' 'Bhutan'
 'Botswana' 'Canada' 'Switzerland' 'Chile' 'China' 'Côte d’Ivoire'
 'Cameroon' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Colombia' 'Comoros'
 'Costa Rica' 'Cuba' 'Cyprus' 'Czech Republic' 'Germany' 'Djibouti'
 'Denmark' 'Dominican Republic' 'Algeria' 'Ecuador' 'Egypt, Arab Rep.'
 'Spain' 'Estonia' 'Ethiopia' 'Finland' 'Fiji' 'France'
 'Micronesia, Fed. Sts.' 'Gabon' 'United Kingdom' 'Georgia' 'Ghana'
 'Guinea' 'Gambia, The' 'Equatorial Guinea' 'Greece' 'Guatemala'
 'Honduras' 'Croatia' 'Haiti' 'Hungary' 'Indonesia' 'Isle of Man' 'India'
 'Ireland' 'Iran, Islamic Rep.' 'Iraq' 'Israel' 'Italy' 'Jordan' 'Japan'
 'Kazakhstan' 'Kenya' 'Kyrgyz Republic' 'Cambodia' 'Kiribati'
 'Korea, Rep.' 'Kuwait' 'Lao PDR' 'Lebanon' 'Li

**How it works:**

- The df['country_name'] accesses the country_name column in the DataFrame.

- The .unique() function then returns an array of the unique values in this column, meaning it will list all the different countries present in the dataset.

**What to expect:** This code will display a list of all unique country names found in the dataset, allowing you to quickly identify the countries represented. For example, you may see country names like 'Afghanistan', 'Brazil', 'India', and many others. This step helps confirm the geographical scope of the data, which is critical when analyzing waste management practices across different countries.

---
##**2.2 Inspecting the Columns in the Dataset**

To better understand the structure of the dataset, it's important to check the column names. This can help you know what data you have available to work with. The following code prints a list of all the column names in the dataset.

In [5]:
# Step 2.2
print(df.columns.tolist())

['iso3c', 'region_id', 'country_name', 'income_id', 'city_name', 'additional_data_annual_budget_for_waste_management_year', 'additional_data_annual_solid_waste_budget_year', 'additional_data_annual_swm_budget_2017_year', 'additional_data_annual_swm_budget_year', 'additional_data_annual_waste_budget_year', 'additional_data_collection_ton', 'additional_data_number_of_scavengers_on_dumpsites_number', 'additional_data_other_user_fees_na', 'additional_data_swm_contract_arrangement_1_year_contract_period', 'additional_data_swm_contract_arrangement_3_year_contract_period', 'additional_data_total_annual_costs_to_collect_and_dispose_of_city_s_waste_year', 'additional_data_total_swm_expenditures_year', 'additional_data_total_waste_management_budget_year', 'communication_list_of_channels_through_which_the_city_collects_feedback_from_it_residents_on_issues_related_to_solid_waste_services_na', 'communication_summary_of_key_solid_waste_information_made_periodically_available_to_the_public_na', 'comp

**How it works:**

- df.columns retrieves the names of all the columns in the DataFrame df.

- The .tolist() method converts these column names from an Index object into a regular Python list, which makes it easier to display and manipulate the list of columns.

**Why this is important:** This step helps you quickly understand the structure of the dataset. It provides a list of all the attributes or features available in the dataset, such as 'country_name', 'city_name', 'total_msw_total_msw_generated_tons_year', and so on. Knowing the column names is crucial when you want to refer to specific attributes of the data for analysis.

**What to expect:** When you run this code, it will print a list of column names, allowing you to see the variables available in the dataset. For example, you might see a list that includes 'country_name', 'city_name', 'waste_collection_cost_recovery_household_fee_amount_na', 'waste_management_cost_open_dump_na', etc. This list helps guide you in selecting the columns you want to analyze or clean.


----
## **2.3 Converting Data Types for Numeric Columns**
Some columns in the dataset may not be in the appropriate format for numerical analysis. For example, columns containing numbers might have been loaded as text. To perform calculations on these columns, we need to ensure they are converted into numeric types.

For instance, the column 'total_msw_total_msw_generated_tons_year' represents the total amount of municipal solid waste generated, and we need it to be numeric to perform any mathematical operations on it.


In [None]:
# Step 2.3
df["total_msw_total_msw_generated_tons_year"] = pd.to_numeric(
    df["total_msw_total_msw_generated_tons_year"],
    errors="coerce"
)


**What this does:** The pd.to_numeric() function attempts to convert the values in the specified column into numeric values. The errors="coerce" argument ensures that any non-numeric values are replaced with NaN, allowing us to handle any invalid or missing data appropriately.

**Why this is important:** Converting the column to numeric format is necessary to perform mathematical operations such as summing or averaging the values. Without this step, attempting to perform calculations on non-numeric values would lead to errors.

---
## **Summary of Step 2: Data Exploration and Preprocessing**
By the end of Step 2, you should have a clear understanding of the structure and content of the dataset. We will have explored the columns, identified any missing or malformed data, and ensured that all data types are appropriate for analysis. This prepares the dataset for the next steps, where we will perform more detailed analysis and visualizations.

---
# <font size="6">**Step 3: Data Analysis and Insights**</font>

In Step 3, we take the data we acquired and begin analyzing it to uncover trends and insights. This step focuses on understanding the patterns in waste generation, waste diversion (recycling and composting), and comparing these patterns across countries and cities. We will break down the data in several key ways:

1. **Total Waste by Country**: We will calculate the total waste generated by each country, which will help us identify the biggest waste producers globally.
2. **Waste Diversion Rate**: We will calculate how much waste is being diverted from landfills by looking at recycling and composting rates.
3. **Top Cities for Waste Diversion**: We will identify the cities with the highest and lowest diversion rates, shedding light on where waste management practices are most effective.
4. **Cities Generating the Most Waste**: We will also look at which cities are generating the most waste, highlighting urban areas with the highest waste challenges.

Now, let's go through each analysis step by step, explain the code we’ll use, and discuss the purpose behind it.

---
##**3.1 Total MSW by Country**
In this step, we group the data by country and calculate the total waste generated by each country. Sorting the countries in descending order allows us to identify the largest producers of municipal solid waste.

In [None]:
# Step 3.1
waste_by_country = df.groupby("country_name")["total_msw_total_msw_generated_tons_year"].sum()
waste_by_country = waste_by_country.sort_values(ascending=False)
print(waste_by_country.head(10))

country_name
India                 2.075418e+07
Brazil                8.903979e+06
Russian Federation    7.989254e+06
China                 7.903000e+06
Saudi Arabia          6.580000e+06
Mexico                5.784915e+06
Egypt, Arab Rep.      5.475000e+06
Pakistan              5.280906e+06
Vietnam               4.909250e+06
South Africa          4.540491e+06
Name: total_msw_total_msw_generated_tons_year, dtype: float64


**How this works:**
We use the groupby() function to group the data by the "country_name" column. Then, we sum the total waste generated (total_msw_total_msw_generated_tons_year) for each country. Finally, we sort the results in descending order to identify the countries that generate the most waste.

**Why it's important:**
This step helps us see which countries are the largest contributors to municipal solid waste. Understanding the top waste-producing countries can help target areas for policy changes or improvements in waste management practices.



---
##**3.2 City-Level Waste Analysis**
We now look at waste generation on a city level. We’ll extract the city name and the total waste generated and sort the cities by their waste generation.

In [6]:
# Step 3.2
all_city_waste = df[["city_name", "total_msw_total_msw_generated_tons_year"]].copy()
all_city_waste = all_city_waste.dropna()
all_city_waste = all_city_waste.sort_values(by="total_msw_total_msw_generated_tons_year", ascending=False)

**How this works:**
Here, we create a new DataFrame that only contains the "city_name" and "total_msw_total_msw_generated_tons_year" columns. We drop any rows with missing data (dropna()), and then we sort the cities by the total waste generated in descending order.

**Why it's important:**
This step gives us a view of which cities are generating the most waste, helping to pinpoint urban areas with the greatest waste management challenges.

---
##**3.3 Waste Diversion Rate Calculation**
We now calculate the waste diversion rate for each city by adding the recycling and composting percentages.


In [None]:
# Step 3.3
df["diversion_rate"] = df["waste_treatment_recycling_percent"] + df["waste_treatment_compost_percent"]

**How this works:**
We add the "waste_treatment_recycling_percent" and "waste_treatment_compost_percent" columns to create a new column, "diversion_rate," which represents the percentage of waste that is diverted from landfills.

**Why it's important:**
This step helps us understand how much of the waste is being managed through recycling and composting, reducing the burden on landfills. This is a key metric in sustainable waste management practices.

---
##**3.4 Average Diversion Rate by Country**
Next, we group the data by country and calculate the average diversion rate.

In [None]:
# Step 3.4
diversion_by_country = df.groupby("country_name")["diversion_rate"].mean()
diversion_by_country = diversion_by_country.sort_values(ascending=False)
print(diversion_by_country.head(10))


country_name
Belgium                 67.000
Korea, Rep.             65.000
Thailand                60.550
Iran, Islamic Rep.      54.000
Canada                  51.650
India                   45.394
France                  40.510
United Arab Emirates    29.000
Mexico                  24.790
Lebanon                 22.500
Name: diversion_rate, dtype: float64


**How this works:**
We use groupby() again to group the data by country, then we calculate the average diversion rate for each country. Sorting the values in descending order allows us to identify the countries with the highest diversion rates.

**Why it's important:**
This helps us identify which countries are most effective in diverting waste from landfills, which is a critical measure of sustainability and environmental responsibility.

---
##**3.5 Top and Bottom Cities for Waste Diversion**
Finally, we find the cities with the highest and lowest diversion rates, providing insight into where waste diversion practices are most successful and where improvements are needed.

In [None]:
# Step 3.5
top_diversion = df[["city_name", "diversion_rate"]].sort_values(by="diversion_rate", ascending=False).head(10)
low_diversion = df[["city_name", "diversion_rate"]].sort_values(by="diversion_rate", ascending=True).head(10)

print("Top 10 Cities for Diversion:")
print(top_diversion)

print("\nBottom 10 Cities for Diversion:")
print(low_diversion)


Top 10 Cities for Diversion:
            city_name  diversion_rate
148  Pimpri-Chinchwad           79.70
171        Coimbatore           73.40
177            Mysore           72.00
26              Liege           67.00
201             Seoul           65.00
139             Kochi           61.90
318           Bangkok           60.55
149            Kanpur           54.66
180            Tehran           54.00
52            Toronto           51.65

Bottom 10 Cities for Diversion:
                                  city_name  diversion_rate
123                             Tegucigalpa            3.00
208  Dehiwala Mt. Lavinia Municipal Council            3.90
151                                  Tenali            4.75
45                                   La Paz            5.00
178                                Warangal            5.32
324                                Vavaâ€™u            6.60
210                             Trincomalee            9.70
344                                   Ha

**How this works:**
We create two new DataFrames: one for the cities with the highest diversion rates (top_diversion) and another for the cities with the lowest diversion rates (low_diversion). We then print the top and bottom 10 cities.

**Why it's important:**
By identifying the top and bottom cities, we can learn from the best practices of leading cities in waste diversion and understand the challenges faced by cities with low diversion rates.



---
##**3.6 Top Cities for Waste Generation**
Lastly, we identify the cities that generate the most waste, which will give us a better understanding of the urban centers with the greatest waste management challenges.


In [None]:
# Step 3.6
top_cities_waste = df[["city_name", "total_msw_total_msw_generated_tons_year"]]
top_cities_waste = top_cities_waste.sort_values(by="total_msw_total_msw_generated_tons_year", ascending=False).head(10)
print(top_cities_waste)


          city_name  total_msw_total_msw_generated_tons_year
58          Beijing                                7903000.0
296          Moscow                                5500000.0
85            Cairo                                5475000.0
227    MÃ©xico City                                4705945.0
48        Sao Paulo                                4700000.0
302          Riyadh                                4380000.0
318         Bangkok                                4190000.0
110          London                                3560990.0
47   Rio De Janeiro                                3368499.0
201           Seoul                                3353985.0


How this works:
We extract the "city_name" and "total_msw_total_msw_generated_tons_year" columns from the original DataFrame and sort them in descending order to show the cities generating the most waste.

Why it's important:
This step allows us to see which cities have the highest waste generation. By identifying these cities, we can target resources and strategies for improving waste management in these areas, which face the biggest challenges.


---

# <font size="5">**Summery of Step 3: Data Analysis and Insights**</font>

In this step, we transformed raw waste data into meaningful insights by grouping, calculating, and ranking key metrics. We identified the countries and cities that generate the most municipal solid waste, as well as those leading and lagging in waste diversion efforts.  
By understanding these patterns, we now have a clear view of where waste management is most effective and where challenges are greatest. These insights will guide the next steps in deeper analysis and potential solutions for more sustainable waste management practices.

---
# <font size="6">**Step 4: Save and Export the Cleaned Data**</font>

After exploring, cleaning, and analyzing our dataset, the final step is to save the processed information for future use.  
Instead of visualizing the data through charts, in this project we are exporting the cleaned dataset into a new CSV file.  
This ensures that all the work we've done — from formatting numbers correctly to selecting relevant columns — is preserved and easy to share, reuse, or further analyze.

Saving your work into a CSV file is a critical habit in data projects. It acts like a "checkpoint" where you can always return without needing to redo all the previous processing steps.


In [None]:
all_city_waste.to_csv("all_city_waste.csv", index=False)

**How and Why:**

- to_csv() is a pandas function that writes the DataFrame into a .csv file.

- We set index=False so that the DataFrame index (row numbers) are not saved as a separate column in the CSV file, keeping the output clean.

- This new file all_city_waste.csv contains the names of the cities and their total municipal solid waste (MSW) generated — ready for reporting, visualization, or further modeling.

By doing this, you lock in all the hard work and create a lightweight version of the dataset that is easy to share and work with later.This save and download csv

---
# <font size="5">**Summary for Step 4**</font>

In this final step, we exported our cleaned and sorted dataset into a CSV file called `all_city_waste.csv`.  
This file contains key information about each city’s total municipal solid waste generation, which is now easy to access without re-running all the cleaning and processing steps.  
Saving cleaned datasets is an essential practice in any data workflow — it not only protects your work but also sets you up for efficient future analysis, collaboration, or presentation.

# <font size="6">**Final Project Summary**</font>

In this project, we explored municipal solid waste management data from the World Bank Group.  
We started by connecting to Google Drive and accessing the raw dataset, ensuring we had the right environment to work with large data files.  
Next, we performed an initial inspection of the data, reviewing available columns and unique country names to understand the scope and variety of information captured.  
We then moved into cleaning and analyzing the data — converting important columns to numeric types, summarizing waste generation by country and city, and calculating diversion rates to see how well different cities and countries manage recycling and composting efforts.  
Finally, we exported a clean, ready-to-use CSV file that holds valuable insights for future use, reporting, or further visualization.

Throughout this project, we practiced essential data handling skills: loading, inspecting, cleaning, analyzing, and saving — key stages in any serious data science or analytics workflow.  
By the end, we transformed a large and messy dataset into a clear, structured resource that answers important questions about global waste management trends.
