<div style="
  text-align: center; 
  color: white; 
  background: linear-gradient(135deg, #2e8b57, #3cb371); 
  padding: 25px; 
  border-radius: 12px;
  box-shadow: 0 4px 10px rgba(0,0,0,0.2);
  max-width: 600px;
  margin: 30px auto;
">
  <h1 style="margin: 0; font-size: 2.2em;">Waste Management Prediction</h1>
  <p style="font-size: 18px; margin-top: 8px; opacity: 0.95;">
    Optimizing sustainability through data-driven insights
  </p>
</div>


# 1. Problem Statement :
> *The objective of this Hackathon is to develop a machine learning model that
predicts the recycling rate (the target variable, “Recycling Rate (%)”) for a given set
of city-specific waste management attributes, such as waste type, population
density, and municipal efficiency. This is primarily a regression problem, where
the goal is to minimize prediction error (e.g., Root Mean Squared Error, RMSE). The
challenge leverages the Waste Management and Recycling in India dataset,
enabling participants to explore data-driven insights that can optimize waste
management systems, reduce environmental impact, and inform urban policymaking. Participants will analyze features like waste generation, cost of
management, and landfill capacity to derive actionable solutions, simulating realworld decision-making in sustainable urban planning.*


# 2. Data - Explanation :
> *The Waste Management and Recycling in India dataset is a comprehensive
collection of attributes related to waste generation, recycling, and disposal
practices across various Indian cities. Sourced from simulated data based on realworld waste management trends in India, this dataset spans 2019–2023 and covers
multiple waste types, including plastic, organic, electronic, construction, and
hazardous waste. It is pivotal for developing predictive models that support
sustainable waste management—a data-driven approach to optimize municipal
systems, reduce landfill dependency, and enhance recycling efforts. Each row in the
dataset represents a unique combination of city, waste type, and management
attributes, paired with outcomes like recycling rates or disposal methods, making it
an ideal resource for building and evaluating machine learning models*

##  Dataset Overview

- **Source**: Simulated data based on Indian municipal waste practices  
- **Target**: Predict **Recycling Rate (%)** (regression)

###  Features

- `City/District`: Name of city (categorical)  
- `Waste Type`: Type of waste (categorical)  
- `Waste Generated (Tons/Day)`: Daily waste amount  
- `Recycling Rate (%)`: % recycled (target)  
- `Population Density`: People per km²  
- `Municipal Efficiency Score`: 1–10 rating  
- `Disposal Method`: Landfill, Recycling, etc. (classification target)  
- `Cost of Waste Management`: ₹ per ton  
- `Awareness Campaigns Count`: Campaigns per year  
- `Landfill Name`: Site name  
- `Landfill Location`: Latitude & Longitude  
- `Landfill Capacity`: Max tons  
- `Year`: 2019–2023

###  Challenges

- Encoding categorical features  
- Handling class imbalance  
- Feature engineering for time & location  
- Managing missing/noisy data  
- Combining mixed data types

# Import Libraries

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import scipy.stats as stats

import warnings
warnings.filterwarnings("ignore")

# 3. Data_Preparation :

> Data cleaning and preprocessing

## 3.1 Data_Ingestion

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/MasteriNeuron/datasets/refs/heads/main/Waste_Management_and_Recycling_India.csv")
df.to_csv("../data/raw/Waste_Management_and_Recycling_India.csv", index=False)

  City/District    Waste Type  Waste Generated (Tons/Day)  Recycling Rate (%)  \
0        Mumbai       Plastic                        6610                  68   
1        Mumbai       Organic                        1181                  56   
2        Mumbai       E-Waste                        8162                  53   
3        Mumbai  Construction                        8929                  56   
4        Mumbai     Hazardous                        5032                  44   

   Population Density (People/km²)  Municipal Efficiency Score (1-10)  \
0                            11191                                  9   
1                            11191                                  5   
2                            11191                                  8   
3                            11191                                  5   
4                            11191                                  7   

  Disposal Method  Cost of Waste Management (₹/Ton)  \
0      Composting  

In [6]:
df.head()

Unnamed: 0,City/District,Waste Type,Waste Generated (Tons/Day),Recycling Rate (%),Population Density (People/km²),Municipal Efficiency Score (1-10),Disposal Method,Cost of Waste Management (₹/Ton),Awareness Campaigns Count,Landfill Name,"Landfill Location (Lat, Long)",Landfill Capacity (Tons),Year
0,Mumbai,Plastic,6610,68,11191,9,Composting,3056,14,Mumbai Landfill,"22.4265, 77.4931",45575,2019
1,Mumbai,Organic,1181,56,11191,5,Composting,2778,12,Mumbai Landfill,"22.4265, 77.4931",45575,2019
2,Mumbai,E-Waste,8162,53,11191,8,Incineration,3390,13,Mumbai Landfill,"22.4265, 77.4931",45575,2019
3,Mumbai,Construction,8929,56,11191,5,Landfill,1498,14,Mumbai Landfill,"22.4265, 77.4931",45575,2019
4,Mumbai,Hazardous,5032,44,11191,7,Recycling,2221,16,Mumbai Landfill,"22.4265, 77.4931",45575,2019


# 3.2 Metadata

## Cols info and dtypes

In [None]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 13 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   City/District                      850 non-null    object
 1   Waste Type                         850 non-null    object
 2   Waste Generated (Tons/Day)         850 non-null    int64 
 3   Recycling Rate (%)                 850 non-null    int64 
 4   Population Density (People/km²)    850 non-null    int64 
 5   Municipal Efficiency Score (1-10)  850 non-null    int64 
 6   Disposal Method                    850 non-null    object
 7   Cost of Waste Management (₹/Ton)   850 non-null    int64 
 8   Awareness Campaigns Count          850 non-null    int64 
 9   Landfill Name                      850 non-null    object
 10  Landfill Location (Lat, Long)      850 non-null    object
 11  Landfill Capacity (Tons)           850 non-null    int64 
 12  Year    

## 3.2 Duplicates

In [8]:
df.duplicated().sum()

np.int64(0)

## 3.3 Unique values and Count
> **Unique value for categorical columns**

In [None]:
for col in df.select_dtypes(include=['object']).columns:
    print(f"{col}:\n {df[col].unique()}")

City/District:
 ['Mumbai' 'Delhi' 'Bengaluru' 'Chennai' 'Kolkata' 'Hyderabad' 'Pune'
 'Ahmedabad' 'Jaipur' 'Lucknow' 'Surat' 'Kanpur' 'Nagpur' 'Patna' 'Bhopal'
 'Thiruvananthapuram' 'Indore' 'Vadodara' 'Guwahati' 'Coimbatore' 'Ranchi'
 'Amritsar' 'Jodhpur' 'Varanasi' 'Ludhiana' 'Agra' 'Meerut' 'Nashik'
 'Rajkot' 'Madurai' 'Jabalpur' 'Allahabad' 'Visakhapatnam' 'Gwalior']
Waste Type:
 ['Plastic' 'Organic' 'E-Waste' 'Construction' 'Hazardous']
Disposal Method:
 ['Composting' 'Incineration' 'Landfill' 'Recycling']
Landfill Name:
 ['Mumbai Landfill' 'Delhi Landfill' 'Bengaluru Landfill'
 'Chennai Landfill' 'Kolkata Landfill' 'Hyderabad Landfill'
 'Pune Landfill' 'Ahmedabad Landfill' 'Jaipur Landfill' 'Lucknow Landfill'
 'Surat Landfill' 'Kanpur Landfill' 'Nagpur Landfill' 'Patna Landfill'
 'Bhopal Landfill' 'Thiruvananthapuram Landfill' 'Indore Landfill'
 'Vadodara Landfill' 'Guwahati Landfill' 'Coimbatore Landfill'
 'Ranchi Landfill' 'Amritsar Landfill' 'Jodhpur Landfill'
 'Varanasi Landf

# 4. Data_cleaning

* we will optimize columns name's


In [14]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_')
df.head()

Unnamed: 0,city/district,waste_type,waste_generated_(tons/day),recycling_rate_(%),population_density_(people/km²),municipal_efficiency_score_(1_10),disposal_method,cost_of_waste_management_(₹/ton),awareness_campaigns_count,landfill_name,"landfill_location_(lat,_long)",landfill_capacity_(tons),year
0,Mumbai,Plastic,6610,68,11191,9,Composting,3056,14,Mumbai Landfill,"22.4265, 77.4931",45575,2019
1,Mumbai,Organic,1181,56,11191,5,Composting,2778,12,Mumbai Landfill,"22.4265, 77.4931",45575,2019
2,Mumbai,E-Waste,8162,53,11191,8,Incineration,3390,13,Mumbai Landfill,"22.4265, 77.4931",45575,2019
3,Mumbai,Construction,8929,56,11191,5,Landfill,1498,14,Mumbai Landfill,"22.4265, 77.4931",45575,2019
4,Mumbai,Hazardous,5032,44,11191,7,Recycling,2221,16,Mumbai Landfill,"22.4265, 77.4931",45575,2019


In [None]:
# Data type correction for Year col.
df['year'] = pd.to_datetime(df['year'], errors='coerce').dt.year

In [18]:
df.to_csv("../data/processed/Waste_Management_and_Recycling_India_cleaned.csv", index=False)

# Summary insights:
- No null values found.
- No duplicates found.
- No mis-entered values found.
- corrected "year" cols dtype to date and time.
- skipped preprocessing , will include include in model_training.ipynb -Pipeline method.