<a href="https://colab.research.google.com/github/tasish/Urban-Logistics-Latency-Predictor/blob/Asish/Urban_Logistics_Latency_Predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project: Urban-Logistics-Latency-Predictor**

## **1. Introduction**
In the domain of on-demand delivery services, the precision of Estimated Time of Arrival (ETA) is a critical metric that directly impacts user retention and fleet efficiency. Inaccurate time estimates can lead to customer dissatisfaction and poor resource allocation. This project aims to solve that challenge by deploying a machine learning solution designed to forecast delivery durations with high accuracy. By analyzing a combination of geospatial data, traffic conditions, and delivery agent metrics, the system provides data-driven time estimates. This approach allows for better expectation management and helps streamline the "last-mile" delivery process.

## **2. Methodology**
To construct a reliable prediction engine, we implemented a structured data science pipeline consisting of four key phases:

* **Data Preprocessing:**
    The raw dataset underwent a rigorous cleaning process to ensure data integrity. We systematically handled missing values, corrected format inconsistencies, and removed statistical outliers to establish a high-quality baseline for training.

* **Feature Extraction:**
    We transformed raw data points into predictive signals. Key features were engineered from the dataset, including geospatial distance (derived from latitude/longitude), agent profile metrics (Age, Ratings), and temporal variables (Order Time). These inputs were crucial in helping the model understand the complex factors that influence travel time.

* **Algorithm Selection & Training:**
    The predictive modeling was conducted using a comparative approach across multiple regression algorithms, specifically **Linear Regression, Decision Trees, Random Forest,** and **XGBoost**. To ensure the model remains generalizable to new data, we utilized cross-validation techniques to validate stability and prevent overfitting.

* **Performance Evaluation:**
    The efficacy of the models was quantified using industry-standard metrics: **Mean Squared Error (MSE)** to measure the average error magnitude, and **R-squared Metrics** to determine how well the model explains the variance in delivery times. These metrics guided the final selection of the most accurate algorithm.
    

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statistics
from geopy.distance import geodesic  # Essential for calculating geospatial delivery radius

# Machine Learning & Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb  # Gradient boosting for high-precision latency prediction

# Metrics
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import warnings

# Configuration
warnings.filterwarnings('ignore')
sns.set_theme(style="whitegrid")  # Sets a professional visual theme for all plots