<a href="https://colab.research.google.com/github/wesha-904/Amazon-Delivery-Time-Prediction/blob/main/Amazon_Delivery_Time_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name** - Amazon Delivery Time Prediction



##### **Project Type** - EDA
##### **Contribution** - Individual
##### **Name** - Anwesha Singh


# **Project Summary -**

This project focuses on predicting e-commerce delivery times using machine learning to help improve customer satisfaction and logistics efficiency. The dataset includes details such as store and delivery locations, traffic, weather, agent performance, and product category.

Key steps included:

- Data Cleaning & Preprocessing: Removed duplicates, handled missing values, standardized traffic/weather labels, and engineered features such as geospatial distance and time-based attributes.

- Exploratory Data Analysis (EDA): Identified patterns in delivery delays, the impact of traffic/weather, and agent performance trends using visualizations (scatter plots, heatmaps, box plots).

- Model Development: Built and compared regression models (Linear Regression, Random Forest, Gradient Boosting/XGBoost) using metrics like RMSE, MAE, and R².

- Model Tracking: Used MLflow to log experiments, hyperparameters, and performance metrics for version control and comparison.

- Deployment: Developed an interactive Streamlit web app that allows users to input order details and get real-time delivery time predictions.

Impact:

- Improved visibility into delivery delays based on real-world factors.

- Provided a scalable solution to enhance last-mile delivery logistics and optimize resource allocation.

- Tech Stack: Python, Pandas, Scikit-learn, XGBoost, MLflow, Streamlit, Geopy, Matplotlib/Seaborn

# **GitHub Link -**

[Anwesha's Github](https://github.com/wesha-904/Amazon-Delivery-Time-Prediction)

# **Problem Statement**


This project aims to predict delivery times for e-commerce orders based on a variety of factors such as product size, distance, traffic conditions, and shipping method.

Using the provided dataset, learners will preprocess, analyze, and build regression models to accurately estimate delivery times.

The final application will allow users to input relevant details and receive estimated delivery times via a user-friendly interface.

# Implementation

## 1. Install & Import Required Libraries

In [1]:
!pip install mlflow streamlit geopy xgboost --quiet

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from geopy.distance import geodesic
from datetime import datetime
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import mlflow
import mlflow.sklearn
import warnings
warnings.filterwarnings('ignore')


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/26.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/26.7 MB[0m [31m37.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/26.7 MB[0m [31m123.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━[0m [32m18.4/26.7 MB[0m [31m281.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m26.7/26.7 MB[0m [31m288.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m26.7/26.7 MB[0m [31m288.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m109.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.2 MB[0m [31m?[0m eta

## 2. Load Dataset

In [2]:
url = "https://raw.githubusercontent.com/wesha-904/Amazon-Delivery-Time-Prediction/main/amazon_delivery.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,Order_ID,Agent_Age,Agent_Rating,Store_Latitude,Store_Longitude,Drop_Latitude,Drop_Longitude,Order_Date,Order_Time,Pickup_Time,Weather,Traffic,Vehicle,Area,Delivery_Time,Category
0,ialx566343618,37,4.9,22.745049,75.892471,22.765049,75.912471,2022-03-19,11:30:00,11:45:00,Sunny,High,motorcycle,Urban,120,Clothing
1,akqg208421122,34,4.5,12.913041,77.683237,13.043041,77.813237,2022-03-25,19:45:00,19:50:00,Stormy,Jam,scooter,Metropolitian,165,Electronics
2,njpu434582536,23,4.4,12.914264,77.6784,12.924264,77.6884,2022-03-19,08:30:00,08:45:00,Sandstorms,Low,motorcycle,Urban,130,Sports
3,rjto796129700,38,4.7,11.003669,76.976494,11.053669,77.026494,2022-04-05,18:00:00,18:10:00,Sunny,Medium,motorcycle,Metropolitian,105,Cosmetics
4,zguw716275638,32,4.6,12.972793,80.249982,13.012793,80.289982,2022-03-26,13:30:00,13:45:00,Cloudy,High,scooter,Metropolitian,150,Toys


## 3. Data Cleaning & Preprocessing

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43739 entries, 0 to 43738
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order_ID         43739 non-null  object 
 1   Agent_Age        43739 non-null  int64  
 2   Agent_Rating     43685 non-null  float64
 3   Store_Latitude   43739 non-null  float64
 4   Store_Longitude  43739 non-null  float64
 5   Drop_Latitude    43739 non-null  float64
 6   Drop_Longitude   43739 non-null  float64
 7   Order_Date       43739 non-null  object 
 8   Order_Time       43739 non-null  object 
 9   Pickup_Time      43739 non-null  object 
 10  Weather          43648 non-null  object 
 11  Traffic          43739 non-null  object 
 12  Vehicle          43739 non-null  object 
 13  Area             43739 non-null  object 
 14  Delivery_Time    43739 non-null  int64  
 15  Category         43739 non-null  object 
dtypes: float64(5), int64(2), object(9)
memory usage: 5.3+ MB


In [4]:
# Check missing values
df.isnull().sum()

Unnamed: 0,0
Order_ID,0
Agent_Age,0
Agent_Rating,54
Store_Latitude,0
Store_Longitude,0
Drop_Latitude,0
Drop_Longitude,0
Order_Date,0
Order_Time,0
Pickup_Time,0


In [5]:
# fill with median (less affected by outliers)
df['Agent_Rating'].fillna(df['Agent_Rating'].median(), inplace=True)

In [6]:
# Fill missing Weather with the mode
df['Weather'].fillna(df['Weather'].mode()[0], inplace=True)

In [7]:
# Check missing values
df.isnull().sum()

Unnamed: 0,0
Order_ID,0
Agent_Age,0
Agent_Rating,0
Store_Latitude,0
Store_Longitude,0
Drop_Latitude,0
Drop_Longitude,0
Order_Date,0
Order_Time,0
Pickup_Time,0


In [8]:
# Drop duplicate rows if any
df.drop_duplicates(inplace=True)

In [9]:
# --- Standardize string columns ---
df['Weather'] = df['Weather'].str.strip().str.lower()
df['Traffic'] = df['Traffic'].str.strip().str.lower()
df['Vehicle'] = df['Vehicle'].str.strip().str.lower()
df['Area'] = df['Area'].str.strip().str.lower()
df['Category'] = df['Category'].str.strip().str.lower()

## 4. Feature Engineering

In [10]:
# Order_Date is datetime
df['Order_Date'] = pd.to_datetime(df['Order_Date'], errors='coerce')


In [11]:
# Distance in KM
df['Distance_km'] = df.apply(
    lambda row: geodesic(
        (row['Store_Latitude'], row['Store_Longitude']),
        (row['Drop_Latitude'], row['Drop_Longitude'])
    ).km,
    axis=1
)

# Time-based features
df['Order_DayOfWeek'] = df['Order_Date'].dt.dayofweek
df['Order_Month'] = df['Order_Date'].dt.month
df['Is_Weekend'] = df['Order_DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0)

# One-hot encode categories
df = pd.get_dummies(df, columns=['Weather', 'Traffic', 'Vehicle', 'Area', 'Category'], drop_first=True)


## 5. Define Features & Target

In [12]:
X = df.drop(['Order_ID', 'Order_Date', 'Order_Time', 'Pickup_Time', 'Delivery_Time'], axis=1)
y = df['Delivery_Time']

## 6. Train-Test Split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

((34991, 40), (8748, 40))

## 7. Train Models & Evaluate

In [14]:
def evaluate_model(model, X_train, y_train, X_test, y_test):
    #Fit the model, make predictions, and return RMSE, MAE, R2
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return rmse, mae, r2

models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(n_estimators=200, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=300, random_state=42)
}

results = {}
for name, model in models.items():
    rmse, mae, r2 = evaluate_model(model, X_train, y_train, X_test, y_test)
    results[name] = {"RMSE": rmse, "MAE": mae, "R2": r2}

results_df = pd.DataFrame(results).T
results_df


Unnamed: 0,RMSE,MAE,R2
Linear Regression,33.301216,26.227718,0.58371
Random Forest,22.505748,17.323064,0.809865
Gradient Boosting,23.591013,18.644501,0.791085


## 8. Log & Compare with MLflow

In [15]:
mlflow.set_experiment("amazon_delivery_time_prediction")

for name, model in models.items():
    with mlflow.start_run(run_name=name):
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)

        # Log parameters & metrics
        mlflow.log_param("model", name)
        mlflow.log_metric("RMSE", rmse)
        mlflow.log_metric("MAE", mae)
        mlflow.log_metric("R2", r2)

        # Log the trained model itself
        mlflow.sklearn.log_model(model, name)

        print(f"{name}: RMSE={rmse:.2f}, MAE={mae:.2f}, R²={r2:.2f}")


2025/10/04 10:18:45 INFO mlflow.tracking.fluent: Experiment with name 'amazon_delivery_time_prediction' does not exist. Creating a new experiment.


Linear Regression: RMSE=33.30, MAE=26.23, R²=0.58




Random Forest: RMSE=22.51, MAE=17.32, R²=0.81




Gradient Boosting: RMSE=23.59, MAE=18.64, R²=0.79


In [16]:
# Random Forest — Hyperparameter Search
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

rf = RandomForestRegressor(random_state=42)

param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt']
}

rf_random = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_grid,
    n_iter=30,      # number of random combos
    cv=3,
    scoring='neg_root_mean_squared_error',
    verbose=2,
    random_state=42,
    n_jobs=-1
)

rf_random.fit(X_train, y_train)
print("Best RF Params:", rf_random.best_params_)
print("Best RF CV RMSE:", -rf_random.best_score_)


Fitting 3 folds for each of 30 candidates, totalling 90 fits
Best RF Params: {'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None}
Best RF CV RMSE: 23.990689563704716


In [17]:
# Gradient Boosting — Hyperparameter Search
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(random_state=42)

param_grid_gb = {
    'n_estimators': [100, 200, 300, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5, 6],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'subsample': [0.6, 0.8, 1.0]
}

gb_random = RandomizedSearchCV(
    estimator=gbr,
    param_distributions=param_grid_gb,
    n_iter=30,
    cv=3,
    scoring='neg_root_mean_squared_error',
    verbose=2,
    random_state=42,
    n_jobs=-1
)

gb_random.fit(X_train, y_train)
print("Best GB Params:", gb_random.best_params_)
print("Best GB CV RMSE:", -gb_random.best_score_)


Fitting 3 folds for each of 30 candidates, totalling 90 fits
Best GB Params: {'subsample': 1.0, 'n_estimators': 300, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 6, 'learning_rate': 0.05}
Best GB CV RMSE: 22.933766546890826


In [18]:
# Evaluate Tuned Model on Test Data

# Random Forest
best_rf = rf_random.best_estimator_
y_pred_rf = best_rf.predict(X_test)
print("Tuned Random Forest → RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)),
      "MAE:", mean_absolute_error(y_test, y_pred_rf),
      "R²:", r2_score(y_test, y_pred_rf))

# Gradient Boosting
best_gb = gb_random.best_estimator_
y_pred_gb = best_gb.predict(X_test)
print("Tuned Gradient Boosting → RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_gb)),
      "MAE:", mean_absolute_error(y_test, y_pred_gb),
      "R²:", r2_score(y_test, y_pred_gb))



Tuned Random Forest → RMSE: 23.080863346048368 MAE: 18.032958719492076 R²: 0.8000230223819007
Tuned Gradient Boosting → RMSE: 22.224494754919032 MAE: 17.35230438420869 R²: 0.814587207781807


## 9. Save Best Model

In [19]:
import joblib

best_model = best_gb  # tuned Gradient Boosting
joblib.dump(best_model, "gradient_boosting_tuned_model.pkl")
print("Tuned Gradient Boosting model saved as gradient_boosting_tuned_model.pkl")


Tuned Gradient Boosting model saved as gradient_boosting_tuned_model.pkl


## 10. Log to MLflow

In [20]:
import mlflow
import mlflow.sklearn

mlflow.set_experiment("amazon_delivery_time_prediction")

with mlflow.start_run(run_name="GradientBoosting_Tuned"):
    mlflow.log_params(gb_random.best_params_)
    mlflow.log_metric("CV_RMSE", -gb_random.best_score_)

    # Evaluate on test set
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    import numpy as np
    y_pred = best_model.predict(X_test)

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    mlflow.log_metric("Test_RMSE", rmse)
    mlflow.log_metric("Test_MAE", mae)
    mlflow.log_metric("Test_R2", r2)

    mlflow.sklearn.log_model(best_model, "gradient_boosting_tuned_model")
    print(f"Tuned Gradient Boosting logged to MLflow → Test RMSE: {rmse:.2f}, MAE: {mae:.2f}, R²: {r2:.2f}")




Tuned Gradient Boosting logged to MLflow → Test RMSE: 22.22, MAE: 17.35, R²: 0.81


# 11. Interactive Web UI (Streamlit) for Predictions

In [34]:
!pip install streamlit localtunnel

[31mERROR: Could not find a version that satisfies the requirement localtunnel (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for localtunnel[0m[31m
[0m

In [41]:
%%writefile app.py
import streamlit as st
import joblib
import numpy as np
from geopy.distance import geodesic
import pandas as pd
import datetime

# -------------------------------
# Load the tuned Gradient Boosting model
# -------------------------------
model = joblib.load("gradient_boosting_tuned_model.pkl")

st.set_page_config(page_title="Amazon Delivery Time Prediction", page_icon="🚚")
st.title("🚚 Amazon Delivery Time Prediction")
st.markdown("Enter order and delivery details below to predict the estimated delivery time (in **minutes**).")

# -------------------------------
# User Inputs
# -------------------------------
st.header("📍 Location Details")
store_lat = st.number_input("Store Latitude", value=12.9716)
store_lon = st.number_input("Store Longitude", value=77.5946)
drop_lat = st.number_input("Drop Latitude", value=12.9260)
drop_lon = st.number_input("Drop Longitude", value=77.6762)

distance = geodesic((store_lat, store_lon), (drop_lat, drop_lon)).km
st.info(f"Calculated Distance: **{distance:.2f} km**")

st.header("👤 Delivery Agent")
agent_age = st.slider("Agent Age", 18, 60, 30)
agent_rating = st.slider("Agent Rating", 1.0, 5.0, 4.5)

st.header("⏰ Time Information")
order_date = st.date_input("Order Date", datetime.date.today())
weekend_choice = st.selectbox("Is it Weekend?", ["No", "Yes"])
is_weekend = 1 if weekend_choice == "Yes" else 0

st.header("🌦️ Delivery Conditions")
weather = st.selectbox("Weather", ["fog", "sandstorms", "stormy", "sunny", "windy"])
traffic = st.selectbox("Traffic", ["jam", "low", "medium", "nan"])
vehicle = st.selectbox("Vehicle", ["motorcycle", "scooter", "van"])
area = st.selectbox("Area", ["urban", "semi-urban", "other"])
category = st.selectbox(
    "Product Category",
    ["books", "clothing", "cosmetics", "electronics", "grocery", "home", "jewelry",
     "kitchen", "outdoors", "pet supplies", "shoes", "skincare", "snacks", "sports", "toys"]
)

# -------------------------------
# Prepare Features (must match training columns)
# -------------------------------
input_dict = {
    "Agent_Age": [agent_age],
    "Agent_Rating": [agent_rating],
    "Store_Latitude": [store_lat],
    "Store_Longitude": [store_lon],
    "Drop_Latitude": [drop_lat],
    "Drop_Longitude": [drop_lon],
    "Distance_km": [distance],
    "Order_DayOfWeek": [order_date.weekday()],
    "Order_Month": [order_date.month],
    "Is_Weekend": [is_weekend],
}

def add_onehot(prefix, options, selected):
    for opt in options:
        key = f"{prefix}_{opt}"
        input_dict[key] = [1 if opt == selected else 0]

add_onehot("Weather", ["fog", "sandstorms", "stormy", "sunny", "windy"], weather)
add_onehot("Traffic", ["jam", "low", "medium", "nan"], traffic)
add_onehot("Vehicle", ["motorcycle", "scooter", "van"], vehicle)
add_onehot("Area", ["other", "semi-urban", "urban"], area)
add_onehot("Category", ["books","clothing","cosmetics","electronics","grocery","home",
                        "jewelry","kitchen","outdoors","pet supplies","shoes",
                        "skincare","snacks","sports","toys"], category)

input_df = pd.DataFrame(input_dict)

# ---- Align columns with model ----
trained_cols = list(model.feature_names_in_)
for col in trained_cols:
    if col not in input_df.columns:
        input_df[col] = 0
input_df = input_df[trained_cols]

# -------------------------------
# Prediction
# -------------------------------
if st.button("Predict Delivery Time"):
    prediction = model.predict(input_df)[0]

    # Convert to hours (1 decimal)
    hours = prediction / 60
    if hours < 1:
        hours_display = f"≈ {int(prediction)} min"
    else:
        # Round to nearest 0.5 hour for readability
        rounded_hours = round(hours * 2) / 2
        hours_display = f"≈ {rounded_hours} h"

    st.success("✅ Prediction Complete")

    # Big metric display
    col1, col2 = st.columns([1, 1])
    with col1:
        st.metric(label="Estimated Delivery Time (minutes)", value=f"{prediction:.0f} min")
    with col2:
        st.metric(label="Approx Time", value=hours_display)

    # Extra note
    st.caption("⏱️ This is based on historical delivery data — actual time may vary due to traffic or weather.")


Overwriting app.py


Run The code block below , after 10-15 seconds click on the link :

Your quick Tunnel has been created!

 Visit it at (it may take some time to be reachable):  |
2025-10-04T10:58:40Z INF |  https://findlaw-sessions-emphasis-then.trycloudflare.com   

Use the website as long as the code block is running.

In [43]:
!pip install streamlit cloudflared -q
!streamlit run app.py --server.headless true & npx cloudflared tunnel --url http://localhost:8501

[1G[0K⠙
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.44.130.19:8501[0m
[0m
[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K[90m2025-10-04T11:00:19Z[0m [32mINF[0m Thank you for trying Cloudflare Tunnel. Doing so, without a Cloudflare account, is a quick way to experiment and try it out. However, be aware that these account-less Tunnels have no uptime guarantee, are subject to the Cloudflare Online Services Terms of Use (https://www.cloudflare.com/website-terms/), and Cloudflare reserves the right to investigate your use of Tunnels for violations of such terms. If you intend to use Tunnels in production you should use a pre-created named tunnel by following: https://developers.cloudfla