# Order Delivery Time Prediction

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Objectives
The objective of this assignment is to build a regression model that predicts the delivery time for orders placed through Porter. The model will use various features such as the items ordered, the restaurant location, the order protocol, and the availability of delivery partners.

The key goals are:
- Predict the delivery time for an order based on multiple input features
- Improve delivery time predictions to optimiae operational efficiency
- Understand the key factors influencing delivery time to enhance the model's accuracy

## Data Pipeline
The data pipeline for this assignment will involve the following steps:
1. **Data Loading**
2. **Data Preprocessing and Feature Engineering**
3. **Exploratory Data Analysis**
4. **Model Building**
5. **Model Inference**

## Data Understanding
The dataset contains information on orders placed through Porter, with the following columns:

| Field                     | Description                                                                                 |
|---------------------------|---------------------------------------------------------------------------------------------|
| market_id                 | Integer ID representing the market where the restaurant is located.                         |
| created_at                | Timestamp when the order was placed.                                                        |
| actual_delivery_time      | Timestamp when the order was delivered.                                                     |
| store_primary_category    | Category of the restaurant (e.g., fast food, dine-in).                                      |
| order_protocol            | Integer representing how the order was placed (e.g., via Porter, call to restaurant, etc.). |
| total_items               | Total number of items in the order.                                                         |
| subtotal                  | Final price of the order.                                                                   |
| num_distinct_items        | Number of distinct items in the order.                                                      |
| min_item_price            | Price of the cheapest item in the order.                                                    |
| max_item_price            | Price of the most expensive item in the order.                                              |
| total_onshift_dashers     | Number of delivery partners on duty when the order was placed.                              |
| total_busy_dashers        | Number of delivery partners already occupied with other orders.                             |
| total_outstanding_orders  | Number of orders pending fulfillment at the time of the order.                              |
| distance                  | Total distance from the restaurant to the customer.                                         |


## **Importing Necessary Libraries**

In [15]:
# Import essential libraries for data manipulation and analysis

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

## **1. Loading the data**
Load 'porter_data_1.csv' as a DataFrame

In [16]:
# Importing the file porter_data_1.csv
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/porter_data_1.csv')
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,market_id,created_at,actual_delivery_time,store_primary_category,order_protocol,total_items,subtotal,num_distinct_items,min_item_price,max_item_price,total_onshift_dashers,total_busy_dashers,total_outstanding_orders,distance
0,1.0,2015-02-06 22:24:17,2015-02-06 23:11:17,4,1.0,4,3441,4,557,1239,33.0,14.0,21.0,34.44
1,2.0,2015-02-10 21:49:25,2015-02-10 22:33:25,46,2.0,1,1900,1,1400,1400,1.0,2.0,2.0,27.6
2,2.0,2015-02-16 00:11:35,2015-02-16 01:06:35,36,3.0,4,4771,3,820,1604,8.0,6.0,18.0,11.56
3,1.0,2015-02-12 03:36:46,2015-02-12 04:35:46,38,1.0,1,1525,1,1525,1525,5.0,6.0,8.0,31.8
4,1.0,2015-01-27 02:12:36,2015-01-27 02:58:36,38,1.0,2,3620,2,1425,2195,5.0,5.0,7.0,8.2


In [17]:
#check shape
df.shape

(175777, 14)

In [18]:
df.columns

Index(['market_id', 'created_at', 'actual_delivery_time',
       'store_primary_category', 'order_protocol', 'total_items', 'subtotal',
       'num_distinct_items', 'min_item_price', 'max_item_price',
       'total_onshift_dashers', 'total_busy_dashers',
       'total_outstanding_orders', 'distance'],
      dtype='object')

## **2. Data Preprocessing and Feature Engineering** <font color = red>[15 marks]</font> <br>

#### **2.1 Fixing the Datatypes**  <font color = red>[5 marks]</font> <br>
The current timestamps are in object format and need conversion to datetime format for easier handling and intended functionality

In [19]:
df.dtypes

Unnamed: 0,0
market_id,float64
created_at,object
actual_delivery_time,object
store_primary_category,int64
order_protocol,float64
total_items,int64
subtotal,int64
num_distinct_items,int64
min_item_price,int64
max_item_price,int64


##### **2.1.1** <font color = red>[2 marks]</font> <br>
Convert date and time fields to appropriate data type

In [20]:
# Convert 'created_at' and 'actual_delivery_time' columns to datetime format

df['created_at'] = pd.to_datetime(df['created_at'])
df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'])

# Verify the conversion
print(df[['created_at', 'actual_delivery_time']].dtypes)



created_at              datetime64[ns]
actual_delivery_time    datetime64[ns]
dtype: object


##### **2.1.2**  <font color = red>[3 marks]</font> <br>
Convert categorical fields to appropriate data type

In [21]:
# Convert categorical features to category type

categorical_columns = ['market_id', 'store_primary_category', 'order_protocol']

for col in categorical_columns:
    df[col] = df[col].astype('category')

# Verify the conversion
print(df[categorical_columns].dtypes)

market_id                 category
store_primary_category    category
order_protocol            category
dtype: object


#### **2.2 Feature Engineering** <font color = red>[5 marks]</font> <br>
Calculate the time taken to execute the delivery as well as extract the hour and day at which the order was placed

##### **2.2.1** <font color = red>[2 marks]</font> <br>
Calculate the time taken using the features `actual_delivery_time` and `created_at`

In [22]:
# Calculate time taken in minutes

df['delivery_time_mins'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds() / 60

# Preview the new column
print(df[['created_at', 'actual_delivery_time', 'delivery_time_mins']].head())


           created_at actual_delivery_time  delivery_time_mins
0 2015-02-06 22:24:17  2015-02-06 23:11:17                47.0
1 2015-02-10 21:49:25  2015-02-10 22:33:25                44.0
2 2015-02-16 00:11:35  2015-02-16 01:06:35                55.0
3 2015-02-12 03:36:46  2015-02-12 04:35:46                59.0
4 2015-01-27 02:12:36  2015-01-27 02:58:36                46.0


##### **2.2.2** <font color = red>[3 marks]</font> <br>
Extract the hour at which the order was placed and which day of the week it was. Drop the unnecessary columns.

In [23]:
# Extract the hour and day of week from the 'created_at' timestamp

df['order_hour'] = df['created_at'].dt.hour
df['order_dayofweek'] = df['created_at'].dt.dayofweek


# Create a categorical feature 'isWeekend'

df['isWeekend'] = df['created_at'].dt.dayofweek.apply(lambda x: 1 if x >= 5 else 0)
df['isWeekend'] = df['isWeekend'].astype('category')

df.head()

Unnamed: 0,market_id,created_at,actual_delivery_time,store_primary_category,order_protocol,total_items,subtotal,num_distinct_items,min_item_price,max_item_price,total_onshift_dashers,total_busy_dashers,total_outstanding_orders,distance,delivery_time_mins,order_hour,order_dayofweek,isWeekend
0,1.0,2015-02-06 22:24:17,2015-02-06 23:11:17,4,1.0,4,3441,4,557,1239,33.0,14.0,21.0,34.44,47.0,22,4,0
1,2.0,2015-02-10 21:49:25,2015-02-10 22:33:25,46,2.0,1,1900,1,1400,1400,1.0,2.0,2.0,27.6,44.0,21,1,0
2,2.0,2015-02-16 00:11:35,2015-02-16 01:06:35,36,3.0,4,4771,3,820,1604,8.0,6.0,18.0,11.56,55.0,0,0,0
3,1.0,2015-02-12 03:36:46,2015-02-12 04:35:46,38,1.0,1,1525,1,1525,1525,5.0,6.0,8.0,31.8,59.0,3,3,0
4,1.0,2015-01-27 02:12:36,2015-01-27 02:58:36,38,1.0,2,3620,2,1425,2195,5.0,5.0,7.0,8.2,46.0,2,1,0


In [24]:
# Drop unnecessary columns

# Drop raw timestamp columns
df.drop(columns=['created_at', 'actual_delivery_time'], inplace=True)

df.head()

Unnamed: 0,market_id,store_primary_category,order_protocol,total_items,subtotal,num_distinct_items,min_item_price,max_item_price,total_onshift_dashers,total_busy_dashers,total_outstanding_orders,distance,delivery_time_mins,order_hour,order_dayofweek,isWeekend
0,1.0,4,1.0,4,3441,4,557,1239,33.0,14.0,21.0,34.44,47.0,22,4,0
1,2.0,46,2.0,1,1900,1,1400,1400,1.0,2.0,2.0,27.6,44.0,21,1,0
2,2.0,36,3.0,4,4771,3,820,1604,8.0,6.0,18.0,11.56,55.0,0,0,0
3,1.0,38,1.0,1,1525,1,1525,1525,5.0,6.0,8.0,31.8,59.0,3,3,0
4,1.0,38,1.0,2,3620,2,1425,2195,5.0,5.0,7.0,8.2,46.0,2,1,0


#### **2.3 Creating training and validation sets** <font color = red>[5 marks]</font> <br>

##### **2.3.1** <font color = red>[2 marks]</font> <br>
 Define target and input features

In [25]:
# Define target variable (y) and features (X)
y = df.pop('delivery_time_mins')
X = df

X.columns

Index(['market_id', 'store_primary_category', 'order_protocol', 'total_items',
       'subtotal', 'num_distinct_items', 'min_item_price', 'max_item_price',
       'total_onshift_dashers', 'total_busy_dashers',
       'total_outstanding_orders', 'distance', 'order_hour', 'order_dayofweek',
       'isWeekend'],
      dtype='object')

##### **2.3.2** <font color = red>[3 marks]</font> <br>
 Split the data into training and test sets

In [26]:
# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

# Display the shape of the splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


NameError: name 'train_test_split' is not defined

## **3. Exploratory Data Analysis on Training Data** <font color = red>[20 marks]</font> <br>
1. Analyzing the correlation between variables to identify patterns and relationships
2. Identifying and addressing outliers to ensure the integrity of the analysis
3. Exploring the relationships between variables and examining the distribution of the data for better insights

#### **3.1 Feature Distributions** <font color = red> [7 marks]</font> <br>


In [None]:
# Define numerical and categorical columns for easy EDA and data manipulation
numerical_cols = X_train.select_dtypes(include=np.number).columns.tolist()
categorical_cols = X_train.select_dtypes(include='category').columns.tolist()

print("Numerical columns:", numerical_cols)
print("Categorical columns:", categorical_cols)

##### **3.1.1** <font color = red>[3 marks]</font> <br>
Plot distributions for numerical columns in the training set to understand their spread and any skewness

In [None]:
# Plot distributions for all numerical columns


# Set up plot style
sns.set(style='whitegrid')
plt.figure(figsize=(16, 22))

# Plot distribution for each numerical column
for i, col in enumerate(numerical_cols):
    plt.subplot(len(numerical_cols) // 2 + 1, 2, i + 1)
    sns.histplot(X_train[col], kde=True, bins=30, color='steelblue')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

##### **3.1.2** <font color = red>[2 marks]</font> <br>
Check the distribution of categorical features

In [None]:
# Distribution of categorical columns

plt.figure(figsize=(20, 18))
for i, col in enumerate(categorical_cols):
    plt.subplot(len(categorical_cols) // 2 + 1, 2, i + 1)
    sns.histplot(X_train[col],kde=True, bins=30, color='steelblue')
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

##### **3.1.3** <font color = red>[2 mark]</font> <br>
Visualise the distribution of the target variable to understand its spread and any skewness

In [None]:
# Distribution of time_taken


plt.figure(figsize=(10, 6))
sns.histplot(y_train, kde=True, bins=50, color='steelblue')

# Title and labels
plt.title('Distribution of Delivery Time (minutes)', fontsize=14)
plt.xlabel('Delivery Time (mins)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

plt.tight_layout()
plt.show()


#### **3.2 Relationships Between Features** <font color = red>[3 marks]</font> <br>

##### **3.2.1** <font color = red>[3 marks]</font> <br>
Scatter plots for important numerical and categorical features to observe how they relate to `time_taken`

In [None]:
# Scatter plot to visualise the relationship between delivery_time_mins and other features

plt.figure(figsize=(18, 16))
for i, col in enumerate(numerical_cols):
    plt.subplot(len(numerical_cols) // 2 + 1, 2, i + 1)
    sns.scatterplot(x=X_train[col], y=y_train, alpha=0.6, color='steelblue')
    plt.title(f'Delivery Time vs {col}')
    plt.xlabel(col)
    plt.ylabel('Delivery Time (mins)')

plt.tight_layout()
plt.show()

In [None]:
# Show the distribution of time_taken for different hours

plt.figure(figsize=(12, 6))
sns.boxplot(x=X_train['order_hour'], y=y_train)
plt.title('Distribution of Delivery Time by Order Hour')
plt.xlabel('Order Hour')
plt.ylabel('Delivery Time (mins)')
plt.show()

#### **3.3 Correlation Analysis** <font color = red>[5 marks]</font> <br>
Check correlations between numerical features to identify which variables are strongly related to `time_taken`

##### **3.3.1** <font color = red>[3 marks]</font> <br>
Plot a heatmap to display correlations

In [None]:
train_data = X_train.copy()
train_data['delivery_time_mins'] = y_train
numerical_cols_a = train_data.select_dtypes(include=['int64', 'float64']).columns

plt.figure(figsize=(12, 8))
correlation_matrix = train_data[numerical_cols_a].corr()

sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", square=True)
plt.title("Correlation Matrix including Target Variable")
plt.show()

In [None]:
# Calculate the correlation matrix including the target variable
correlation_matrix = X_train[numerical_cols].corrwith(y_train).sort_values(ascending=False)

# Plot the heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix.to_frame(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation of Numerical Features with Delivery Time')
plt.show()

##### **3.3.2** <font color = red>[2 marks]</font> <br>
Drop the columns with weak correlations with the target variable

In [None]:
# Drop 3 weakly correlated columns from training dataset
columns_to_drop = ['total_onshift_dasher', 'total_outstanding_orders', 'num_distinct_item']
X_train = X_train.drop(columns=columns_to_drop)
#X_test = X_test.drop(columns=columns_to_drop)

print("Shape of X_train after dropping columns:", X_train.shape)
#print("Shape of X_test after dropping columns:", X_test.shape)

#### **3.4 Handling the Outliers** <font color = red>[5 marks]</font> <br>



##### **3.4.1** <font color = red>[2 marks]</font> <br>
Visualise potential outliers for the target variable and other numerical features using boxplots

In [None]:
# Boxplot for time_taken
plt.figure(figsize=(8, 6))
sns.boxplot(y=y_train)
plt.title('Boxplot of Delivery Time (mins)')
plt.ylabel('Delivery Time (mins)')
plt.show()

# Boxplots for other numerical features
numerical_cols_after_drop = X_train.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(16, 18))
for i, col in enumerate(numerical_cols_after_drop):
    plt.subplot(len(numerical_cols_after_drop) // 2 + 1, 2, i + 1)
    sns.boxplot(y=X_train[col])
    plt.title(f'Boxplot of {col}')
    plt.ylabel(col)

plt.tight_layout()
plt.show()

##### **3.4.2** <font color = red>[3 marks]</font> <br>
Handle outliers present in all columns

In [None]:
# Handle outliers

# Create a copy to work on
train_data = pd.concat([X_train, y_train], axis=1)

def cap_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[col] = np.where(df[col] < lower_bound, lower_bound,
                       np.where(df[col] > upper_bound, upper_bound, df[col]))
    return df

# Apply cap putlier fn to delivery_time_mins and other numerical columns
clean_cols_after_drop = train_data.select_dtypes(include=np.number).columns.tolist()

for col in clean_cols_after_drop:
     train_data = cap_outliers(train_data, col)

X_train = train_data.drop('delivery_time_mins', axis=1)
y_train = train_data['delivery_time_mins']

# Boxplot for time_taken after capping outlier
plt.figure(figsize=(8, 6))
sns.boxplot(y=y_train)
plt.title('Boxplot of Delivery Time (mins)')
plt.ylabel('Delivery Time (mins)')
plt.show()

# Boxplots for other numerical features after capping outlier
numerical_cols_after_drop = X_train.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(16, 18))
for i, col in enumerate(numerical_cols_after_drop):
    plt.subplot(len(numerical_cols_after_drop) // 2 + 1, 2, i + 1)
    sns.boxplot(y=X_train[col])
    plt.title(f'Boxplot of {col}')
    plt.ylabel(col)

plt.tight_layout()
plt.show()

## **4. Exploratory Data Analysis on Validation Data** <font color = red>[optional]</font> <br>
Optionally, perform EDA on test data to see if the distribution match with the training data

In [None]:
# Define numerical and categorical columns for easy EDA and data manipulation
numerical_cols = X_test.select_dtypes(include=np.number).columns.tolist()
categorical_cols = X_test.select_dtypes(include='category').columns.tolist()


#### **4.1 Feature Distributions**


##### **4.1.1**
Plot distributions for numerical columns in the validation set to understand their spread and any skewness

In [None]:
# Plot distributions for all numerical columns


# Set up plot style
sns.set(style='whitegrid')
plt.figure(figsize=(16, 18))

# Plot distribution for each numerical column
for i, col in enumerate(numerical_cols):
    plt.subplot(len(numerical_cols) // 2 + 1, 2, i + 1)
    sns.histplot(X_test[col], kde=True, bins=30, color='steelblue')
    plt.title(f'Distribution of {col} (Test Set)')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

##### **4.1.2**
Check the distribution of categorical features

In [None]:
# Distribution of categorical columns

plt.figure(figsize=(20, 18))
for i, col in enumerate(categorical_cols):
    plt.subplot(len(categorical_cols) // 2 + 1, 2, i + 1)
    sns.histplot(X_test[col],kde=True, bins=30, color='steelblue')
    plt.title(f'Distribution of {col} (Test Set)')
plt.tight_layout()
plt.show()

##### **4.1.3**
Visualise the distribution of the target variable to understand its spread and any skewness

In [None]:
# Distribution of time_taken

plt.figure(figsize=(10, 6))
sns.histplot(y_test, kde=True, bins=50, color='steelblue')

# Title and labels
plt.title('Distribution of Delivery Time (minutes) (Test Set)', fontsize=14)
plt.xlabel('Delivery Time (mins)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

plt.tight_layout()
plt.show()

#### **4.2 Relationships Between Features**
Scatter plots for numerical features to observe how they relate to each other, especially to `time_taken`

In [None]:
# Scatter plot to visualise the relationship between time_taken and other features

plt.figure(figsize=(18, 16))
for i, col in enumerate(numerical_cols):
    plt.subplot(len(numerical_cols) // 2 + 1, 2, i + 1)
    sns.scatterplot(x=X_test[col], y=y_test, alpha=0.6, color='steelblue')
    plt.title(f'Delivery Time vs {col} (Test Set)')
    plt.xlabel(col)
    plt.ylabel('Delivery Time (mins)')

plt.tight_layout()
plt.show()

#### **4.3** Drop the columns with weak correlations with the target variable

In [None]:
# Drop the weakly correlated columns from test dataset

columns_to_drop = ['min_item_price', 'order_dayofweek', 'order_hour']
X_test = X_test.drop(columns=columns_to_drop)
print("Shape of X_test after dropping columns:", X_test.shape)


## **5. Model Building** <font color = red>[15 marks]</font> <br>

#### **Import Necessary Libraries**

In [None]:
# Import libraries
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

#### **5.1 Feature Scaling** <font color = red>[3 marks]</font> <br>

In [None]:
# Apply scaling to the numerical columns

scaler = MinMaxScaler()

num_cols = X_train.select_dtypes(include=np.number).columns.to_list()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

print("Numerical columns in X_train after scaling:")
display(X_train[num_cols].head())

print("Numerical columns in X_test after scaling:")
display(X_test[num_cols].head())

Note that linear regression is agnostic to feature scaling. However, with feature scaling, we get the coefficients to be somewhat on the same scale so that it becomes easier to compare them.

#### **5.2 Build a linear regression model** <font color = red>[5 marks]</font> <br>

You can choose from the libraries *statsmodels* and *scikit-learn* to build the model.

In [None]:
# Create/Initialise the model
model = LinearRegression()

In [None]:
# Train the model using the training data
lr = model.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

In [None]:
# Find results for evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R-squared (R2): {r2:.2f}")

Note that we have 12 (depending on how you select features) training features. However, not all of them would be useful. Let's say we want to take the most relevant 8 features.

We will use Recursive Feature Elimination (RFE) here.

For this, you can look at the coefficients / p-values of features from the model summary and perform feature elimination, or you can use the RFE module provided with *scikit-learn*.

#### **5.3 Build the model and fit RFE to select the most important features** <font color = red>[7 marks]</font> <br>

For RFE, we will start with all features and use
the RFE method to recursively reduce the number of features one-by-one.

After analysing the results of these iterations, we select the one that has a good balance between performance and number of features.

In [None]:
# Loop through the number of features and test the model

# Running RFE with the output number of the variable equal to 8
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, n_features_to_select=8)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

In [None]:
# Build the final model with selected number of features

# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]

# Adding a constant variable
X_train_rfe = sm.add_constant(X_train_rfe)

lm = sm.OLS(y_train,X_train_rfe).fit()   # Running the linear model

#Let's see the summary of our linear model
print(lm.summary())

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_rfe.drop(['const'], axis=1)
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

## **6. Results and Inference** <font color = red>[5 marks]</font> <br>

#### **6.1 Perform Residual Analysis** <font color = red>[3 marks]</font> <br>

In [None]:
# Perform residual analysis using plots like residuals vs predicted values, Q-Q plot and residual histogram

# Get predictions on the training data using the final model (lm)
y_train_pred = lm.predict(X)

# Calculate residuals
residuals = y_train - y_train_pred

# Plot the histogram of the error terms
fig = plt.figure()
plt.figure(figsize=(10, 6))
sns.distplot(residuals, bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading
plt.xlabel('Errors', fontsize = 18)                         # X-label


[Your inferences here:]



#### **6.2 Perform Coefficient Analysis** <font color = red>[2 marks]</font> <br>

Perform coefficient analysis to find how changes in features affect the target.
Also, the features were scaled, so interpret the scaled and unscaled coefficients to understand the impact of feature changes on delivery time.


In [None]:
# Compare the scaled vs unscaled features used in the final model

Additionally, we can analyse the effect of a unit change in a feature. In other words, because we have scaled the features, a unit change in the features will not translate directly to the model. Use scaled and unscaled coefficients to find how will a unit change in a feature affect the target.

In [None]:
# Analyze the effect of a unit change in a feature, say 'total_items'



Note:
The coefficients on the original scale might differ greatly in magnitude from the scaled coefficients, but they both describe the same relationships between variables.

Interpretation is key: Focus on the direction and magnitude of the coefficients on the original scale to understand the impact of each variable on the response variable in the original units.

Include conclusions in your report document.

## Subjective Questions <font color = red>[20 marks]</font>

Answer the following questions only in the notebook. Include the visualisations/methodologies/insights/outcomes from all the above steps in your report.

#### Subjective Questions based on Assignment

##### **Question 1.** <font color = red>[2 marks]</font> <br>

Are there any categorical variables in the data? From your analysis of the categorical variables from the dataset, what could you infer about their effect on the dependent variable?

**Answer:**
>Categorical Variable(s) in the Data

**From the OLS summary:**

isWeekend is a binary categorical variable.

It likely represents whether the order was placed on a weekend (1) or weekday (0).

Coef of `isWeekend` is **+1.444**

Being a weekend increases delivery time by ~1.44 minutes, on average, holding other variables constant.

**Note: ** Other categorical variables (like store_primary_category or order_protocol)  have been dropped during preprocessing or encoded as dummies and not included in this particular model output.




---



##### **Question 2.** <font color = red>[1 marks]</font> <br>
What does `test_size = 0.2` refer to during splitting the data into training and test sets?

**Answer:**
> it means that the data set is divided into 2 parts training & test data. and test_size = 0.2 indicates 20% of the data set has been categorised as test data



---



##### **Question 3.** <font color = red>[1 marks]</font> <br>
Looking at the heatmap, which one has the highest correlation with the target variable?  

**Answer:**
> distance has the highest correlation with delivery time in minutes



---



##### **Question 4.** <font color = red>[2 marks]</font> <br>
What was your approach to detect the outliers? How did you address them?

**Answer:**

> i used the box plot to identify the outliers visually and then used the IQR to clip the outliers   

IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR



---



##### **Question 5.** <font color = red>[2 marks]</font> <br>
Based on the final model, which are the top 3 features significantly affecting the delivery time?

**Answer:**
>Based on coefficient magnitude and statistical significance, the top 3 features influencing delivery time are total_outstanding_orders, total_onshift_dashers, and distance. High outstanding orders substantially increase delivery time, while having more dashers reduces it. Longer delivery distances also contribute to delays. These insights can guide operational improvements in workload management and delivery logistics.



| Rank | Feature                        | Coefficient  | Direction   | Interpretation                                                                                                               |
| ---- | ------------------------------ | ------------ | ----------- | ---------------------------------------------------------------------------------------------------------------------------- |
| 1️  | **`total_outstanding_orders`** | **+70.2251** |  Positive | For every 1 std increase, delivery time increases by **\~70 mins**. Indicates heavy backlog significantly delays deliveries. |
| 2️  | **`total_onshift_dashers`**    | **−50.3741** |  Negative | More dashers on shift **reduce** delivery time — high workforce availability speeds up delivery.                             |
| 3️  | **`distance`**                 | **+22.3346** |  Positive | Longer distance between store and customer **increases** delivery time.                                                      |




---



#### General Subjective Questions

##### **Question 6.** <font color = red>[3 marks]</font> <br>
Explain the linear regression algorithm in detail

**Answer:**
>Linear regression is a fundamental statistical method used to model the relationship between a dependent variable (the one you want to predict) and one or more independent variables (the features you use for prediction).

Here's a breakdown of the algorithm:

**The Goal:** The primary goal of linear regression is to find the best-fitting straight line  that describes the relationship between the variables. This line is represented by a linear equation.


**The Equation:**

**Simple Linear Regression:** When you have one independent variable (X) and one dependent variable (y), the equation is: y = β₀ + β₁x + ε

**Where:**
y is the dependent variable.

β₀ is the y-intercept (the value of y when x is 0).

β₁ is the slope of the line (how much y changes for a one-unit change in x).
x is the independent variable.

ε is the error term (the part of y that the model cannot explain).

**Multiple Linear Regression:**

When you have multiple independent variables (X₁, X₂, ..., Xn), the equation expands to: y = β₀ + β₁x₁ + β₂x₂ + ... + βnxn + ε

**Where:**

y is the dependent variable.

β₀ is the y-intercept.

β₁, β₂, ..., βn are the coefficients for each independent variable (representing the change in y for a one-unit change in that variable, holding others constant).

x₁, x₂, ..., xn are the independent variables.

ε is the error term.

**Finding the Best Fit:** The "best-fitting" line is determined by minimizing the difference between the actual observed values of the dependent variable and the values predicted by the linear equation. This difference is called the residual.

**Cost Function:** To quantify the overall error of the model, a cost function is used. The most common cost function for linear regression is the Mean Squared Error (MSE) or Sum of Squared Errors (SSE). These functions calculate the average (or sum) of the squared residuals. Squaring the residuals ensures that both positive and negative errors contribute to the cost and penalizes larger errors more heavily.


**Minimizing the Cost Function:** The process of finding the coefficients (β values) that minimize the cost function is called optimization.

There are two main methods for this:

**Ordinary Least Squares (OLS):** This is an analytical method that uses calculus to find the coefficients that minimize the sum of squared residuals directly. It provides a closed-form solution.

**Gradient Descent:** This is an iterative optimization algorithm. It starts with initial values for the coefficients and then repeatedly adjusts them in the direction that reduces the cost function the most, until a minimum is reached.

**Assumptions of Linear Regression:** For the results of linear regression to be valid and reliable, certain assumptions should ideally be met:

**Linearity:** There should be a linear relationship between the independent and dependent variables.

**Independence:** The observations should be independent of each other.

**Homoscedasticity:** The variance of the residuals should be constant across all levels of the independent variables.

**Normality:** The residuals should be normally distributed.

**No Multicollinearity:** In multiple linear regression, the independent variables should not be highly correlated with each other.

In summary, linear regression is a powerful yet simple algorithm that models linear relationships by finding the line (or hyperplane) that minimizes the squared errors between predicted and actual values, typically using OLS or gradient descent.




---



##### **Question 7.** <font color = red>[2 marks]</font> <br>
Explain the difference between simple linear regression and multiple linear regression

**Answer:**
>**Simple Linear Regression:** When you have one independent variable (X) and one dependent variable (y), the equation is: y = β₀ + β₁x + ε

**Where:**
y is the dependent variable.

β₀ is the y-intercept (the value of y when x is 0).

β₁ is the slope of the line (how much y changes for a one-unit change in x).
x is the independent variable.

ε is the error term (the part of y that the model cannot explain).

**Multiple Linear Regression:**

When you have multiple independent variables (X₁, X₂, ..., Xn), the equation expands to: y = β₀ + β₁x₁ + β₂x₂ + ... + βnxn + ε

**Where:**

y is the dependent variable.

β₀ is the y-intercept.

β₁, β₂, ..., βn are the coefficients for each independent variable (representing the change in y for a one-unit change in that variable, holding others constant).

x₁, x₂, ..., xn are the independent variables.

ε is the error term.



---



##### **Question 8.** <font color = red>[2 marks]</font> <br>
What is the role of the cost function in linear regression, and how is it minimized?

**Answer:**
>**Cost Function:** To quantify the overall error of the model, a cost function is used. The most common cost function for linear regression is the Mean Squared Error (MSE) or Sum of Squared Errors (SSE). These functions calculate the average (or sum) of the squared residuals. Squaring the residuals ensures that both positive and negative errors contribute to the cost and penalizes larger errors more heavily.


**Minimizing the Cost Function:** The process of finding the coefficients (β values) that minimize the cost function is called optimization.

There are two main methods for this:

**Ordinary Least Squares (OLS):** This is an analytical method that uses calculus to find the coefficients that minimize the sum of squared residuals directly. It provides a closed-form solution.

**Gradient Descent:** This is an iterative optimization algorithm. It starts with initial values for the coefficients and then repeatedly adjusts them in the direction that reduces the cost function the most, until a minimum is reached.




---



##### **Question 9.** <font color = red>[2 marks]</font> <br>
Explain the difference between overfitting and underfitting.



**Answer:**

>
**Underfitting**

* The model is too simple.
*It doesn’t learn enough from the data.
*It makes a lot of mistakes even on the training data.

➤ Example:
Trying to predict house prices using only the number of bedrooms — you’re ignoring other important details like location, size, or age.

**Overfitting**
* The model is too smart for its own good.

* It memorizes the training data, including noise and mistakes.

* It does great on training data but fails on new data.

➤ Example:
You learn all the questions and answers from a practice test — but in the real exam, when the questions change slightly, you can’t answer.



---



##### **Question 10.** <font color = red>[3 marks]</font> <br>
How do residual plots help in diagnosing a linear regression model?

**Answer:**
>Residual plots help you see if your model is doing a good job. If residuals are randomly scattered around zero, your model is likely valid. If you see patterns, curves, or uneven spread, it suggests problems with your model — like non-linearity, missing variables, or inconsistent errors.

A residual plot shows:

* **X-axis:** Predicted values (or an independent variable)

* **Y-axis:** Residuals

This helps us visually check how well the linear regression model fits the data.