# Porter Neural Network Regression

**Porter** is India's Largest Marketplace for Intra-City Logistics. Leader in the country's $40 billion intra-city logistics market, Porter strives to improve the lives of 1,50,000+ driver-partners by providing them with consistent earning & independence. Currently, the company has serviced 5+ million customers

Porter works with a wide range of restaurants for delivering their items directly to the people.

Porter has a number of delivery partners available for delivering the food, from various restaurants and wants to get an estimated delivery time that it can provide the customers on the basis of what they are ordering, from where and also the delivery partners.

This dataset has the required data to train a regression model that will do the delivery time estimation, based on all those features.

**Data Dictionary**

Each row in this file corresponds to one unique delivery. Each column corresponds to a feature as explained below.
1. market_id : integer id for the market where the restaurant lies
2. created_at : the timestamp at which the order was placed
3. actual_delivery_time : the timestamp when the order was delivered
4. store_primary_category : category for the restaurant
5. order_protocol : integer code value for order protocol(how the order was placed ie: through porter, call to restaurant, pre booked, third part etc)
6. total_items subtotal : final price of the order
7. num_distinct_items : the number of distinct items in the order
8. min_item_price : price of the cheapest item in the order
9. max_item_price : price of the costliest item in order
10. total_onshift_partners : number of delivery partners on duty at the time order was placed
11. total_busy_partners : number of delivery partners attending to other tasks
12. total_outstanding_orders : total number of orders to be fulfilled at the moment

## Module 1 (Building DataScientist MindSet)

#### Before We Touch Any Code â€” Ask Questions First
This is the habit that separates seniors from **juniors**. A junior opens the dataset and starts coding. A **senior interrogates** the problem first.

#### The 5 Questions you always ask first
1. **What are we predicting and why does it matter?** We're predicting deliver time for porter(India's logistics platform). This matters because
- underestimating= Angry customers,
- OverEstimating = Losing business to competitors and > > Every Minute of error has real cost
2. **Who consumes this prediction? Is it dashboard for ops teams? A real-time API call when a customer places an order?**
- This changes how fast out model needs to be and how accurate it needs to be
3. **What does 'Good'/'success' look like?**
- In regression, accuracy means nothing. We need to error metric upfront.
4. **What data do we actually have and does it make logical sense**?
- Do the columns make intuitive sense for predicting delivery time? are there obvious things missing that should be there?
5. **What could go wrong/be misleading? DataLeakage? TimeBased issues? Biased collection? and so on...**

In [2]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# load data 
data_path = "data//porter_dataset.csv"
original_df = pd.read_csv(data_path)

# create a copy of the original
df = original_df.copy(deep= True)
df.head() # top 5 rows

Unnamed: 0,market_id,created_at,actual_delivery_time,store_id,store_primary_category,order_protocol,total_items,subtotal,num_distinct_items,min_item_price,max_item_price,total_onshift_partners,total_busy_partners,total_outstanding_orders
0,1.0,2015-02-06 22:24:17,2015-02-06 23:27:16,df263d996281d984952c07998dc54358,american,1.0,4,3441,4,557,1239,33.0,14.0,21.0
1,2.0,2015-02-10 21:49:25,2015-02-10 22:56:29,f0ade77b43923b38237db569b016ba25,mexican,2.0,1,1900,1,1400,1400,1.0,2.0,2.0
2,3.0,2015-01-22 20:39:28,2015-01-22 21:09:09,f0ade77b43923b38237db569b016ba25,,1.0,1,1900,1,1900,1900,1.0,0.0,0.0
3,3.0,2015-02-03 21:21:45,2015-02-03 22:13:00,f0ade77b43923b38237db569b016ba25,,1.0,6,6900,5,600,1800,1.0,1.0,2.0
4,3.0,2015-02-15 02:40:36,2015-02-15 03:20:26,f0ade77b43923b38237db569b016ba25,,1.0,3,3900,3,1100,1600,6.0,6.0,9.0


In [7]:
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
print("---"*10)
print("Dtypes of all the features")
print(df.dtypes)
print("---"*10)
print("Columns list")
print(df.columns.tolist())

There are 197428 rows and 14 columns.
------------------------------
Dtypes of all the features
market_id                   float64
created_at                   object
actual_delivery_time         object
store_id                     object
store_primary_category       object
order_protocol              float64
total_items                   int64
subtotal                      int64
num_distinct_items            int64
min_item_price                int64
max_item_price                int64
total_onshift_partners      float64
total_busy_partners         float64
total_outstanding_orders    float64
dtype: object
------------------------------
Columns list
['market_id', 'created_at', 'actual_delivery_time', 'store_id', 'store_primary_category', 'order_protocol', 'total_items', 'subtotal', 'num_distinct_items', 'min_item_price', 'max_item_price', 'total_onshift_partners', 'total_busy_partners', 'total_outstanding_orders']


#### Immediate Observations
1. We have no target column.
- There's no delivery time minutes instead we have `created_at` and `actual_delivery_time` both in string type.
- So we have to create a target variable: `actual_deliver_time` - `created_at`.
2. Three float columns that should probably be integers.
`market_id`, `order_protocol`, `total_onshift_partners`, `total_busy_partners`, `total_outstanding_orders` are float64. Why? Because they have NaN values hiding in them. Pandas can't store NaN in an integer column, so it upcasts to float. This is a red flag for missing data.
3. `store_id` is a hash string. It's a categorical identifier with potentially hundreds of unique values. We need to check its cardinality before deciding how to handle it.

In [8]:
# Check missing values
print("Missing values:")
print(df.isnull().sum())
print("---"*10)

# Engineer the target variable
df['created_at'] = pd.to_datetime(df['created_at'])
df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'])
df['delivery_time_mins'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds() / 60

print("Target variable stats:")
print(df['delivery_time_mins'].describe())
print("---"*10)

# Check store_id cardinality
print(f"Unique stores: {df['store_id'].nunique()}")
print(f"Unique markets: {df['market_id'].nunique()}")
print(f"Unique categories: {df['store_primary_category'].nunique()}")
print(f"Unique order protocols: {df['order_protocol'].nunique()}")

Missing values:
market_id                     987
created_at                      0
actual_delivery_time            7
store_id                        0
store_primary_category       4760
order_protocol                995
total_items                     0
subtotal                        0
num_distinct_items              0
min_item_price                  0
max_item_price                  0
total_onshift_partners      16262
total_busy_partners         16262
total_outstanding_orders    16262
dtype: int64
------------------------------
Target variable stats:
count    197421.000000
mean         48.470956
std         320.493482
min           1.683333
25%          35.066667
50%          44.333333
75%          56.350000
max      141947.650000
Name: delivery_time_mins, dtype: float64
------------------------------
Unique stores: 6743
Unique markets: 6
Unique categories: 74
Unique order protocols: 7
