# Overview
The issue of missing packages has plagued the postal and shipping industry for as long as one can remember. This is especially costly (in terms of both money and reputation) in the modern day and age, where e commerce companies regularly ship out valuable goods to their paying customers.

Through the course of this notebook, I intend to showcase the potential for using logistic regression to save costs in an e commerce company by building a model to identify the parcels to apply for insurance for. 

In this Hypothetical example, we managed to save **$457253.78** in insurance costs for 22,581 parcels.

### Note:
1. Insurance rates have been defined as **insurance_cost = max($20, 10% the cost of the order)** for this example.

2. We are assuming that a random 5% of all parcels get stolen. Hence, we **calibrate the model** to account for this unbalance.

3. Total reveune generated from the 22,581 parcels stands at **$6531261.43**

**By : [Naimish Mani B](https://www.linkedin.com/in/naimish-balaji-a6182b180/)**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Introducton

There are 3 cases wherein the customer can claim to have not recieved a package they had paid for: 

1. Package was stolen from the customer before they could recieve it.
2. Package was lost in transit.
3. The customer is lying to get a refund.

Regardless of the case, usually the company offers a refund to the customer. But at the end of the day, it amounts to revenue lost from the company's side. To minimise the losses from the company's end, the e commerce can opt to insure the packages that are being shipped out.

But again, insurance usually comes at a fixed cost. Hence, for young startups that have just started out and are struggling to maintain good cashflow, insuring all packages might not be feasible. Hence, they might want to selectively insure only certain packages.

---

# The Dataset

First, let's load the datset into memory, and list out the columns. Next, we inspect it by calling `df.head()`

In [None]:
df = pd.read_csv('/kaggle/input/ecommerce-purchase-history-from-jewelry-store/jewelry.csv')
print(df.columns)
df.head()

In [None]:
# Calculate total revenue
df['price'].sum()

For the task at hand, we'll only be requiring the following columns:
   - event_time
   - order_id
   - user_id
   - price

Hence, we drop the other columns directly.

In [None]:
dataset = df.drop(columns=['product_id', 'quantity', 'category_id', 'category_code', 'brand', 'gender', 'color', 'metal', 'gem'])
dataset.head()

In [None]:
print("Number of elements: ", len(dataset))
print("Number of unique orders: ", len(dataset['order_id'].unique()))

As we can see, the number of unique orders < total number of rows in the dataset. This is because there are instances of some orders having multiple products in them.

**For simplicity sake, we will be assuming that each order gets shipped in a seperate box, and hence each box will have to be insured seperately.**

In [None]:
dataset.describe()

Next, we convert all timestamps from the current format to Unix time, to make comparisions easier down the line.

In [None]:
dataset['event_time'] = dataset['event_time'].str.slice(0, -4)
dataset['event_time'] = (pd.DatetimeIndex(dataset['event_time']).astype(np.int64) // 10**9) * 1000
dataset.head()

Now, we calculate the number of previous times the customer has made purchases on the site. We will be using this value as an input to the model later.

In [None]:
def f(x, y):
    foo = dataset[(dataset['user_id'] == y) & (dataset['event_time'] < x)]
    return len(foo)
    
prev_orders = [f(x, y) for x, y in zip(dataset['event_time'], dataset['user_id'])]
prev_orders[-100:]

Next, we add this column to the dataset and get rid of the 'order_id' column, as we don't need it anymore.

In [None]:
# Add new column to dataset
dataset['prev_orders'] = prev_orders
# Delete old column from dataset
dataset = dataset.drop(columns=['order_id'])
dataset.head()

Now unfortunately for us, the dataset does not contain data as to whether the customer recieved their order successfully or not. Hence, we generate fictional data for this using the following assumptions:

    - Each package has a 5% chance of getting stolen
    - This probability is random, and not dependent on any other parameters.

These assumptions are satisfied by using the `random` module in python.

In [None]:
from random import random
l = [0] * len(dataset['price'])
for i in range(len(l)):
    if random() < 0.05:
        l[i] = 1

dataset['lost'] = l
print(dataset['lost'].value_counts())
dataset.head()

Next, we need to calculate the insurance cost for each order. That is calculated according to the following definition:

**insurance_cost = max($20, 10% the cost of the order)**

This is implemented below.

In [None]:
def g(x):
    return max(20, x*0.1)
    
insurance = [g(x) for x in dataset['price']]
insurance[-10:]

In [None]:
dataset['insurance'] = insurance
dataset = dataset.drop(columns=['event_time', 'user_id'])
dataset.head()

Now that we have the whole dataset ready, we split it up into train and test sets using sklearn.

In [None]:
from sklearn.model_selection import train_test_split

dataset = dataset.dropna()
train, test = train_test_split(dataset, test_size=0.2)
print(train.head())
print(len(train))
print(test.head())
print(len(test))

In [None]:
x_train = train.iloc[:,[0, 1]]
x_test = test.iloc[:,[0, 1]]
y_train = train.iloc[:, 2]
y_test = test.iloc[:, 2]
i_train = train.iloc[:, 3]
i_test = test.iloc[:, 3]

In [None]:
x_train.isnull().sum()

In [None]:
y_train.value_counts()

As we can see clearly here, the dataset is very clearly unbalanced. So, there is a very large possibility that the model does not "learn" anything, but rather always guesses "0" (do not insure). This is bad for us, though. Hence, we attempt to balance this out by means of calibration. This is achieved through the cross validation calibrated classifier found in scikit learn, and has been implemented below. 

For the model, we are using **logistic regression**, with **auc score** as our evaluation metric.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.calibration import CalibratedClassifierCV


clf = LogisticRegression(class_weight='balanced')
calibrated_clf = CalibratedClassifierCV(base_estimator=clf, cv=3, method='isotonic')
calibrated_clf.fit(x_train, y_train.values.ravel())
y_pred = calibrated_clf.predict_proba(x_test)[:, 1]
roc_auc_score(y_test, y_pred)

Although the AUC value looks bad, it is exactly what is to be expected, since we had randomly initialised the labels.

In [None]:
y_test_pred = pd.DataFrame(y_pred, columns=['prediction'])
y_test_pred.head()

In [None]:
x_final = x_test
pred = [x[0] for x in y_test_pred.values.tolist()]
x_final['prediction'] = pred
x_final[-10:]

In [None]:
E_x = x_final['price'] * x_final['prediction']
ins = [g(x) for x in x_final['price']]
E_x

In [None]:
x_final['E_x'] = E_x
x_final['ins'] = ins
x_final['lost'] = y_test
x_final

In [None]:
x_final['price'].sum()

Now, we calculate the expenses we have in different cases.

In [None]:
print("Total cost (insuring everything): ", x_final['ins'].sum())
print("Total cost (insuring everything where insurance cost is less than actual cost): ",
     x_final['ins'][x_final['price']>x_final['ins']].sum()
      +x_final['price'][(x_final['price']<x_final['ins'])&(x_final['lost']==1)].sum())
print("Total cost (insuring according to model): ",
     x_final['ins'][x_final['E_x']>x_final['ins']].sum()
      +x_final['price'][(x_final['E_x']<x_final['ins'])&(x_final['lost']==1)].sum())

# Inference
From the above, we can see that by using the model, we can easily save **$457253.78**, which is a pretty decent value.

# Concluding Remarks

There are a couple of things to note, here. For starters, we have used a generated dataset, and results like the one derived above may / may not be actually visible on real world datasets. Also, the model could be improved by adding the following features into consideration:

- Zip code to where the parcel is being delivered
- Medium of shipment (air, freight, cargo might have it's own set of effects on this)
- Company used to facilitate the shipment