Buyers spend a significant amount of time surfing an e-commerce store, since the pandemic the e-commerce has seen a boom in the number of users across the domains. In the meantime, the store owners are also planning to attract customers using various algorithms to leverage customer behavior patterns

Tracking customer activity is also a great way of understanding customer behavior and figuring out what can actually be done to serve them better. Machine learning and AI has already played a significant role in designing various recommendation engines to lure customers by predicting their buying patterns

In this competition provided the visitor's session data, we are challenging the Machinehack community to come up with a regression algorithm to predict the time a buyer will spend on the platform.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

%matplotlib inline

In [None]:
train = pd.read_csv('../input/machinehack-buyers-time-prediction-challenge/ParticipantData_BTPC/Train.csv')
train['date'] = pd.to_datetime(train['date'])
test = pd.read_csv('../input/machinehack-buyers-time-prediction-challenge/ParticipantData_BTPC/Test.csv')
test['date'] = pd.to_datetime(test['date'])
sample_sub = pd.read_csv('../input/machinehack-buyers-time-prediction-challenge/ParticipantData_BTPC/Sample Submission.csv')

Columns details:-
- session_id - Unique identifier for every row
- session_number - Session type identifier
- client_agent - Client-side software details
- device_details -  Client-side device details
- date - Datestamp of the session
- purchased - Binary value for any purchase done
- added_in_cart - Binary value for cart activity
- checked_out -  Binary value for checking out successfully
- time_spent - Total time spent in seconds (Target Column)


Skills:
- Regression Modeling
- Advance Feature engineering, with Datestamp and Text datatypes
- Optimizing RMSLE score as a metric to generalize well on unseen data

In [None]:
train.info()

In [None]:
train.head()

In [None]:
test.info()

In [None]:
# test.head()

### Feature Engineering

In [None]:
train['month'] = train.date.dt.month
test['month'] =  test.date.dt.month
train['day'] = train.date.dt.day
test['day'] = test.date.dt.day
train['year'] = train.date.dt.year
test['year'] = test.date.dt.year
# def week_day(date):
    # return date.weekday()
train['week_day'] = train.date.dt.weekday
test['week_day'] = test.date.dt.weekday

train['weekend'] = train['week_day'].apply(lambda x: 0 if x not in [5, 6] else 1)
test['weekend'] = test['week_day'].apply(lambda x: 0 if x not in [5, 6] else 1)

In [None]:
train = pd.concat([train, pd.get_dummies(train['week_day'], prefix='week', dtype='int64')], axis=1)
test = pd.concat([test, pd.get_dummies(test['week_day'], prefix='week',  dtype='int64')], axis=1)

In [None]:
train.info()

In [None]:
train['device_details'].value_counts()

In [None]:
phone = ['iPhone - iOS', 'iPhone - Web', 'Android Phone - Android', 'iPad - Web', 'iPhone - MobileWeb', 'Android Tablet - Web', 'Unknown - MobileWeb', 'Android Phone - Web', 'iPad - iOS', 'Android Phone - MobileWeb', 'Android Tablet - Android', 'Android Tablet - MobileWeb']
desktop = ['Desktop - Chrome', 'Desktop - Safari', 'Desktop - IE', 'Desktop - Firefox']
# label = {'phone': 0, 'desktop': 1}

def clean_details(data):
    for idx, row in data.iterrows():
        if row['device_details'] == 'Other - Other':
            if type(row['client_agent']) is str and 'Android' in row['client_agent']:
                data.loc[idx, 'device_details'] = 'phone'
            else:
                data.loc[idx, 'device_details'] = 'desktop'
        else:
            if row['device_details'] in phone:
                data.loc[idx, 'device_details'] = 'phone'
            else:
                data.loc[idx, 'device_details'] = 'desktop'
    data = pd.concat([data, pd.get_dummies(data['device_details'], prefix='device', dtype='int64')], axis=1)
    data.drop(columns=['device_details'], inplace=True)
    return data

In [None]:
train = clean_details(train)
test = clean_details(test)

In [None]:
# train['device_details'].value_counts(dropna=False)

In [None]:
def make_column(data):
    for idx, row in data.iterrows():
        if row['added_in_cart'] == 1 or row['checked_out'] == 1:
            data.loc[idx, 'buy'] = 1
        else:
            data.loc[idx, 'buy'] = 0
    return data
train = make_column(train)
test = make_column(test)

In [None]:
# Time spent corresponding to date
plt.title('Time spent corresponding to date')
plt.scatter(train.index, train['time_spent'])
plt.show()

In [None]:
# Time spent corresponding to the week days
weeks = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
week_day =  train['week_day'].apply(lambda x: weeks[x])
plt.title('Time spent corresponding to week days')
plt.scatter(train['time_spent'], week_day)
plt.show()

In [None]:
# Removing outlier
outlier = []
for idx, row in train.iterrows():
    if row['time_spent'] > 14000:
        outlier.append(idx)
train = train.drop(index=outlier)

In [None]:
plt.figure(figsize=(15, 8))
sns.heatmap(train.corr(), annot=True, fmt='.2f', linewidths=0.5)
plt.show()

In [None]:
features = ['purchased', 'added_in_cart', 'checked_out']
X = train[features]
y = train['time_spent']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, shuffle=True)

### Model

In [None]:
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge
import xgboost as xg

In [None]:
xgb_r = xg.XGBRegressor(objective ='reg:squarederror', n_estimators = 1, seed = 123).fit(X_train, y_train)
y_pred = xgb_r.predict(X_test)
for idx, val in pd.Series(y_pred).iteritems():
    if val < 0:
        y_pred[idx] = 0

In [None]:
print(np.round(np.sqrt(mean_squared_log_error(y_test, y_pred)), 5))

In [None]:
test.head()

In [None]:
pred = xgb_r.predict(test[features])
for idx, val in pd.Series(pred).iteritems():
    if val < 0:
        pred[idx] = 0

In [None]:
sample_sub['time_spent'] = pd.Series(pred)
# sample_sub.to_csv('sample_sub.csv', index=False)