### Problem
Buyers spend a significant amount of time surfing an e-commerce store, since the pandemic the e-commerce has seen a boom in the number of users across the domains. In the meantime, the store owners are also planning to attract customers using various algorithms to leverage customer behavior patterns.

Tracking customer activity is also a great way of understanding customer behavior and figuring out what can actually be done to serve them better. Machine learning and AI has already played a significant role in designing various recommendation engines to lure customers by predicting their buying patterns.

`In this competition provided the visitor's session data, we are challenging the Machinehack community to come up with a regression algorithm to predict the time a buyer will spend on the platform.`

#### What is the Metric In this competition?
The submission will be evaluated using the RMSLE metric. 

One can use np.sqrt(mean_squared_log_error(actual, predicted)) to calculate the same.

## In this notebook
1. Basic EDA with comments and cleaning of data with feature removing and feature extraction.
2. Use four techniques namely - `linear regression`, `xg boost`, `decision tree` and `random forest` to predict the time and also display the RMSLE of train data.
3. To check your RMSLE of test data, click [here](https://www.machinehack.com/hackathons/buyers_time_prediction_challenge/submission)

In [None]:
# importing the libraries

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')

import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xg
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import r2_score, mean_squared_error,mean_squared_log_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform, randint

# filter the warnings
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 100)

### Train data

In [None]:
# reading the train data and looking top 5 rows

df = pd.read_csv('/kaggle/input/buyers-time-prediction-challenge/ParticipantData_BTPC/Train.csv')
df.head()

In [None]:
# checking for null values in columns
df.isnull().sum()

In [None]:
# looking for big picture of the data
df.info()

> Note:
- most of the independent variables are type object and few are binary. It means when we do cleaning and transformation of type object variables they also converted into binary variables.
- date is not in right data type. currently it is an object and should be changed to datetime.
- client_agent has 160 missing values.

---
#### session id
Unique identifier for every row

In [None]:
df.session_id.nunique()

---
#### session number
Session type identifier

In [None]:
df.session_number.nunique()

In [None]:
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.scatterplot(x=df.session_number, y=df.time_spent)
plt.title('session number vs time spent', size=14)

plt.subplot(1,2,2)
sns.scatterplot(x=df[(df.session_number<2000)&(df.time_spent<15000)]['session_number'], 
                y=df[(df.session_number<2000)&(df.time_spent<15000)]['time_spent'])
plt.title('zooming in first plot where session number < 2000 and time spent < 15000', size=14)

plt.show()

> Note:<br>
It is been observed that there is no such correlation between the two i.e. session number is not a significant variables in predicting the time. The points are distributed randomly and make no sense. **So we will drop this variable.** 

---
#### device details
Client-side software details

In [None]:
# looking for count
df.device_details.value_counts()

In [None]:
# looking the mean time spent from each device
df.groupby('device_details')['time_spent'].mean().round(2).sort_values(ascending=False)

In [None]:
# splitting the columns into the two and removing the white spaces
df[['device','browser']] = df['device_details'].str.split('-',expand=True)
df['device'] = df['device'].str.strip()
df['browser'] = df['browser'].str.strip()

#### device
Created from device_details column

In [None]:
# checking the mean time spent from each device
df.groupby('device')['time_spent'].mean().round(2).sort_values(ascending=False)

In [None]:
device_index = df.groupby('device')['time_spent'].mean().sort_values(ascending=False).index
device_value = df.groupby('device')['time_spent'].mean().sort_values(ascending=False).values

plt.figure(figsize=(10,5))
sns.barplot(y=device_index, x=device_value, color='pink')
plt.xlabel('Time spent (in seconds)', size=12)
plt.yticks(size=12)
plt.ylabel('')
plt.title('Mean time spent by users from their device',size=15)
plt.show()

> Note:<br>
It is been observed that people using **desktop** and **ipad** spent most time on the website followed by android phone and iphone. 

In [None]:
desktop = df[df.device=='Desktop'].groupby('browser')['time_spent'].mean().round(2).sort_values()
plt.figure(figsize=(10,5))
desktop.plot(kind='bar',color='green',alpha=0.4)
plt.ylabel('Time spent (in seconds)', size=12)
plt.xticks(size=12,rotation=45)
plt.xlabel('')
plt.title(f'Mean time spent from different desktop\n{desktop}',size=12)
plt.show()

> Note:<br>
Within the desktop we see that people use four type of browser among which **firefox** has the highest mean time spent. **So it is been seen that people who login from desktop and uses firefox browser are the one who spent most time.**

In [None]:
plt.figure(figsize=(15,5))

ipad = df[df.device=='iPad'].groupby('browser')['time_spent'].mean().round(2).sort_values()
iphone = df[df.device=='iPhone'].groupby('browser')['time_spent'].mean().round(2).sort_values()

plt.subplot(1,2,1)
ipad.plot(kind='bar',color='green',alpha=0.4)
plt.ylabel('Time spent (in seconds)', size=12)
plt.xticks(size=12,rotation=45)
plt.xlabel('')
plt.title(f'Mean time spent from different ipad\n{ipad}',size=12)

plt.subplot(1,2,2)
iphone.plot(kind='bar',color='green',alpha=0.4)
plt.ylabel('Time spent (in seconds)', size=12)
plt.xticks(size=12,rotation=45)
plt.xlabel('')
plt.title(f'Mean time spent from different iphone\n{iphone}',size=12)

plt.show()

> Note:
- Within the ipad people who uses iOS i.e. app of the website spent most of the time.
- Within iphone also people using the application spent the most time.
- But the point note to be that ipad people spent more time on website from app than the iphone people who uses app. The difference is of 190 seconds i.e. more than 3 minutes.

In [None]:
plt.figure(figsize=(15,5))

android_phone = df[df.device=='Android Phone'].groupby('browser')['time_spent'].mean().round(2).sort_values()
android_tablet = df[df.device=='Android Tablet'].groupby('browser')['time_spent'].mean().round(2).sort_values()

plt.subplot(1,2,1)
android_phone.plot(kind='bar',color='green',alpha=0.4)
plt.ylabel('Time spent (in seconds)', size=12)
plt.xticks(size=12,rotation=45)
plt.xlabel('')
plt.title(f'Mean time spent from different android phone\n{android_phone}',size=12)

plt.subplot(1,2,2)
android_tablet.plot(kind='bar',color='green',alpha=0.4)
plt.ylabel('Time spent (in seconds)', size=12)
plt.xticks(size=12,rotation=45)
plt.xlabel('')
plt.title(f'Mean time spent from different android tablet\n{android_tablet}',size=12)

plt.show()

> Note:
- Within the android phone people who uses android i.e. app of the website downloaded from app store spent most of the time.
- Within android tablet also people using the application spent the most time.
- The mobileweb was second highest in phone and lowest in tablet (graphically) but if we observed then we see that time spent from both the device using mobileweb was not very far from each other. The difference between them is 33 seconds.

In [None]:
df.groupby(['device','purchased'])['time_spent'].mean().unstack().plot(kind='bar',figsize=(12,5),color=['grey','red'],alpha=0.7)
plt.xticks(rotation=45,size=12)
plt.xlabel('')
plt.ylabel('mean time spent (in seconds)',size=12)
plt.title('mean time spent by peoples who purchased using different devices', size=15)
plt.show()

> Note:<br>
This graph shows that the peoples who purchased spent three times more time from the one who dosn't purchase.

In [None]:
# merging the categories into 4 - phone, tablet, desktop and other

df['device'] = df['device'].replace(('Android Phone','Android Tablet','Unknown','iPad','iPhone'),
                                   ('Phone','Tablet','Other','Tablet','Phone'))

#### browser
Created from device_details column

In [None]:
# checking the mean time spent from different browser
df.groupby('browser')['time_spent'].mean().round(2).sort_values(ascending=False)

In [None]:
browser_index = df.groupby('browser')['time_spent'].mean().sort_values(ascending=False).index
browser_value = df.groupby('browser')['time_spent'].mean().sort_values(ascending=False).values

plt.figure(figsize=(10,5))
sns.barplot(x=browser_index, y=browser_value, color='pink')
plt.ylabel('Time spent (in seconds)', size=12)
plt.xlabel('')
plt.xticks(size=12,rotation=45)
plt.title('Time spent by users on the website coming from different browser',size=15)
plt.show()

In [None]:
df.groupby(['browser','purchased'])['time_spent'].mean().unstack().plot(kind='bar',figsize=(12,5),color=['grey','red'],alpha=0.7)
plt.xticks(rotation=45,size=12)
plt.xlabel('')
plt.title('mean time spent by peoples who purchased using different browsers', size=15)
plt.show()

In [None]:
df['browser'] = df['browser'].replace(('Android','Chrome','Firefox','IE','MobileWeb','Other','Safari','Web','iOS'),
                                     ('App','Web','Web','Web','Web','Other','Web','Web','App'))

In [None]:
df.shape

---
#### date
Datestamp of the session

In [None]:
# converting to datetime
df['date'] = pd.to_datetime(df['date'])

In [None]:
print('Minimum date in the data:',min(df['date']))
print('Maximum date in the data:',max(df['date']))

In [None]:
# extracting the month, day, weekday and week from the date
df['month'] = df['date'].dt.month_name()
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()
df['week'] = (df.day - 1) // 7 + 1

#### month

In [None]:
# mean time spent in each month
df.groupby('month')['time_spent'].mean().round(2)

In [None]:
month_index = df.groupby('month')['time_spent'].mean().sort_values(ascending=False).index
month_value = df.groupby('month')['time_spent'].mean().sort_values(ascending=False).values

plt.figure(figsize=(12,5))
sns.barplot(x=month_index, y=month_value, color='pink')
plt.ylabel('Time spent (in seconds)', size=12)
plt.xlabel('')
plt.xticks(size=11,rotation=45)
plt.title('Mean time spent on website each month',size=15)
plt.show()

In [None]:
# created new columns based upon time spent
df['is_september'] = df['month'].apply(lambda x: 1 if x=='September' else 0)
df['is_apr_may_july'] = df['month'].apply(lambda x: 1 if (x=='April' or x=='May' or x=='July') else 0)

#### weekday

In [None]:
# mean time spent on each day of week
df.groupby('weekday')['time_spent'].mean().round(2)

In [None]:
week_index = df.groupby('weekday')['time_spent'].mean().sort_values(ascending=False).index
week_value = df.groupby('weekday')['time_spent'].mean().sort_values(ascending=False).values

plt.figure(figsize=(10,5))
sns.barplot(x=week_index, y=week_value, color='pink')
plt.ylabel('Time spent (in seconds)', size=12)
plt.xlabel('')
plt.ylim(500,800)
plt.title('Mean time spent on website on each day of week',size=15)
plt.show()

In [None]:
# created new column
df['is_friday'] = df['weekday'].apply(lambda x: 1 if x=='Friday' else 0)
df['is_mon_tue_sat'] = df['weekday'].apply(lambda x: 1 if (x=='Monday' or x=='Tuesday' or x=='Saturday') else 0)

#### day

In [None]:
day_index = df.groupby('day')['time_spent'].mean().index
day_value = df.groupby('day')['time_spent'].mean().values

plt.figure(figsize=(15,5))
plt.plot(day_index, day_value, color='purple')
plt.ylabel('Time spent (in seconds)', size=12)
plt.xlabel('Number of days', size=12)
plt.xlim(1,31)
plt.hlines(y=round(df.time_spent.mean(),2), xmin=1,xmax=31,linestyles='dashed',label='mean time')
plt.title('Mean time spent on website on each day throughout the year',size=15)
plt.legend(loc='center',fontsize=12)
plt.show()

> Note:<br>
From the above line plot it is seen that from 5th to 18th the time spent on the website is more than mean time and any number of conclusion can be made from this like offer in that period or arrival of new stock in the webiste or purchasing power in that peroid is high, etc.

In [None]:
dayp_index = df.groupby('day')['purchased'].count().index
dayp_value = df.groupby('day')['purchased'].count().values

plt.figure(figsize=(15,5))
sns.lineplot(x=dayp_index, y=dayp_value, color='purple')
plt.ylabel('Number of purchases', size=12)
plt.xlabel('Number of days', size=12)
plt.xlim(1,31)
plt.ylim(50,250)
plt.vlines(5, ymin=50,ymax=250,linestyles='dashed')
plt.vlines(18, ymin=50,ymax=250,linestyles='dashed')
plt.title('Number of purchases on website on each day throughout the year',size=15)
plt.show()

#### week of months

In [None]:
week_index = df.groupby('week')['time_spent'].mean().index
week_value = df.groupby('week')['time_spent'].mean().values

plt.figure(figsize=(15,5))
sns.lineplot(x=week_index, y=week_value, color='purple')
plt.ylabel('Time spent (in seconds)', size=12)
plt.xlabel('Week number', size=12)
plt.xlim(1,5)
plt.hlines(y=round(df.time_spent.mean(),2),xmin=1,xmax=5,linestyles='dashed',label='mean time')
plt.title('Mean time spent on website on each week of month',size=15)
plt.legend(loc='center',fontsize=12)
plt.show()

---
#### purchased
Binary value for any purchase done

In [None]:
df.purchased.value_counts(normalize=True).round(2)

In [None]:
df.groupby('purchased')['time_spent'].mean().round(2)

---
#### added in cart
Binary value for cart activity

In [None]:
# proportion of rows
df.added_in_cart.value_counts(normalize=True).round(2)

In [None]:
# mean time spent
df.groupby('added_in_cart')['time_spent'].mean().round(2)

---
#### checked out
Binary value for checking out successfully

In [None]:
# proportion of rows
df.checked_out.value_counts(normalize=True).round(3)

In [None]:
# mean time spent
df.groupby('checked_out')['time_spent'].mean().round(2)

---
#### time spent
Total time spent in seconds (Target Column)

In [None]:
df.time_spent.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
plt.figure(figsize=(10,5))

sns.distplot(df.time_spent)
plt.xlabel('Time spent(in seconds)', size=12)
plt.title('Distribution of time spent on website', size=15)

plt.show()

In [None]:
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.boxplot(np.log(df.time_spent))
plt.title('Log distribution of time spent', size=15)

plt.subplot(1,2,2)
sns.boxplot(np.sqrt(df.time_spent))
plt.title('Squre root distribution of time spent', size=15)

plt.show()

In [None]:
# converting the time with log distribution
df['time_spent'] = np.log(df['time_spent'])

---
#### client agent
Client-side software details

In [None]:
# created a function to find out the platform the user used

def platform(x):
    if x.find('Product')!=-1:
        return 'Product'
    elif x.find('Chrome')!=-1:
        return 'Chrome'
    elif x.find('Safari')!=-1:
        return 'Safari'
    elif x.find('Mozilla')!=-1:
        return 'Mozilla'
    else:
        return 'Other'

In [None]:
# created a function to find out the device used by the user

import re
def device_used(x):
    if x.lower().find('windows')!=-1:
        return 'windowns'
    elif x.lower().find('iphone')!=-1:
        return 'iphone'
    elif x.lower().find('ipad')!=-1:
        return 'ipad'
    elif x.lower().find('android')!=-1:
        return 'android'
    elif x.lower().find('macintosh')!=-1:
        return 'macintosh'
    elif x.lower().find('linux')!=-1:
        return 'linux'
    elif len(re.findall("cfnetwork|cros", x.lower())) > 0:
        return 'apple_device'
    else:
        return 'unknown'

In [None]:
# applying the above two functions on client_agent column

df['platform'] = df['client_agent'].apply(lambda x: platform(str(x)))
df['server'] = df['client_agent'].apply(lambda x: device_used(str(x)))

In [None]:
df.platform.value_counts()

In [None]:
df.server.value_counts()

In [None]:
df.head()

In [None]:
# dropping the unnecessary columns

df.drop(['session_number','client_agent','device_details','platform','server','date','day',
         'weekday','month','week','checked_out'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
# making the dummy variables for platform and server

devicepd = pd.get_dummies(df.device,drop_first=True,prefix='device')
browserpd = pd.get_dummies(df.browser,drop_first=True,prefix='browser')
df = pd.concat([df,devicepd,browserpd], axis=1)
df.drop(['device','browser'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.drop('session_id', axis=1, inplace=True)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df = df[['time_spent', 'purchased', 'added_in_cart', 'is_september',
       'is_apr_may_july', 'is_friday', 'is_mon_tue_sat',
       'device_Other', 'device_Phone', 'device_Tablet', 'browser_Other','browser_Web']]

In [None]:
# looking for the correlation matrix

plt.figure(figsize=(15,10))
sns.heatmap(df.iloc[:,1:].corr().round(2), annot=True, cmap="YlGnBu")
plt.show()

In [None]:
# a high correlation between device_other and browser_other
# droping one of the variables 

df.drop('device_Other',axis=1,inplace=True)

In [None]:
df.shape

In [None]:
# splitting the data into train and test
df_train, df_test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state = 444)

In [None]:
# using min-max scaler to transform time spent as all the other variables are between 0-1
df_train[['time_spent']] = scaler.fit_transform(df_train[['time_spent']])

df_train.head()

In [None]:
# splitting the x's and y
y_train = df_train.pop('time_spent')
X_train = df_train

In [None]:
# only transform the test time spent column

df_test[['time_spent']] = scaler.transform(df_test[['time_spent']])
df_test.head()

In [None]:
# splitting the x's and y
y_test = df_test.pop('time_spent')
X_test = df_test

### Test Data
- The data on which the actual prediction has to be done and whose result is to submitted for final evaluation.
- `All the changes that has been done on train data should also be done on test data`. 

In [None]:
df1 = pd.read_csv('/kaggle/input/buyers-time-prediction-challenge/ParticipantData_BTPC/Test.csv')
df1.head()

In [None]:
df1['platform'] = df1['client_agent'].apply(lambda x: platform(str(x)))
df1['server'] = df1['client_agent'].apply(lambda x: device_used(str(x)))

In [None]:
df1.head()

In [None]:
df1[['device','browser']] = df1['device_details'].str.split('-',expand=True)
df1['device'] = df1['device'].str.strip()
df1['browser'] = df1['browser'].str.strip()
df1['device'] = df1['device'].replace(('Android Phone','Android Tablet','Unknown','iPad','iPhone'),
                                   ('Phone','Tablet','Other','Tablet','Phone'))
df1['browser'] = df1['browser'].replace(('Android','Chrome','Firefox','IE','MobileWeb','Other','Safari','Web','iOS'),
                                     ('App','Web','Web','Web','Web','Other','Web','Web','App'))

In [None]:
df1['date'] = pd.to_datetime(df1['date'])
df1['month'] = df1['date'].dt.month_name()
df1['day'] = df1['date'].dt.day
df1['weekday'] = df1['date'].dt.day_name()
df1['week'] = (df1.day - 1) // 7 + 1

In [None]:
df1['is_september'] = df1['month'].apply(lambda x: 1 if x=='September' else 0)
df1['is_apr_may_july'] = df1['month'].apply(lambda x: 1 if (x=='April' or x=='May' or x=='July') else 0)
df1['is_friday'] = df1['weekday'].apply(lambda x: 1 if x=='Friday' else 0)
df1['is_mon_tue_sat'] = df1['weekday'].apply(lambda x: 1 if (x=='Monday' or x=='Tuesday' or x=='Saturday') else 0)

In [None]:
devicepd1 = pd.get_dummies(df1.device,drop_first=True,prefix='device')
browserpd1 = pd.get_dummies(df1.browser,drop_first=True,prefix='browser')
df1 = pd.concat([df1,devicepd1,browserpd1], axis=1)
df1.drop(['device','browser'], axis=1, inplace=True)

In [None]:
df1.head()

In [None]:
df1.drop(['session_id','session_number','client_agent','device_details','checked_out','date','month','day','week',
          'weekday','platform','server'], axis=1,inplace=True)

In [None]:
df1.drop(['device_Other'],axis=1,inplace=True)

In [None]:
df1.head()

In [None]:
df1.columns

In [None]:
df.columns

### I. Linear Regression

In [None]:
# Running the linear model 
lm1 = sm.OLS(y_train, sm.add_constant(X_train)).fit()

# Looking for summary
print(lm1.summary())

In [None]:
# checking again the VIFs

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# deleting the variable 'is_apr_may_july' as it is insignificant 
X_train = X_train.drop(['is_apr_may_july'],axis=1)

In [None]:
# Running the linear model 
lm2 = sm.OLS(y_train, sm.add_constant(X_train)).fit()

# Looking for summary
print(lm2.summary())

In [None]:
# checking again the VIFs

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# deleting the variable 'device_Tablet' as it is insignificant 
X_train = X_train.drop(['device_Tablet'],axis=1)

In [None]:
# Running the linear model 
lm3 = sm.OLS(y_train,sm.add_constant(X_train)).fit()

# Looking for summary
print(lm3.summary())

In [None]:
# checking again the VIFs

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

> Note:<br>
As all the variables comes significant and VIF < 3. So we will stop here and these are our final columns.

In [None]:
# predicting on train data
y_train_time = lm3.predict(sm.add_constant(X_train))

In [None]:
# Plot the histogram of the error terms

fig = plt.figure()
sns.distplot((y_train - y_train_time), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)               
plt.xlabel('Errors', fontsize = 18)                         
plt.show()

In [None]:
mse = mean_squared_error(y_train, y_train_time)
msle = mean_squared_log_error(y_train,y_train_time)
r_squared = r2_score(y_train,y_train_time)

print('RMSE:',np.sqrt(mse))
print('RMSLE:',np.sqrt(msle))
print('R-squared value:',r_squared)

In [None]:
print('The final columns are:')
fcol = list(X_train.columns)
print(fcol)

In [None]:
# Including only those variables which are selected by third model in training set 

X_test = X_test[fcol]

In [None]:
# Making prediction using the third model on test data
y_test_time = lm3.predict(sm.add_constant(X_test))

In [None]:
mse = mean_squared_error(y_test, y_test_time)
msle = mean_squared_log_error(y_test,y_test_time)
r_squared = r2_score(y_train,y_train_time)

print('RMSE:',np.sqrt(mse))
print('RMSLE:',np.sqrt(msle))
print('R-squared value:',r_squared)

##### Prediction on test set

In [None]:
pred_test_lr = lm3.predict(sm.add_constant(df1[fcol])) # using only selected columns

In [None]:
y_final_lr = np.array(pred_test_lr).reshape(-1,1)

In [None]:
# using inverse transform and exponential function to get back the actual seconds predicted
y_final_lr = scaler.inverse_transform(y_final_lr)
y_final_lr = np.exp(y_final_lr)
y_final_lr

In [None]:
# storing it into the dataframe

df_lr = pd.DataFrame(y_final_lr)
df_lr.rename(columns={0:'time_spent_lr'},inplace=True)
df_lr = df_lr.round(4)
df_lr.head()

### II. XG Boost

In [None]:
# Instantiation 
xgb_r = xg.XGBRegressor(objective ='reg:linear', n_estimators = 10, seed = 123) 
  
# Fitting the model 
xgb_r.fit(X_train, y_train) 
  
# Predict the model 
pred = xgb_r.predict(X_test)

mse = mean_squared_error(y_test, pred)
r_squared = r2_score(y_test, pred)

print('RMSE:',np.sqrt(mse))
print('R-squared value:',r_squared)

In [None]:
# running the cross validation

def display_scores(scores):
    print("Scores: {0}\nMean: {1:.3f}\nStd: {2:.3f}".format(scores, np.mean(scores), np.std(scores)))

xg_model = xg.XGBRegressor(objective="reg:linear", random_state=42)

scores = cross_val_score(xg_model, X_train, y_train, scoring="neg_mean_squared_error", cv=5)

display_scores(np.sqrt(-scores))

In [None]:
def report_best_scores(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [None]:
# use randomizedsearchcv to find the parameters

xgb_model = xg.XGBRegressor()

params = {
    "colsample_bytree": uniform(0.7, 0.3),
    "gamma": uniform(0, 0.5),
    "learning_rate": uniform(0.03, 0.3), # default 0.1 
    "max_depth": randint(2, 6), # default 3
    "n_estimators": randint(100, 150), # default 100
    "subsample": uniform(0.6, 0.4)
}

search = RandomizedSearchCV(xgb_model, param_distributions=params, random_state=42, n_iter=200, cv=3, verbose=1, 
                            n_jobs=1, return_train_score=True)

search.fit(X_train, y_train)

report_best_scores(search.cv_results_, 1)

In [None]:
# Instantiation 
xgb_model1 = xg.XGBRegressor(objective ='reg:linear', n_estimators = 144,seed = 123) 
  
# Fitting the model 
xgb_model1.fit(X_train, y_train) 
  
# Predict the model 
pred = xgb_model1.predict(X_train)

mse = mean_squared_error(y_train, pred)
msle = mean_squared_log_error(y_train, pred)
r_squared = r2_score(y_train, pred)

print('RMSE:',np.sqrt(mse))
print('RMSLE:',np.sqrt(msle))
print('R-squared value:',r_squared)

In [None]:
xgb_model1.feature_importances_

In [None]:
# Plot the histogram of the error terms

fig = plt.figure()
sns.distplot((y_train - pred), bins = 20)
fig.suptitle('Error Terms', fontsize = 20) 
plt.xlabel('Errors', fontsize = 18)                         
plt.show()

In [None]:
# Making prediction on test data
pred_test = xgb_model1.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, pred_test)
msle = mean_squared_log_error(y_test, pred_test)
r_squared = r2_score(y_test, pred_test)

print('RMSE:',np.sqrt(mse))
print('RMSLE:',np.sqrt(msle))
print('R-squared value:',r_squared)

##### Prediction on test data

In [None]:
pred_test_xgb = xgb_model1.predict(df1[fcol])

In [None]:
y_final_xgb = np.array(pred_test_xgb).reshape(-1,1)

In [None]:
# using inverse transform and exponential function to get back the actual seconds predicted
y_final_xgb = scaler.inverse_transform(y_final_xgb)
y_final_xgb = np.exp(y_final_xgb)
y_final_xgb

In [None]:
# storing in the dataframe
df_xgb = pd.DataFrame(y_final_xgb)
df_xgb.rename(columns={0:'time_spent_xgb'},inplace=True)
df_xgb = df_xgb.round(4)
df_xgb.head()

In [None]:
df_xgb.shape

### III. Decision Tree Regressor

In [None]:
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)

In [None]:
# depth of the decision tree
print('Depth of the Decision Tree: ', dtr.get_depth())

#checking the training score
print('Accuracy on training: ',dtr.score(X_train, y_train))

In [None]:
# Implementing grid search

parameter_grid = {
    'max_depth' : [24,25,26,27,28,29,30],
    'max_features': [0.3, 0.5, 0.7]
    }

gridsearch = GridSearchCV(estimator=dtr, param_grid=parameter_grid, scoring='neg_mean_squared_error', cv=5)

gridsearch.fit(X_train, y_train)

print(gridsearch.best_params_)

In [None]:
# Implementing random search

parameter_grid = {
    'max_depth' : [24,25,26,27,28,29,30],
    'max_features': [0.3, 0.5, 0.7,0.9]
    }

randomsearch = RandomizedSearchCV(estimator=dtr, param_distributions=parameter_grid, n_iter= 10, cv=5)
randomsearch.fit(X_train, y_train)

print(randomsearch.best_params_)

In [None]:
# final model
dtr1 = DecisionTreeRegressor(max_depth=27, max_features=0.5 ,random_state=10)

# fitting the model
dtr1.fit(X_train, y_train)

# Training score
print(dtr1.score(X_train, y_train).round(4))

In [None]:
from sklearn import tree

fig = plt.figure(figsize=(15,10))
_ = tree.plot_tree(dtr1, feature_names=X_train.columns, max_depth=2, filled=True)

In [None]:
dtr1.feature_importances_

In [None]:
# Predict the model 
pred_dt = dtr1.predict(X_train)

mse = mean_squared_error(y_train, pred_dt)
msle = mean_squared_log_error(y_train, pred_dt)
r_squared = r2_score(y_train, pred_dt)

print('RMSE:',np.sqrt(mse))
print('RMSLE:',np.sqrt(msle))
print('R-squared value:',r_squared)

In [None]:
# Plot the histogram of the error terms

fig = plt.figure()
sns.distplot((y_train - pred_dt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20) 
plt.xlabel('Errors', fontsize = 18)
plt.show()

In [None]:
# Making prediction on test data
pred_test_dt = dtr1.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, pred_test_dt)
msle = mean_squared_log_error(y_test, pred_test_dt)
r_squared = r2_score(y_test, pred_test_dt)

print('RMSE:',np.sqrt(mse))
print('RMSLE:',np.sqrt(msle))
print('R-squared value:',r_squared)

##### Prediction on test data

In [None]:
pred_test_dt = dtr1.predict(df1[fcol])

In [None]:
y_final_dt = np.array(pred_test_dt).reshape(-1,1)

# using inverse transform and exponential function to get back the actual seconds predicted
y_final_dt = scaler.inverse_transform(y_final_dt)
y_final_dt = np.exp(y_final_dt)
y_final_dt

In [None]:
df_dt = pd.DataFrame(y_final_dt)
df_dt.rename(columns={0:'time_spent_dt'},inplace=True)
df_dt = df_dt.round(4)
df_dt.head()

### IV. Random Forest

In [None]:
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)

In [None]:
#checking the training score
print('Accuracy on training: ',rfr.score(X_train, y_train))

In [None]:
# Implementing grid search

parameter_grid = {
    'max_depth' : [24,25,26,27,28,29,30],
    'max_features': [0.3, 0.5, 0.7]
    }

gridsearch = GridSearchCV(estimator=rfr, param_grid=parameter_grid, scoring='neg_mean_squared_error', cv=5)

gridsearch.fit(X_train, y_train)

print(gridsearch.best_params_)

In [None]:
# Implementing random search

parameter_grid = {
    'max_depth' : [24,25,26,27,28,29,30],
    'max_features': [0.3, 0.5, 0.7,0.9]
    }

randomsearch = RandomizedSearchCV(estimator=rfr, param_distributions=parameter_grid, n_iter= 10, cv=5)
randomsearch.fit(X_train, y_train)

print(randomsearch.best_params_)

In [None]:
rfr1 = RandomForestRegressor(max_depth=26, max_features=0.7)
rfr1.fit(X_train, y_train)

#checking the training score
print('Accuracy on training: ',rfr1.score(X_train, y_train))

In [None]:
rfr1.feature_importances_

In [None]:
# Predict the model 
pred_rf = rfr1.predict(X_train)

mse = mean_squared_error(y_train, pred_rf)
msle = mean_squared_log_error(y_train, pred_rf)
r_squared = r2_score(y_train, pred_rf)

print('RMSE:',np.sqrt(mse))
print('RMSLE:',np.sqrt(msle))
print('R-squared value:',r_squared)

In [None]:
# Plot the histogram of the error terms

fig = plt.figure()
sns.distplot((y_train - pred_rf), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)              
plt.xlabel('Errors', fontsize = 18)                     
plt.show()

In [None]:
# Making prediction on test data
pred_test_rf = rfr1.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, pred_test_rf)
msle = mean_squared_log_error(y_test, pred_test_rf)
r_squared = r2_score(y_test, pred_test_rf)

print('RMSE:',np.sqrt(mse))
print('RMSLE:',np.sqrt(msle))
print('R-squared value:',r_squared)

##### Prediction on test data

In [None]:
pred_test_rf = rfr1.predict(df1[fcol])

In [None]:
y_final_rf = np.array(pred_test_rf).reshape(-1,1)

# using inverse transform and exponential function to get back the actual seconds predicted
y_final_rf = scaler.inverse_transform(y_final_rf)
y_final_rf = np.exp(y_final_rf)
y_final_rf

In [None]:
df_rf = pd.DataFrame(y_final_rf)
df_rf.rename(columns={0:'time_spent_rf'},inplace=True)
df_rf = df_rf.round(4)
df_rf.head()

**Thank you!**