**Description**: The dataset consists of feature vectors belong to 12330 sessions and the data were constituted so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period. Moreover, among the 12330 sessions in the dataset, 84.5% (10422) were negative class samples that did not end with shopping and the rest (1908) were positive class samples ending with shopping.

Additional Variable Information: The dataset comprises 10 numerical and 8 categorical attributes, with the 'Revenue' attribute serving as the class label. Metrics such as "Administrative," "Administrative Duration," "Informational," "Informational Duration," "Product Related," and "Product Related Duration" quantify the number of pages visited and the time spent on different page categories during a session. These values are derived from the URL information of the visited pages, dynamically updated in real-time as users navigate through the site. The features "Bounce Rate," "Exit Rate," and "Page Value" correspond to metrics measured by "Google Analytics" for each page in the e-commerce site. "Bounce Rate" indicates the percentage of visitors who enter a page and leave without triggering additional requests to the analytics server. "Exit Rate" calculates the percentage of pageviews that were the last in a session. Meanwhile, "Page Value" represents the average value of a page visited before completing an e-commerce transaction. The "Special Day" feature gauges the proximity of site visits to specific occasions (e.g., Mother’s Day, Valentine's Day), where transactions are more likely to occur. This attribute's value considers e-commerce dynamics, such as the duration between the order date and delivery date. For instance, around Valentine’s Day, the value is nonzero between February 2 and February 12, zero before and after unless close to another special day, reaching a maximum of 1 on February 8. Additionally, the dataset includes information on the operating system, browser, region, traffic type, visitor type (returning or new), a Boolean indicator for weekend visits, and the month of the year.



In [None]:
!pip install requests
!pip install tabulate
!pip install "colorama>=0.3.8"
!pip install future



In [None]:
!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o


In [None]:
import h2o
from h2o.automl import H2OAutoML
import random, os, sys
from datetime import datetime
import pandas as pd
import logging
import csv
import optparse
import time
import json
from distutils.util import strtobool
import psutil
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set(context="notebook", palette="Spectral", style = 'darkgrid' ,font_scale = 1.5, color_codes=True)
import warnings
warnings.filterwarnings('ignore')
import os
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.compat import lzip
import statsmodels.stats.api as sms
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
from yellowbrick.regressor import ResidualsPlot
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

In [None]:
min_mem_size=6
run_time=222

In [None]:
pct_memory=0.5
virtual_memory=psutil.virtual_memory()
min_mem_size=int(round(int(pct_memory*virtual_memory.available)/1073741824,0))
print(min_mem_size)

In [None]:
port_no=random.randint(5555,55555)

#  h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
try:
  h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
except:
  logging.critical('h2o.init')
  h2o.download_all_logs(dirname=logs_path, filename=logfile)
  h2o.cluster().shutdown()
  sys.exit(2)

In [None]:
url = "https://github.com/shwetackhade/Data-Science-Engineering-Methods-and-Tools/blob/main/online_shoppers_intention.csv?raw=true"
df = h2o.import_file(path = url)
dff = pd.read_csv('https://github.com/shwetackhade/Data-Science-Engineering-Methods-and-Tools/blob/main/online_shoppers_intention.csv?raw=true')

In [None]:
dff.head()

In [None]:
dff.describe()

In [None]:
dff.isnull()

In [None]:
df.types

In [None]:
dff.dtypes

In [None]:
dff['Revenue'].value_counts()

In [None]:
dff.shape

In [None]:
dff.Revenue = dff.Revenue.astype(int)
dff.Weekend = dff.Weekend.astype(int)
dff.VisitorType = dff.VisitorType.replace(
    {'Returning_Visitor': '0',
    'Other': '2',
    'New_Visitor': '1',
    }).astype(int)
dff.Month = dff.Month.replace(
    {'Jan': '1',
    'Feb': '2',
    'Mar': '3',
    'Apr': '4',
    'May': '5',
    'June': '6',
    'Jul': '7',
    'Aug': '8',
    'Sep': '9',
    'Oct': '10',
    'Nov': '11',
    'Dec': '12',
    }).astype(int)

In [None]:
dff.Month

In [None]:
pct_rows=0.80
df_train, df_test = df.split_frame([pct_rows])

In [None]:
print(df_train.shape)
print(df_test.shape)

In [None]:
X=df.columns
print(X)

In [None]:
y_numeric ='Revenue'
X.remove(y_numeric)
print(X)


**H20 AutoML Execution**

In [None]:
aml = H2OAutoML(max_runtime_secs=run_time, seed=1)


In [None]:
aml.train(x=X,y=y_numeric,training_frame=df_train)


**Interpreting the above results**



In [None]:
print(aml.leaderboard)

**Analysing the Analysing relation between all variables**

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Encode categorical variables
#dff_encoded = pd.get_dummies(dff, columns=['Month', 'VisitorType'], drop_first=True)

# Select relevant columns, ensuring all are numeric
Multic = dff[['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'Revenue']]

# Compute VIF
vif = pd.DataFrame()
vif["variables"] = Multic.columns
vif["VIF"] = [variance_inflation_factor(Multic.values, i) for i in range(Multic.shape[1])]
vif


In [None]:
import statsmodels.formula.api as smf #OLS model Library
results = smf.ols('Revenue ~ Administrative + Administrative_Duration + Informational + Informational_Duration + ProductRelated + ProductRelated_Duration + BounceRates + ExitRates + PageValues + SpecialDay + Month + OperatingSystems + Browser + Region + TrafficType + VisitorType + Weekend', data=dff).fit()
results.summary()

In [None]:
dff.corr()


In [None]:
#Representing Matrix as a plot
from IPython.core.pylabtools import figsize
f,ax=plt.subplots(figsize=(10,6))

sns.heatmap(dff.corr(),center=0, linewidths=0.8,cmap='coolwarm',annot=True, annot_kws={"size": 9})
plt.title('Variable Correlation')

In [None]:
sns.pairplot(dff)


**H20 AutoML Rexecution on new model**
Dropping the variables that are not significant for determining Price. Passing this new model again through H20AutoML. Here, we are repeating the entire process exactly as above whilst ignoring the unecessary features.

Dropping Operating System, Browser,Weekend and Traffic Type


In [None]:
df1=df.drop(['OperatingSystems', 'Browser','Weekend','TrafficType'], axis=1)

In [None]:
df1_train, df1_test = df1.split_frame([pct_rows])


In [None]:
X1=df1.columns
print(X1)

In [None]:
#Seperate Dependent variable from Independent variable
y1_numeric ='Revenue'
X1.remove(y1_numeric)
print(X1)

In [None]:
aml1 = H2OAutoML(max_runtime_secs=run_time, seed=1)


In [None]:
aml1.train(x=X1,y=y1_numeric,training_frame=df1_train)


In [None]:
print(aml1.leaderboard)


In [None]:
#assign index values to all the models generated
model_index=0
glm_index=0
glm_model=''
aml1_leaderboard_df1=aml1.leaderboard.as_data_frame()
models_dict={}
for m in aml1_leaderboard_df1['model_id']:
  models_dict[m]=model_index
  if 'StackedEnsemble' not in m:
    break
  model_index=model_index+1

for m in aml1_leaderboard_df1['model_id']:
  if 'GLM' in m:
    models_dict[m]=glm_index
    break
  glm_index=glm_index+1
models_dict

In [None]:
#print the index value of best model
print(model_index)
best_model1 = h2o.get_model(aml1.leaderboard[model_index,'model_id'])

In [None]:
best_model1.algo


In [None]:
#plot variables in order of their importance for Revenue prediction
if best_model1.algo in ['gbm','drf','xrt','xgboost']:
    best_model1.varimp_plot()

In [None]:
if glm_index is not 0:
  print(glm_index)
  glm_model1=h2o.get_model(aml1.leaderboard[glm_index,'model_id'])
  print(glm_model1.algo)
  glm_model1.std_coef_plot()

**Checking if assumptions violated**

In [None]:
df.head()


In [None]:
dff.describe()

In [None]:
#Seperating the predictor and target variables
A=dff.drop(['Revenue'],axis=1)
B=dff['Revenue']

In [None]:
A_train,A_test,b_train,b_test=tts(A,B,test_size=0.2,random_state=42)


In [None]:
# Assuming A is your feature set and B is your target variable, both in an H2O Frame
split_frames = train_test_split(A,test_size=0.2,random_state=42) # Splits A into two frames with 80% of the data in the first frame
A_train = split_frames[0]
A_test = split_frames[1]

split_frames_b = train_test_split(B,test_size=0.2,random_state=42)  # Do the same for B if it's a separate frame
b_train = split_frames_b[0]
b_test = split_frames_b[1]



In [None]:
model1 = sm.OLS(b_train,sm.add_constant(A_train[X])).fit()

In [None]:
b_pred = model1.predict(sm.add_constant(A_train[X]))


In [None]:
residuals = b_train-b_pred
mean_residuals = np.mean(residuals)
print("Mean of Residuals {}".format(mean_residuals))

In [None]:
p = sns.distplot(residuals,kde=True)
p = plt.title('Normality of error terms/residuals')


In [None]:
import pylab
import scipy.stats as stats
stats.probplot(residuals, dist="norm", plot=pylab)
pylab.show()

In [None]:
sns.histplot(dff.Revenue, kde = True)


Ridge Regularization in H20



In [None]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
house_glm = H2OGeneralizedLinearEstimator(family = 'gaussian', lambda_ = 0, compute_p_values = True)
house_glm_regularization = H2OGeneralizedLinearEstimator(family = 'gaussian', lambda_ = .001, alpha = 0)


In [None]:
df1_train.types["Revenue"]

In [None]:
df1_train["Revenue"] = df1_train["Revenue"].asnumeric()

In [None]:
house_glm_regularization.train(x = X1, y = y1_numeric, training_frame = df1_train)

In [None]:
#Model details without regularization
house_glm.train(x = X1, y = y1_numeric, training_frame = df1_train)



DATA REPORT

In [None]:
exa = aml1.explain(df1_test)


Hyperparameter Tuning


In [None]:
s = dff['Revenue']

t = dff.drop(['Revenue'], axis = 1)

In [None]:
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import RandomizedSearchCV

In [None]:
t_train, t_test, s_train, s_test = train_test_split (t, s, random_state = 42, test_size = 0.2)


In [None]:
from sklearn.model_selection import RandomizedSearchCV

from sklearn.ensemble import RandomForestRegressor

mode = RandomForestRegressor()

param_vals = {'max_depth': [200, 500, 800, 1100], 'n_estimators': [100,200, 300, 400], 'min_samples_split' : [2,3,5]

}

random_rf = RandomizedSearchCV(estimator=mode, param_distributions=param_vals,

n_iter=10, scoring='accuracy', cv=5,

refit=True, n_jobs=-1)

#Training and prediction



random_rf.fit(t_train, s_train)

preds = random_rf.best_estimator_.predict(t_test)

In [None]:
random_rf.best_params_


**CONCLUSION**:

AutoML was utilized for revenue prediction, taking into account various factors such as Variance Inflation Factor (VIF), p-values, and additional tests to exclude certain independent variables. The H2O.ai framework facilitated the training and testing of dataset variables related to the online consumer purchase intent analysis, identifying 'gbm' as the optimal model. The findings indicate that the suggested linear regression approach is capable of assessing and forecasting housing prices to a certain degree. However, it's acknowledged that the model's predictive precision has limitations at specific junctures, necessitating further enhancements through ongoing research. Future studies on these models may benefit from implementing strategies such as outlier removal and the application of ensemble or boosting techniques to improve prediction accuracy.



1) Is the relationship significant?
Ans: A relationship is considered to be statistically significant if the p-value associated with the variables is below 0.05. The p-value represents the likelihood of observing a result as extreme as, or more so, than the one observed, under the assumption that the null hypothesis holds true. A low p-value indicates a substantial difference between the two compared groups, suggesting the null hypothesis can be rejected. In this model, the p-value was determined through two methods. Using the OLS (Ordinary Least Squares) approach, it was found that the p-values for variables such as 'Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'Browser', 'TrafficType', and 'Weekend' exceeded 0.05. This observation leads to the conclusion that the p-values for the remaining variables in the dataset are below 0.05, thereby affirming the significance of the relationship for the dataset under consideration.

2) Are any model assumptions violated?
Ans:
- Linear relationship- The graph for dependent and independent variable needs to be linear by keeping other variables constant. When target variable is plotted against all other independent variables, linear relation is observed for few of them. Hence this assumption is not violated.

- Homoscedasticity which means normality of the error distribution - The plot for residuals should be normally distributed i.e., it should form a bell-curve shape. For this model the same is achieved.


3)Is there any multicollinearity in the model?
Ans: Multicollinearity occurs in a model when there is a high correlation between two or more independent variables. This condition is problematic because it diminishes the reliability of the statistical significance of individual independent variables. To identify multicollinearity, one can utilize a correlation matrix or compute the Variance Inflation Factor (VIF) for each variable. In a correlation matrix, a coefficient near +1 or -1 indicates a strong correlation between two variables. A VIF value exceeding 10 suggests the presence of multicollinearity. In the discussed model, although no variables exhibited a VIF greater than 10, certain variables had p-values higher than 0.05. Removing these variables and reassessing the model led to the desired results. Presently, there is a significant correlation observed between Bounce Rate and Exit Rate.

4) In the multivariate models are predictor variables independent of all the other predictor variables?
Ans: Variables are said to be independent when there is no relation between them. To check this relation, correlation matrix can be used, or it can be observed from graphs too whether there is any pattern followed or not. When correlation matrix is computed for the model, it can be observed that ExitRates and BounceRates are correlated to each other. Other than those other predictors are independent from each other.

5) In multivariate models rank the most significant predictor variables and exclude insignificant ones from the model.
Ans: From the variable importance plot, the most to least important variables are displayed. For my model PageValues, Month, Bouncerates, ProductRelated are top 4 most important variables to determine Revenue. VIF and p-values for OperatingSystems, Browser, Weekend, and TrafficType was higherrafficType was high than the ideal values. So those variables were excluded from the model.

6) Does the model make sense?
Ans: For a model to make sense it should follow all the assumptions and have p value, VIF between their respective ranges. RMSE should be as low as possible considering the minimum and maximum values of the target variable. So overall the model makes sense. To increase the accuracy, some additional variables can be dropped depending on their importance. Furthermore, outliers can be removed or boosting, or ensemble model can be used.

7) Does regularization help?
Ans: Regularization is a technique used for tuning the random noise function by adding an additional term to noise function. This additional term controls the excessively fluctuating function such that the coefficients don’t take extreme values and the prediction of target value for test data is not highly affected. The main use of Regularization is to minimize the validation loss and try to improve the accuracy of the model. For this model Ridge Regularization was used on training data. It was observed that Root Mean Square Error (RMSE) and R2 was calculated twice, once when regularization was not applied and once when regularization was applied. The values were same in both the cases. Hence it can be concluded that for this model regularization does not help.

8)Which independent variables are significant?
Ans: Variables are significant when p-value is less than 0.05.'Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'Browser', 'TrafficType', and 'Weekend' all other variables have p-value less than 0.05. So, it can be said that all variables are significant.

9) Which hyperparameters are important?
Ans: To find best set a hyperparameter and combinations of interacting hyperparameters for a given dataset hyperparameters tuning is used. It objectively searches different values for model hyperparameters and chooses a subset that results in a model that achieves the best performance on a given dataset. For this model tuning is performed using RandomForestRegressor. The best hyperparameters for this model are: {'n_estimators': 100, 'min_samples_split': 2, 'max_depth': 500}

LICENSE: