## Project: Prediction of online shopper intention 


### Objectives

The main goal of this project is to 

Q1) Classify whether an online customer will generate a revenue or not, based on user activity and characteristics. 

Q2) What are the major session or user features that drive the shopper intention. 

To achieve this goal and better understand the data of shopper and non shopper, this project will initially conduct exploratory analysis to understand the characteristics of a shopper and non shopper.

This project will utilize the following models and select the best performing model to implementation. 

   a) Classification models -logit, svm 
              
   b) Instance based - kNN
   
   c) Clustering K means - check if possible to segment 
        
        
Also, the project will investigate if the near-real time predicton of shopper intention is possible.          

### Dataset

 Dataset from UCI ML repository https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset
 
 The dataset consists of 10 numerical and 8 categorical attributes.
 
 The 'Revenue' attribute is used as the class label. It has 2 classes indicating 
 
     TRUE  = Revenue Generated and 
     FALSE = No Revenue Generated. 
 

#### Data Field description for numerical attributes

**Administrative**: Number of pages visited by the visitor about account management 

**Administrative duration**:Total amount of time (in seconds) spent by the visitor on account management related
pages 

**Informational**: Number of pages visited by the visitor about Web site, communication and address information of the shopping site

**Informational duration**: Total amount of time (in seconds) spent by the visitor on informational pages 

**Product related**: Number of pages visited by visitor about product related pages 

**Product related duration**: Total amount of time (in seconds) spent by the visitor on product related pages 

**Bounce rate**: Average bounce rate value of the pages visited by the visitor (percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session.)

**Exit rate**: Average exit rate value of the pages visited by the visitor ( for all pageviews to the page, the percentage that were the last in the session. This is the number of exits from the website.)

**Page value**: Average page value of the pages visited by the visitor (average value for a web page that a user visited before completing an e-commerce transaction. It tells you which specific pages of the site offer the most value)

**Special day**: Closeness of the site visiting time to a special day in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.

**Change the variable type for categorical/boolean variables to numeric below.**

**OperatingSystems**:  Operating system of the visitor  

**Browser**: Browser of the visitor  

**Region**: Geographic region from which the session has been started by the visitor 

**TrafficType**: Traffic source by which the visitor has arrived at the Web site (e.g., banner, SMS, direct) 

**VisitorType**: Visitor type as ‘‘New Visitor,’’ ‘‘Returning Visitor,’’ and ‘‘Other’’  

**Weekend**: Boolean value indicating whether the date of the visit is weekend  

**Month**: Month value of the visit date  

** Target Variable ** 

**Revenue**: Class label indicating whether the visit has been finalized with a transaction 

### Step 1: Data Understanding

  
There are 12330 entries and total 18 columns. The data is imbalanced as 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples ending with shopping. 
 
The statistical summary of the numerical variables are evaluated to see the distribution of values.  
 
|Feature   | mean	| std	| min	| max
 ----------|--------|-------|-------|----
|Administrative|	2.315166|	3.321784|	0.0|	27.000000
|Administrative_Duration|	80.818611|	176.779107|	0.0|	3398.750000
|Informational|	0.503569|	1.270156|	0.0|	24.000000
|Informational_Duration|	34.472398|	140.749294|	0.0|	2549.375000
|ProductRelated|	31.731468|	44.475503|	0.0|	705.000000
|ProductRelated_Duration|	1194.746220|	1913.669288|	0.0|	63973.522230
|BounceRates|	0.022191|	0.048488|	0.0|	0.200000
|ExitRates|	0.043073|	0.048597|	0.0|	0.200000
|PageValues|	5.889258|	18.568437|	0.0|	361.763742
|SpecialDay|	0.061427|	0.198917|	0.0|	1.000000


 The independent attributes are identified to be highly positively skewed. The kurtosis values indicates that all the independent variables' distribution is highly peaked.Due to these reasons, the distribution is considered nonnormal. Anderson-Darling Test for normality is conducted to confirm this and the results reassures this observation.
  
The distribution of categorical variables are understood.

|Feature   |  No:of distinct values
 ----------|--------
|Month| 10
|OperatingSystems|8
|Browser| 13
|Region|9
|TrafficType|20
|VisitorType|3
|Weekend|2

The revenue generation based on the 'Month','OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend' are identified. During the year, most shoppers visited the site in the month of May, contributing 27% of site traffic, followed by November with 24%. February marked the least number of visitors with about 1.5%. But its during November that most conversions to revenue recorded. While only 14% of returning users made a purchase, 25% of new users who visited the site ended up with purchase. While 15% of weekday hits generated revenue,it is about 17% for weekend. 
 


## Step 2: Data Preperation

### Missing Value

There are no missing values identified in this dataset. 

### Outlier detection
 
There seem to be values scattered well beyond the q3+interquartile bounderies. In this context, These seem to be valid values. However, the max values in the below attributes seem to be actual outliers are are removed.

Administrative_Duration - max value row is eliminated
Informational_Duration - max value row is eliminated
ProductRelated_Duration - 2 max value rows are eliminated

The total number of instances is now 12326.

The count of the nominal variables seems as expected and no treatment is done. 

### Feature Selection 
Initially a logit model is trained and tested with all features to benchmark the performance. Then deature selection methods are employed to see if that helps in better performance of the model. 

The accuracy of logit model employing all features is 88%

Filering and 2 Wrapper methods RFE and Fwd step propogation are used.  

**Filtering Method:** 

The reduntant features from the independent variables are removed. The following attributes convey same information. The heatmap and scatterplot matrix confirms the strong correlation between these attributes and only the attribute from each set that has better relation with target variable will be retained. 

a) Administrative and Administrative_Duration - Removed Administrative_Duration 

b) Informational	Informational_Duration - Removed	Informational_Duration

c) ProductRelated	ProductRelated_Duration - Removed ProductRelated

d) BounceRates  ExitRates - Removed BounceRates  


ChiSquare test is conducted to see the significance of categorical variable with the target variable. It is evident from the test that OperatingSystems and TrafficType  are not significant. 

Using the 

**Wrapper methods:**

*RFE method*  is used to find the features with significant importance to predict the revenue. Optimal number of features is 8 with an accuracy of 88%. 

*Step forward feature selection* Optimal number of features is 10 with an accuracy of 88%. 



#### Feature Importance summary of different feature selection methods

All features	|	Filtering	|	RFE	|	STEP FS
----------------|-----------------------|---------------|---------------- 
ExitRates	|	ExitRates	|	ExitRates	|	ExitRates
BounceRates	|	VisitorType	|	BounceRates	|	BounceRates
SpecialDay	|	Weekend	|	SpecialDay	|	Weekend
VisitorType	|	Month	|	VisitorType	|	VisitorType
Weekend	|	PageValues	|	Informational	|	PageValues
Month	|	Informational	|	Month	|	OperatingSystems
PageValues	|	Browser	|	Weekend	|	Region
OperatingSystems	|	Region	|	PageValues	|	Browser
Browser	|	Administrative	|		|	TrafficType
Informational	|	ProductRelated_Duration	|		|	Administrative_Duration
Region	|		|		|	
Administrative	|		|		|	
TrafficType	|		|		|	
ProductRelated	|		|		|	
Informational_Duration	|		|		|	
ProductRelated_Duration	|		|		|	
Administrative_Duration	|		|		|

#### Logit model performance with different feature selection


|Feature selection|No:of features| Accuracy |precision|recall  |f1-score|AUC 
|----------------|------------|----------|---------|--------|--------|---- 
|All Features    |   17       | 88       |67       |    84  |    0.71 | .89 
|Filtering       |   10       | 88       |66       |    83  |    0.70 | .88 
|RFE             |   8        | 88       |67       |    86  |    0.71 | .88 
|SForward        |   10       | 88       |66       |    85  |    0.70 | .89 

SForward  seem to be better at this point. 
