### Why I used this dataset?

The data used in this analysis is an Online Shoppers Purchasing Intention data set provided on the UC Irvine’s Machine Learning Repository. The data set was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period.The primary purpose of the data set is to predict the purchasing intentions of a visitor to this particular store’s website. This dataset has very few missing values and all features of the dataset are relevant to the purchasing intention based on inference. That's why I choose this dataset.

## Column Descriptions:
**Administrative:** This is the number of pages of this type (administrative) that the user visited.

**Administrative_Duration:** This is the amount of time spent in this category of pages.

**Informational:** This is the number of pages of this type (informational) that the user visited.

**Informational_Duration:** This is the amount of time spent in this category of pages.

**ProductRelated:** This is the number of pages of this type (product related) that the user visited.

**ProductRelated_Duration:** This is the amount of time spent in this category of pages.

**BounceRates:** The percentage of visitors who enter the website through that page and exit without triggering any additional tasks.

**ExitRates:** The percentage of pageviews on the website that end at that specific page.

**PageValues:** The average value of the page averaged over the value of the target page and/or the completion of an eCommerce transaction. <br>
[More information about how this is calculated](https://support.google.com/analytics/answer/2695658?hl=en)

**SpecialDay:** This value represents the closeness of the browsing date to special days or holidays (eg Mother's Day or Valentine's day) in which the transaction is more likely to be finalized. More information about how this value is calculated below.

**Month:** Contains the month the pageview occurred, in string form.

**OperatingSystems:** An integer value representing the operating system that the user was on when viewing the page.

**Browser:** An integer value representing the browser that the user was using to view the page.

**Region:** An integer value representing which region the user is located in.

**TrafficType:** An integer value representing what type of traffic the user is categorized into. <br>
[Read more about traffic types here.](https://www.practicalecommerce.com/Understanding-Traffic-Sources-in-Google-Analytics)

**VisitorType:** A string representing whether a visitor is New Visitor, Returning Visitor, or Other.

**Weekend:** A boolean representing whether the session is on a weekend.

**Revenue:** A boolean representing whether or not the user completed the purchase.


### Necessary Imports

In [None]:
# Data Analysis and visualization tools
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sns
import plotly as py
import plotly.graph_objs as go

#statistics tools
import statsmodels.api as sm
import scipy.stats as st
from scipy.stats import shapiro, mannwhitneyu, chi2_contingency

#scikit learn framework
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.naive_bayes import GaussianNB

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Reading Database
data = pd.read_csv('../input/online-shoppers-intention/online_shoppers_intention.csv')

# shape of the data(number of rows vs number of column)
data.shape

In [None]:
type(data)

In [None]:
# Displaying some rows of the data
data.head()

**Attribute Information:**

The dataset consists of 10 numerical and 8 categorical attributes. The 'Revenue' attribute can be used as the class label. Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples ending with shopping.

"Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" and "Product Related Duration" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories.

The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another.

The "Bounce Rate", "Exit Rate" and "Page Value" features represent the metrics measured by "Google Analytics" for each page in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session.

The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction.

The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction.

The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.

The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

In [None]:
# Information of data
data.info()

In [None]:
# description of the data
data.describe()

In [None]:
# Null data checking 
data.isnull().sum()

In [None]:
# missing percentage of the data
missing_percentage = data.isnull().sum()/data.shape[0]
print(missing_percentage)

### Univariate Analysis with Visualization
- Revenue
- Weekend
- Operating System
- Browser
- Month
- VistorType
- TrafficType
- Region

## Revenue

In [None]:
data['Revenue'].value_counts()

In [None]:
# checking the Distribution of customers on Revenue

plt.rcParams['figure.figsize'] = (13, 5)

plt.subplot(1, 2, 1)
sns.countplot(data['Revenue'], palette = 'pastel')
plt.title('Buy or Not', fontsize = 15)
plt.xlabel('Revenue or not', fontsize = 15)
plt.ylabel('count', fontsize = 15)
plt.show()

## Weekend

In [None]:
data['Weekend'].value_counts()

In [None]:
# checking the Distribution of customers on Weekend

plt.rcParams['figure.figsize'] = (13,5)
plt.subplot(1, 2, 2)
sns.countplot(data['Weekend'], palette = 'inferno')
plt.title('Puchase on Weekends', fontsize = 30)
plt.xlabel('Weekend or not', fontsize = 15)
plt.ylabel('count', fontsize = 15)

plt.show()

**What we observered here?**
- From the above information we see that the distribution of `Revenue` and `Weekend` data are hightly imbalanced.

## Operating Systems

In [None]:
# checking the no. of Os's is having
data['OperatingSystems'].value_counts()

In [None]:
# plotting a pie chart for Operating Systems

plt.rcParams['figure.figsize'] = (18, 7)
size = [6601, 2585, 2555, 589]
colors = ['violet', 'magenta', 'pink', 'blue']
labels = "2", "1", "3", "others"

plt.subplot(1, 2, 2)
plt.pie(size, colors = colors, labels = labels, shadow = True, autopct = '%.2f%%', startangle=90)
plt.title('Different Operating Systems', fontsize = 30)
plt.axis('off')
plt.legend()
plt.show()




**What is observation point here?**
- Top 3 Operating Systems are covered 95% of this dataset. So we should focus on them to increase our business.

## Browsers

In [None]:
# checking the no. of Browser is having
data['Browser'].value_counts()

In [None]:
# Ploting a pie chart for operating systems
plt.rcParams['figure.figsize'] = (18, 7)

size = [7961, 2462, 736, 467,174, 163, 300]
colors = ['orange', 'yellow', 'pink', 'crimson', 'lightgreen', 'cyan', 'blue']
labels = "2", "1", "4", "5", "6", "10", "others"

plt.subplot(1, 2, 2)
plt.pie(size, colors = colors, labels = labels, shadow = True, autopct = '%.1f%%', startangle = 90)
plt.title('Different Browsers', fontsize = 30)
plt.axis('off')
plt.legend()
plt.show()

**What is the observation point here?**
- 90% users used only top 3 browser. 

## Month

In [None]:
data['Month'].value_counts()

In [None]:
# creating a donut chart for the months variations
# plotting a pie chart for share of special days
size = [3364, 2998, 1907, 1727, 549, 448, 433, 432, 288, 184]
colors = ['yellow', 'pink', 'lightblue', 'crimson', 'lightgreen', 'orange', 'cyan', 'magenta', 'violet', 'pink', 'lightblue', 'red']
labels = "May", "November", "March", "December", "October", "September", "August", "July", "June", "February"
explode = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

circle = plt.Circle((0, 0), 0.6, color = 'white')

plt.rcParams['figure.figsize'] = (18, 7)
plt.pie(size, colors = colors, labels = labels, explode = explode, shadow = True, autopct = '%.2f%%')
plt.title('Month', fontsize = 30)
p = plt.gcf()
p.gca().add_artist(circle)
plt.axis('off')
plt.legend()
plt.show()

## Visitor Type

In [None]:
data['VisitorType'].value_counts()

In [None]:
# plotting a pie chart for Visitors

plt.rcParams['figure.figsize'] = (18, 7)
size = [10551, 1694, 85]
colors = ['lightGreen', 'green', 'pink']
labels = "Returning Visitor", "New Visitor", "Others"
explode = [0, 0, 0.1]
plt.subplot(1, 2, 1)
plt.pie(size, colors = colors, labels = labels, explode = explode, shadow = True, autopct = '%.2f%%')
plt.title('Different Visitors', fontsize = 30)
plt.axis('off')
plt.legend()
plt.show()

**What is observation point here?**
- More than 85% visitors are returning vistors, This is huge. This information may helpful for marketing.

### Traffic Type

In [None]:
data['TrafficType'].value_counts()

In [None]:
# visualizing the distribution of different traffic around the TrafficType
plt.rcParams['figure.figsize'] = (18, 7)

plt.subplot(1, 2, 1)
plt.hist(data['TrafficType'], color = 'lightblue')
plt.title('Distribution of different Traffic', fontsize = 30)
plt.xlabel('TrafficType Codes', fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.grid()
plt.show()

**What is the observation point here?**
- Different type of Traffic are not normal(Gaussian) distributed. This data is exponentially distributed. So we need to take care of this type distribution.
- There are 20 different Traffic Type Codes here.

## Region

In [None]:
data['Region'].value_counts()

In [None]:
# visualizing the distribution of the users around the Region
plt.rcParams['figure.figsize'] = (18, 7)

plt.subplot(1, 2, 1)
plt.hist(data['Region'], color = 'lightgreen')
plt.title('Distribution of users(Customers)', fontsize = 30)
plt.xlabel('Region Codes', fontsize = 15)
plt.ylabel('Count', fontsize = 15)

plt.show()

**What is the observation point here?**
- Different type of users with respect to region are not normal(Gaussian) distributed.This Regional data is exponentially distributed. So we need to take care of this type distribution.
- There are 9 different Region Codes here.

In [None]:
data['SpecialDay'].value_counts()

In [None]:
# visualizing the distribution of the users around the SpecialDay
plt.rcParams['figure.figsize'] = (18, 7)

plt.subplot(1, 2, 1)
plt.hist(data['SpecialDay'], color = 'lightblue')
plt.title('Distribution of users(Customers)', fontsize = 30)
plt.xlabel('SpecialDay', fontsize = 15)
plt.ylabel('Count', fontsize = 15)

plt.show()

## Bi-Variate Analysis with Visualization
- Administrative duration vs revenue
- Informational duration vs revenue
- product related duration vs revenue
- exit rate vs revenue
- page values vs revenue
- bounce rates vs revenue
- weekend vs Revenue
- Traffic Type vs Revenue
- visitor type vs revenue
- region vs Revenue

### Administrative duration vs Revenue

In [None]:
# boxenplot for Administrative duration vs revenue
plt.rcParams['figure.figsize'] = (8, 5)

sns.boxenplot(data['Administrative_Duration'], data['Revenue'], palette = 'pastel', orient='h')
plt.title('Admin. duration vs Revenue', fontsize = 30)
plt.xlabel('Admin. duration', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
plt.show()

**What is the observation Point here?**
- We see here `Administrative_Duration` is exponentially distributed for both purchased(`True`) or not puchased(`False`). 
- We also see there are so many outliers in not puchased(`False`) according to `Administrative_Duration`.

## Informational duration vs Revenue

In [None]:
# boxenplot for Informational duration vs revenue
plt.rcParams['figure.figsize'] = (8, 5)

sns.boxenplot(data['Informational_Duration'], data['Revenue'], palette = 'rainbow', orient = 'h')
plt.title('Info. duration vs Revenue', fontsize = 30)
plt.xlabel('Info. duration', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)

plt.show()

**What is the observation Point here?**
- We see here `Informational_Duration` is exponentially distributed for both purchased(`True`) or not puchased(`False`). 
- We also see there are so many outliers in not puchased(`False`) according to `Informational_Duration`.

## Product Related Duration vs Revenue

In [None]:
# boxen plot product related duration vs revenue
plt.rcParams['figure.figsize'] = (8, 5)

sns.boxenplot(data['ProductRelated_Duration'], data['Revenue'], palette = 'inferno', orient = 'h')
plt.title('Product Related Duration vs Revenue', fontsize = 30)
plt.xlabel('Product Related Duration', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
plt.show()

**What is the observation Point here?**
- We see here `ProductRelatedDuration` is exponentially distributed for both purchased(`True`) or not puchased(`False`). 
- We also see there are so many outliers in not puchased(`False`) according to `ProductRelatedDuration`.

## Exit Rates vs Revenue

In [None]:
# boxenplot for exit rates vs revenue
plt.rcParams['figure.figsize'] = (8, 5)

sns.boxenplot(data['ExitRates'], data['Revenue'], palette = 'dark', orient = 'h')
plt.title('Exit Rates vs Revenue', fontsize = 30)
plt.xlabel('Exit Rates', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
plt.show()

**What is the observation Point here?**
- We see here `ExitRates` is normally(gaussian) distributed for both purchased(`True`) or not puchased(`False`). 
- We also see there are so many outliers in not puchased(`False`) according to `ExitRates`.

## Page Values vs Revenue

In [None]:
# strip plot for page values vs revenue
plt.rcParams['figure.figsize'] = (8, 5)

sns.stripplot(data['PageValues'], data['Revenue'], palette = 'spring', orient = 'h')
plt.title('Page Values vs Revenue', fontsize = 30)
plt.xlabel('PageValues', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
plt.show()


**What is the observation Point here?**
- We see here `PageValues` is exponentially distributed for both purchased(`True`) or not puchased(`False`). 
- We also see there are so many outliers in puchased(`True`) according to `ExitRates`.
- Most important things is here `PageValues` are highly influenced to purchased(`True`) a product.

## Bounce Rates vs Revenue

In [None]:
# strip plot for bounce rates vs revenue
plt.rcParams['figure.figsize'] = (8, 5)

sns.stripplot(data['BounceRates'], data['Revenue'], palette = 'autumn', orient = 'h')
plt.title('Bounce Rates vs Revenue', fontsize = 30)
plt.xlabel('Bounce Rates', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
plt.show()


**What is the observation Point here?**
- We see here `BounceRates` is exponentially distributed for both purchased(`True`) or not puchased(`False`). 
- We also see there are so many outliers in not puchased(`False`) according to `ExitRates`.
- `BounceRates` is highly influenced to buy a product or not.

## Weekend vs Revenue

In [None]:
# bar plot for weekend vs Revenue
df = pd.crosstab(data['Weekend'], data['Revenue'])
df.div(df.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True, figsize = (15, 5), color = ['orange', 'crimson'])
plt.title('Weekend vs Revenue', fontsize = 30)
plt.show()

**What is the observation Point here?**
- We see here `Weekend` is also a boolean column. 
- There is nothing significant to describe here.

## Traffic Type vs Revenue

In [None]:
# bar plot for traffic type vs revenue

df = pd.crosstab(data['TrafficType'], data['Revenue'])
df.div(df.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True, figsize = (15, 5), color = ['lightblue', 'blue'])
plt.title('Traffic Type as Revenue', fontsize = 30)
plt.show()

**What is the observation Point here?**
- We see here `Traffic Type` is a categorical column. 
- In this visualization, every category is different than others. Some of them are highly influenced to buy a product such as (2, 7, 16, 20, etc).

## Visitor type vs revenue

In [None]:
# bar plot for visitor type vs revenue
df = pd.crosstab(data['VisitorType'], data['Revenue'])
df.div(df.sum(1).astype(float), axis=0).plot(kind = 'bar', stacked = True, figsize =(15, 5), color = ['lightgreen', 'green'])
plt.title('Visitor Type vs Revenue', fontsize = 30)
plt.show()

**What is the observation Point here?**
- We see here `Visitor Type` is also a categorical column. 
- In this visualization, every category is different than others. New_Visitors are highly influenced to buy a product.

## Region vs Revenue

In [None]:
# bar plot for region vs revenue

df = pd.crosstab(data['Region'], data['Revenue'])
df.div(df.sum(1).astype(float), axis=0).plot(kind = 'bar', stacked = True, figsize = (15, 5), color = ['pink', 'yellow'])
plt.title('Region vs Revenue', fontsize = 30)
plt.show()

**What is the observation Point here?**
- We see here `Region` is also a categorical column. 
- In this visualization, every category is almost similar to others.

## Multi-variate Analysis
- month vs pagevalues w.r.t. revenue
- month vs exitrates w.r.t. revenue
- month vs bounceRates w.r.t. Revenue
- visitor type vs BounceRates w.r.t. revenue
- visitor type vs exit rates w.r.t. revenue
- visitor type vs exit rates w.r.t. revenue
- region vs pagevalues w.r.t. revenue
- rigion vs exit rates w.r.t. revenue

### month vs pagevalues w.r.t. revenue

In [None]:
# boxplot for month vs pagevalues w.r.t. revenue
plt.rcParams['figure.figsize'] = (10, 7)
sns.boxplot(x = data['Month'], y = data['PageValues'], hue = data['Revenue'], palette = 'spring')
plt.title('month vs pagevalues w.r.t. revenue', fontsize = 25)
plt.show()

**What is the observation Point here?**
- We see here `Month` vs `PageValues` are normally(Gaussian) distributed when users purchased a product.
- There are lots of outlier also here.

### month vs exitrates w.r.t. revenue

In [None]:
# boxplot for month vs exitrates w.r.t. revenue
plt.rcParams['figure.figsize'] = (10, 7)
sns.boxplot(x = data['Month'], y = data['ExitRates'], hue = data['Revenue'], palette = 'inferno')
plt.title('month vs exitrates w.r.t. revenue', fontsize = 25)
plt.show()

**What is the observation Point here?**
- We see here `Month` vs `ExitRates` are normally(Gaussian) distributed for both when users purchased a product or not.
- There are lots of outlier also here.

### month vs bounceRates w.r.t. Revenue

In [None]:
# boxplot for month vs bounceRates w.r.t. Revenue
plt.rcParams['figure.figsize'] = (10, 7)

sns.boxplot(x = data['Month'], y = data['BounceRates'], hue = data['Revenue'], palette = 'autumn')
plt.title("month vs bounceRates w.r.t. Revenue", fontsize = 25)
plt.show()

**What is the observation Point here?**
- We see here `Month` vs `BounceRates` are normally(Gaussian) distributed when users purchased a product for some `Month` but some them are exponentially distributed.
- There are lots of outlier also here.

## VisitorType vs BounceRates w.r.t. revenue

In [None]:
# boxplot for visitorType vs BounceRates w.r.t. revenue
plt.rcParams['figure.figsize'] = (10, 7)

sns.boxplot(x = data['VisitorType'], y = data['BounceRates'], hue = data['Revenue'], palette = 'pastel')
plt.title('visitor type vs BounceRates w.r.t. revenue', fontsize = 25)
plt.show()

**What is the observation Point here?**
- We see here `VisitorType` vs `BounceRates` are normally(Gaussian) distributed when Returning_users purchased a product but New_users and others are exponentially distributed.
- There are lots of outlier also here.

## visitor type vs exit rates w.r.t. revenue

In [None]:
# violin plot for visitor type vs exit rates w.r.t revenue
plt.rcParams['figure.figsize'] = (10, 7)

sns.violinplot(x = data['VisitorType'], y = data['PageValues'], hue = data['Revenue'], palette = 'Reds')
plt.title('visitor type vs exit rates w.r.t. revenue', fontsize = 25)
plt.show()

**What is the observation Point here?**
- We see here `VisitorType` vs `PageValues` are exponentially distributed when users purchased a product or not.
- There are lots of outlier also here.

## visitor type vs exit rates w.r.t. revenue

In [None]:
# violin plot for visitor type vs exit rates wrt revenue
plt.rcParams['figure.figsize'] = (10, 7)

sns.violinplot(x = data['VisitorType'], y = data['ExitRates'], hue = data['Revenue'], palette = 'Purples')
plt.title('visitor type vs exit rates w.r.t. revenue', fontsize = 25)
plt.show()

**What is the observation Point here?**
- We see here `VisitorType` vs `PageValues` are exponentially distributed when users purchased a product or not.
- There are lots of outlier for Returning_Visitors.

## region vs pagevalues w.r.t. revenue

In [None]:
# violin plot for region vs pagevalues w.r.t. revenue
plt.rcParams['figure.figsize'] = (10, 7)

sns.violinplot(x = data ['Region'], y = data['PageValues'], hue = data['Revenue'])
plt.title('region vs pagevalues w.r.t. revenue', fontsize = 25)
plt.show()


**What is the observation Point here?**
- We see here `Region` vs `PageValues` are exponentially distributed when users purchased a product or not.
- There are lots of outlier also here.

## rigion vs exit rates w.r.t revenue

In [None]:
# violin plot for rigion vs exit rates w.r.t revenue
plt.rcParams['figure.figsize'] = (10, 7)
sns.violinplot(x = data['Region'], y = data['ExitRates'], hue = data['Revenue'], palette = 'Greens')
plt.title("rigion vs exit rates w.r.t revenue", fontsize = 25)
plt.show()

**What is the observation Point here?**
- We see here `VisitorType` vs `PageValues` are normally(Gaussian) distributed when users purchased a product or not.
- There are lots of outlier also here.

### Summary table of multivariate Feature Analysis

In [None]:
multivariate_feature_analysis = [
    ['month vs pagevalues', 'Revenue', 'Gaussian', 'High', 'Low', 'Low', 'High'],
    ['month vs exitrates' , 'Revenue', 'Gaussian', 'Low', 'High', 'Medium', 'Medium'],
    ['month vs bounceRates' , 'Revenue', 'Gaussian', 'Low', 'High', 'Medium', 'High'],
    ['visitor type vs BounceRates' , 'Revenue', 'Exponential', 'Low', 'High', 'Low', 'High'],
    ['visitor type vs exit rates' , 'Revenue', 'Exponential', 'Low', 'High', 'High', 'Medium'],
    ['visitor type vs exit rates', 'Revenue', 'Exponential', 'High', 'Low', 'High', 'Medium'],
    ['region vs pagevalues', 'Revenue', 'Exponential', 'Low', 'High', 'High', 'High'],
    ['rigion vs exit rates', 'Revenue', 'Gaussian', 'High', 'High', 'High', 'Medium'] 
]
feature_summary = pd.DataFrame(multivariate_feature_analysis, columns=['Multivariate_features', 'W.R.T', 'Distribution', 'Revenue_True', 'Revenue_False', 'Outliers', 'Importance'])
feature_summary

## Statistical Tests
- categorical column vs target column
- Numerical column vs target column

### categorical column vs target column

In [None]:
cat_cols=['Administrative','Informational','ProductRelated','Month','OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType',
       'Weekend', 'SpecialDay']

A chi-squared test, also written as χ² test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. 

#### chi2_contingency
Chi-square test of independence of variables in a contingency table.

This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed. The expected frequencies are computed based on the marginal sums under the assumption of independence; see `scipy.stats.contingency.expected_freq`. The number of degrees of freedom is (expressed using numpy functions and attributes):

dof = observed.size - sum(observed.shape) + observed.ndim - 1

In [None]:
# check wheather Revenue is influenced by categorical column
# Null Hypopthesis, H0 = proportion of revenue accross the category is same
# Alternative Hypothesis, H1 = proportion of revenue at least in two category is different
scol = []
spval = []
ss = []
for n in cat_cols:
    scol.append(n)
    cp = chi2_contingency(pd.crosstab(data[n], data['Revenue']))[1]
    spval.append(round(cp, 4))
    if cp < 0.05:
        # rejects Null Hypothesis
        ss.append('*') # significant
    else:
        # Accept Null Hypothesis
        ss.append('**') # not significant
        

In [None]:
pd.DataFrame({'Feature': scol, 'P-Value': spval, 'Significance': ss})

### Numerical Column vs Target

In [None]:
numerical_columns=['Administrative_Duration','Informational_Duration','ProductRelated_Duration','BounceRates', 'ExitRates', 'PageValues']

#### Shapiro-Wilk test
The null hypothesis for the Shapiro-Wilk test is that `a variable is normally distributed in some population.`
A different way to say the same is that a variable’s values are a simple random sample from a normal distribution. As a rule of thumb, we
reject the null hypothesis if `p < 0.05`.

#### MannWhitneyu
Compute the Mann-Whitney rank test on samples x and y.
### Mann-Whitney rank test
In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.

#### levene (scipy.stats)
Perform Levene test for equal variances.

The Levene test tests the null hypothesis that all input samples are from populations with equal variances. Levene’s test is an alternative to Bartlett’s test bartlett in the case where there are significant deviations from normality.

### Levene's test
In statistics, Levene's test is an inferential statistic used to assess the equality of variances for a variable calculated for two or more groups. Some common statistical procedures assume that variances of the populations from which different samples are drawn are equal. Levene's test assesses this assumption. 

In [None]:
from scipy.stats import levene

# Two-Sample T-test
tcol = []
tpval = []
ts = []
for n in numerical_columns:
    tcol.append(n)
    # splitting into 2 groups(Revenue = True, Revenue = False)
    g1 = data[n][data['Revenue'] == False]
    g2 = data[n][data['Revenue'] == True]
    # Test for normality(Shapiro Test)
    # H0: Data is normal
    # H1: Data is not normal
    # if p < 0.05 --- reject Null Hypothesis
    for b in [g1]:
        s, p = shapiro(b)
    for c in [g2]:
        s1, p1 = shapiro(c)
    if p > 0.05 or p1 > 0.05:
        w, lp = levene(g1, g2)
    # If doesn't pass normality or variance test, we do non-parametric Test(mannwhitneyu Test)
    if p <= 0.05 or p1 <= 0.05 or lp <= 0.05:
        ms, mp = mannwhitneyu(g1, g2)
        tpval.append(round(mp, 4))
    if mp < 0.05:
        ts.append('*')  # significant
    else:
        ts.append('**') # not significant
        

In [None]:
pd.DataFrame({'Feature': tcol, 'P-Value': tpval, 'Significance': ts})

## Outliers

In [None]:
plt.figure(figsize=(62, 20))
data.boxplot();

In [None]:
# identify outliers with standard deviation
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std
out_per=[]
for i in numerical_columns:
    data_mean, data_std = mean(data[i]), std(data[i])
    
    # identify outliers
    cut_off = data_std * 3
    lower, upper = data_mean - cut_off, data_mean + cut_off
    print(i, ': \n')

    # identify outliers
    outliers = [x for x in data[i] if x < lower or x > upper]
    
    num_out = len(outliers)
    print('Identified outliers: %d' %num_out)
    outliers_removed = [x for x in data[i] if x >= lower and x <= upper]
    num_nout = len(outliers_removed)
    print('Non-outlier observations: %d' %num_nout)
    outlier_percent = (num_out / (num_out + num_nout)) * 100
    print('Percent of outliers:', outlier_percent, '\n')
    out_per.append(outlier_percent)

# Visualization of Outliers

In [None]:
Outliers = pd.DataFrame({'Feature': numerical_columns, '% Of Outliers': out_per})
outlier_sorted = Outliers.sort_values('% Of Outliers', ascending = False)
outlier_sorted

In [None]:
plt.rcParams['figure.figsize'] = (8, 5)
sns.barplot(y = outlier_sorted['Feature'], x = outlier_sorted['% Of Outliers'], palette = 'GnBu_d')
plt.title('Percent fo Outliers by columns')
plt.ylabel('Column Name')
plt.show()

# Clustering Analysis
> **Trying to learn the user characteristics of in terms of time spent on the Website**
- Administrative Duration vs Bounce Rate
- Informative Duration vs Bounce Rates
- Administrative Duration vs Exit Rates

> **Where from the Users of the Website come?**
- Region vs Traffic Type
- Adminstrative Duration vs Region

**The Elbow Method to find out the maximum no. of Optimal Clusters**
- Compute clustering algorithm(e.g., K-Means Clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
- For each k, calculate the total within-cluster sum of square(WCSS).
- plot the curve of WCSS according to the number of clusters k.
- The location of a bend(Knee) in the plot is generally considered as an indicator of the appropriate number of clusters.


In [None]:
# Imputing Missing Values with 0
data.fillna(0, inplace = True)

#checking the no. of null values after imputing
data.isnull().sum().sum()

## Administrative Duration vs Bounce Rates

In [None]:
# Time spent by the Users on website vs Bounce Rates
# let's cluster Administrative duration and bounce Rates to different types of clusters in the dataset.
# preparing the dataset
x = data.iloc[:, [1, 6]].values

# checking the shape of the dataset
print("Shape of the dataset: ", x.shape)

from sklearn.cluster import KMeans

wcss = []
for i in range(1, 11):
    km = KMeans(n_clusters = i,
               init = 'k-means++',
               max_iter = 200,
               n_init = 10, 
               random_state = 0,
               algorithm = 'elkan',
               tol = 0.001)
    km.fit(x)
    labels = km.labels_
    wcss.append(km.inertia_)
    
plt.rcParams['figure.figsize'] = (15, 7)
plt.plot(range(1, 11), wcss)
plt.grid()
plt.tight_layout()
plt.xlabel('No. of Clusters')
plt.ylabel('within-cluster sum of square')
plt.show()

**What is the observation point here?**
- According to the above plot, the maximum bend at index 3, that is number of optimal no. of Clusters for Administrative Duration and Revenue is 3. Let's go the next step, i.e., Plotting the Clusters.


### Visualizing the Cluster using scatter plot

In [None]:
km = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 200, n_init = 10, random_state = 0)
y_means = km.fit_predict(x)

plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'orange', label = 'Un-interested Customers')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'red', label = 'General Customers')
plt.scatter(x[y_means == 2, 0], x[y_means == 2, 1], s = 100, c = 'cyan', label = 'Target Customers')
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s = 50, c = 'blue', label = 'centeroid')

plt.title('Administrative Duration vs BounceRates', fontsize = 20)
plt.grid()
plt.xlabel('Administrative Duration')
plt.ylabel('Bounce Rates')
plt.legend()
plt.show()

**What is the observation point here?**
- We see at this clustering plot, we can confidently say that the customers who spent a longer administrative duration in a website are very less likely to bounce from the website that is nevigating away from the website just after navigating one page of that website.
- There are Three groups, The Pink group is a group of customers who stay for shortest administrative duration and have highest chance for Navigating away from a website.


## Informational Duration vs Bounce Rates

In [None]:
# cluster anaysis of Informational Duration vs Bounce Rates
x = data.iloc[:, [3, 6]].values

wcss = []
for i in range(1, 11):
    km = KMeans(n_clusters = i,
                init = 'k-means++',
                max_iter = 200,
                n_init = 10,
                random_state = 0,
                algorithm = 'elkan',
                tol = 0.001)
    km.fit(x)
    labels = km.labels_
    wcss.append(km.inertia_)
    
plt.rcParams['figure.figsize'] = (15, 7)
plt.plot(range(1, 11), wcss)
plt.grid()
plt.tight_layout()
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('within-cluster sum of square')
plt.show()

**What is the observation point here?**
- According to the above plot, the maximum bend at index 2, that is number of optimal no. of Clusters for Informational Duration and Revenue is 2. Let's go the next step, i.e., Plotting the Clusters.


In [None]:
km = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 200, n_init = 10, random_state = 0)
y_means = km.fit_predict(x)

plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'pink', label = 'Un-interested Customers')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'cyan', label = 'Target Customers')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')

plt.title('Informational Duration vs Bounce Rates', fontsize = 20)
plt.grid()
plt.xlabel('Informational Duration')
plt.ylabel('Bounce Rates')
plt.legend()
plt.show()

**What is the observation point here?**
- We see at this clustering plot, we can confidently say that the customers who spent a longer Informational duration in a website are very less likely to bounce from the website that is nevigating away from the website just after navigating one page of that website.
- There are Two groups, The Pink group is a group of customers who stay for shortest Informational duration and have highest chance for Navigating away from a website.


## Administrative Duration vs Exit Rates

In [None]:
# informational duration vs Bounce Rates
x = data.iloc[:, [1, 7]].values

wcss = []
for i in range(1, 11):
    km = KMeans(n_clusters = i,
               init = 'k-means++',
               max_iter = 200,
               n_init = 10, 
               random_state = 0,
               algorithm = 'elkan',
               tol = 0.001)
    km.fit(x)
    labels = km.labels_
    wcss.append(km.inertia_)
    
plt.rcParams['figure.figsize'] = (15, 7)
plt.plot(range(1, 11), wcss)
plt.grid()
plt.tight_layout()
plt.xlabel('No. of Clusters')
plt.ylabel('wcss')
plt.show()

**What is the observation point here?**
- According to the above plot, the maximum bend at index 3, that is number of optimal no. of Clusters for Administrative Duration and Exitrates is 2. Let's go the next step, i.e., Plotting the Clusters.


In [None]:
km = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 200, n_init = 10, random_state = 0)
y_means = km.fit_predict(x)

plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'magenta', label = 'Un-interested Customers')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'cyan', label = 'General Customers')
plt.scatter(x[y_means == 2, 0], x[y_means == 2, 1], s = 100, c = 'pink', label = 'Target Customers')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue', label = 'centroid')

plt.title('Administrative Clustering vs Exit Rates', fontsize = 20)
plt.grid()
plt.xlabel('Administrative Duration')
plt.ylabel('Exit Rates')
plt.legend()
plt.show()

**What is the observation point here?**
- We see at this clustering plot, we can confidently say that the customers who spent a longer Administrative duration in a website are very less likely to Exit from the website that is nevigating away from the website.
- There are Three groups, The Magenta group is a group of customers who stay for shortest Administrative duration and have highest chance for Navigating away from a website.


## Where from the Users of the Website come?

## Region vs Traffic Type

In [None]:
# Region vs TrafficType clustering
x = data.iloc[:, [13, 14]].values

wcss = []
for i in range(1, 11):
    km = KMeans(n_clusters = i,
                init = 'k-means++',
                max_iter = 200,
                n_init = 10,
                random_state = 0,
                algorithm = 'elkan',
                tol = 0.001)
    km.fit(x)
    labels = km.labels_
    wcss.append(km.inertia_)
    
plt.rcParams['figure.figsize'] = (15, 7)
plt.plot(range(1, 11), wcss)
plt.grid()
plt.tight_layout()
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Cluster')
plt.ylabel('within cluster sum of the square')
plt.show()

**What is the observation point here?**
- According to the above plot, the maximum bend at index 2, that is number of optimal no. of Clusters for Region and Traffic Type is 2. Let's go the next step, i.e., Plotting the Clusters.


In [None]:
km = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 200, n_init = 10, random_state = 0)
y_means = km.fit_predict(x)

plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'pink', label = 'Un-iterested customers')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'lightgreen', label = 'Target Customers')
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s = 50, c = 'blue', label = 'centroid')

plt.title('Region vs Traffic Type', fontsize = 20)
plt.xlabel('Region')
plt.ylabel('Traffic')
plt.legend()
plt.grid()
plt.show()

**What is the observation point here?**
- We see at this clustering plot, we can say that the customers who is from Region 2, 4, 5 have less traffic than others.

## Adminstrative Duration vs Region

In [None]:
# administrative duration vs bounce rates
x = data.iloc[:, [1, 13]].values

wcss = []
for i in range(1, 11):
    km = KMeans(n_clusters = i,
                init = 'k-means++',
                max_iter = 300,
                n_init = 10,
                random_state = 0,
                algorithm = 'elkan',
                tol = 0.001)
    km.fit(x)
    labels = km.labels_
    wcss.append(km.inertia_)
    
plt.rcParams['figure.figsize'] = (15, 7)
plt.plot(range(1, 11), wcss)
plt.grid()
plt.tight_layout()
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('Within cluster sum of the square')
plt.show()

**What is the observation point here?**
- According to the above plot, the maximum bend at index 2, that is number of optimal no. of Clusters for Administrative Duration and Region is 2. Let's go the next step, i.e., Plotting the Clusters.


In [None]:
km = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_means = km.fit_predict(x)

plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'cyan', label = 'Unproductive Customers')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'magenta', label = 'Target Customers')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')

plt.title('Adminstrative Duration vs Region', fontsize = 20)
plt.grid()
plt.xlabel('Administrative Duration')
plt.ylabel('Region Type')
plt.legend()
plt.show()

**What is the observation point here?**
- We see at this clustering plot, we can confidently say that the customers who spent a longer Administrative duration in a website are very less likely comers from 2, 4 Region Type.

# Data Preprocessing
- One Hot and Label Encoding

In [None]:
data.info()

In [None]:
# one hot encoding
df1 = pd.get_dummies(data)
df1.head()

In [None]:
df1.info()

In [None]:
# Label encoding of revenue
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df1['Revenue'] = le.fit_transform(df1['Revenue'])
df1['Revenue'].value_counts()

In [None]:
df1['Weekend'].value_counts()

In [None]:
# Label encoding of weekend

df1['Weekend'] = le.fit_transform(df1['Weekend'])
df1['Weekend'].value_counts()

In [None]:
# Splitting dependent and independent variables(columns)
y = df1['Revenue']
x = df1.drop(['Revenue'], axis = 1)

# checking the shapes
print("Shape of x: ", x.shape)
print("Shape of y: ", y.shape)

In [None]:
# Splitting of the Data

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size =  0.2, random_state = 0)

# checking the shapes

print("Shape of x_train :", x_train.shape)
print("Shape of y_train :", y_train.shape)
print("Shape of x_test :", x_test.shape)
print("Shape of y_test :", y_test.shape)

# Modelling : Support Vector Machine

In [None]:
# model define and training
model = svm.SVC()
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

# evaluating the model
print("Training Accuracy: ", model.score(x_train, y_train))
print("Testing Accuracy: ", model.score(x_test, y_test))

The model which is just build by Support vector machine gives us training accuracy 99.5% and testing accuracy 82.85%

In [None]:
# confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
plt.rcParams['figure.figsize'] = (6, 6)
sns.heatmap(cm, annot = True)
plt.show()

In [None]:
# classification report
cr = metrics.classification_report(y_test, y_pred)
print(cr)

# Modelling: Naive Bayes

In [None]:
# model define and training
model = GaussianNB()
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

# evaluating the model
print("Training Accuracy: ", model.score(x_train, y_train))
print("Testing Accuracy: ", model.score(x_test, y_test))

The model which is just build by Naive Bayes(Gaussian) gives us training accuracy 79.83% and testing accuracy 79.03%

In [None]:
# confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
plt.rcParams['figure.figsize'] = (6, 6)
sns.heatmap(cm, annot = True)
plt.show()

In [None]:
# classification report
cr = metrics.classification_report(y_test, y_pred)
print(cr)

Above two models shows different type of accuracy. Though support vector machine gave us more accuracy I would choose naive bayes algorithm. Because, we see in the heatmap of confusion matrix that provide us important information which is support vector machine doesn't recognize any '`false-negative`'. It means support vector machine doesn't give us a any good solution. 

## Summary Table based on Two Learning Algorithm

In [None]:
precision = ['Naive Bayes', 'precision', 0.91, 0.42]
recall = ['Naive Bayes', 'recall', 0.83, 0.61]
f1_score = ['Naive Bayes', 'f1_score', 0.87, 0.50 ]
precision2 = ['Support Vector Machine', 'precision', 0.83, 0.00 ]
recall2 = ['Support Vector Machine', 'recall', 1.0, 0.00 ]
f1_score2 = ['Support Vector Machine', 'f1_score', 0.91, 0.00 ]
table = pd.DataFrame([precision, precision2, recall, recall2, f1_score, f1_score2])
table.columns = ['model_name', 'metrics', 'Is_Revenue(False)', 'Is_Revenue(True)']
table

In [None]:
table.hist()