<hr>

#  Exploratory Data Analysis (EDA)

<style>
h1 {
    text-align: center;
    color: orange;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<hr>

## üõçÔ∏è Online Shoppers Purchasing Intention Dataset

Dataset source: Kaggle ‚Äì Online Shoppers Intention Dataset (https://www.kaggle.com/datasets/henrysue/online-shoppers-intention/data)

---

## 1. üìå Project Objective

The objective of this analysis is to explore the **Online Shoppers Purchasing Intention Dataset** to understand user behavior during online shopping sessions and identify factors that influence purchase decisions (`Revenue`).

This dataset contains **12_330 user sessions** with **18 features** describing session behavior and contextual information. The target variable `Revenue` indicates whether a purchase was made (`True` = '1') or not (`False`= '0').

---

## 2. üìä Dataset Overview


### üî¢ Dataset Shape
- Rows: 12,330 sessions
- Columns: 18 features


### üéØ Target Variable
- `Revenue` (Boolean)
  - `True` ‚Üí Purchase completed
  - `False` ‚Üí No purchase


### üßæ Feature Categories

#### üîπ Numerical Features
- `Administrative`
- `Administrative_Duration`
- `Informational`
- `Informational_Duration`
- `ProductRelated`
- `ProductRelated_Duration`
- `BounceRates`
- `ExitRates`
- `PageValues`
- `SpecialDay`

#### üîπ Categorical Features
- `OperatingSystems`
- `Browser`
- `Region`
- `TrafficType`
- `VisitorType`
- `Weekend`
- `Month`

---

## 3. üîç Data Cleaning

- Standardized columns names.
- One Hot Encoded categorical columns.
- Checked for missing values.
- Verified data types.
- Confirmed no duplicate rows.
- Validated target distribution.

**Observation:**  
The dataset does not contain null values.

---

## 4. üìà Univariate Analysis

### 4.1 Numerical Features

- Most sessions have low values for `Administrative` and `Informational` pages.
- `ProductRelated` pages have higher counts compared to other page types.
- `ProductRelated_Duration` shows strong right skewness.
- `BounceRates` and `ExitRates` are heavily right-skewed.
- `PageValues` is zero for most sessions, but significantly higher when a purchase occurs.

### 4.2 Categorical Features

- Majority of users are **Returning Visitors**.
- Most traffic occurs in specific months (e.g., May, November).
- Most sessions occur on weekdays rather than weekends.
- Some browsers and operating systems dominate traffic.

---

## 5. üìä Target Variable Analysis

### Class Distribution (Binary)

- Approximately 84‚Äì85% of sessions result in **no purchase**.
- Approximately 15‚Äì16% result in **purchase**.

**Conclusion:**  
The dataset is imbalanced, which should be considered during modeling.

---

## 6. üîé Bivariate Analysis

### 6.1 Numerical Features vs Revenue

- `ProductRelated_Duration` is significantly higher for sessions that ended in purchase.
- `PageValues` is strongly associated with purchases.
- Lower `BounceRates` and `ExitRates` are associated with higher purchase probability.
- `SpecialDay` does not show a strong influence.

### 6.2 Categorical Features vs Revenue

- Returning visitors are more likely to purchase.
- Purchases tend to increase in certain months.
- Weekend effect appears moderate.
- Traffic source may influence purchase likelihood.

---

## 7. üî• Correlation Analysis

- Strong correlation between:
  - `ProductRelated` and `ProductRelated_Duration`
  - `BounceRates` and `ExitRates`
- `PageValues` has strong positive correlation with `Revenue`.
- Most other features show moderate to low correlation.

---

## 8. üìå Key Insights

1. **Product engagement is the strongest indicator of purchase.**
2. **Higher session duration increases purchase probability.**
3. **Lower bounce and exit rates are associated with conversion.**
4. **Returning visitors are more likely to convert.**
5. The dataset is imbalanced and requires appropriate handling in predictive modeling.

---

## 9. üöÄ Next Steps

- Handle class imbalance (e.g., SMOTE or class weighting).
- Encode categorical variables.
- Train classification models (Logistic Regression, Random Forest, XGBoost).
- Evaluate using Precision, Recall, F1-score, and ROC-AUC.

---

## 10. Conclusion

The EDA reveals that user engagement metrics, particularly related to product pages and session behavior, are strong predictors of purchasing intention. These insights can help businesses optimize user experience and improve conversion rates.


<hr>

## 0Ô∏è‚É£ IMPORT


<style>
h1 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
pd.set_option('display.max_columns', 200)


<hr>

## 0Ô∏è‚É£ DATA READING


<style>
h1 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

In [17]:
# Load the dataset
df = pd.read_csv('../data/processed/online_shoppers_intention_01_standard.csv')

<hr>

## 0Ô∏è‚É£ DATA UNDERSTANDING


<style>
h1 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

In [None]:
# Basic EDA
# shape of the dataset
print("Shape of the dataset:", df.shape)

# first 5 rows
print("First 5 rows of the dataset:")
display(df.head())

# last 5 rows
print("Last 5 rows of the dataset:")
display(df.tail())


# data types
display(df.dtypes)

# missing values
print("Missing values in each column:")



# summary statistics
display(df.describe())

# columns in the dataset
display(df.columns.to_list())

(12330, 18)

Unnamed: 0,admin,admin_duration,info,info_duration,prod_related,prod_related_duration,bounce_rate,exit_rate,page_value,special_day,month,os,browser,region,traffic_type,visitor_type,weekend,revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


Unnamed: 0,admin,admin_duration,info,info_duration,prod_related,prod_related_duration,bounce_rate,exit_rate,page_value,special_day,month,os,browser,region,traffic_type,visitor_type,weekend,revenue
12325,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,Dec,4,6,1,1,Returning_Visitor,True,False
12326,0,0.0,0,0.0,5,465.75,0.0,0.021333,0.0,0.0,Nov,3,2,1,8,Returning_Visitor,True,False
12327,0,0.0,0,0.0,6,184.25,0.083333,0.086667,0.0,0.0,Nov,3,2,1,13,Returning_Visitor,True,False
12328,4,75.0,0,0.0,15,346.0,0.0,0.021053,0.0,0.0,Nov,2,2,3,11,Returning_Visitor,False,False
12329,0,0.0,0,0.0,3,21.25,0.0,0.066667,0.0,0.0,Nov,3,2,1,2,New_Visitor,True,False


admin                      int64
admin_duration           float64
info                       int64
info_duration            float64
prod_related               int64
prod_related_duration    float64
bounce_rate              float64
exit_rate                float64
page_value               float64
special_day              float64
month                     object
os                         int64
browser                    int64
region                     int64
traffic_type               int64
visitor_type              object
weekend                     bool
revenue                     bool
dtype: object

Unnamed: 0,admin,admin_duration,info,info_duration,prod_related,prod_related_duration,bounce_rate,exit_rate,page_value,special_day,os,browser,region,traffic_type
count,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0
mean,2.315166,80.818611,0.503569,34.472398,31.731468,1194.74622,0.022191,0.043073,5.889258,0.061427,2.124006,2.357097,3.147364,4.069586
std,3.321784,176.779107,1.270156,140.749294,44.475503,1913.669288,0.048488,0.048597,18.568437,0.198917,0.911325,1.717277,2.401591,4.025169
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,7.0,184.1375,0.0,0.014286,0.0,0.0,2.0,2.0,1.0,2.0
50%,1.0,7.5,0.0,0.0,18.0,598.936905,0.003112,0.025156,0.0,0.0,2.0,2.0,3.0,2.0
75%,4.0,93.25625,0.0,0.0,38.0,1464.157214,0.016813,0.05,0.0,0.0,3.0,2.0,4.0,4.0
max,27.0,3398.75,24.0,2549.375,705.0,63973.52223,0.2,0.2,361.763742,1.0,8.0,13.0,9.0,20.0


['admin',
 'admin_duration',
 'info',
 'info_duration',
 'prod_related',
 'prod_related_duration',
 'bounce_rate',
 'exit_rate',
 'page_value',
 'special_day',
 'month',
 'os',
 'browser',
 'region',
 'traffic_type',
 'visitor_type',
 'weekend',
 'revenue']

<hr>

## 0Ô∏è‚É£ DATA PREPARATION


<style>
h1 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

- Dropping irrelevant columns and rows
- Identifying duplicated columns
- Renaming Columns
- Feature Creation

<hr>

## 0Ô∏è‚É£ FEATURE UNDERSTANDING


<style>
h1 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

### Univariat Analysis
- Plotting Feature Distributions
    - Histogram
    - KDE
    - BoxPlot