# E-Commerce Revenue Prediction System

Machine learning project that predicts whether an online shopping visitor will make a purchase (Revenue = TRUE/FALSE) using a k-nearest neighbors classifier. Built with Python and scikit-learn, it processes visitor session data and outputs purchase predictions with performance metrics.

# Problem
E-commerce platforms need to quickly understand which visitors are most likely to make a purchase. Accurately predicting purchase intent from real-time session behavior enables smarter decisions across the business, including:

- Targeted and efficient marketing
- Personalized on-site experiences
- Higher conversion through optimized user flows
- Better allocation of operational and advertising resources

This project builds a model that identifies high-intent shoppers based on their browsing patterns, helping e-commerce teams act proactively rather than reactively.

# Data
The dataset consists of 12,332 e-commerce visitor sessions, each represented as a single row. It includes 17 behavioral and contextual features and a binary target variable (Revenue) indicating whether the session resulted in a purchase.

Dataset provided by Sakar, C.O., Polat, S.O., Katircioglu, M. et al., Neural Comput & Applic (2018).

# Evaluation
The model is evaluated using metrics that highlight its ability to correctly identify both purchasing and non-purchasing visitors:

- Sensitivity (True Positive Rate): proportion of actual purchasers correctly identified
- Specificity (True Negative Rate): proportion of non-purchasers correctly identified
- Correct/Incorrect counts: overall accuracy breakdown
Includes division-by-zero handling for edge cases.

# Model

This project uses a K-Nearest Neighbors (KNN) classifier (KNeighborsClassifier) to predict purchase intent. The model is instance-based and non-parametric, relying on stored training examples rather than learning explicit parameters. With a default setting of k = 3 (configurable), predictions are made by examining the nearest neighbors in feature space.

In [25]:
import csv
import sys
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

TEST_SIZE = 0.3

In [19]:
df = pd.read_csv("shopping_data.csv")
print("Dataset loaded successfully!")
df.head(5)

Dataset loaded successfully!


Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


In [None]:
# Load data from spreadsheet and split into train and test sets using pandas
def load_data(file):
    # Load CSV into pandas DataFrame
    df = pd.read_csv(file)
    df.head(5)
    
    # Month to index mapping
    month_to_index = {
        "Jan": 0, "Feb": 1, "Mar": 2, "Apr": 3, "May": 4, "June": 5,
        "Jul": 6, "Aug": 7, "Sep": 8, "Oct": 9, "Nov": 10, "Dec": 11
    }
    
    # Convert Month to index
    df['Month'] = df['Month'].map(month_to_index)
    
    # Convert boolean strings to integers
    df['VisitorType'] = (df['VisitorType'] == 'Returning_Visitor').astype(int)
    df['Weekend'] = (df['Weekend'] == 'True').astype(int)
    df['Revenue'] = (df['Revenue'] == 'True').astype(int)
    
    # Define evidence columns in the desired order
    evidence_columns = [
        'Administrative',
        'Administrative_Duration',
        'Informational',
        'Informational_Duration',
        'ProductRelated',
        'ProductRelated_Duration',
        'BounceRates',
        'ExitRates',
        'PageValues',
        'SpecialDay',
        'Month',
        'OperatingSystems',
        'Browser',
        'Region',
        'TrafficType',
        'VisitorType',
        'Weekend'
    ]
    
    # Extract evidence and labels
    evidence = df[evidence_columns].values.tolist()
    labels = df['Revenue'].values.tolist()
    
    return (evidence, labels)

evidence, labels = load_data("shopping_data.csv")


print(f"Evidence: {evidence[0]}")
print(f"Labels: {labels[0]}")



Evidence: [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.2, 0.2, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0]
Labels: 0


In [None]:
# Split into train and test sets
    
X_train, X_test, Y_train, Y_test = train_test_split(
    evidence, labels, test_size=TEST_SIZE
)

In [30]:
# Train model and make predictions
def train_model(X_train, Y_train):
    model = KNeighborsClassifier(n_neighbors=3)
    model.fit(X_train, Y_train)
    return model

model = train_model(X_train, Y_train)

In [31]:
# Make predictions on test data
predictions = model.predict(X_test)

In [32]:
# Evaluate model performance
def evaluate(labels, predictions):
    true_positives = 0  # Accurate true predictions (1 predicted as 1)
    false_positives = 0  # Inaccurate true predictions (0 predicted as 1)
    true_negatives = 0  # Accurate false predictions (0 predicted as 0)
    false_negatives = 0  # Inaccurate false predictions (1 predicted as 0)

    # Check if labels and predictions are the same length
    if len(labels) != len(predictions):
        raise ValueError("Labels and predictions must be the same length")

    for i in range(len(labels)):
        if labels[i] == 1 and predictions[i] == 1:
            true_positives += 1
        elif labels[i] == 0 and predictions[i] == 1:
            false_positives += 1
        elif labels[i] == 0 and predictions[i] == 0:
            true_negatives += 1
        elif labels[i] == 1 and predictions[i] == 0:
            false_negatives += 1

    # Calculate sensitivity and specificity, guarding against zero denominators
    actual_positives = true_positives + false_negatives
    actual_negatives = true_negatives + false_positives

    if actual_positives != 0:
        sensitivity = true_positives / actual_positives
    else:
        sensitivity = 0.0

    if actual_negatives != 0:
        specificity = true_negatives / actual_negatives
    else:
        specificity = 0.0

    return (sensitivity, specificity)

sensitivity, specificity = evaluate(Y_test, predictions)

# Print results
print(f"Correct: {(Y_test == predictions).sum()}")
print(f"Incorrect: {(Y_test != predictions).sum()}")
print(f"True Positive Rate: {100 * sensitivity:.2f}%")
print(f"True Negative Rate: {100 * specificity:.2f}%")


Correct: 3699
Incorrect: 0
True Positive Rate: 0.00%
True Negative Rate: 100.00%
