1. Extracting the features from Postgres.

2. Split the data into training and testing sets.

3. Train a Random Forest Classifier (excellent for handling non-linear e-commerce data).

4. Evaluate the results with a Confusion Matrix.

In [4]:
#pip install scikit-learn joblib

In [6]:
import os
import pandas as pd
from sqlalchemy import create_engine
from dotenv import load_dotenv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import joblib

# 1. Database Connection
load_dotenv('../.env')
engine = create_engine(f"postgresql://postgres:{os.getenv('DB_PASS')}@localhost:5432/ecommerce_db")

# 2. Load Features (Using a 500k sample for efficient training on M3)
query = "SELECT * FROM customer_features LIMIT 500000"
df = pd.read_sql(query, engine)

# 3. Preprocessing and Train-Test Split
# We drop 'purchase_count' and 'cart_count' because they are too close to the answer.
# We want to predict based on views, duration, and timing.

features_to_drop = [
    'user_session', 'user_id', 'session_start', 
    'label_purchased', 'purchase_count', 'cart_count'
]

X = df.drop(features_to_drop, axis=1)
y = df['label_purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# 4. Training
print("ðŸš€ Training Random Forest Model...")
model = RandomForestClassifier(n_estimators=100, class_weight='balanced', n_jobs=-1, random_state=42)
model.fit(X_train, y_train)

# 5. Evaluation
y_pred = model.predict(X_test)
print("\n--- Model Performance ---")
print(classification_report(y_test, y_pred))

# 6. Save the Model for Phase D
os.makedirs('../models', exist_ok=True)
joblib.dump(model, '../models/purchase_predictor.pkl')
print("\nâœ… Model saved to models/purchase_predictor.pkl")

ðŸš€ Training Random Forest Model...

--- Model Performance ---
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     93257
           1       0.97      0.85      0.91      6743

    accuracy                           0.99    100000
   macro avg       0.98      0.92      0.95    100000
weighted avg       0.99      0.99      0.99    100000


âœ… Model saved to models/purchase_predictor.pkl


Precision (0.97) for Class 1: 

    When the AI predicts a user will buy, it is correct 97% of the time. This is excellent for marketingâ€”you won't waste money sending discounts to people who won't use them.

Recall (0.85) for Class 1: 

    The AI is catching 85% of all actual buyers. In a real-world scenario, identifying nearly 9 out of 10 buyers just by looking at their browsing patterns is a "Gold Standard" result.

F1-Score (0.91): 

    This balance between precision and recall proves model is robust and hasn't fallen victim to the 93/7 class imbalance.

In [7]:
import numpy as np

importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

print("Top Behavioral Predictors:")
for f in range(X.shape[1]):
    print(f"{f + 1}. {X.columns[indices[f]]} ({importances[indices[f]]:.4f})")

Top Behavioral Predictors:
1. total_interactions (0.2943)
2. session_duration_sec (0.2072)
3. view_count (0.1743)
4. view_to_cart_ratio (0.1577)
5. unique_products_viewed (0.0945)
6. avg_price_interacted (0.0438)
7. start_hour (0.0254)
8. is_weekend (0.0028)


Total Interactions (0.29): 
    
    This is the strongest signal. The more a user engages with the site (clicks, scrolls), the higher the intent.

Session Duration (0.21): 
    
    Time spent on the site is the second biggest factor. This suggests "slow shopping" leads to more conversions than "impulse clicking."

View to Cart Ratio (0.16): 
    
    our engineered feature is in the Top 4! It proved that how a user filters their choices is more important than the price of the item or what time of day it is.