# Coding Challenge

### Submitted by  
**Alfonso Cervantes Barragán**  
*Data Scientist*  
a.cervantes2012@hotmail.com



## Preprocesamiento y modelamiento enviado

In [16]:
"""
Baseline Model - Simple Logistic Regression
Use this as your starting point
"""

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
import joblib

# I’m keeping this to reduce noise for the exercise,
# but I personally prefer leaving warnings on and fixing the root cause.
import warnings
warnings.filterwarnings("ignore")


# Load data
train = pd.read_csv("ecommerce_returns_train.csv")
test = pd.read_csv("ecommerce_returns_test.csv")

def preprocess(df):
    """Simple preprocessing pipeline"""
    df_processed = df.copy()

    # Encode categorical: product_category
    le_category = LabelEncoder()

    # I suggest fixing leakage here.
    # Right now, if you call preprocess(test), you're fitting the encoder on test too.
    # My recommendation: fit the encoder once on TRAIN and then reuse it to transform TEST.
    df_processed["product_category_encoded"] = le_category.fit_transform(
        df_processed["product_category"]
    )

    # Handle missing sizes (Fashion items only have sizes)
    if df_processed["size_purchased"].notna().any():
        most_common_size = df_processed["size_purchased"].mode()[0]

        # I recommend avoiding inplace=True.
        # It can create subtle side effects and pandas is moving away from it.
        # Cleaner option: df_processed["size_purchased"] = df_processed["size_purchased"].fillna(most_common_size)
        df_processed["size_purchased"].fillna(most_common_size, inplace=True)

        le_size = LabelEncoder()

        # Same recommendation as category: fit on TRAIN only, transform on TEST with the same encoder.
        # I’d also handle “unknown” sizes that appear only in test (otherwise production will break).
        df_processed["size_encoded"] = le_size.fit_transform(
            df_processed["size_purchased"]
        )
    else:
        # I suggest always creating the column to avoid crashing later when selecting features.
        df_processed["size_encoded"] = 0

    # Feature selection
    feature_cols = [
        "customer_age", "customer_tenure_days", "product_category_encoded",
        "product_price", "days_since_last_purchase", "previous_returns",
        "product_rating", "size_encoded", "discount_applied"
    ]

    # I recommend a quick NaN check here (and then impute/log).
    # Missing values can quietly distort training and evaluation.
    X = df_processed[feature_cols]
    y = df_processed["is_return"]

    return X, y


# Prepare data
X_train, y_train = preprocess(train)
X_test, y_test = preprocess(test)

# I’d change this design: preprocess() creates new encoders every time,
# so train/test can end up with different mappings.
# My recommendation: fit encoders once on train, reuse them everywhere (test + production), and persist them.

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# I suggest saving feature_cols too.
# Column order mismatches are a very common production bug.

# Train baseline model
baseline_model = LogisticRegression(random_state=42, max_iter=1000)

# If returns are the minority class, I recommend trying class_weight="balanced".
# It's a small change that usually improves recall for the return class in a measurable way.
baseline_model.fit(X_train_scaled, y_train)

# Predictions
y_pred = baseline_model.predict(X_test_scaled)

# I strongly recommend storing predict_proba too and tuning thresholds (0.3/0.4/0.5).
# That's the simplest way to control precision vs recall without changing the model.

# Basic evaluation
print("Baseline Model Performance")
print("=" * 50)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# I wouldn’t stop at accuracy here.
# I recommend adding confusion matrix + recall/F1 for class 1 (returns), since accuracy can look fine while missing returns.

# Save artifacts
joblib.dump(baseline_model, "baseline_model.pkl")
joblib.dump(scaler, "scaler.pkl")

# I recommend saving the encoders (and feature_cols) as artifacts too.
# Otherwise you won't be able to reproduce the same preprocessing in production.

print("\n" + "=" * 50)
print("YOUR TASK: Evaluate thoroughly and improve this baseline")
print("=" * 50)


Baseline Model Performance
Accuracy: 0.7475

Classification Report:
              precision    recall  f1-score   support

           0       0.75      1.00      0.86      1495
           1       0.00      0.00      0.00       505

    accuracy                           0.75      2000
   macro avg       0.37      0.50      0.43      2000
weighted avg       0.56      0.75      0.64      2000


YOUR TASK: Evaluate thoroughly and improve this baseline


**Observations – Alfonso –**
The model looks “good” based on accuracy (0.7475), but in practice it is useless for detecting returns: for class 1 it has precision = 0.00, recall = 0.00, and F1 = 0.00, meaning it predicts almost everything as “no return.”


## Evaluation of the model

In [17]:
import pandas as pd
import numpy as np
import joblib

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report
)

# Load test data
test = pd.read_csv("ecommerce_returns_test.csv")

# Load artifacts
model = joblib.load("baseline_model.pkl")
scaler = joblib.load("scaler.pkl")

# Reuse preprocessing (as in baseline)
X_test, y_test = preprocess(test)
X_test_scaled = scaler.transform(X_test)

# Predictions
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]


## Metrics

In [11]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy : {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall   : {recall:.4f}")
print(f"F1-score : {f1:.4f}")


Accuracy : 0.7475
Precision: 0.0000
Recall   : 0.0000
F1-score : 0.0000


## Confusion matrix

In [12]:
cm = confusion_matrix(y_test, y_pred)
print(cm)


[[1495    0]
 [ 505    0]]


## Performance by Product Category

In [18]:
# Create an evaluation copy of the test data
test_eval = test.copy()
test_eval["y_true"] = y_test.values
test_eval["y_pred"] = y_pred

# Review model performance by product category
performance_by_category = (
    test_eval
    .groupby("product_category")
    .apply(lambda df: pd.Series({
        "accuracy": accuracy_score(df["y_true"], df["y_pred"]),
        "recall": recall_score(df["y_true"], df["y_pred"]),
        "support": len(df)
    }))
    .sort_values("recall")
)

# Inspect categories where recall is weakest
performance_by_category.head(10)



Unnamed: 0_level_0,accuracy,recall,support
product_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Electronics,0.828666,0.0,607.0
Fashion,0.686594,0.0,1104.0
Home_Decor,0.809689,0.0,289.0


**Observation – Alfonso:**
From my perspective, the model shows acceptable accuracy by category, but recall is zero across all of them. This tells me that it is not detecting returns in any product group and is mainly doing well by predicting the majority class. I want to be very clear about this limitation.

## Key Questions – Part 1

**Strengths and weaknesses**
I see this model as a solid baseline in terms of simplicity: it is fast, easy to interpret, and straightforward to run. However, its biggest weakness is clear to me—it **does not detect returns**, and it almost always defaults to predicting “no return.”

**Where does it fail most?**
In my evaluation, it fails most when identifying orders that actually end up being returned, particularly in high-variability categories like Fashion, where patterns are harder to capture with a simple linear model.

**Is accuracy the right metric? Why/why not?**
In my opinion, accuracy is not the right metric on its own here. With imbalanced data, accuracy can look acceptable even when the model completely misses the class we care about. That’s why I focus more on recall and F1, which expose the real performance issues more honestly.


## Part 2 – Business-Aligned Metrics

### Tasks

1. Define **“success”** in clear business terms.  
2. Recommend **2–3 metrics** aligned with business goals.  
3. Analyze **false positive vs. false negative** trade-offs.  
4. Calculate the **financial impact** of model predictions.  
5. Determine the **optimal decision threshold**.

In this problem, I do not define success as achieving high accuracy, but as **reducing the overall cost of product returns**. A return costs **\$18**, while a preventive intervention costs **\$3** and reduces the probability of a return by **35%**. Given this imbalance, missing a real return is significantly more expensive than intervening unnecessarily, so the model should be evaluated based on financial impact rather than purely technical metrics.

The cost structure makes the trade-off very clear. Correctly identifying a return and acting costs \$3 and can avoid losing \$18, which is a strong positive outcome. An unnecessary intervention costs only \$3 and is acceptable. However, failing to detect a real return results in losing the full \$18. This means false negatives are far more damaging than false positives.

Because of this, **accuracy alone is not an appropriate metric**. In the baseline, accuracy was around **75\%**, yet the model failed to detect any returns at all. This shows that accuracy can look acceptable even when the model provides no real business value. Instead, I focus primarily on **recall for the return class**, while using precision to keep intervention costs under control.

I also do not treat the default probability threshold of 0.5 as optimal. I would explicitly test lower thresholds (e.g., 0.3 or 0.4) and select the one that minimizes total expected cost. A model would be considered ready for a pilot deployment if it clearly reduces expected losses compared to doing nothing, improves recall for returns, and keeps intervention volume within reasonable limits. Even a simple baseline model can deliver meaningful business value under these criteria.



## Part 3: Model Improvement (20 minutes) 

**Hypothesis 1 – Class imbalance**
Returns are typically a minority class. Using `class_weight="balanced"` should improve recall and F1 for returns without changing the algorithm.

**Hypothesis 2 – Regularization tuning**
Light tuning of the regularization strength (`C`) can reduce under/over-regularization and improve generalization.

**Methodological improvement**
Fixed data leakage by fitting encoders and scaler **only on training data** and reusing them for test and production.


**Validation & Overfitting Check**

* Preprocessing fitted **only on TRAIN** → no leakage.
* Hyperparameters selected via **5-fold CV**.
* Train vs Test F1 are comparable → no strong overfitting signal.
* Same preprocessing object persisted for production reproducibility.



In [23]:
"""
Improved Baseline Model - Logistic Regression (No Classes)

What I changed vs. the original baseline:
- I fit encoders and the scaler on TRAIN only (to avoid leakage).
- I handled class imbalance to stop the model from ignoring returns.
- I tuned regularization lightly (small grid on C) to improve generalization.
- I compare TRAIN vs TEST metrics to sanity-check overfitting.
"""

import pandas as pd
import warnings
warnings.filterwarnings("ignore")

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, f1_score
from sklearn.model_selection import GridSearchCV
import joblib

# ------------------------------------------------------------
# Load data
# ------------------------------------------------------------
train = pd.read_csv("ecommerce_returns_train.csv")
test = pd.read_csv("ecommerce_returns_test.csv")

# ------------------------------------------------------------
# Encode categorical variables (fit on TRAIN only)
# ------------------------------------------------------------
le_category = LabelEncoder()
le_size = LabelEncoder()

train = train.copy()
test = test.copy()

# Fit on train, reuse on test (prevents leakage / mapping mismatch)
train["product_category_encoded"] = le_category.fit_transform(train["product_category"])
test["product_category_encoded"] = le_category.transform(test["product_category"])

# Handle size feature consistently
# I fit size encoder on TRAIN and map unseen sizes in TEST to a safe fallback
if train["size_purchased"].notna().any():
    most_common_size = train["size_purchased"].mode()[0]

    train["size_purchased"] = train["size_purchased"].fillna(most_common_size)
    test["size_purchased"] = test["size_purchased"].fillna(most_common_size)

    train["size_encoded"] = le_size.fit_transform(train["size_purchased"])

    # Keep production behavior stable: unseen sizes fall back to the most common train size
    test_sizes = test["size_purchased"].map(lambda x: x if x in le_size.classes_ else most_common_size)
    test["size_encoded"] = le_size.transform(test_sizes)
else:
    # Always create the column so feature selection never breaks
    train["size_encoded"] = 0
    test["size_encoded"] = 0

# ------------------------------------------------------------
# Feature selection
# ------------------------------------------------------------
feature_cols = [
    "customer_age", "customer_tenure_days", "product_category_encoded",
    "product_price", "days_since_last_purchase", "previous_returns",
    "product_rating", "size_encoded", "discount_applied"
]

X_train = train[feature_cols]
y_train = train["is_return"]

X_test = test[feature_cols]
y_test = test["is_return"]

# ------------------------------------------------------------
# Scaling (fit on TRAIN only)
# ------------------------------------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ------------------------------------------------------------
# Model + light tuning
# ------------------------------------------------------------
# My assumption: the returns class is minority, so I balance it to improve recall/F1.
param_grid = {"C": [0.1, 1.0, 3.0]}

base_model = LogisticRegression(
    random_state=42,
    max_iter=2000,
    class_weight="balanced"
)

grid = GridSearchCV(
    base_model,
    param_grid=param_grid,
    scoring="f1",
    cv=5,
    n_jobs=-1
)

grid.fit(X_train_scaled, y_train)
best_model = grid.best_estimator_

# ------------------------------------------------------------
# Evaluation (quick generalization check)
# ------------------------------------------------------------
train_pred = best_model.predict(X_train_scaled)
test_pred = best_model.predict(X_test_scaled)

print("Improved Model Performance")
print("=" * 50)
print(f"Best params: {grid.best_params_}")

print("\nTRAIN metrics")
print(f"Accuracy: {accuracy_score(y_train, train_pred):.4f}")
print(f"F1 (returns): {f1_score(y_train, train_pred):.4f}")

print("\nTEST metrics")
print(f"Accuracy: {accuracy_score(y_test, test_pred):.4f}")
print(f"F1 (returns): {f1_score(y_test, test_pred):.4f}")
print("\nClassification Report (TEST):")
print(classification_report(y_test, test_pred))

# ------------------------------------------------------------
# Save artifacts (so inference uses the exact same preprocessing)
# ------------------------------------------------------------
joblib.dump(best_model, "model.pkl")
joblib.dump(scaler, "scaler.pkl")
joblib.dump(le_category, "le_category.pkl")
joblib.dump(le_size, "le_size.pkl")
joblib.dump(feature_cols, "feature_cols.pkl")

print("\nModels saved for production reproducibility")


Improved Model Performance
Best params: {'C': 0.1}

TRAIN metrics
Accuracy: 0.5503
F1 (returns): 0.3688

TEST metrics
Accuracy: 0.5465
F1 (returns): 0.3644

Classification Report (TEST):
              precision    recall  f1-score   support

           0       0.77      0.56      0.65      1495
           1       0.28      0.51      0.36       505

    accuracy                           0.55      2000
   macro avg       0.53      0.54      0.51      2000
weighted avg       0.65      0.55      0.58      2000


Models saved for production reproducibility


**Observations:** When I look at these results, what stands out to me is how consistent the model behaves between training and test data. The numbers are almost the same, which gives me confidence that it’s not overfitting and that the validation setup is solid. The fact that the model settled on a low C value tells me the data doesn’t contain a very strong signal, so being conservative is the right call here. I also see the expected trade-off from balancing the classes: the model catches more returns, but it does so by raising more false alarms. To me, the accuracy staying flat isn’t a failure; it simply reflects the limits of the current features. The real improvement is that the model is now making clearer, more intentional decisions that would be easier to reason about and deploy in a real business setting.


## Part 4: Deployment Planning (10 minutes) 

## Task 1: Production Monitoring Plan 

### Metrics to Track
In production, I would not focus on accuracy. What I care about is how the model behaves on returns, since that’s where the business impact is. I would track recall and F1-score for the return class to make sure we are still catching a meaningful portion of returns. I would also monitor precision, because too many false positives can quickly become an operational problem. On top of model metrics, I would continuously track data quality indicators such as missing values, unseen categories, and basic shifts in feature distributions.

### Detecting Model Degradation
To me, degradation is rarely a single bad number. I would look for consistent patterns: a sustained drop in return recall or F1 compared to the validation baseline, or sudden changes in the percentage of orders predicted as returns. Even if accuracy stays flat, these signals usually indicate that customer behavior or product mix has changed. I would complement this with simple distribution checks on key features to confirm whether the input data is drifting.

### Alerts to Configure
I would configure alerts that are informative rather than reactive. For example, if return recall drops by more than 10–15% for several consecutive periods, that should trigger an investigation. I would also alert on sharp week-over-week changes in the predicted return rate and on spikes in unseen categories or missing values. These alerts are meant to prompt analysis, not automatic retraining.

### Retraining Strategy
I see retraining as a controlled decision, not an automatic reaction. Under stable conditions, I would retrain on a regular cadence, such as quarterly. If monitoring shows persistent degradation that cannot be explained by data quality issues, I would trigger an earlier retraining cycle. In all cases, retraining would reuse the same preprocessing pipeline and be validated against a recent holdout set.

### Rollback Criteria
A rollback should be simple and fast. I would roll back if a newly deployed model performs worse than the current production model on live or shadow metrics, or if it creates excessive false positives that disrupt operations. In those cases, reverting to the last stable model is safer than trying to tune a degraded one in production.


## Task 2 : Stakeholder Summary

This model is designed to identify orders that are likely to be returned, with the goal of helping the business act earlier and reduce avoidable return costs. From my perspective, the model’s value is not in maximizing accuracy, but in improving our ability to flag a meaningful portion of returns while keeping operational impact under control. In its current form, the model detects roughly half of all returns, which already provides actionable signal compared to a purely reactive process.

In business terms, this means the model can be used to prioritize manual reviews, adjust fulfillment strategies, or trigger targeted interventions for high-risk orders. The expected return on investment comes from reducing handling, logistics, and customer service costs associated with returns. Even a modest reduction in unnecessary returns can generate positive ROI, given that the cost of deploying and maintaining the model is relatively low.

There are, however, clear risks to consider. Customer behavior, product mix, and promotional strategies change over time, and any of these can degrade model performance. To mitigate this, I recommend continuous monitoring of return-related metrics and data quality indicators, along with a controlled retraining strategy. The model has been built with reproducibility in mind, which allows us to retrain or roll back safely when needed.

Success after launch should be measured using business-facing metrics rather than technical ones alone. Key indicators include the percentage of returns successfully flagged, the operational cost of false positives, and the overall impact on return-related expenses. I would also recommend running a controlled A/B test, where a subset of orders uses model-driven actions while the rest follows the current process, to quantify incremental business value.

Overall, I see this model as a solid first deployment: conservative, explainable, and easy to monitor. It is not a final solution, but a practical and low-risk step toward more data-driven decision-making around returns.
