## Instructions {-}

1. This is the template for the code and final report on the Prediction Problem.

2. You may modify the template if you see fit, but it should have the information asked below.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegressionCV, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import accuracy_score, mean_squared_error, classification_report
from numpy import log1p
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

x_train = pd.read_csv("train_X.csv")
y_train = pd.read_csv("train_y.csv")
x_test = pd.read_csv("public_private_X.csv")
train = pd.merge(x_train, y_train, how='inner', on='ID')

## 1) Data Preprocessing

Steps for cleaning and preparing the dataset

In [2]:
train_x = train.drop(columns=["ID", "ON_TIME_AND_COMPLETE"])
train_y = train["ON_TIME_AND_COMPLETE"]

train_x.drop(columns = ['RESERVABLE_INDICATOR', 'PURCHASE_ORDER_DUE_DATE', 'ORDER_DATE' , 'DIVISION_CODE', "PRODUCT_STATUS"], inplace=True, errors='ignore')
x_test.drop(columns = ['RESERVABLE_INDICATOR', 'PURCHASE_ORDER_DUE_DATE', 'ORDER_DATE' , 'DIVISION_CODE', "PRODUCT_STATUS"], inplace=True, errors='ignore')


## 2) Feature Engineering

Techniques used to create meaningful features

In [None]:
# Define categorical and numeric variables
cat_vars = ['DIVISION_NUMBER', 'ORDER_DAY_OF_WEEK', "DUE_DATE_WEEKDAY", "PURCHASE_ORDER_TYPE", "PRODUCT_CLASSIFICATION"]
num_vars = [col for col in train_x.columns if col not in cat_vars]

# Impute NaNs before one-hot encoding
numeric_imputer = SimpleImputer(strategy='median')
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Impute numeric columns
X_train_num = pd.DataFrame(numeric_imputer.fit_transform(train_x[num_vars]), columns=num_vars)
X_test_num = pd.DataFrame(numeric_imputer.transform(x_test[num_vars]), columns=num_vars)

# Impute categorical columns
X_train_cat = pd.DataFrame(categorical_imputer.fit_transform(train_x[cat_vars]), columns=cat_vars)
X_test_cat = pd.DataFrame(categorical_imputer.transform(x_test[cat_vars]), columns=cat_vars)

# Combine imputed numeric and categorical
X_train = pd.concat([X_train_num, X_train_cat], axis=1)
X_test = pd.concat([X_test_num, X_test_cat], axis=1)

# One-hot encoding
X_train = pd.get_dummies(X_train, columns=cat_vars, drop_first=True)
X_test = pd.get_dummies(X_test, columns=cat_vars, drop_first=True)
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)

# Polynomial features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_poly = pd.DataFrame(poly.fit_transform(X_train), columns=poly.get_feature_names_out(X_train.columns))
X_test_poly = pd.DataFrame(poly.transform(X_test), columns=poly.get_feature_names_out(X_test.columns))

# Function to check same-category interactions
def same_interactions(feature_name, categorical_features):
    terms = feature_name.split(" ")
    for cat in categorical_features:
        if all(cat in term for term in terms) and len(terms) >= 2:
            return True
    return False

# Remove redundant interactions
features_to_remove = [col for col in X_train_poly.columns if same_interactions(col, cat_vars)]
X_train_poly_filtered = X_train_poly.drop(columns=features_to_remove, errors="ignore")
X_test_poly_filtered = X_test_poly.drop(columns=features_to_remove, errors="ignore")
print(f"Removed {len(features_to_remove)} redundant categorical interactions.")

# Scale the data
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_poly_filtered), columns=X_train_poly_filtered.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test_poly_filtered), columns=X_test_poly_filtered.columns)


Removed 587 redundant categorical interactions.


## 3) Developing the Model

In [5]:
from sklearn.model_selection import StratifiedKFold

X_train_split, X_val, y_train_split, y_val = train_test_split(X_train_scaled, train_y, test_size=0.2, random_state=42)
C_values = [0.1, 1, 10, 100, 1000]
l1_ratios = [0.25, 0.5, 0.75]
cv_folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

model = LogisticRegressionCV(
    Cs=C_values,
    l1_ratios=l1_ratios,
    cv=cv_folds,
    penalty="elasticnet",
    solver="saga",
    max_iter=100,  
    scoring="accuracy",
    random_state=42,
    n_jobs=-1
)
model.fit(X_train_split, y_train_split)





## 4) Model Fine Tuning

Including cross-validation, regularization, and hyperparameter tuning, etc

In [7]:
optimal_c = model.C_[0]
optimal_l1_ratio = model.l1_ratios_[0]

print(f"Optimal C value: {optimal_c}")
print(f"Optimal L1 ratio: {optimal_l1_ratio}")

train_accuracy = model.score(X_train_split, y_train_split)
val_accuracy = model.score(X_val, y_val)
print(f"Train Accuracy: {train_accuracy}")
print(f"Validation Accuracy: {val_accuracy}")

val_probs = model.predict_proba(X_val)[:, 1]
thresholds = np.arange(0.1, 0.9, 0.05)

best_threshold = 0
best_accuracy = 0

for threshold in thresholds:
    val_preds = (val_probs >= threshold).astype(int)
    val_accuracy = accuracy_score(y_val, val_preds)
    if val_accuracy > best_accuracy:
        best_threshold = threshold
        best_accuracy = val_accuracy

print(f"Best threshold: {best_threshold}")
print(f"Best accuracy: {best_accuracy}")

test_probs = model.predict_proba(X_test_scaled)[:, 1]
y_preds = (test_probs >= best_threshold).astype(int)

Optimal C value: 1000.0
Optimal L1 ratio: 0.25
Train Accuracy: 0.8233312921004287
Validation Accuracy: 0.8185157972079353
Best threshold: 0.5000000000000001
Best accuracy: 0.8185157972079353


Please note that your code should be reproducible, well-structured, and include clear comments for readability.
Your code should generate the csv file that you submitted.

In [8]:
original_test = pd.read_csv("public_private_X.csv")
submission = pd.DataFrame({"ID": original_test["ID"], "ON_TIME_AND_COMPLETE": y_preds})
submission.to_csv("final_report.csv", index=False)

## 5) Key Takeaways (Short Reflection)

* Provide a brief summary of your key takeaways from this prediction problem.
* Reflect on challenges faced and lessons learned from your model building process.

Making sure I was making the same transformations to both the train and test data while keeping runtime reasonable was a big challenge of the prediction problem. While I tried constructing pipelines to do so, the runtime grew exponentially so I had to manually make changes to both datasets. I also experienced the tradeoff between model complexity, accuracy, and runtime as many of my earlier submissions had high complexity and train accuracies but were overfitting the data and resulted in subpar test accuracies when I submitted to Kaggle. It was also difficult finding which variables I had to transform and which interaction terms to use and how to exclude irrelevant interaction terms.