### Regression Task: Target Selection

Target Variable: 'days_since_prior_order'

"We chose to predict the time interval between orders rather than reorder counts for its high strategic value. Predicting when a customer will return allows for:

Precision Retargeting: Sending marketing incentives at the exact moment a customer is likely to restock.

Dynamic Scheduling: Optimizing logistics and supply chain operations based on predicted temporal demand.

Customer Churn Prevention: Identifying deviations from predicted ordering cycles to intervene before a customer stops using the platform."

### Feature Isolation:
To ensure a robust regression model for predicting days_since_prior_order, the following technical steps were implemented:

Feature Isolation: I isolated behavioral features by explicitly removing identifiers (user_id, product_id) and the classification target. This prevents Data Leakage and ensures the model learns only from relevant user behavior.

Categorical Integrity: Handled categorical variables by converting them to string formats. This was a critical step to prevent TypeErrors during the imputation phase, especially when dealing with high-cardinality features.

Memory-Efficient Imputation: Applied a zero-filling strategy (fillna(0)) across the feature set and the target variable. This approach was chosen to maintain a dense matrix structure while keeping the memory footprint low for the 10-million-row dataset.

Deterministic Splitting: Utilized an 80/20 train-test split with a fixed random_state to ensure the reproducibility of results during model evaluation.

In [None]:
# --- Feature Selection & Data Cleaning ---
# Removing identifiers and the classification target to isolate features for regression.
features = [col for col in My_Data_Aggregated.columns if col not in ['user_id', 'product_id', 'target', 'days_since_prior_order']]

X = My_Data_Aggregated[features].copy()

# Handling Categorical columns to prevent "TypeError" during fillna(0).
# We convert categories to strings so that they can accept the new '0' value for missing data.
for col in X.select_dtypes(include=['category']).columns:
    X[col] = X[col].astype(str)

# Filling missing values with 0 for both features and the target variable.
X = X.fillna(0)
y = My_Data_Aggregated['days_since_prior_order'].fillna(0)

# Splitting the dataset into Training (80%) and Testing (20%) sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# RAM Optimization
del X
gc.collect()

print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")