In [None]:
# ----------------------------------------------------------------------------
# Title: Assignment 8.2
# Author: Surenther Selvaraj
# Date: 01 Nov 2025
# Modified By: Surenther Selvaraj
# Description: Term Project Milestone 2: Data Preparation
# Data: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis
# ----------------------------------------------------------------------------

## Term Project Milestone 2: Data Preparation Plan
This document outlines the initial data preparation and feature engineering strategy for the classification model, specifically focusing on identifying and dropping features that are non-predictive, redundant, or problematic for the modeling process.

### 1. Feature Drop Strategy

To optimize the model's performance and prevent data leakage or multicollinearity, several features from the raw dataset will be excluded.

| Feature to Drop  | Rationale for Exclusion  |  Category |
|---|---|---|
| ID  | This is a unique customer identifier. It has no predictive power and should be dropped to prevent the model from memorizing individual observations.  | Non-Predictive  |
| Dt_Customer  | Date of customer enrollment. While conversion time might be relevant, the more actionable and direct time-based feature, Recency, is already available and highly predictive (as shown in Graph 3).  | Redundant/High Cardinality  |
| Z_CostContact  | Standardized cost of customer contact. This column is constant across all customers in the dataset (a value of 3) and therefore provides zero variance or predictive information.  | Zero-Variance |
| Z_Revenue | Standardized revenue from contact. This column is constant across all customers (a value of 11) and provides zero variance or predictive information.  | Zero-Variance |
| AcceptedCmp1 to AcceptedCmp5 | Response to previous campaigns. These features represent the outcome of past offers. Using them to predict the current campaign's Response (which is effectively AcceptedCmp6) creates severe data leakage, as a positive response to a previous similar campaign is an unnaturally strong predictor for the current one. Dropping these ensures the model generalizes to new campaigns.  | Data Leakage |
| Complain | Customer complaint status. While technically a potential predictor, the count of complaints is extremely low, leading to highly skewed classes that provide minimal signal for a classification model focused on positive response prediction.  | Near Zero-Variance |


### 2. Features to Keep and Transform

The remaining features will be kept as they are highly relevant to the business problem (customer wealth, engagement, and past purchase behavior).

| Feature Type  | Features to Keep |  Required Transformation |
|---|---|---|
| Target  | Response  | None (already binary: 0 or 1)  |
| Demographic  | Income, Year_Birth (used to calculate Age), Marital_Status, Education, Kidhome, Teenhome  | Age is derived. Marital_Status and Education require one-hot encoding.  |
| Behavior  | Recency | None (numerical, already normalized).  |
| Spending  | MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds | None (numerical, may require scaling).  |
| Channel  | NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumWebVisitsMonth | None (numerical, may require scaling).  |