# 01 - Data Cleaning and Preprocessing

This notebook loads the raw customer purchase dataset, performs basic cleaning, handles missing values, and saves a cleaned version for further analysis.


In [1]:
import pandas as pd

In [2]:
# Load the raw data
df = pd.read_csv('../data/raw/customer_purchase_data.csv') 

# Show first few rows
df.head()

Unnamed: 0,Age,Gender,AnnualIncome,NumberOfPurchases,ProductCategory,TimeSpentOnWebsite,LoyaltyProgram,DiscountsAvailed,PurchaseStatus
0,40,1,66120.267939,8,0,30.568601,0,5,1
1,20,1,23579.773583,4,2,38.240097,0,5,0
2,27,1,127821.306432,11,2,31.633212,1,0,1
3,24,1,137798.62312,19,3,46.167059,0,4,1
4,31,1,99300.96422,19,1,19.823592,0,0,1


## Step 1: Dataset Overview and Missing Value Check

In this step, we examine the structure of the dataset—column types, number of rows and columns, and identify if any missing values exist.

In [3]:
# Check shape and data types
print("Dataset shape:", df.shape)
df.info()


Dataset shape: (1500, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 1500 non-null   int64  
 1   Gender              1500 non-null   int64  
 2   AnnualIncome        1500 non-null   float64
 3   NumberOfPurchases   1500 non-null   int64  
 4   ProductCategory     1500 non-null   int64  
 5   TimeSpentOnWebsite  1500 non-null   float64
 6   LoyaltyProgram      1500 non-null   int64  
 7   DiscountsAvailed    1500 non-null   int64  
 8   PurchaseStatus      1500 non-null   int64  
dtypes: float64(2), int64(7)
memory usage: 105.6 KB


### Interpretation:
* No missing values (all columns have 1500 non-null values)
* All columns are numeric (mostly `int64` and `float64`)
* But some of these columns are categorical in meaning (like `Gender`, `LoyaltyProgram`, `PurchaseStatus`, etc) even though they are stored and numbers. 

---

## Step 2: Convert Categorical Variables

Although all variables are stored as numeric types, some represent categories.  
We'll convert those to the `category` dtype for clarity and better EDA handling.


In [4]:
# Convert numeric columns that are actually categorical
categorical_cols = ['Gender', 'ProductCategory', 'LoyaltyProgram', 'PurchaseStatus']
for col in categorical_cols:
    df[col] = df[col].astype('category')

# Confirm changes
df.dtypes

Age                      int64
Gender                category
AnnualIncome           float64
NumberOfPurchases        int64
ProductCategory       category
TimeSpentOnWebsite     float64
LoyaltyProgram        category
DiscountsAvailed         int64
PurchaseStatus        category
dtype: object

## Step 3: Save the Cleaned Dataset

We save the cleaned dataset into the `data/processed` folder to reuse it during EDA and modeling phases without repeating the cleaning steps.


In [5]:
# Save the cleaned dataframe to a new CSV
df.to_csv('../data/processed/cleaned_customer_data.csv', index=False)
print("Cleaned dataset saved successfully.")


Cleaned dataset saved successfully.
