# Shopping Trends Data Preprocessing

This notebook walks through the preprocessing steps for the `shopping_trends.csv` dataset, preparing it for clustering and analysis.

## Step 1: Import Required Libraries

We use pandas for data manipulation, numpy for numerical operations, and scikit-learn for preprocessing.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

## Step 2: Load the Dataset

Read the CSV file into a pandas DataFrame and display the first few rows.

In [None]:
df = pd.read_csv('shopping_trends.csv')
df.head()

## Step 3: Check for Missing Values

Missing values can affect analysis and modeling. We check for any null values in each column.

In [None]:
print("Missing Values:\n", df.isnull().sum())

## Step 4: Drop Irrelevant Columns

Columns like `Customer ID` are identifiers and do not contribute to clustering. We remove them.

In [None]:
df = df.drop(columns=['Customer ID'], errors='ignore')
df.head()

## Step 5: Handle Duplicate Rows

Duplicate rows can bias analysis. We check for and remove duplicates.

In [None]:
print("Number of duplicate rows:", df.duplicated().sum())

In [None]:
df = df.drop_duplicates()

## Step 6: Identify Numerical and Categorical Columns

We separate columns into numerical and categorical types for appropriate preprocessing.

In [None]:
numerical_cols = ['Age', 'Purchase Amount (USD)', 'Review Rating', 'Previous Purchases']

categorical_cols = ['Gender', 'Item Purchased', 'Category', 'Location', 'Size', 'Color', 
                    'Season', 'Subscription Status', 'Payment Method', 'Shipping Type', 
                    'Discount Applied', 'Promo Code Used', 'Preferred Payment Method', 
                    'Frequency of Purchases']

## Step 7: Detect Outliers in Numerical Columns

Outliers can skew results. We use the Interquartile Range (IQR) method to identify them.

In [None]:
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
    print(f"Outliers in {col}:\n", outliers)

## Step 8: Cap Outliers (Optional)

To reduce the impact of extreme values, we cap `Purchase Amount (USD)` at its 95th percentile.

In [None]:
df['Purchase Amount (USD)'] = df['Purchase Amount (USD)'].clip(upper=df['Purchase Amount (USD)'].quantile(0.95))

## Step 9: Preprocessing Pipeline

We use scikit-learn's `ColumnTransformer` to scale numerical features and one-hot encode categorical features.

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), categorical_cols)
    ])

## Step 10: Apply Preprocessing

Fit and transform the data using the pipeline.

In [None]:
X = preprocessor.fit_transform(df)

## Step 11: Retrieve Feature Names

After one-hot encoding, we get the new feature names for the transformed data.

In [None]:
cat_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_cols)
feature_names = numerical_cols + list(cat_feature_names)
X = pd.DataFrame(X, columns=feature_names)

## Step 12: Save Preprocessed Data

Export the cleaned and transformed dataset to a new CSV file for further analysis.

In [None]:
X.to_csv('preprocessed_shopping_trends.csv', index=False)
print("Preprocessed data saved to 'preprocessed_shopping_trends.csv'")

## Step 13: Preview Preprocessed Data

Display the first few rows of the final preprocessed dataset.

In [None]:
print("Preprocessed Data (first 5 rows):\n", X.head())

## Summary of Preprocessing Steps

- **Missing Values:** Checked for nulls; handle as needed.
- **Duplicates:** Removed duplicate rows.
- **Outliers:** Detected using IQR; capped extreme values.
- **Encoding:** Categorical variables one-hot encoded.
- **Scaling:** Numerical features standardized.
- **Irrelevant Columns:** Dropped identifiers.
- **Saving:** Preprocessed data exported for further use.