# 🛍️ E-commerce Customer Segmentation using K-Means Clustering

This notebook performs customer segmentation on an e-commerce dataset using K-Means clustering and RFM analysis (Recency, Frequency, Monetary).

## 📦 Step 1: Import Required Libraries

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import datetime as dt

sns.set(style="whitegrid")
```

## 📥 Step 2: Load Dataset

```python
# Load dataset (change filename as needed)
df = pd.read_csv("data.csv", encoding='ISO-8859-1')
df.head()
```

## 🧹 Step 3: Basic Data Exploration

```python
# Check shape
print("Shape:", df.shape)

# Check column info
df.info()

# Describe numerical data
df.describe()
```

## ❓ Step 4: Check Missing and Duplicate Values

```python
# Missing values
missing = df.isnull().sum()
print("Missing values:\n", missing)

# Drop rows with missing CustomerID
df.dropna(subset=["CustomerID"], inplace=True)

# Check for duplicates
duplicates = df.duplicated().sum()
print("Duplicate rows:", duplicates)
```

## 🔍 Step 5: Data Cleaning

```python
# Remove canceled orders (InvoiceNo starting with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]

# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create TotalPrice = Quantity * UnitPrice
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Filter out negative or zero quantities and prices
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

df.shape
```

## 🧠 Step 6: Feature Engineering – RFM Metrics

```python
# Snapshot date: 1 day after last invoice
snapshot_date = df['InvoiceDate'].max() + dt.timedelta(days=1)

# Group by customer
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,  # Recency
    'InvoiceNo': 'nunique',                                   # Frequency
    'TotalPrice': 'sum'                                       # Monetary
}).reset_index()

rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']
rfm.head()
```

## ⚖️ Step 7: Normalize RFM Features

```python
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[['Recency', 'Frequency', 'Monetary']])
```

## 📉 Step 8: Elbow Method for Optimal Clusters

```python
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(rfm_scaled)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('No. of Clusters')
plt.ylabel('WCSS')
plt.show()
```

## 🎯 Step 9: K-Means Clustering

```python
# Let's assume 4 clusters from elbow curve
kmeans = KMeans(n_clusters=4, random_state=42)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)

# Silhouette Score
score = silhouette_score(rfm_scaled, rfm['Cluster'])
print("Silhouette Score:", score)
```

## 📊 Step 10: Cluster Summary

```python
cluster_summary = rfm.groupby('Cluster').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': 'mean',
    'CustomerID': 'count'
}).rename(columns={'CustomerID': 'Count'}).reset_index()

cluster_summary
```

## 📈 Step 11: Visualize Clusters

```python
# Pairplot of clusters
sns.pairplot(rfm, hue='Cluster', palette='tab10')
plt.suptitle("Cluster-wise RFM Distribution", y=1.02)
plt.show()
```

## 💾 Step 12: Save Results

```python
# Export clustered customers to CSV
rfm.to_csv("clustered_customers.csv", index=False)
```

## ✅ Project Complete!

You have successfully performed customer segmentation using K-Means clustering on an E-commerce dataset. This analysis can help businesses tailor marketing strategies for different customer segments.

## 🌐 Optional: Streamlit Dashboard (app.py)

To deploy an interactive dashboard:

```python
# Save as app.py

import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('clustered_customers.csv')

st.title("Customer Segmentation Dashboard")

# Sidebar filter
cluster = st.sidebar.selectbox("Select Cluster", sorted(df['Cluster'].unique()))

# Show data
st.subheader(f"Cluster {cluster} Summary")
st.write(df[df['Cluster'] == cluster].describe())

# Plot
st.subheader("RFM Distribution by Cluster")
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
sns.histplot(df[df['Cluster'] == cluster]['Recency'], ax=ax[0])
sns.histplot(df[df['Cluster'] == cluster]['Frequency'], ax=ax[1])
sns.histplot(df[df['Cluster'] == cluster]['Monetary'], ax=ax[2])
st.pyplot(fig)
```

Run it with:
```bash
streamlit run app.py
```

## 🧬 Optional: Dimensionality Reduction with PCA and t-SNE

### PCA Visualization

```python
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_components = pca.fit_transform(rfm_scaled)

rfm['PCA1'] = pca_components[:, 0]
rfm['PCA2'] = pca_components[:, 1]

plt.figure(figsize=(8,6))
sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', data=rfm, palette='Set2')
plt.title("Customer Segments via PCA")
plt.show()
```

### t-SNE Visualization

```python
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
tsne_components = tsne.fit_transform(rfm_scaled)

rfm['TSNE1'] = tsne_components[:, 0]
rfm['TSNE2'] = tsne_components[:, 1]

plt.figure(figsize=(8,6))
sns.scatterplot(x='TSNE1', y='TSNE2', hue='Cluster', data=rfm, palette='Set1')
plt.title("Customer Segments via t-SNE")
plt.show()
```