#  SmartCart Customer Segmentation Project

##  Objective
The goal of this project is to analyze customer purchasing behavior and apply clustering techniques to segment customers based on their shopping patterns.

##  Dataset Description
Brief description of dataset features:
- Income
- Recency
- NumWebPurchases
- NumCatalogPurchases
- NumStorePurchases
- etc.


## Import Basic Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Load Dataset

In [None]:
data=pd.read_csv("smartcart_customers.csv")
data.head()

## Data Understanding

In [None]:
data.info()

In [None]:
data.isnull().sum()

## Data Cleaning

In [None]:
from sklearn.impute import SimpleImputer 
si=SimpleImputer(strategy="median")
data["Income"]=si.fit_transform(data[["Income"]])

In [None]:
data["Income"].isnull().sum()

## Feature Engineering

In [None]:
data=data.drop("ID",axis=1)

In [None]:
data["Age"] =2026 - data["Year_Birth"]
data = data.drop("Year_Birth", axis=1)


In [None]:
data.columns

In [None]:
data["Dt_Customer"].dtype

In [None]:
data["Dt_Customer"]=pd.to_datetime(data["Dt_Customer"],format="%d-%m-%Y")
data["Customer_tenure"]=data["Dt_Customer"].max()-data["Dt_Customer"]
data=data.drop("Dt_Customer",axis=1)
data["Customer_tenure"]=data["Customer_tenure"].dt.days

In [None]:
#we are classifying on the basis of customer not product so we will merge unneccary data
data["Total_Spent"]=data["MntFruits"]+data["MntMeatProducts"]+data["MntFishProducts"]+data["MntWines"]+data["MntSweetProducts"]+data["MntGoldProds"]

data["Children"]=data["Kidhome"]+data["Teenhome"]
data=data.drop(["MntFruits","MntMeatProducts","MntFishProducts","MntWines","MntSweetProducts","Kidhome","Teenhome","MntGoldProds"],axis=1)

In [None]:
data.info()

In [None]:
data["Marital_Status"].unique()

In [None]:
data["Education"].unique()

In [None]:
data["Marital_Status"]=data["Marital_Status"].map(
   { "Single":"Single","Married":"Together","Together":"Together","Divorced":"Single","Widow":"Single","Alone":"Single","Absurd":"Single","YOLO":"Single"
   })

In [None]:
data["Education"]=data["Education"].map({
    "PhD":"PostGraduate","Master":"PostGraduate","Graduation":"Graduate",
    "Basic":"UnderGraduate","2n Cycle":"UnderGraduate"
})

In [None]:
data.shape

## Outlier Detection and Removal


In [None]:
data.columns

In [None]:
cols=["Income","Age","Complain","Children","Customer_tenure"]
sns.pairplot(data[cols])

In [None]:
# age and income
data=data[data["Age"]<100]
data=data[data["Income"]<300000]

In [None]:
corr=data.corr(numeric_only=True)


In [None]:
sns.heatmap(corr,annot=True,fmt=".2f",cmap="coolwarm",annot_kws={"size":8})

## Data Encoding

In [None]:
data_cleaned=pd.get_dummies(data,columns=["Education","Marital_Status"],dtype=int)

## Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
data_scaled=sc.fit_transform(data_cleaned)

## Visualization of Data

In [None]:
from sklearn.decomposition import PCA
pca_2=PCA(n_components=2)
data_pca_2=pca_2.fit_transform(data_scaled)
pca_2.explained_variance_ratio_

In [None]:
sns.scatterplot(x=data_pca_2[:,0],y=data_pca_2[:,1])

In [None]:
pca_3=PCA(n_components=3)
data_pca_3=pca_3.fit_transform(data_scaled)
pca_3.explained_variance_ratio_  #3 components retain most of the variance with minimal information loss.

In [None]:
fig=plt.figure(figsize=(8,6))
ax=fig.add_subplot(111,projection="3d")
ax.scatter(data_pca_3[:,0],data_pca_3[:,1],data_pca_3[:,2])
plt.show()

## Elbow Method

In [None]:
from sklearn.cluster import KMeans
wcss=[]
for k in range (2,10):
    model = KMeans(n_clusters=k,random_state=42)
    model.fit(data_pca_3)
    wcss.append(model.inertia_)

In [None]:
sns.lineplot(x=range(2,10),y=wcss,marker="o")

In [None]:
from kneed import KneeLocator
kneedle=KneeLocator(range(2,10),wcss,curve="convex",direction="decreasing")
kneedle.knee

## KMeans

In [None]:
best_kmean_model=KMeans(n_clusters=5,random_state=42)
labels_kmean=best_kmean_model.fit_predict(data_pca_3)
fig=plt.figure(figsize=(8,6))
ax=fig.add_subplot(111,projection="3d")
ax.scatter(data_pca_3[:,0],data_pca_3[:,1],data_pca_3[:,2],c=labels_kmean)
plt.show()

## Agglomerative_Clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering
best_agg_model=AgglomerativeClustering(n_clusters=5)
labels_agg=best_agg_model.fit_predict(data_pca_3)
fig=plt.figure(figsize=(8,6))
ax=fig.add_subplot(111,projection="3d")
ax.scatter(data_pca_3[:,0],data_pca_3[:,1],data_pca_3[:,2],c=labels_agg)
plt.show()

### Our both model has classified almost same , so we can go with anyone

In [None]:
data["Cluster"]=labels_agg

## Cluster Analysis

In [None]:
sns.countplot(x=data["Cluster"],palette="Set1",hue=data["Cluster"])

In [None]:
sns.scatterplot(x=data["Total_Spent"],y=data["Income"],hue=data["Cluster"],palette="Set1")
#we are judging this b/c thse are highly coorelated as seen in heatmap 
#therefore we can derive some info from here

In [None]:
cluster_summary=data.groupby("Cluster").mean(numeric_only=True)
cluster_summary

#### - Cluster 0: High Income, High Spenders
#### - Cluster 1: Low to Mid Income, Moderate Spenders
#### - Cluster 2: Low Income, Low Spenders
#### - Cluster 3: High Income, Moderate Spenders
#### - Cluster 4: Low Income, Moderate Spenders

## Conclusion -
#### The customers were successfully segmented into 5 distinct clusters based on income and spending behavior.  
#### These insights can help businesses design targeted marketing strategies and improve customer engagement.