<a href="https://colab.research.google.com/github/usshaa/SMBDA/blob/main/C-5.12%3A%20Customer_Segmentation_for_Marketing_Campaigns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Customer Segmentation for Marketing Campaigns

**Objective:** Segment customers based on their behavior and demographics to optimize marketing campaigns.

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

- **Step 1: Initialize Spark Session**: Starts a Spark session to utilize Spark's distributed computing capabilities.

In [None]:
# Step 1: Initialize Spark Session
spark = SparkSession.builder \
    .appName("Customer Segmentation") \
    .getOrCreate()

- **Step 2: Load and Prepare Data**: Loads synthetic customer data into a Spark DataFrame. For real-time scenarios, this step would involve streaming or continuously loading data from a source.
  

In [None]:
# Step 2: Load data into Spark DataFrame
data = spark.read.csv("/FileStore/tables/customer_segment_data.csv", header=True, inferSchema=True)

- **Step 3: Feature Engineering**: Uses `VectorAssembler` to combine selected feature columns into a single vector column named "features", which is required for model training.
  

In [None]:
# Step 3: Feature Engineering
feature_columns = ["Age", "Income", "SpendingScore"]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
data = assembler.transform(data).select("CustomerID", "features")

- **Step 4: Scale Features**: Scales the features using `StandardScaler` to normalize the data, which is important for clustering algorithms like K-Means.
  

In [None]:
# Step 4: Scale Features
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)
scalerModel = scaler.fit(data)
data = scalerModel.transform(data)

- **Step 5: Build and Train Clustering Model**: Initializes a K-Means clustering model (`KMeans`) and trains it using the prepared dataset.

In [None]:
# Step 5: Build and Train Clustering Model (K-Means)
kmeans = KMeans(k=3, seed=42)
model = kmeans.fit(data)

- **Step 6: Assign Cluster Labels**: Makes predictions on the dataset and assigns cluster labels (predictions) to each customer.

In [None]:
# Step 6: Assign Cluster Labels
predictions = model.transform(data)

In [None]:
predictions.show()

+----------+--------------------+--------------------+----------+
|CustomerID|            features|      scaledFeatures|prediction|
+----------+--------------------+--------------------+----------+
|         1|[58.0,92549.49450...|[4.18086341281472...|         2|
|         2|[25.0,74102.62278...|[1.80209629862703...|         2|
|         3|[19.0,97496.48990...|[1.36959318695654...|         2|
|         4|[65.0,100394.3858...|[4.68545037643029...|         2|
|         5|[35.0,120840.1601...|[2.52293481807784...|         0|
|         6|[33.0,69449.14339...|[2.37876711418768...|         2|
|         7|[32.0,138105.5690...|[2.30668326224260...|         0|
|         8|[26.0,58655.90722...|[1.87418015057211...|         1|
|         9|[65.0,96508.26803...|[4.68545037643029...|         2|
|        10|[24.0,75747.87869...|[1.73001244668195...|         2|
|        11|[61.0,93921.83118...|[4.39711496864996...|         2|
|        12|[65.0,27829.93956...|[4.68545037643029...|         1|
|        1

#### Silhouette Score

Silhouette analysis can provide insight into the density and separation of the formed clusters. It computes the average silhouette coefficient for all samples.

In [None]:
from pyspark.ml.evaluation import ClusteringEvaluator

# Make predictions
predictions = model.transform(data)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette with squared euclidean distance = {silhouette}")

Silhouette with squared euclidean distance = 0.7417477569091551


- **Step 7: Real-time Prediction Example**: Simulates real-time prediction by creating new data (e.g., new customer information) and predicting the cluster using the trained model.

In [None]:
# Step 7: Real-time Prediction Example (Assume streaming or real-time data)
# For real-time, simulate new data arriving and predict clusters
new_data = spark.createDataFrame([(1001, 35, 90000, 70)], ["CustomerID", "Age", "Income", "SpendingScore"])

# Assemble features and scale
new_data = assembler.transform(new_data)
new_data = scalerModel.transform(new_data)

# Predict clusters
predicted_cluster = model.transform(new_data).select("CustomerID", "prediction").first()

print(f"Predicted cluster for CustomerID {predicted_cluster['CustomerID']}: {predicted_cluster['prediction']}")

Predicted cluster for CustomerID 1001: 2
