# Customer Churn Analysis with PySpark

This notebook demonstrates how to process the customer churn dataset using PySpark.

### Objectives:
- Load the dataset and resolve header/schema issues.
- Create a `features` column using `VectorAssembler`.
- Convert the `churn` column into a numeric `label` using `StringIndexer`.
- Train a Decision Tree model and store it in a file.
- Evaluate the model using metrics like confusion matrix, precision, recall, and F1-score.
- Plot the ROC curve.

## Step 1: Initialize PySpark and Load Data
We'll start by initializing a Spark session and loading the dataset.

In [None]:
!pip install numpy
!pip install matplotlib

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("Customer Churn Analysis").getOrCreate()

# Load the CSV file with inferred schema
data = spark.read.csv("customer_churn_prod.csv", header=True, inferSchema=True)

# Show the first few rows
data.show(5)

# Print schema to verify column types
data.printSchema()

### Challenge:
- Identify and drop any unnecessary columns, such as `_c0`.

In [None]:
# Drop the `_c0` column if it exists
if "_c0" in data.columns:
    data = data.drop("_c0")

# Verify remaining columns
print(data.columns)

## Step 2: Assemble Features
Combine numerical columns into a single `features` column using `VectorAssembler`.

In [None]:
from pyspark.ml.feature import VectorAssembler

# List of numerical feature columns
feature_columns = [
    "account_length", "number_vmail_messages", "total_day_minutes",
    "total_day_calls", "total_day_charge", "total_eve_minutes",
    "total_eve_calls", "total_eve_charge", "total_night_minutes",
    "total_night_calls", "total_night_charge", "total_intl_minutes",
    "total_intl_calls", "total_intl_charge", "number_customer_service_calls"
]

# Assemble features into a vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
data = assembler.transform(data)

# Show the first few rows with the new features column
data.select("features", "churn").show(5)

## Step 3: Index the `churn` Column
Convert the `churn` column into a numeric `label` column using `StringIndexer`.

In [None]:
from pyspark.ml.feature import StringIndexer

# Index the `churn` column
indexer = StringIndexer(inputCol="churn", outputCol="label")
data = indexer.fit(data).transform(data)

# Show the updated DataFrame with the label column
data.select("features", "label").show(5)

## Step 4: Split the Dataset
Split the dataset into training (70%) and test (30%) sets.

In [None]:
# Split the data into training and test sets
train_data, test_data = data.randomSplit([0.7, 0.3], seed=123)

# Print the number of rows in each split
print(f"Training Rows: {train_data.count()}, Testing Rows: {test_data.count()}")

## Step 5: Train a Decision Tree Model and Store It
Train a Decision Tree model on the training data and save it to a file.

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

# Train a Decision Tree Classifier
dt = DecisionTreeClassifier(featuresCol="features", labelCol="label", maxDepth=5)
model = dt.fit(train_data)

# Save the trained model
model_path = "/home/Chapter 5/trained_model_decision_tree"
model.write().overwrite().save(model_path)

print(f"Model saved at: {model_path}")

## Step 6: Evaluate the Model
Calculate the confusion matrix, precision, recall, and F1-score.

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import col

# Generate predictions
predictions = model.transform(test_data)

# Confusion Matrix
confusion_matrix = predictions.groupBy("label", "prediction").count()
confusion_matrix.show()

# Calculate Precision, Recall, and F1-Score
evaluator_precision = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedPrecision")
evaluator_recall = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedRecall")
evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")

precision = evaluator_precision.evaluate(predictions)
recall = evaluator_recall.evaluate(predictions)
f1_score = evaluator_f1.evaluate(predictions)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1_score:.2f}")

## Step 7: Plot the ROC Curve
Generate the ROC curve and calculate the area under the curve (AUC).

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
import matplotlib.pyplot as plt
import numpy as np

# Evaluate Area Under ROC (AUC)
evaluator_roc = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
auc = evaluator_roc.evaluate(predictions)
print(f"Area Under ROC: {auc:.2f}")

# Extract probabilities, labels, and predictions
probabilities = predictions.select("label", col("probability").alias("prob")).collect()

# Calculate TPR and FPR manually for the ROC curve
thresholds = np.linspace(0, 1, 100)
tpr_list, fpr_list = [], []

for threshold in thresholds:
    tp = sum((prob["label"] == 1) and (prob["prob"][1] >= threshold) for prob in probabilities)
    fp = sum((prob["label"] == 0) and (prob["prob"][1] >= threshold) for prob in probabilities)
    fn = sum((prob["label"] == 1) and (prob["prob"][1] < threshold) for prob in probabilities)
    tn = sum((prob["label"] == 0) and (prob["prob"][1] < threshold) for prob in probabilities)

    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0

    tpr_list.append(tpr)
    fpr_list.append(fpr)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr_list, tpr_list, label=f"ROC Curve (AUC = {auc:.2f})")
plt.plot([0, 1], [0, 1], 'r--', label="Random Classifier")
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.grid()
plt.show()