# Banking Capstone Project - Big Data Analytics

## Project Overview
This project demonstrates comprehensive big data analytics in the banking sector using Apache Spark and Databricks. The project covers five critical use cases that address key business challenges in modern banking operations.

## Objectives
1. **Customer Churn Prediction**: Identify customers at risk of leaving to improve retention strategies
2. **Personalized Marketing**: Optimize credit card offers through customer segmentation
3. **ATM Operations**: Analyze downtime patterns to enhance service reliability
4. **Employee Performance**: Evaluate sales performance and identify top contributors
5. **Insurance Renewal**: Predict policy renewals for better customer engagement

## Technologies Used
- **Databricks**: Cloud-based big data platform
- **Apache Spark (PySpark)**: Distributed data processing
- **Python**: Data analysis and machine learning
- **Spark MLlib**: Machine learning algorithms

## Dataset Location
`/Volumes/workspace/default/capstone-project/`

---

## Use Case 1: Customer Churn Prediction

**Objective:** Identify customers at risk of churning to improve retention strategies and reduce customer attrition.

**Dataset:** `banking_churn.csv`

**Key Features:**
- Customer demographics (Age, Gender, Geography)
- Account information (Balance, Tenure, Number of Products)
- Engagement metrics (IsActiveMember, HasCrCard)
- Target variable: Exited (1 = Churned, 0 = Retained)

In [0]:
# Import necessary libraries
from pyspark.sql import functions as F
from pyspark.sql.functions import col, when
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Load Customer Churn Data
df_churn = spark.read.csv('/Volumes/workspace/default/capstone-project/Customer-Churn-Records.csv', header=True, inferSchema=True)

print("Dataset loaded successfully!")
print(f"Total Records: {df_churn.count()}")
print(f"Total Columns: {len(df_churn.columns)}")

# Display first few rows
display(df_churn.limit(10))

Dataset loaded successfully!
Total Records: 10000
Total Columns: 18


RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425
6,15574012,Chu,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1,1,5,DIAMOND,484
7,15592531,Bartlett,822,France,Male,50,7,0.0,2,1,1,10062.8,0,0,2,SILVER,206
8,15656148,Obinna,376,Germany,Female,29,4,115046.74,4,1,0,119346.88,1,1,2,DIAMOND,282
9,15792365,He,501,France,Male,44,4,142051.07,2,0,1,74940.5,0,0,3,GOLD,251
10,15592389,H?,684,France,Male,27,2,134603.88,1,1,1,71725.73,0,0,3,GOLD,342


In [0]:
# Data Cleaning and Exploration - Customer Churn

# Check for missing values
print("Missing Values:")
df_churn.select([F.count(when(col(c).isNull(), c)).alias(c) for c in df_churn.columns]).show()

# Data Schema
print("\nData Schema:")
df_churn.printSchema()

# Summary Statistics
print("\nSummary Statistics:")
display(df_churn.describe())

# Churn Rate Analysis
print("\nChurn Rate Analysis:")
churn_analysis = df_churn.groupBy("Exited").count().withColumn("Percentage", (col("count")/df_churn.count())*100)
display(churn_analysis)

Missing Values:
+---------+----------+-------+-----------+---------+------+---+------+-------+-------------+---------+--------------+---------------+------+--------+------------------+---------+------------+
|RowNumber|CustomerId|Surname|CreditScore|Geography|Gender|Age|Tenure|Balance|NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|Complain|Satisfaction Score|Card Type|Point Earned|
+---------+----------+-------+-----------+---------+------+---+------+-------+-------------+---------+--------------+---------------+------+--------+------------------+---------+------------+
|        0|         0|      0|          0|        0|     0|  0|     0|      0|            0|        0|             0|              0|     0|       0|                 0|        0|           0|
+---------+----------+-------+-----------+---------+------+---+------+-------+-------------+---------+--------------+---------------+------+--------+------------------+---------+------------+


Data Schema:
root
 |--

summary,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
count,10000.0,10000.0,10000,10000.0,10000,10000,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000,10000.0
mean,5000.5,15690940.5694,,650.5288,,,38.9218,5.0128,76485.88928799961,1.5302,0.7055,0.5151,100090.2398809998,0.2038,0.2044,3.0138,,606.5151
stddev,2886.8956799071675,71936.18612274864,,96.65329873613037,,,10.487806451704609,2.8921743770496824,62397.40520238573,0.5816543579989895,0.4558404644751333,0.4997969284589204,57510.49281769815,0.4028421380377401,0.40328265979381,1.4059186394390228,,225.92483921713327
min,1.0,15565701.0,Abazu,350.0,France,Female,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0,0.0,1.0,DIAMOND,119.0
max,10000.0,15815690.0,Zuyeva,850.0,Spain,Male,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0,1.0,5.0,SILVER,1000.0



Churn Rate Analysis:


Exited,count,Percentage
1,2038,20.380000000000003
0,7962,79.62


In [0]:
# Visualization - Churn by Age Group

# Create age groups
df_churn_viz = df_churn.withColumn("AgeGroup",
    when(col("Age") < 25, "18-24")
    .when(col("Age") < 35, "25-34")
    .when(col("Age") < 50, "35-49")
    .otherwise("50+")
)

# Churn by Age Group
print("Churn Distribution by Age Group:")
age_churn = df_churn_viz.groupBy("AgeGroup", "Exited").count().orderBy("AgeGroup")
display(age_churn)

# Geography-wise Churn Analysis
print("\nChurn by Geography:")
geo_churn = df_churn.groupBy("Geography", "Exited").count()
display(geo_churn)

Churn Distribution by Age Group:


AgeGroup,Exited,count
18-24,1,40
18-24,0,417
25-34,0,2972
25-34,1,250
35-49,1,1114
35-49,0,3812
50+,1,634
50+,0,761


Databricks visualization. Run in Databricks to view.


Churn by Geography:


Geography,Exited,count
France,1,811
Spain,0,2064
Spain,1,413
Germany,0,1695
Germany,1,814
France,0,4203


Databricks visualization. Run in Databricks to view.

In [0]:
# Predictive Model - Random Forest Classifier for Churn Prediction

# Select features for modeling
feature_cols = ["CreditScore", "Age", "Tenure", "Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"]

# Handle categorical variables (Geography, Gender) using StringIndexer
from pyspark.ml import Pipeline

gender_indexer = StringIndexer(inputCol="Gender", outputCol="GenderIndex")
geo_indexer = StringIndexer(inputCol="Geography", outputCol="GeographyIndex")

# Create feature vector
assembler = VectorAssembler(
    inputCols=feature_cols + ["GenderIndex", "GeographyIndex"],
    outputCol="features"
)

# Split data into training and testing
train_data, test_data = df_churn.randomSplit([0.7, 0.3], seed=42)

# Build Random Forest model
rf = RandomForestClassifier(labelCol="Exited", featuresCol="features", numTrees=100)

# Create pipeline
pipeline = Pipeline(stages=[gender_indexer, geo_indexer, assembler, rf])

# Train model
print("Training Random Forest model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

print("\nModel Predictions (sample):")
display(predictions.select("CustomerId", "Exited", "prediction", "probability").limit(20))

# Evaluate model
evaluator = BinaryClassificationEvaluator(labelCol="Exited", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"\nModel Performance - AUC-ROC Score: {auc:.4f}")

Training Random Forest model...

Model Predictions (sample):


CustomerId,Exited,prediction,probability
15792365,0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.8011187461517614"",""0.1988812538482386""]}"
15737173,0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.9070090016777038"",""0.09299099832229615""]}"
15691483,0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.9035831915929265"",""0.0964168084070735""]}"
15597945,0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.9022588309355258"",""0.09774116906447416""]}"
15725737,0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.8245170782872384"",""0.17548292171276159""]}"
15625047,0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.8835365977195713"",""0.11646340228042867""]}"
15738191,0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.9188444314277061"",""0.081155568572294""]}"
15700772,0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.7885188468930471"",""0.211481153106953""]}"
15589475,1,1.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.3937162468191552"",""0.6062837531808447""]}"
15732963,0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.908570965858193"",""0.09142903414180702""]}"


Databricks visualization. Run in Databricks to view.


Model Performance - AUC-ROC Score: 0.8313
