# 🏄 LBB: Predicting Credit Card Balance

Suatu institusi perbankan multinasional ingin mempelajari faktor-faktor yang mempengaruhi **balance kartu kredit** pemegang kartunya menggunakan machine learning. Prediksi balance kartu kredit dapat memberikan manfaat yang signifikan bagi perbankan yaitu perbankan dapat meningkatkan kualitas layanan dan keuntungan. Analisis ini juga dapat membantu perusahaan untuk memahami perilaku pemegang kartu.


## 1. Read data `credit card`

Kita akan memahami konsep dasar dari linear regression dengan menggunakan data `credit_card.csv`, yaitu dataset yang terdiri dari informasi 310 pemegang kartu kredit di suatu perbankan.

In [1]:
# code here
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LBB DSS Big Data with Pyspark Januari 2025").getOrCreate()

In [2]:
credit_card = spark.read.csv('data_input/credit_card.csv', header=True, inferSchema=True)

In [3]:
credit_card.show(5)

+------------+-----+------+-----+---+---------+------+-------+-------+---------+-------+
|      Income|Limit|Rating|Cards|Age|Education|Gender|Student|Married|Ethnicity|Balance|
+------------+-----+------+-----+---+---------+------+-------+-------+---------+-------+
|  221.741881| 3606|   283|    2| 34|       11|  Male|     No|    Yes|Caucasian|    333|
|11241.300625| 6645|   483|    3| 82|       15|Female|    Yes|    Yes|    Asian|    903|
|10939.695649| 7075|   514|    4| 71|       11|  Male|     No|     No|    Asian|    580|
|22178.357776| 9504|   681|    3| 36|       11|Female|     No|     No|    Asian|    964|
| 3122.797924| 4897|   357|    2| 68|       16|  Male|     No|    Yes|Caucasian|    331|
+------------+-----+------+-----+---+---------+------+-------+-------+---------+-------+
only showing top 5 rows



**Deskripsi:**

- `Income`: Besaran gaji nasabah per tahun (dalam $100)
- `Limit` : Besaran kredit limit
- `Rating` : Skor yang diberikan kepada individu berdasarkan kelayakan kreditnya. Semakin besar maka semakin baik
- `Cards` : Jumlah banyaknya kartu kredit yang dimiliki oleh nasabah
- `Age` : Usia nasabah
- `Education` : Level/lamanya pendidikan yang ditempuh oleh nasabah
- `Gender`: Jenis kelamin nasabah
    - Male
    - Female
- `Student` : Apakah nasabah seorang pelajar atau bukan
    - Yes -> Pelajar
    - No -> Bukan pelajar
- `Married`: Status pernikahan
    - Yes -> Sudah menikah
    - No -> Belum menikah
- `Ethnicity`: Etnis nasabah
    - African American
    - Asian
    - Caucasian
- `Balance`: Rata-rata pengeluaran dalam 3 bulan menggunakan kartu kredit

**Asumsi data**: Balance dihitung sebagai rata-rata transaksi selama periode penagihan/billing cycle (dalam hal ini 3 bulan). Sebagai contoh, jika seorang pemegang kartu mengeluarkan `$400`, `$500`, dan `$600` dalam 3 bulan, maka Balance akan dicatat sebagai `$500`.

## 2. Data Pre-processing

#### 1️⃣ Check Data Stucture and Data Types

In [4]:
# code here
credit_card.printSchema()

root
 |-- Income: double (nullable = true)
 |-- Limit: integer (nullable = true)
 |-- Rating: integer (nullable = true)
 |-- Cards: integer (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Education: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Student: string (nullable = true)
 |-- Married: string (nullable = true)
 |-- Ethnicity: string (nullable = true)
 |-- Balance: integer (nullable = true)



In [5]:
credit_card.describe().show()

+-------+------------------+------------------+-----------------+------------------+------------------+------------------+------+-------+-------+----------------+-----------------+
|summary|            Income|             Limit|           Rating|             Cards|               Age|         Education|Gender|Student|Married|       Ethnicity|          Balance|
+-------+------------------+------------------+-----------------+------------------+------------------+------------------+------+-------+-------+----------------+-----------------+
|  count|               310|               310|              310|               310|               310|               310|   310|    310|    310|             310|              310|
|   mean|3928.2701042225817| 5485.467741935484|405.0516129032258| 2.996774193548387| 55.60645161290323|13.425806451612903|  NULL|   NULL|   NULL|            NULL|670.9870967741936|
| stddev|  6180.70944183338|2052.4517434400805|137.9673894937973|1.4267404339958059|17.34179409

### 2️⃣ Dummy Variable Encoding

- **List categorical columns**

In [6]:
# code here
cat_columns = ['Student','Married','Ethnicity']

- **Buat list kosong untuk tahapan pipeline**

In [7]:
# code here
indexers = []
encoders = []

- **Melakukan tahapan `StringIndexer` dan `OneHotEncoder`**

In [8]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

In [9]:
# code here
for col in cat_columns:
    # Create a StringIndexer for each column
    indexer = StringIndexer(inputCol=col, outputCol=f"{col}_index", stringOrderType='alphabetAsc', handleInvalid="keep")
    indexers.append(indexer)
    
    # Create an OneHotEncoder for the indexed column
    encoder = OneHotEncoder(inputCol=f"{col}_index", outputCol=f"{col}_encoded")
    encoders.append(encoder)

- **Membuat pipeline**

In [10]:
# code here
from pyspark.ml import Pipeline

# Combine indexers and encoders into a single pipeline
pipeline = Pipeline(stages = indexers + encoders)

- **Fit pipeline**

In [11]:
# code here
pipeline_model = pipeline.fit(credit_card)

In [12]:
credit_card = pipeline_model.transform(credit_card)

### 3️⃣ Splitting Train-Test Data

Silahkan untuk melakukan proses splitting data dengan kondisi sebagai berikut : 
- Proporsi split = 80:20
- Seed = 123

In [13]:
# code here
train_data, test_data = credit_card.randomSplit([0.8, 0.2], seed = 123)

### 4️⃣ Feature Selection

Melakukan pemilihan fitur dan menggabungkannya menjadi satu kesatuan vector dengan menggunakan `VectorAssembler()`

- **List kolom predictor**

In [14]:
credit_card.show(2)

+------------+-----+------+-----+---+---------+------+-------+-------+---------+-------+-------------+-------------+---------------+---------------+---------------+-----------------+
|      Income|Limit|Rating|Cards|Age|Education|Gender|Student|Married|Ethnicity|Balance|Student_index|Married_index|Ethnicity_index|Student_encoded|Married_encoded|Ethnicity_encoded|
+------------+-----+------+-----+---+---------+------+-------+-------+---------+-------+-------------+-------------+---------------+---------------+---------------+-----------------+
|  221.741881| 3606|   283|    2| 34|       11|  Male|     No|    Yes|Caucasian|    333|          0.0|          1.0|            2.0|  (2,[0],[1.0])|  (2,[1],[1.0])|    (3,[2],[1.0])|
|11241.300625| 6645|   483|    3| 82|       15|Female|    Yes|    Yes|    Asian|    903|          1.0|          1.0|            1.0|  (2,[1],[1.0])|  (2,[1],[1.0])|    (3,[1],[1.0])|
+------------+-----+------+-----+---+---------+------+-------+-------+---------+-----

In [15]:
# code here
feature_predictor = ['Income','Limit','Rating','Cards','Age','Education',
                     'Student_encoded','Married_encoded','Ethnicity_encoded']

- **Menggabungkan semua fitur**

In [16]:
# code here
from pyspark.ml.feature import VectorAssembler

In [17]:
assembler = VectorAssembler(inputCols = feature_predictor,
                            outputCol = 'predictors')

- **Transformasi data**

In [18]:
# code here
train_data = assembler.transform(train_data)
test_data = assembler.transform(test_data)

## 3. Training Model Regression 

- **Membangun dan melatih model**

In [19]:
# code here
from pyspark.ml.regression import LinearRegression

In [20]:
lr = LinearRegression(labelCol = 'Balance', #target variable data actual
                      featuresCol = 'predictors', #predictor hasil vectorAssembler
                      predictionCol = 'balance_predict') #kolom baru yang menyimpan hasil prediksi

- **Menampilkan hasil summary**

- **Melakukan prediksi**

In [21]:
# code here
model = lr.fit(train_data)

- **Menampilkan hasil prediksi**

In [22]:
# code here
print("Coefficients:", model.coefficients)

Coefficients: [-0.049300603643392074,0.22248148940186638,1.054978883281222,24.32622367920828,-1.6488877607281467,1.0971767885862416,-245.58907514018634,245.58907514045146,3.280235437750275,-3.2802354372254325,1.3211368253149776,11.824302906323647,-9.89832034667852]


In [23]:
test_data.show(2)

+----------+-----+------+-----+---+---------+------+-------+-------+----------------+-------+-------------+-------------+---------------+---------------+---------------+-----------------+--------------------+
|    Income|Limit|Rating|Cards|Age|Education|Gender|Student|Married|       Ethnicity|Balance|Student_index|Married_index|Ethnicity_index|Student_encoded|Married_encoded|Ethnicity_encoded|          predictors|
+----------+-----+------+-----+---+---------+------+-------+-------+----------------+-------+-------------+-------------+---------------+---------------+---------------+-----------------+--------------------+
|110.313009| 2923|   232|    3| 25|       18|Female|     No|    Yes|African American|    191|          0.0|          1.0|            0.0|  (2,[0],[1.0])|  (2,[1],[1.0])|    (3,[0],[1.0])|[110.313009,2923....|
|115.240225| 3746|   280|    2| 44|       17|Female|     No|    Yes|       Caucasian|    410|          0.0|          1.0|            2.0|  (2,[0],[1.0])|  (2,[1],[1

In [24]:
predictions = model.transform(test_data)

In [25]:
predictions.show(5)

+----------+-----+------+-----+---+---------+------+-------+-------+----------------+-------+-------------+-------------+---------------+---------------+---------------+-----------------+--------------------+------------------+
|    Income|Limit|Rating|Cards|Age|Education|Gender|Student|Married|       Ethnicity|Balance|Student_index|Married_index|Ethnicity_index|Student_encoded|Married_encoded|Ethnicity_encoded|          predictors|   balance_predict|
+----------+-----+------+-----+---+---------+------+-------+-------+----------------+-------+-------------+-------------+---------------+---------------+---------------+-----------------+--------------------+------------------+
|110.313009| 2923|   232|    3| 25|       18|Female|     No|    Yes|African American|    191|          0.0|          1.0|            0.0|  (2,[0],[1.0])|  (2,[1],[1.0])|    (3,[0],[1.0])|[110.313009,2923....| 99.51008591269942|
|115.240225| 3746|   280|    2| 44|       17|Female|     No|    Yes|       Caucasian|   

In [26]:
predictions.select('balance','balance_predict').show(5)

+-------+------------------+
|balance|   balance_predict|
+-------+------------------+
|    191| 99.51008591269942|
|    410|265.03669827122997|
|    602| 469.6684371110115|
|    210| 93.96072854292493|
|    955| 734.5066881158222|
+-------+------------------+
only showing top 5 rows



## 5. Model Evaluation

- **Menghitung nilai MAE**

In [27]:
# code here
summary_hasil = model.summary

In [28]:
summary_hasil.meanAbsoluteError

83.91709294166968

- **Menghitung nilai RMSE**

In [29]:
# code here
summary_hasil.rootMeanSquaredError

102.39806633146488

In [30]:
summary_hasil.r2

0.9394115273124833

In [31]:
summary_hasil.r2adj

0.9359720070289125

<div class="alert alert-success" role="alert">

Interpretasi Hasil :

-   Model ini memiliki performa yang sangat baik dengan R² di atas 93%, menunjukkan bahwa model sangat baik dalam menjelaskan variasi data target.

-   MAE dan RMSE menunjukkan bahwa rata-rata kesalahan model kecil dibandingkan dengan skala target.

-   Perbedaan kecil antara R² dan adjusted R² menunjukkan bahwa model sudah efisien, dengan fitur-fitur yang relevan.

</div>