# 📥 1. Load the Data

In [0]:
# Load csv file

df = spark.read.csv("/FileStore/tables/creditcard.csv", header=True, inferSchema=True )
df.printSchema()



root
 |-- Time: double (nullable = true)
 |-- V1: double (nullable = true)
 |-- V2: double (nullable = true)
 |-- V3: double (nullable = true)
 |-- V4: double (nullable = true)
 |-- V5: double (nullable = true)
 |-- V6: double (nullable = true)
 |-- V7: double (nullable = true)
 |-- V8: double (nullable = true)
 |-- V9: double (nullable = true)
 |-- V10: double (nullable = true)
 |-- V11: double (nullable = true)
 |-- V12: double (nullable = true)
 |-- V13: double (nullable = true)
 |-- V14: double (nullable = true)
 |-- V15: double (nullable = true)
 |-- V16: double (nullable = true)
 |-- V17: double (nullable = true)
 |-- V18: double (nullable = true)
 |-- V19: double (nullable = true)
 |-- V20: double (nullable = true)
 |-- V21: double (nullable = true)
 |-- V22: double (nullable = true)
 |-- V23: double (nullable = true)
 |-- V24: double (nullable = true)
 |-- V25: double (nullable = true)
 |-- V26: double (nullable = true)
 |-- V27: double (nullable = true)
 |-- V28: double (nulla

# 🔍 Check for missing value

In [0]:
from pyspark.sql.functions import col, sum
df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).show()

+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+
|Time| V1| V2| V3| V4| V5| V6| V7| V8| V9|V10|V11|V12|V13|V14|V15|V16|V17|V18|V19|V20|V21|V22|V23|V24|V25|V26|V27|V28|Amount|Class|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+
|   0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|     0|    0|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+



There are no missing values from the credit card data

# ⚖️ 3. Check Class Imbalance

In [0]:
df.groupBy("Class").count().show()

+-----+------+
|Class| count|
+-----+------+
|    1|   492|
|    0|284315|
+-----+------+



The dataset shows a significant imbalance, with fraudulent transactions making up only 0.17% of the total. This disproportion can lead to machine learning models that are less effective at detecting fraud, as they tend to favor the majority class. While techniques like SMOTE are often used to balance datasets by creating synthetic examples of the minority class, PySpark does not currently support this method. As a practical alternative, I have implemented undersampling to reduce the number of legitimate transactions and help the model better identify fraudulent activity.




# Undersampling

In [0]:
fraud = df.filter(df.Class == 1)
non_fraud = df.filter(df.Class == 0)

ratio = 4
non_fraud_sample = non_fraud.sample(False, ratio * fraud.count() / non_fraud.count(), seed=42)

df_balanced = fraud.union(non_fraud_sample)
df_balanced.groupBy("Class").count().show()

+-----+-----+
|Class|count|
+-----+-----+
|    1|  492|
|    0| 1932|
+-----+-----+



# 💾 5. Save Cleaned Data

In [0]:
df_balanced.write.mode("Overwrite").parquet("/FileStore/tables/creditcard_balanced.parquet")
