# PySpark SQL Case Study: Analysis of Credit Card, Loan, and Transaction Data
This case study leverages PySpark and Spark SQL to perform insightful data analysis on three key financial datasets—credit card usage, loan records, and transaction history.


### Step 1: Uploading and Reading the Data
We begin by uploading the datasets into the environment and reading them into Spark DataFrames for further processing.

#### Credit Card Dataset Overview
The credit card dataset contains information on user demographics, usage patterns, and credit activity, which will help us understand spending behavior and credit utilization.

In [0]:
# Reading credit_card.csv
credit_df = spark.read.option("header", True).csv("/Volumes/workspace/default/data/credit card.csv")
credit_df.show()
credit_df.printSchema()
credit_df.createOrReplaceTempView("credit_card")


+---------+----------+---------+-----------+---------+------+---+------+---------+-------------+--------------+---------------+------+
|RowNumber|CustomerId|  Surname|CreditScore|Geography|Gender|Age|Tenure|  Balance|NumOfProducts|IsActiveMember|EstimatedSalary|Exited|
+---------+----------+---------+-----------+---------+------+---+------+---------+-------------+--------------+---------------+------+
|        1|  15634602| Hargrave|        619|   France|Female| 42|     2|        0|            1|             1|      101348.88|     1|
|        2|  15647311|     Hill|        608|    Spain|Female| 41|     1| 83807.86|            1|             1|      112542.58|     0|
|        3|  15619304|     Onio|        502|   France|Female| 42|     8| 159660.8|            3|             0|      113931.57|     1|
|        4|  15701354|     Boni|        699|   France|Female| 39|     1|        0|            2|             0|       93826.63|     0|
|        5|  15737888| Mitchell|        850|    Spain|F

1. Credit card users in Spain

In [0]:
%sql
SELECT COUNT(*) AS spain_users
FROM credit_card
WHERE Geography = 'Spain'


spain_users
2477


2. Members eligible and active in the bank

In [0]:
%sql
SELECT COUNT(*) AS eligible_active_members
FROM credit_card
WHERE IsActiveMember = 1 AND Exited = 0


eligible_active_members
4416


####LOAN


In [0]:
# Reading loan.csv
loan_df = spark.read.option("header", True).csv("/Volumes/workspace/default/data/loan.csv")
loan_df.show()
loan_df.printSchema()
loan_df.createOrReplaceTempView("loan")


+-----------+---+------+-------------------+--------------+-----------+------+-----------+-------------+------------------+-----------+-------+------------+----------------+------------------+
|Customer_ID|Age|Gender|         Occupation|Marital Status|Family Size|Income|Expenditure|Use Frequency|     Loan Category|Loan Amount|Overdue| Debt Record| Returned Cheque| Dishonour of Bill|
+-----------+---+------+-------------------+--------------+-----------+------+-----------+-------------+------------------+-----------+-------+------------+----------------+------------------+
|    IB14001| 30|  MALE|       BANK MANAGER|        SINGLE|          4| 50000|      22199|            6|           HOUSING| 10,00,000 |      5|      42,898|               6|                 9|
|    IB14008| 44|  MALE|          PROFESSOR|       MARRIED|          6| 51000|      19999|            4|          SHOPPING|     50,000|      3|      33,999|               1|                 5|
|    IB14012| 30|FEMALE|           

1. Number of loans in each category

In [0]:
%sql
SELECT `Loan Category`, COUNT(*) AS total_loans
FROM loan
GROUP BY `Loan Category`


Loan Category,total_loans
AUTOMOBILE,60
COMPUTER SOFTWARES,35
BUILDING,7
RESTAURANTS,41
ELECTRONICS,14
DINNING,14
BOOK STORES,7
AGRICULTURE,12
HOUSING,67
BUSINESS,24


2. Number of people who have taken more than 1 lakh loan

In [0]:
%sql
SELECT COUNT(*) AS count_above_1_lakh
FROM loan
WHERE TRY_CAST(REPLACE(TRIM(`Loan Amount`), ',', '') AS INT) > 100000



count_above_1_lakh
450


3. Number of people with income greater than 60000 rupees

In [0]:
%sql
SELECT COUNT(*) AS high_income_people
FROM loan
WHERE CAST(Income AS INT) > 60000


high_income_people
198


4. People with 2 or more returned cheques and income less than 50000

In [0]:
%sql
SELECT COUNT(*) AS people_count
FROM loan
WHERE TRY_CAST(` Returned Cheque` AS INT) >= 2
  AND TRY_CAST(Income AS INT) < 50000


people_count
137


5. People with 2+ returned cheques and are single

In [0]:
%sql
SELECT COUNT(*) AS single_with_2plus_returned
FROM loan
WHERE TRY_CAST(` Returned Cheque` AS INT) >= 2
  AND `Marital Status` = 'Single'


single_with_2plus_returned
0


6. People with expenditure over 50000/month

In [0]:
%sql
SELECT COUNT(*) AS high_spenders
FROM loan
WHERE CAST(Expenditure AS INT) > 50000


high_spenders
6


7. Members eligible for a credit card

In [0]:
%sql
SELECT COUNT(*) AS eligible_members
FROM loan
WHERE TRY_CAST(Income AS INT) > 50000
  AND ` Debt Record` = 'No'


eligible_members
0


####TRANSACTION

In [0]:
# Reading transaction.csv
txn_df = spark.read.option("header", True).csv("/Volumes/workspace/default/data/txn.csv")
txn_df.show()
txn_df.printSchema()
txn_df.createOrReplaceTempView("transaction")


+-------------+--------------------+----------+----------------+-------------+-----------+
|   Account No| TRANSACTION DETAILS|VALUE DATE| WITHDRAWAL AMT | DEPOSIT AMT |BALANCE AMT|
+-------------+--------------------+----------+----------------+-------------+-----------+
|409000611074'|TRF FROM  Indiafo...| 29-Jun-17|            NULL|      1000000|    1000000|
|409000611074'|TRF FROM  Indiafo...|  5-Jul-17|            NULL|      1000000|    2000000|
|409000611074'|FDRL/INTERNAL FUN...| 18-Jul-17|            NULL|       500000|    2500000|
|409000611074'|TRF FRM  Indiafor...|  1-Aug-17|            NULL|      3000000|    5500000|
|409000611074'|FDRL/INTERNAL FUN...| 16-Aug-17|            NULL|       500000|    6000000|
|409000611074'|FDRL/INTERNAL FUN...| 16-Aug-17|            NULL|       500000|    6500000|
|409000611074'|FDRL/INTERNAL FUN...| 16-Aug-17|            NULL|       500000|    7000000|
|409000611074'|FDRL/INTERNAL FUN...| 16-Aug-17|            NULL|       500000|    7500000|

1. Maximum withdrawal amount in transactions

In [0]:
%sql
SELECT MAX(CAST(` WITHDRAWAL AMT ` AS DOUBLE)) AS max_withdrawal
FROM transaction


max_withdrawal
459447546.4


2. Minimum withdrawal amount of an account

In [0]:
%sql
SELECT `Account No`, MIN(CAST(` WITHDRAWAL AMT ` AS DOUBLE)) AS min_withdrawal
FROM transaction
WHERE ` WITHDRAWAL AMT ` IS NOT NULL
GROUP BY `Account No`


Account No,min_withdrawal
409000611074',120.0
409000438620',0.34
409000493201',2.1
409000425051',1.25
409000493210',0.01
409000438611',0.2
409000405747',21.0
1196711',0.25
1196428',0.25
409000362497',0.97


3. Maximum deposit amount of an account

In [0]:
%sql
SELECT `Account No`, MAX(CAST(` DEPOSIT AMT ` AS DOUBLE)) AS max_deposit
FROM transaction
WHERE ` DEPOSIT AMT ` IS NOT NULL
GROUP BY `Account No`


Account No,max_deposit
409000611074',3000000.0
409000438620',544800000.0
409000493201',1000000.0
409000425051',15000000.0
409000493210',15000000.0
409000438611',170250000.0
409000405747',202100000.0
1196711',500000000.0
1196428',211959442.2
409000362497',200000000.0


4. Minimum deposit amount of an account

In [0]:
%sql
SELECT `Account No`, MIN(CAST(` DEPOSIT AMT ` AS DOUBLE)) AS min_deposit
FROM transaction
WHERE ` DEPOSIT AMT ` IS NOT NULL
GROUP BY `Account No`


Account No,min_deposit
409000611074',1320.0
409000438620',0.07
409000493201',0.9
409000425051',1.0
409000493210',0.01
409000438611',0.03
409000405747',500.0
1196711',1.01
1196428',1.0
409000362497',0.03


5. Sum of balance in every bank account

In [0]:
%sql
SELECT `Account No`, SUM(CAST(`BALANCE AMT` AS DOUBLE)) AS total_balance
FROM transaction
GROUP BY `Account No`


Account No,total_balance
409000611074',1615533622.0
409000438620',-7122918679513.672
409000493201',1042083182.9499984
409000425051',-3772118411.649988
409000493210',-3275849521320.957
409000438611',-2494865770683.3955
409000405747',-24310804706.700016
1196711',-16047649810127.5
1196428',-81418498130721.0
409000362497',-52860004792808.0


6. Number of transactions on each date

In [0]:
%sql
SELECT `VALUE DATE`, COUNT(*) AS transaction_count
FROM transaction
GROUP BY `VALUE DATE`
ORDER BY `VALUE DATE`


VALUE DATE,transaction_count
1-Apr-17,1
1-Aug-15,75
1-Aug-16,85
1-Aug-17,65
1-Aug-18,144
1-Dec-15,96
1-Dec-16,106
1-Dec-17,45
1-Dec-18,97
1-Feb-16,97


7. Customers with withdrawal amount more than 1 lakh

In [0]:
%sql
SELECT DISTINCT `Account No`
FROM transaction
WHERE CAST(` WITHDRAWAL AMT ` AS DOUBLE) > 100000


Account No
409000611074'
409000438620'
409000493201'
409000425051'
409000493210'
409000438611'
409000405747'
1196711'
1196428'
409000362497'
