# Fraud Detection

<table>
    <tbody>
        <tr>
            <td>Problems</td>
            <td>Adanya fraud pada transaksi kartu kredit yang dapat merugikan bank.</td>
        </tr>
        <tr>
            <td>Goals</td>
            <td>Memahami tren dan distribusi fraud.</td>
        </tr>
        <tr>
            <td>Bussiness Objective</td>
            <td>Membangun model machine learning guna membantu fraud detection dengan efektif dan cepat.  </td>
        </tr>
    </tbody>
</table>


<table>
    <thead>
        <tr>
            <th>Feature Name</th>
            <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>trans_date_trans_time</td>
            <td>Waktu dan tanggal transaksi dilakukan.</td>
        </tr>
        <tr>
            <td>cc_num</td>
            <td>Nomor kartu kredit yang digunakan untuk transaksi.</td>
        </tr>
        <tr>
            <td>merchant</td>
            <td>Nama pedagang atau toko tempat transaksi terjadi.</td>
        </tr>
        <tr>
            <td>category</td>
            <td>Kategori pedagang berdasarkan jenis barang/jasa yang dijual.</td>
        </tr>
        <tr>
            <td>amt</td>
            <td>Jumlah uang yang dikeluarkan dalam transaksi.</td>
        </tr>
        <tr>
            <td>first</td>
            <td>Nama depan pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>last</td>
            <td>Nama belakang pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>gender</td>
            <td>Jenis kelamin pemegang kartu (<code>M</code> untuk pria, <code>F</code> untuk wanita).</td>
        </tr>
        <tr>
            <td>street</td>
            <td>Alamat jalan pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>city</td>
            <td>Kota tempat tinggal pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>state</td>
            <td>Negara bagian tempat tinggal pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>zip</td>
            <td>Kode pos alamat pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>lat</td>
            <td>Garis lintang alamat pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>long</td>
            <td>Garis bujur alamat pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>city_pop</td>
            <td>Jumlah populasi kota tempat tinggal pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>job</td>
            <td>Profesi atau pekerjaan pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>dob</td>
            <td>Tanggal lahir pemegang kartu kredit.</td>
        </tr>
        <tr>
            <td>trans_num</td>
            <td>Nomor unik transaksi.</td>
        </tr>
        <tr>
            <td>unix_time</td>
            <td>Waktu transaksi dalam format UNIX timestamp.</td>
        </tr>
        <tr>
            <td>merch_lat</td>
            <td>Garis lintang lokasi pedagang tempat transaksi dilakukan.</td>
        </tr>
        <tr>
            <td>merch_long</td>
            <td>Garis bujur lokasi pedagang tempat transaksi dilakukan.</td>
        </tr>
        <tr>
            <td>is_fraud</td>
            <td>Indikator apakah transaksi tersebut adalah penipuan (<code>1</code> untuk fraud, <code>0</code> untuk bukan fraud).</td>
        </tr>
    </tbody>
</table>


In [58]:
from pyspark.sql import SparkSession # import library ini berfungsi untuk membuat sesi pyspark
import pyspark.sql.functions as F # import library ini berfungsi untuk memanipulasi data
from pyspark.sql.types import * # import library ini untuk mendefinisaikan skema DataFrame
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
import plotly.express as px

In [59]:
spark = SparkSession.builder.appName('Fraud').getOrCreate()

In [60]:
fraud = spark.read.csv('fraudTrain.csv', header=True)
fraud.show(5, truncate=False)

+---+---------------------+----------------+----------------------------------+-------------+------+---------+-------+------+----------------------------+--------------+-----+-----+-------+---------+--------+---------------------------------+----------+--------------------------------+----------+------------------+-----------+--------+
|_c0|trans_date_trans_time|cc_num          |merchant                          |category     |amt   |first    |last   |gender|street                      |city          |state|zip  |lat    |long     |city_pop|job                              |dob       |trans_num                       |unix_time |merch_lat         |merch_long |is_fraud|
+---+---------------------+----------------+----------------------------------+-------------+------+---------+-------+------+----------------------------+--------------+-----+-----+-------+---------+--------+---------------------------------+----------+--------------------------------+----------+------------------+--------

In [61]:
rows = fraud.count()
col = fraud.columns
print(f'Jumlah baris: {rows}, kolom: {len(col)}')

Jumlah baris: 1296675, kolom: 23


In [62]:
fraud.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- trans_date_trans_time: string (nullable = true)
 |-- cc_num: string (nullable = true)
 |-- merchant: string (nullable = true)
 |-- category: string (nullable = true)
 |-- amt: string (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- street: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: string (nullable = true)
 |-- lat: string (nullable = true)
 |-- long: string (nullable = true)
 |-- city_pop: string (nullable = true)
 |-- job: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- trans_num: string (nullable = true)
 |-- unix_time: string (nullable = true)
 |-- merch_lat: string (nullable = true)
 |-- merch_long: string (nullable = true)
 |-- is_fraud: string (nullable = true)



In [63]:
fraud = fraud.withColumn('is_fraud', F.col('is_fraud').cast('int'))
fraud = fraud.withColumn('amt', F.col('amt').cast('int'))
fraud.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- trans_date_trans_time: string (nullable = true)
 |-- cc_num: string (nullable = true)
 |-- merchant: string (nullable = true)
 |-- category: string (nullable = true)
 |-- amt: integer (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- street: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: string (nullable = true)
 |-- lat: string (nullable = true)
 |-- long: string (nullable = true)
 |-- city_pop: string (nullable = true)
 |-- job: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- trans_num: string (nullable = true)
 |-- unix_time: string (nullable = true)
 |-- merch_lat: string (nullable = true)
 |-- merch_long: string (nullable = true)
 |-- is_fraud: integer (nullable = true)



## Data Cleaning

### Missing Values

In [64]:
fraud.select([F.sum(F.col(c).isNull().cast('int')).alias(c) for c in fraud.columns]).show()

+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+
|_c0|trans_date_trans_time|cc_num|merchant|category|amt|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|trans_num|unix_time|merch_lat|merch_long|is_fraud|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+
|  0|                    0|     0|       0|       0|  0|    0|   0|     0|     0|   0|    0|  0|  0|   0|       0|  0|  0|        0|        0|        0|         0|       0|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+



### Duplicates

In [65]:
count = fraud.count()
distinct = fraud.distinct().count()
duplicate = count - distinct
duplicate

0

## Manipulation

In [66]:
fraud = fraud.drop('first', 'last', 'lat', 'long', 'trans_num', 'unix_time', 'merch_lat', 'merch_long', '_c0', 'street')

### Date Time

In [67]:
fraud.show(5, truncate=False)

+---------------------+----------------+----------------------------------+-------------+---+------+--------------+-----+-----+--------+---------------------------------+----------+--------+
|trans_date_trans_time|cc_num          |merchant                          |category     |amt|gender|city          |state|zip  |city_pop|job                              |dob       |is_fraud|
+---------------------+----------------+----------------------------------+-------------+---+------+--------------+-----+-----+--------+---------------------------------+----------+--------+
|2019-01-01 00:00:18  |2703186189652095|fraud_Rippin, Kub and Mann        |misc_net     |4  |F     |Moravian Falls|NC   |28654|3495    |Psychologist, counselling        |1988-03-09|0       |
|2019-01-01 00:00:44  |630423337322    |fraud_Heller, Gutmann and Zieme   |grocery_pos  |107|F     |Orient        |WA   |99160|149     |Special educational needs teacher|1978-06-21|0       |
|2019-01-01 00:00:51  |38859492057661  |fraud

In [68]:
fraud = fraud.withColumn('year', F.year('trans_date_trans_time'))
fraud = fraud.withColumn('month', F.month('trans_date_trans_time'))
fraud = fraud.withColumn('day', F.dayofweek('trans_date_trans_time'))
fraud = fraud.withColumn('hour', F.hour('trans_date_trans_time'))
fraud.show(5)

+---------------------+----------------+--------------------+-------------+---+------+--------------+-----+-----+--------+--------------------+----------+--------+----+-----+---+----+
|trans_date_trans_time|          cc_num|            merchant|     category|amt|gender|          city|state|  zip|city_pop|                 job|       dob|is_fraud|year|month|day|hour|
+---------------------+----------------+--------------------+-------------+---+------+--------------+-----+-----+--------+--------------------+----------+--------+----+-----+---+----+
|  2019-01-01 00:00:18|2703186189652095|fraud_Rippin, Kub...|     misc_net|  4|     F|Moravian Falls|   NC|28654|    3495|Psychologist, cou...|1988-03-09|       0|2019|    1|  3|   0|
|  2019-01-01 00:00:44|    630423337322|fraud_Heller, Gut...|  grocery_pos|107|     F|        Orient|   WA|99160|     149|Special education...|1978-06-21|       0|2019|    1|  3|   0|
|  2019-01-01 00:00:51|  38859492057661|fraud_Lind-Buckridge|entertainment|220| 

In [69]:
fraud = fraud.drop('trans_date_trans_time')

### Job

In [70]:
fraud.select('job').show(5, truncate=False)

+---------------------------------+
|job                              |
+---------------------------------+
|Psychologist, counselling        |
|Special educational needs teacher|
|Nature conservation officer      |
|Patent attorney                  |
|Dance movement psychotherapist   |
+---------------------------------+
only showing top 5 rows



In [71]:
fraud = fraud.withColumn('professional', F.split(fraud['job'], ', ')[0]) \
            .withColumn('specialization', F.when(F.split(fraud['job'], ', ')[1].isNull(), F.lit('Non-Specific'))
                                          .otherwise(F.split(fraud['job'], ', ')[1]))

fraud.show(5, truncate=False)

+----------------+----------------------------------+-------------+---+------+--------------+-----+-----+--------+---------------------------------+----------+--------+----+-----+---+----+---------------------------------+--------------+
|cc_num          |merchant                          |category     |amt|gender|city          |state|zip  |city_pop|job                              |dob       |is_fraud|year|month|day|hour|professional                     |specialization|
+----------------+----------------------------------+-------------+---+------+--------------+-----+-----+--------+---------------------------------+----------+--------+----+-----+---+----+---------------------------------+--------------+
|2703186189652095|fraud_Rippin, Kub and Mann        |misc_net     |4  |F     |Moravian Falls|NC   |28654|3495    |Psychologist, counselling        |1988-03-09|0       |2019|1    |3  |0   |Psychologist                     |counselling   |
|630423337322    |fraud_Heller, Gutmann and Ziem

In [72]:
fraud = fraud.drop('job')

### DOB

In [73]:
fraud = fraud.withColumn('year_birth_cust', F.year('dob'))

In [74]:
fraud = fraud.drop('dob')

In [75]:
fraud.show(5, truncate=False)

+----------------+----------------------------------+-------------+---+------+--------------+-----+-----+--------+--------+----+-----+---+----+---------------------------------+--------------+---------------+
|cc_num          |merchant                          |category     |amt|gender|city          |state|zip  |city_pop|is_fraud|year|month|day|hour|professional                     |specialization|year_birth_cust|
+----------------+----------------------------------+-------------+---+------+--------------+-----+-----+--------+--------+----+-----+---+----+---------------------------------+--------------+---------------+
|2703186189652095|fraud_Rippin, Kub and Mann        |misc_net     |4  |F     |Moravian Falls|NC   |28654|3495    |0       |2019|1    |3  |0   |Psychologist                     |counselling   |1988           |
|630423337322    |fraud_Heller, Gutmann and Zieme   |grocery_pos  |107|F     |Orient        |WA   |99160|149     |0       |2019|1    |3  |0   |Special educational n

## Exploratory Data Analysis (EDA)

In [76]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [77]:
fraud_pandas = fraud.sample(withReplacement=False, fraction=0.05, seed=42).toPandas()
fraud_pandas.shape

(64626, 17)

In [78]:
fraud_pandas.head()

Unnamed: 0,cc_num,merchant,category,amt,gender,city,state,zip,city_pop,is_fraud,year,month,day,hour,professional,specialization,year_birth_cust
0,6011860238257910,fraud_Lebsack and Sons,misc_net,327,F,Lahoma,OK,73754,1078,0,2019,1,3,0,Programme researcher,broadcasting/film/video,1952
1,4687263141103,fraud_Kunze Inc,grocery_pos,108,F,Bigelow,MN,56117,399,0,2019,1,3,0,Economist,Non-Specific,1977
2,3592325941359225,fraud_Mraz-Herzog,gas_transport,59,F,Hopewell,VA,23860,31970,0,2019,1,3,1,Purchasing manager,Non-Specific,1935
3,180040027502291,fraud_Pouros-Conroy,shopping_pos,9,F,New York City,NY,10162,1577385,0,2019,1,3,1,Audiological scientist,Non-Specific,1957
4,30376238035123,fraud_Medhurst PLC,shopping_net,215,F,Sixes,OR,97476,217,0,2019,1,3,1,Retail merchandiser,Non-Specific,1928


In [79]:
fraud_pandas.describe()

Unnamed: 0,amt,is_fraud,year,month,day,hour,year_birth_cust
count,64626.0,64626.0,64626.0,64626.0,64626.0,64626.0,64626.0
mean,71.117909,0.005834,2019.286077,6.136555,3.727416,12.798904,1973.151843
std,188.125524,0.076155,0.451929,3.404582,2.132037,6.827461,17.420764
min,1.0,0.0,2019.0,1.0,1.0,0.0,1924.0
25%,9.0,0.0,2019.0,3.0,2.0,7.0,1962.0
50%,47.0,0.0,2019.0,6.0,3.0,14.0,1975.0
75%,83.0,0.0,2020.0,9.0,6.0,19.0,1987.0
max,17897.0,1.0,2020.0,12.0,7.0,23.0,2005.0


In [80]:
fraud_pandas['is_fraud'] = fraud_pandas['is_fraud'].map({0: 'No', 1:'Yes'})

In [81]:
target_value = fraud_pandas['is_fraud'].value_counts().reset_index()
target_value.columns = ['is_fraud', 'count']
target_value

Unnamed: 0,is_fraud,count
0,No,64249
1,Yes,377


In [82]:
target = px.pie(target_value, names='is_fraud', values='count',title='Distribution of Fraud Status', color_discrete_sequence=px.colors.sequential.Viridis)
target.show()

Terdapat 0.06% transaksi credit card yang fraud.

In [83]:
fraud_merchant_yes = fraud_pandas[fraud_pandas['is_fraud'] == 'Yes'].groupby('merchant')['is_fraud'].count().reset_index()
fraud_merchant_15 = fraud_merchant_yes.sort_values(by='is_fraud', ascending=False).head(15)
fraud_merchant_15

Unnamed: 0,merchant,is_fraud
53,fraud_Doyle Ltd,6
92,fraud_Hudson-Ratke,5
109,fraud_Kiehn-Emmerich,5
174,"fraud_Reichert, Shanahan and Hayes",5
54,fraud_DuBuque LLC,5
62,fraud_Fisher Inc,5
50,fraud_Dooley Inc,4
145,fraud_McDermott-Weimann,4
84,"fraud_Heller, Gutmann and Zieme",4
120,"fraud_Kovacek, Dibbert and Ondricka",4


In [84]:
merchant = px.bar(fraud_merchant_15, x ='merchant', y='is_fraud',title='Merchant Bank vs Fraud Status', color_discrete_sequence=px.colors.sequential.Viridis)
merchant.update_layout(xaxis_tickangle=-45)
merchant.show()

Merchant paling banyak mengalami fraud adalah Doyle Ltd.

In [85]:
fraud_gender_yes = fraud_pandas[fraud_pandas['is_fraud'] == 'Yes'].groupby('gender')['is_fraud'].count().reset_index()
fraud_gender_yes

Unnamed: 0,gender,is_fraud
0,F,185
1,M,192


In [86]:
gender = px.pie(fraud_gender_yes, names='gender', values='is_fraud',title='Gender vs Fraud Status', color_discrete_sequence=px.colors.sequential.Viridis)
gender.show()

Gender yang paling banyak melakukan fraud adalah laki-laki.

In [87]:
fraud_category_yes = fraud_pandas[fraud_pandas['is_fraud'] == 'Yes'].groupby('category')['is_fraud'].count().reset_index()
fraud_category = fraud_category_yes.sort_values(by='is_fraud', ascending=False)
fraud_category

Unnamed: 0,category,is_fraud
4,grocery_pos,106
11,shopping_net,68
8,misc_net,49
12,shopping_pos,46
2,gas_transport,25
7,kids_pets,15
0,entertainment,14
3,grocery_net,12
1,food_dining,9
5,health_fitness,8


In [88]:
category = px.bar(fraud_category, x ='category', y='is_fraud',title='Category Transaction vs Fraud', color_discrete_sequence=px.colors.sequential.Viridis)
category.show()

Kategori transaksi yang paling sering terjadinya fraud adalah grocery_pos dan diikuti shopping_net.

In [89]:
fraud_month_yes = fraud_pandas[fraud_pandas['is_fraud'] == 'Yes'].groupby('month')['is_fraud'].count().reset_index()
fraud_month = fraud_month_yes.sort_values(by='month', ascending=True)
fraud_month

Unnamed: 0,month,is_fraud
0,1,47
1,2,44
2,3,42
3,4,37
4,5,50
5,6,35
6,7,22
7,8,19
8,9,26
9,10,13


In [90]:
fig = px.line(fraud_month, x='month', y='is_fraud', title='Month vs Fraud Status', color_discrete_sequence=px.colors.sequential.Viridis)
fig.show()

Fraud terbanyak ada pada bulan Mei.

### Recommendation

Rekomendasi untuk mengurangi fraud pada transaksi kartu kredit:
1. Mengingkatkan deteksi fraud pada merchant yang rentan.
2. Mengawasi ketat transaksi shopping_net dan grocery_pos.
3. Berikan pengawasan yang ketat pada transaksi bulan Mei.
4. Lakukan analisis lebih lanjut mengenai gender yang lebih sering melakukan fraud (laki-laki).

## Feature Engineering

### Add New Feature

In [38]:
fraud.show(5, truncate=False)

+----------------+----------------------------------+-------------+---+------+--------------+-----+-----+--------+--------+----+-----+---+----+---------------------------------+--------------+---------------+
|cc_num          |merchant                          |category     |amt|gender|city          |state|zip  |city_pop|is_fraud|year|month|day|hour|professional                     |specialization|year_birth_cust|
+----------------+----------------------------------+-------------+---+------+--------------+-----+-----+--------+--------+----+-----+---+----+---------------------------------+--------------+---------------+
|2703186189652095|fraud_Rippin, Kub and Mann        |misc_net     |4  |F     |Moravian Falls|NC   |28654|3495    |0       |2019|1    |3  |0   |Psychologist                     |counselling   |1988           |
|630423337322    |fraud_Heller, Gutmann and Zieme   |grocery_pos  |107|F     |Orient        |WA   |99160|149     |0       |2019|1    |3  |0   |Special educational n

In [39]:
# Usia Cust (0: Child, 1: Young Adult, 2: Adult, 3: Senior)
fraud = fraud.withColumn('Age', F.col('year') - F.col('year_birth_cust'))

fraud = fraud.withColumn(
    'Age_group',
    F.when(F.col('Age') < 18, 0)
    .when(F.col('Age') < 35, 1)
    .when(F.col('Age') < 60, 2)
    .otherwise(3)
)

In [40]:
# Location Type (0: Rural, 1: Urban)
fraud = fraud.withColumn('Location_type',
                         F.when(F.col('city_pop') < 10000, 0).otherwise(1))

In [41]:
#Time of day (0: Night, 1: Morning, 2: Afternoon, 3: Evening)
fraud = fraud.withColumn('Time_of_day',
                         F.when((F.col('hour') >= 0) & (F.col('hour') < 6), 0)
                         .when((F.col('hour') >= 6) & (F.col('hour') < 12), 1)
                         .when((F.col('hour') >= 12) & (F.col('hour') < 18), 2)
                         .otherwise(3)
                         )

In [42]:
fraud = fraud.withColumn('is_weekend', F.when(F.col('day').isin([6,7]), 1).otherwise(0))

In [43]:
fraud = fraud.withColumn('is_weekday', F.when(F.col('day').isin([1, 2, 3, 4, 5]), 1).otherwise(0))

In [44]:
#Amount Category (0: Low, 1: Medium, 2: High)
fraud = fraud.withColumn('amt_category',
                         F.when((F.col('amt')) < 50, 0)
                         .when((F.col('amt') >= 50) & (F.col('amt') < 100), 1)
                         .otherwise(2))

### Encoding Data

#### Binary Encoded

In [45]:
indexer = StringIndexer(inputCol='gender', outputCol='gender_index')
fraud = indexer.fit(fraud).transform(fraud)
fraud.select('gender', 'gender_index').distinct().show(5)

+------+------------+
|gender|gender_index|
+------+------------+
|     M|         1.0|
|     F|         0.0|
+------+------------+



In [46]:
encoder = OneHotEncoder(inputCol='gender_index', outputCol='gender_ohe', dropLast=False)
fraud = encoder.fit(fraud).transform(fraud)
fraud.select('gender', 'gender_index', 'gender_ohe').distinct().show()

+------+------------+-------------+
|gender|gender_index|   gender_ohe|
+------+------------+-------------+
|     F|         0.0|(2,[0],[1.0])|
|     M|         1.0|(2,[1],[1.0])|
+------+------------+-------------+



#### Frequency Encoded

In [47]:
list_column_cat = ['cc_num', 'merchant', 'category', 'city', 'state', 'zip', 'professional', 'specialization']
total_rows = fraud.count()

for col_name in list_column_cat:
      col_freq = fraud.groupBy(col_name).agg(F.count('*').alias(f'{col_name}_count'))
      col_freq = col_freq.withColumn(f'{col_name}_freq', F.col(f'{col_name}_count') / total_rows)
      col_freq = col_freq.withColumn(f'{col_name}_freq', F.round(F.col(f'{col_name}_freq'), 4))
      fraud = fraud.join(col_freq.select(col_name, f'{col_name}_freq'), on=col_name, how='left')


### Delete Columns

In [48]:
fraud = fraud.drop('year', 'year_birth_cust', 'city_pop', 'hour', 'merchant', 'category', 'gender', 'city', 'state', 'zip', 'professional', 'specialization', 'cc_num', 'amt')


In [49]:
fraud.show(5, truncate=False)

+--------+-----+---+---+---------+-------------+-----------+----------+----------+------------+------------+-------------+-----------+-------------+-------------+---------+----------+--------+-----------------+-------------------+
|is_fraud|month|day|Age|Age_group|Location_type|Time_of_day|is_weekend|is_weekday|amt_category|gender_index|gender_ohe   |cc_num_freq|merchant_freq|category_freq|city_freq|state_freq|zip_freq|professional_freq|specialization_freq|
+--------+-----+---+---+---------+-------------+-----------+----------+----------+------------+------------+-------------+-----------+-------------+-------------+---------+----------+--------+-----------------+-------------------+
|0       |1    |3  |57 |2        |0            |0          |0         |1         |2           |1.0         |(2,[1],[1.0])|4.0E-4     |0.0015       |0.0725       |4.0E-4   |0.0043    |4.0E-4  |4.0E-4           |0.7592             |
|0       |1    |3  |52 |2        |0            |0          |0         |1    

In [50]:
correlations = [Row(Feature=column, Correlation=fraud.corr(column, 'is_fraud')) for column in fraud.columns if column != 'is_fraud' and column != 'gender_ohe']

corr_df = spark.createDataFrame(correlations)

corr_df.orderBy(F.col('Correlation').desc()).show(truncate=False)

+-------------------+----------------------+
|Feature            |Correlation           |
+-------------------+----------------------+
|amt_category       |0.09002173080895591   |
|Age                |0.012453487015932442  |
|day                |0.009620213899482673  |
|Age_group          |0.008378485572780228  |
|merchant_freq      |0.007847201076862887  |
|gender_index       |0.007641534190320515  |
|Time_of_day        |0.006634277521023595  |
|is_weekend         |0.005966119560024192  |
|category_freq      |0.00582024851767668   |
|Location_type      |0.001992969263110647  |
|specialization_freq|2.5635106250906274E-4 |
|state_freq         |-3.593222640601406E-4 |
|is_weekday         |-0.005966119560024182 |
|professional_freq  |-0.0071998289680903675|
|month              |-0.012409331585155019 |
|city_freq          |-0.03872798246254171  |
|zip_freq           |-0.052732403984976355 |
|cc_num_freq        |-0.055325505497714776 |
+-------------------+----------------------+



In [51]:
fraud = fraud.drop('specialization_freq', 'state_freq', 'Location_type', 'Age', 'is_weekend', 'category_freq', 'is_weekday')

## Modeling

### Vector Assembler

In [52]:
feature_columns = [col_name for col_name in fraud.columns if col_name != 'is_fraud' and col_name != 'gender_ohe']
assembler = VectorAssembler(inputCols=feature_columns, outputCol='features_e')
fraud = assembler.transform(fraud)

### Split Data

In [53]:
train, test = fraud.randomSplit([0.8, 0.2], seed=42)

### Train vs Test

In [54]:
models = {
    'Logistic Regression': LogisticRegression(featuresCol='features_e', labelCol='is_fraud'),
    'Random Forest': RandomForestClassifier(featuresCol='features_e', labelCol='is_fraud'),
    'Gradient Boosting Tree Classifier': GBTClassifier(featuresCol='features_e', labelCol='is_fraud'),
}

In [56]:
binary_eval = BinaryClassificationEvaluator(labelCol='is_fraud', metricName='areaUnderROC')
multi_eval = MulticlassClassificationEvaluator(labelCol='is_fraud', predictionCol='prediction')

results = []

for model_name, model in models.items():
    pipeline = Pipeline(stages=[model])
    pipeline_model = pipeline.fit(train)

    train_predictions = pipeline_model.transform(train)
    test_predictions = pipeline_model.transform(test)
    train_auc = binary_eval.evaluate(train_predictions)
    train_accuracy = multi_eval.setMetricName('accuracy').evaluate(train_predictions)
    train_f1 = multi_eval.setMetricName('f1').evaluate(train_predictions)
    train_recall = multi_eval.setMetricName('weightedRecall').evaluate(train_predictions)

    test_auc = binary_eval.evaluate(test_predictions)
    test_accuracy = multi_eval.setMetricName('accuracy').evaluate(test_predictions)
    test_f1 = multi_eval.setMetricName('f1').evaluate(test_predictions)
    test_recall = multi_eval.setMetricName('weightedRecall').evaluate(train_predictions)

    results.append({  'Model': model_name,
                      'Train AUC': train_auc,
                      'Train Accuracy': train_accuracy,
                      'Train Recall': train_recall,
                      'Train F1-Score': train_f1,
                      'Test AUC': test_auc,
                      'Test Accuracy': test_accuracy,
                      'Test Recall': test_recall,
                      'Test F1-Score': test_f1,
                      })

results_rows = [Row(**r) for r in results]
df_baseline = spark.createDataFrame(results_rows)
df_baseline.show(truncate=True)

+--------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|               Model|         Train AUC|    Train Accuracy|      Train Recall|    Train F1-Score|          Test AUC|     Test Accuracy|       Test Recall|     Test F1-Score|
+--------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
| Logistic Regression|0.8176106123846839|0.9941930273927985|0.9941930273927985| 0.991297995870254|0.8283295456251195|0.9942847654243755|0.9941930273927985|0.9914353375152369|
|       Random Forest|0.8615129029599123|0.9941930273927985|0.9941930273927985| 0.991297995870254|0.8666070475043218|0.9942847654243755|0.9941930273927985|0.9914353375152369|
|Gradient Boosting...|0.9327852329592852|0.9943472118629532| 0.994347211862953|0.9917796627107782|0.9325247366623882| 0.99441

1. Model tidak mengalami overfitting.
2. Model memiliki performa yang baik.
3. Model unggul pada nilai AUC, Recall dan F1-Score dibandingkan model lainnya.