<a href="https://colab.research.google.com/github/rjpost20/Anomalous-Bank-Transactions-Detection-Project/blob/main/Notebook-1_Intro-EDA-Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/rjpost20/Anomalous-Bank-Transactions-Detection-Project/blob/main/data/AdobeStock_319163865.jpeg?raw=true">
Image by <a href="https://stock.adobe.com/contributor/200768506/andsus?load_type=author&prev_url=detail" >AndSus</a> on Adobe Stock

# Phase 5 Project: *Detecting Anomalous Financial Transactions*

## Notebook 1: Intro, EDA and Preprocessing

### By Ryan Posternak

Flatiron School, Full-Time Live NYC<br>
Project Presentation Date: August 25th, 2022<br>
Instructor: Joseph Mata

## Goal: 

*This is a project for learning purposes. The *** is not involved with this project in any way.*

<br>

# Overview and Business Understanding

<br>

# Data Understanding

<br>

# Initial Exploratory Data Analysis

### Google colab compatibility downloads

In [1]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"
!pip install pyspark==3
!pip install -q findspark
import findspark
findspark.init()

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:8 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:9 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:11 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Reading package lists... Done
Building dependency tree       
Reading state infor

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Import libraries, packages and modules

In [3]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import os

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, IntegerType, DoubleType, TimestampType
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

import matplotlib.pyplot as plt
plt.style.use('seaborn')
import seaborn as sns
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('retina')
from IPython.display import HTML, display
%matplotlib inline

In [4]:
# Check colab GPU info

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
    print('Not connected to a GPU')
else:
    print(gpu_info)

Not connected to a GPU


In [5]:
# Set text to wrap in Google colab notebook

def set_css():
    display(HTML('''
    <style>
    pre {
        white-space: pre-wrap;
    }
    </style>
    '''))
get_ipython().events.register('pre_run_cell', set_css)

In [6]:
# Initialize Spark Session

# spark = SparkSession.builder.master('local[*]').getOrCreate()
spark = SparkSession.builder\
        .master("local[*]")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

### Description of Features

**Dataset 1 – Transactions:**

`MessageId` - Globally unique identifier within this dataset for individual transactions<br>
`UETR` - The Unique End-to-end Transaction Reference—a 36-character string enabling traceability of all individual transactions associated with a single end-to-end transaction<br>
`TransactionReference` - Unique identifier for an individual transaction<br>
`Timestamp` - Time at which the individual transaction was initiated<br>
`Sender` - Institution (bank) initiating/sending the individual transaction<br>
`Receiver` - Institution (bank) receiving the individual transaction<br>
`OrderingAccount` - Account identifier for the originating ordering entity (individual or organization) for end-to-end transaction<br>
`OrderingName` - Name for the originating ordering entity<br>
`OrderingStreet` - Street address for the originating ordering entity<br>
`OrderingCountryCityZip` - Remaining address details for the originating ordering entity<br>
`BeneficiaryAccount` - Account identifier for the final beneficiary entity (individual or organization) for end-to-end transaction<br>
`BeneficiaryName` - Name for the final beneficiary entity<br>
`BeneficiaryStreet` - Street address for the final beneficiary entity<br>
`BeneficiaryCountryCityZip` - Remaining address details for the final beneficiary entity<br>
`SettlementDate` - Date the individual transaction was settled<br>
`SettlementCurrency` - Currency used for transaction<br>
`SettlementAmount` - Value of the transaction net of fees/transfer charges/forex<br>
`InstructedCurrency` - Currency of the individual transaction as instructed to be paid by the Sender<br>
`InstructedAmount` - Value of the individual transaction as instructed to be paid by the Sender<br>
`Label` - Boolean indicator of whether the transaction is anomalous or not. This is the target variable for the prediction task.<br>
<br>
**Dataset 2 – Banks:**

`Bank` - Identifier for the bank<br>
`Account` - Identifier for the account<br>
`Name` - Name of the account<br>
`Street` - Street address associated with the account<br>
`CountryCityZip` - Remaining address details associated with the account<br>
`Flags` - Enumerated data type indicating potential issues or special features that have been associated with an account. Flag definitions are below:<br>
00 - No flags<br>
01 - Account closed<br>
03 - Account recently opened<br>
04 - Name mismatch<br>
05 - Account under monitoring<br>
06 - Account suspended<br>
07 - Account frozen<br>
08 - Non-transaction account<br>
09 - Beneficiary deceased<br>
10 - Invalid company ID<br>
11 - Invalid individual ID<br>

### Read in Data

In [7]:
# Read in transactions data csv file to a Spark DataFrame
transactions_df = spark.read.csv('/content/drive/MyDrive/Colab Notebooks/transaction_dataset.csv', header=True, inferSchema=True)

# Read in banks data csv file to a Spark DataFrame
# banks_df = spark.read.csv('/content/drive/MyDrive/Colab Notebooks/bank_dataset.csv', header=True, inferSchema=True)

### Initial EDA

In [8]:
# Print shape of dataframes
print(f"transactions_df:  {transactions_df.count():,} Rows, {len(transactions_df.columns)} Columns")
# print(f"banks_df:  {banks_df.count():,} Rows, {len(banks_df.columns)} Columns")

transactions_df:  4,691,725 Rows, 20 Columns


In [9]:
transactions_df.printSchema()

root
 |-- MessageId: string (nullable = true)
 |-- Timestamp: string (nullable = true)
 |-- UETR: string (nullable = true)
 |-- Sender: string (nullable = true)
 |-- Receiver: string (nullable = true)
 |-- TransactionReference: string (nullable = true)
 |-- OrderingAccount: string (nullable = true)
 |-- OrderingName: string (nullable = true)
 |-- OrderingStreet: string (nullable = true)
 |-- OrderingCountryCityZip: string (nullable = true)
 |-- BeneficiaryAccount: string (nullable = true)
 |-- BeneficiaryName: string (nullable = true)
 |-- BeneficiaryStreet: string (nullable = true)
 |-- BeneficiaryCountryCityZip: string (nullable = true)
 |-- SettlementDate: integer (nullable = true)
 |-- SettlementCurrency: string (nullable = true)
 |-- SettlementAmount: double (nullable = true)
 |-- InstructedCurrency: string (nullable = true)
 |-- InstructedAmount: double (nullable = true)
 |-- Label: integer (nullable = true)



In [None]:
# banks_df.printSchema()

In [10]:
# Print first row of transactions dataframe
transactions_df.show(n=1, vertical=True, truncate=False)

-RECORD 0---------------------------------------------------------
 MessageId                 | TRA7CGN3FF                           
 Timestamp                 | 2022-01-01 00:00:00                  
 UETR                      | f474fdb3-4675-4fff-ab7e-3469f82bd6a7 
 Sender                    | DPSUFRPP                             
 Receiver                  | ABVVUS6S                             
 TransactionReference      | PETX22-FXIDA-7054                    
 OrderingAccount           | FR90714755422956984353               
 OrderingName              | PHACELIA HETEROPHYLLA                
 OrderingStreet            | 3| RUE HAMON                         
 OrderingCountryCityZip    | FR/42859 SAINTE AURÉLIE              
 BeneficiaryAccount        | 611024064274704358                   
 BeneficiaryName           | PAPAVER CALIFORNICUM                 
 BeneficiaryStreet         | 2584 CHARLES PLACE                   
 BeneficiaryCountryCityZip | US/ROJASLAND| DC 58442           

In [None]:
# Print first 5 rows of banks dataframe
# banks_df.show(n=5, truncate=False)

In [None]:
# Print number of null/missing values in each column of transactions_df
transactions_df_null = transactions_df.select([F.count(F.when(F.col(c).contains('None') | \
                                                              F.col(c).contains('NULL') | \
                                                              (F.col(c) == '' ) | \
                                                              F.col(c).isNull() | \
                                                              F.isnan(c), c ))\
                                               .alias(c) for c in transactions_df.columns])

print('Number of null/missing values per column:\n')
transactions_df_null.show(vertical=True, truncate=False)

In [None]:
# Print number of null/missing values in each column of banks_df
# banks_df_null = banks_df.select([F.count(F.when(F.col(c).contains('None') | \
#                                                 F.col(c).contains('NULL') | \
#                                                 (F.col(c) == '' ) | \
#                                                 F.col(c).isNull() | \
#                                                 F.isnan(c), c ))\
#                                  .alias(c) for c in banks_df.columns])

# print('Number of null/missing values per column:\n')
# banks_df_null.show(vertical=True, truncate=False)

In [None]:
# Print number of unique values in each column in transactions_df; sample 10% of df for efficiency
transactions_df_unique = transactions_df.sample(False, 0.1).agg(*(F.countDistinct(F.col(c)) for c in transactions_df.columns))

print(f"Number of unique values per column (in sample of 10% of dataframe):\n")
transactions_df_unique.show(vertical=True, truncate=False)

In [None]:
# Print number of unique values in each column in banks_df; sample 10% of df for efficiency
# banks_df_unique = banks_df.sample(False, 0.1).agg(*(F.countDistinct(F.col(c)) for c in banks_df.columns))

# print(f"Number of unique values per column (in sample of 10% of dataframe):\n")
# banks_df_unique.show(vertical=True, truncate=False)

In [11]:
# Show value counts for 'Label' column (classification target) in transactions_df
class_counts = transactions_df.groupBy('Label').count().withColumn('percent', F.col('count')/transactions_df.count())

class_counts.show(truncate=10)

+-----+-------+----------+
|Label|  count|   percent|
+-----+-------+----------+
|    1|   4900|0.00104...|
|    0|4686825|0.99895...|
+-----+-------+----------+



**Remarks:**
- It looks like this is an extremely imbalanced dataset - only about 0.1% of the data is in the positive class. We will need to address this class imbalance as part of the modeling process.

# Feature Engineering

Steps:
1. Join dataframes
2. Change column datatypes to correct type

In [None]:
# Specify join condition
# join_condition = (transactions_df.OrderingAccount == banks_df.Account) | (transactions_df.BeneficiaryAccount == banks_df.Account)

# Join dataframes
# preprocessed_df = transactions_df.join(banks_df, on=join_condition, how='left')

# Unpersist old dataframes from memory
# display(transactions_df.unpersist())
# display(banks_df.unpersist())

In [None]:
# Print shape of joined dataframe
# print(f"{preprocessed_df.count():,} Rows, {len(preprocessed_df.columns)} Columns")

### Convert `Timestamp` column to TimestampType

In [12]:
# Convert 'Timestamp' column to TimestampType
transactions_df = transactions_df.withColumn('Timestamp', transactions_df.Timestamp.cast(TimestampType()))

assert transactions_df.select('Timestamp').dtypes[0][0] == 'Timestamp'

## Feature Engineering

### Create `SenderHourFreq` feature: hour frequency for each sender

In [40]:
# Define UDF to extract hour from timestamp
hour = F.udf(lambda x: x.hour, IntegerType())

# Create new column of transaction hours
transactions_df = transactions_df.withColumn('Hour', hour(transactions_df.Timestamp))

# Create list of unique senders
senders = transactions_df.select('Sender').distinct()

# Create column of unique senders concatenated with hours
transactions_df = transactions_df.withColumn('SenderHour', F.concat(F.col('Sender'), F.col('Hour').cast(StringType())))

In [39]:
transactions_df.show(n=3, vertical=True)

-RECORD 0-----------------------------------------
 MessageId                 | TRA7CGN3FF           
 Timestamp                 | 2022-01-01 00:00:00  
 UETR                      | f474fdb3-4675-4ff... 
 Sender                    | DPSUFRPP             
 Receiver                  | ABVVUS6S             
 TransactionReference      | PETX22-FXIDA-7054    
 OrderingAccount           | FR907147554229569... 
 OrderingName              | PHACELIA HETEROPH... 
 OrderingStreet            | 3| RUE HAMON         
 OrderingCountryCityZip    | FR/42859 SAINTE A... 
 BeneficiaryAccount        | 611024064274704358   
 BeneficiaryName           | PAPAVER CALIFORNICUM 
 BeneficiaryStreet         | 2584 CHARLES PLACE   
 BeneficiaryCountryCityZip | US/ROJASLAND| DC ... 
 SettlementDate            | 220101               
 SettlementCurrency        | USD                  
 SettlementAmount          | 1.74631905316E9      
 InstructedCurrency        | EUR                  
 InstructedAmount          | 1.