<img src="https://github.com/rjpost20/Anomalous-Bank-Transactions-Detection-Project/blob/main/data/AdobeStock_319163865.jpeg?raw=true">
Image by <a href="https://stock.adobe.com/contributor/200768506/andsus?load_type=author&prev_url=detail" >AndSus</a> on Adobe Stock

# Phase 5 Project: *Detecting Anomalous Financial Transactions*

## Notebook 2: Modeling, Analysis and Results

### By Ryan Posternak

Flatiron School, Full-Time Live NYC<br>
Project Presentation Date: August 25th, 2022<br>
Instructor: Joseph Mata

<br>

# Imports and Reading in Data

### Google colab compatibility downloads

In [3]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz 
!tar xf spark-3.3.0-bin-hadoop3.tgz
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"
!pip install pyspark==3.3.0
!pip install -q findspark
import findspark
findspark.init()

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1,581 B]
Hit:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:5 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:6 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:7 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:8 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [90.7 kB]
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [903 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:12 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:13 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bioni

In [4]:
# Connect to Google drive
from google.colab import drive, files
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
from itertools import chain
import os

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, IntegerType, DoubleType, TimestampType
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

import matplotlib.pyplot as plt
plt.style.use('seaborn')
import seaborn as sns
# import matplotlib_inline.backend_inline
# matplotlib_inline.backend_inline.set_matplotlib_formats('retina')
from IPython.display import HTML, display
%matplotlib inline

In [7]:
# Check colab GPU info

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
    print('Not connected to a GPU')
else:
    print(gpu_info)

Not connected to a GPU


In [8]:
# Set text to wrap in Google colab notebook

def set_css():
    display(HTML('''
    <style>
      pre {
          white-space: pre-wrap;
      }
    </style>
    '''))
get_ipython().events.register('pre_run_cell', set_css)

In [9]:
# Initialize Spark Session

# spark = SparkSession.builder.master('local[*]').getOrCreate()
spark = SparkSession.builder\
        .master("local[*]")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

spark.version

'3.3.0'

In [10]:
# Read in resampled_df (training) and test_df_preprocessed (testing) data csv files to Spark DataFrames - Colab
resampled_df = spark.read.csv('/content/drive/MyDrive/Colab Notebooks/resampled_df.csv', header=True, inferSchema=True)
test_df_preprocessed = spark.read.csv('/content/drive/MyDrive/Colab Notebooks/test_df_preprocessed.csv', header=True, inferSchema=True)

In [11]:
# Print shape of dataframes
print(f"resampled_df:  {resampled_df.count()} Rows, {len(resampled_df.columns)} Columns")
print(f"test_df_preprocessed:  {test_df_preprocessed.count()} Rows, {len(test_df_preprocessed.columns)} Columns")

resampled_df:  454052 Rows, 10 Columns
test_df_preprocessed:  651584 Rows, 10 Columns


In [15]:
# Print schema of training dataframe
resampled_df.printSchema()

root
 |-- MessageId: string (nullable = true)
 |-- SettlementAmount: double (nullable = true)
 |-- InstructedCurrency: string (nullable = true)
 |-- InstructedAmount: double (nullable = true)
 |-- Label: integer (nullable = true)
 |-- Hour: integer (nullable = true)
 |-- SenderHourFreq: integer (nullable = true)
 |-- SenderCurrencyFreq: integer (nullable = true)
 |-- SenderCurrencyAmtAvg: double (nullable = true)
 |-- SenderReceiverFreq: integer (nullable = true)



In [16]:
# Print schema of test dataframe
test_df_preprocessed.printSchema()

root
 |-- MessageId: string (nullable = true)
 |-- SettlementAmount: double (nullable = true)
 |-- InstructedCurrency: string (nullable = true)
 |-- InstructedAmount: double (nullable = true)
 |-- Label: integer (nullable = true)
 |-- Hour: integer (nullable = true)
 |-- SenderHourFreq: integer (nullable = true)
 |-- SenderCurrencyFreq: integer (nullable = true)
 |-- SenderCurrencyAmtAvg: double (nullable = true)
 |-- SenderReceiverFreq: integer (nullable = true)



In [19]:
# Display first 5 rows of resampled_df dataframe
resampled_df.drop('MessageId').show(n=5, truncate=False)

+----------------+------------------+----------------+-----+----+--------------+------------------+--------------------+------------------+
|SettlementAmount|InstructedCurrency|InstructedAmount|Label|Hour|SenderHourFreq|SenderCurrencyFreq|SenderCurrencyAmtAvg|SenderReceiverFreq|
+----------------+------------------+----------------+-----+----+--------------+------------------+--------------------+------------------+
|2.28982271419E9 |EUR               |2.04576315036E9 |0    |11  |536611        |1402533           |1.8253516160989076E8|1629189           |
|2.18142740428E9 |EUR               |1.94892109737E9 |0    |12  |70646         |1402533           |1.8253516160989076E8|1629189           |
|5407563.54      |EUR               |4831201.23      |0    |9   |50129         |1402533           |1.8253516160989076E8|1629189           |
|3273391.59      |GBP               |2630693.23      |0    |11  |536611        |218987            |3691762.8543079915  |1629189           |
|375812.89       |US

In [20]:
# Display first 5 rows of test_df_preprocessed dataframe
test_df_preprocessed.drop('MessageId').show(n=5, truncate=False)

+----------------+------------------+----------------+-----+----+--------------+------------------+--------------------+------------------+
|SettlementAmount|InstructedCurrency|InstructedAmount|Label|Hour|SenderHourFreq|SenderCurrencyFreq|SenderCurrencyAmtAvg|SenderReceiverFreq|
+----------------+------------------+----------------+-----+----+--------------+------------------+--------------------+------------------+
|1585724.74      |JPY               |1.61477062E8    |0    |22  |3455          |20944             |9.956381033034086E11|4362              |
|2862743.17      |GBP               |2862743.17      |0    |13  |30133         |529744            |1674064.4259055888  |261994            |
|5139383.79      |GBP               |4130341.39      |0    |16  |45272         |218987            |3691762.8543079915  |1629189           |
|6107360.55      |EUR               |5456410.75      |0    |11  |536611        |1402533           |1.8253516160989076E8|1629189           |
|3365712.14      |GB

In [17]:
# Print number of null/missing values in each column of resampled_df
resampled_df_null = resampled_df.select([F.count(F.when(F.col(c).isNull() | F.isnan(c), c))\
                                         .alias(c) for c in resampled_df.columns])

print('Number of null/missing values per column:\n')
resampled_df_null.show(vertical=True, truncate=False)

Number of null/missing values per column:

-RECORD 0-------------------
 MessageId            | 0   
 SettlementAmount     | 0   
 InstructedCurrency   | 0   
 InstructedAmount     | 0   
 Label                | 0   
 Hour                 | 0   
 SenderHourFreq       | 0   
 SenderCurrencyFreq   | 0   
 SenderCurrencyAmtAvg | 0   
 SenderReceiverFreq   | 0   



In [18]:
# Print number of null/missing values in each column of resampled_df
test_df_preprocessed_null = test_df_preprocessed.select([F.count(F.when(F.col(c).isNull() | F.isnan(c), c))\
                                                         .alias(c) for c in test_df_preprocessed.columns])

print('Number of null/missing values per column:\n')
test_df_preprocessed_null.show(vertical=True, truncate=False)

Number of null/missing values per column:

-RECORD 0-------------------
 MessageId            | 0   
 SettlementAmount     | 0   
 InstructedCurrency   | 0   
 InstructedAmount     | 0   
 Label                | 0   
 Hour                 | 0   
 SenderHourFreq       | 0   
 SenderCurrencyFreq   | 0   
 SenderCurrencyAmtAvg | 0   
 SenderReceiverFreq   | 0   

