## Run Data Pipeline in Jupyter

The dataset was imported in PosgresSQL database, this section will skip data processing implementation details. Please find all details in the above section “Dataset Acquisition”. For data volume perspective, the file credit_card_transactions-ibm_v2.csv has size of 2.35GB.  This figure is too huge for machine learning.

We tried to run a query against database table transaction to understand data distribution.  It was found that 3281 fraud transactions in 2015 and 3579 in 2016. The total number of records in both years is 3410295.  This volume is good enough to build the data model.

We sought to execute a query on the data for the database table transactions so as to comprehend the distribution pattern of the data. A total of 3281 fraud transactions were discovered in the year 2015 and 3579 records in the year 2016. The total number of records in both years is 3410295. This volume appears satisfactory to create the data model.

In [61]:
# Set download_data to true if you want to download the data from database
download_data = False

In [62]:
if download_data:
    import pandas as pd
    from sqlalchemy import create_engine
    import os
    import fastparquet

    # Define your PostgreSQL database credentials
    db_username = 'postgres'
    db_password = 'postgres'
    db_host = 'localhost'
    db_port = '5432'
    db_name = 'postgres'

    # Create the connection string
    connection_string = f'postgresql://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}'

    # Create the database engine
    engine = create_engine(connection_string)

    # Define your SQL query
    query = """
    SELECT
        c.id as customer_id
      , c.address
      , c.birth_month
      , c.birth_year
      , c.credit_card_count
      , c.credit_score
      , c.current_age
      , c.email
      , c.first_name
      , c.gender
      , c.last_name
      , c.latitude
      , c.longitude
      , c.per_capita_income
      , c.retirement_age
      , c.total_debt
      , c.yearly_income
      , cd.id as card_id 
      , cd.account_open_date
      , cd.card_brand
      , cd.card_expiration_date
      , cd.card_index
      , cd.card_number
      , cd.card_on_dark_web
      , cd.card_type
      , cd.credit_limit
      , cd.cvv_code
      , cd.has_chip
      , cd.number_cards_issued
      , cd.pin_last_changed_year
      , tx.id as transaction_id
      , tx.fraud_detected
      , tx.merchant_city
      , tx.merchant_id
      , tx.merchant_mcc_code
      , tx.merchant_state
      , tx.merchant_zip
      , tx.transaction_amount
      , tx.transaction_datetime
      , tx.transaction_error
      , tx.transaction_type
    FROM  
        transactions tx
      , cards cd
      , customers c
    WHERE
          tx.transaction_datetime >= '2015-01-01'
      AND tx.transaction_datetime < '2017-01-01'
      AND tx.customer_id = cd.customer_id
      AND tx.card_index = cd.card_index
      AND cd.customer_id = c.id
    """
    if not os.path.exists('data'):
        os.makedirs('data', exist_ok=True)
        
    if not os.path.exists('data/transactions.parquet'):
        # Read the SQL query into a pandas DataFrame
        df = pd.read_sql_query(query, engine)
        # Convert the 'account_open_date' column to datetime
        df['customer_id'] = df['customer_id'].astype('int64')
        df['account_open_date'] = pd.to_datetime(df['account_open_date'])
        df['card_expiration_date'] = pd.to_datetime(df['card_expiration_date'])
        # fastparquet.write('data/transactions.parquet', df)
        df.to_parquet('data/transactions.parquet', engine='fastparquet')
    else:
        df = pd.read_parquet('data/transactions.parquet', engine='fastparquet')

    display(df)

## Data Profiling

Data profiling is a technique used to observe the data distribution of each feature. It provides an overview of the data by providing information about min/max values, mean, null or infinite values, and count. This helps us to identify any outliers in our dataset which can then be removed through data cleansing techniques. Data profiling serves as an important starting point for further analysis and understanding what type of cleaning might be required for our dataset before we move on with other tasks like machine learning models building.

Exploratory Data Analysis (EDA) is a ground-breaking book in the field of data analysis. It introduces and explains the principles of exploratory data analysis, which involves analyzing datasets to summarize their main characteristics using statistical graphics and other visualization methods (Turkey, 1997).

EDA framework helps describe a set of data features, expose its inner structure, get out important variables, identify any anomalies and outliers and test for the underlying assumptions. Here are some problems that may be discovered in an EDA report:

- **Missing Values** - By using EDA it is possible to find columns with missing values. In this regard, you will probably replace or rather remove them depending on what proportion of values were not found.
- **Outliers** - In case there are outliers in your data, performing EDA can help you to detec them. Such cases differ greatly from other observations. These strange values might be true or erroneous.
- **Distribution of Data** - When one does exploratory analysis of data, he/she can understand its distribution too well. If it’s skewed, then it might not work as expected by some machine learning algorithms.
- **Correlation** - Furthermore, through conducting EDA on your dataset you can also determine if there are any correlated features among them which will lead multicollinearity among linear regression models if they have highly correlated features.
- **Constant Features** - Moreover, one may equally use this step to determine if there are any constant features in dataset that lacks useful information hence, they need to be deleted as well.
- **Categorical Variables** - Additionally, when engaged in exploratory analysis of data one may also find out how many categorical variables exist and their distinct categories too since a few categories might have minimal counts thus requiring special treatment separately.
- **Feature Magnitude** - Furthermore another thing that comes into the picture during EDA is whether the different measures are being used for scaling features for such algorithms like learning machines where scale needs to be uniform across all these arrays.

YData Proflding API was adopted to provide quick EDA Analysis into preliminary features and labels within a dataset before any further ML tasks are undertaken. Detail EDA report can be found at <project_root>/project/mlops_pipeline/data/ydata_reports/initial_eda_report.html.

**Remarks**: The following snippet may take around 20 mins

In [63]:
import numpy as np
from ydata_profiling import ProfileReport

import polars as pl
import os

df = pl.read_parquet('data/transactions.parquet')

# Assuming ohlcv_df and project_outdir are defined in previous cells
initial_eda_report = ProfileReport(
    df.to_pandas(),
    correlations={
        "pearson": {"calculate": True},
        "spearman": {"calculate": True},
        "kendall": {"calculate": True},
        "phi_k": {"calculate": True},
    },
    interactions={
        "targets": ["fraud_detected"],
    },
)

# Create EDA report directory
if not os.path.exists("data/ydata_reports"):
    os.makedirs("data/ydata_reports", exist_ok=True)
        
# Save EDA Report
initial_eda_report.to_file("data/ydata_reports/initial_eda_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  stats["range"] = stats["max"] - stats["min"]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Data Cleansing

Data Cleansing stage handles missing values, outliers, feature engineering, etc. This could
involve techniques such as imputation or removal of instances with missing values, depending
on the proportion of missing data and the specific requirements of your analysis or model.

A strategic change is being introduced in our data processing pipeline to ensure efficiency and
coherence. We will merge Exploratory Data Analysis (EDA) with the data cleaning step. As a
result, we can now spot missing values, outliers or other red flags easily during clean-up. This
way, EDA helps us to ensure that our data is not only clean but also understood as well leading
to more precise and dependable subsequent analysis and models. In this way, one can be sure
that all the relevant information regarding the used dataset has been found before any decision
on further steps for pre-processing is made.

In [64]:
# import pandas as pd
import polars as pl

# Load the data from the parquet file downloaded in the previous step
df = pl.read_parquet('data/transactions.parquet')

### Constant Features

In the context of EDA, the first step and effort can be to interface with constant features so as to remove them because they do not convey any relevancy and cannot be utilized in predictive modeling. The term “constant feature” refers to any feature where the value is the same across all records in the dataset.

The following columns were identifed in EDA report as belonging to constant features

- **card_on_dark_web** - has False value only

### Missing Values

Columns having missing values can be spotted during EDA. With regard to this, it is umost likely you delete them or let’s say fill them in, on what the percentage of values were missing so long it was not too great a number. The available dataset has 428339 missing values under "merchant_state" attribute with 12.6%, "merchant_zip" has 449930 missing values with 13.2%, and transaction_error has 3355771 missing values which is 98.4% of the total rows. First let’s take a look at “transaction_error”. In this case, we should mention that the derived value comes from ‘No Error’ so the stating error ‘No Error’ is considered quite enough for solving problems. For the "merchant_state" and "merchant_zip" attributes, in respect of the timeline to complete the project, these types of transactions will be bypassed at this phase.

### Unique Values

They rightfully advise that such unique values are also indicative of possible outliers, which are values that fall well outside the expected range and can impact the performance of the model adversely. The EDA report signaled unique value found in feature “transaction_id”. In fact this feature is system generated and does not have any significance for target prediction.

### Zero Values

EDA reported that feature "total_debt" and "card_index" have zero values.  These alerts will be ignored. Debt of zero dollar makes sense for those card holder holdig healthy record while card_index is a meaningless identifier for target.  Handling for "card_index" will be discussed later.

### PII Data

To respect data privacy, all PII data will be removed from data set including “first_name”, “last_name”, “address”, “email”, and “birth_month”.

### High Correlation

High Correlation Removal reported in EDA report is necessary for variable duplication to reduce dataset side and outliers. After reviewing the high correlation alerts, the following attributes will be removed.

As for the columns generated by the system but which do not have any relevance to the prediction made are:

- account_open_date
- card_id
- card_index
- card_number
- credit_limit
- customer_id
- current_age
- cvv_code
- email
- has_chip
- merchant_id
- merchant_state
- merchant_zip
- per_capita_income
- yearly_income


### DateTime Features

Chronicling time is a crucial aspect in identifying fraud, as it delineates the trends and sequence of fraudulent activities that take place, allows risk evaluation on an instant basis, and enables the model to be more efficient. Criminals are said to operate on time schedules, thus, time framed transactions or a series of transactions in a short time could be suspicious. Time helps understand the normal behaviour of a client and its deviation as well as aids in fraud detection while cutting down unnecessary detections. Also, temporal context is important in systems that are aimed at preventing fraud in real-time because such systems must take into account the time of the day, regions and times of the year that affect genuine transaction. Correctly measured time variables improve the overall performance of models aimed at detecting fraud in all its forms.

### Categorical Variables

It is important to identify and analyze categorical variables as they can greatly impact one’s analysis and predictive models. Some categorical variables may need special treatment like encoding, grouping or even excluding them altogether from the analysis if they contain too many categories (high cardinality) or very few observations (low frequency).

Weighted Target Encoding is a novel encoding scheme that applies both global and local statistics to improve the encoding of categorical features. The overall mean (for instance, the overall fraud rate) and the segmented means are blended together using a category frequency driven smoothing constant. The general formula of applied values is computed as follows: (weight * category_mean) + ((1 - weight) * global_mean), where weight = n / (n + min_samples).

Particularly, this approach is relevant for finding frauds owing to:

- mitigate data sparsity by taking into account both infrequent and frequent patterns of frauds
- increase the robustness of the encodings by incorporating both global and local metrics
- work well with non homogeneous category distributions in the transactional data
- help in improving the signal to noise ratio
- not compromise on the importance of the encoded feature

Moreover, the ability of the method to adaptively smooth in rare categories is its main benefit over basic target encoding, which makes it appropriate for fraud detection which has rare patterns that abuse the norm. The smoothing parameter can be altered during the process to account for global trends or bias towards specific category fraud rates.

Weighted Target Encoding is crucial to ensure that all categorical variables take a numerical value which is crucial for the subsequent steps of MLOps pipeline namely:
- gender
- card_brand
- card_type
- merchant_city
- merchant_mcc_code
- transaction_error
- transaction_type

### Target Encoding


Finally encoded_df need to map fraud_detected to 0 and 1 to numericalize target column

### Duplicate Rows

Dropping duplicate rows does not have big impact against final prediction.

In [65]:
import polars as pl

# Drop constant features
cleaned_df = df.drop("card_on_dark_web")

# Fix missing values
cleaned_df = cleaned_df.with_columns(
    pl.col("transaction_error").fill_null("No Error")
)

# Drop rows with missing values
cleaned_df = cleaned_df.drop_nulls()

# Drop specified columns
cleaned_df = cleaned_df.drop([
    "transaction_id",
    "first_name", "last_name", "email", "address", "birth_month",
    "account_open_date", "card_id", "card_index", "card_number",
    "credit_limit", "customer_id", "current_age", "cvv_code",
    "has_chip", "merchant_id", "merchant_state", "merchant_zip",
    "per_capita_income", "yearly_income"
])

# Extract datetime components
cleaned_df = cleaned_df.with_columns([
    pl.col("card_expiration_date").dt.year().alias("card_expiration_year"),
    pl.col("card_expiration_date").dt.month().alias("card_expiration_month"),
    pl.col("transaction_datetime").dt.year().alias("transaction_year"),
    pl.col("transaction_datetime").dt.month().alias("transaction_month"),
    pl.col("transaction_datetime").dt.day().alias("transaction_day"),
    pl.col("transaction_datetime").dt.hour().alias("transaction_hour")
])

# Drop original datetime columns
cleaned_df = cleaned_df.drop(["card_expiration_date", "transaction_datetime"])

def weighted_target_encoding(df, categorical_columns, target_col='fraud_detected', min_samples=100):
    df_encoded = df.clone()
    global_mean = df.select(pl.col(target_col)).mean().item()
    
    for col in categorical_columns:
        # Calculate category-level statistics
        category_stats = (
            df.group_by(col)  
            .agg([
                pl.col(target_col).count().alias("count"),
                pl.col(target_col).mean().alias("mean")
            ])
        )
        
        # Calculate weights and encoded values
        category_stats = category_stats.with_columns([
            (pl.col("count") / (pl.col("count") + min_samples)).alias("weight")
        ]).with_columns([
            (pl.col("weight") * pl.col("mean") + 
             (1 - pl.col("weight")) * global_mean).alias("encoded_value")
        ])
        
        # Join encoded values back to original dataframe
        df_encoded = df_encoded.join(
            category_stats.select([pl.col(col), pl.col("encoded_value")]),
            on=col,
            how="left"
        ).with_columns([
            pl.col("encoded_value").alias(f"{col}_encoded")
        ]).drop("encoded_value")
    
    return df_encoded

# Apply encoding
categorical_columns = [
    'gender', 'card_brand', 'card_type', 'merchant_city',
    'merchant_mcc_code', 'transaction_error', 'transaction_type'
]

encoded_df = weighted_target_encoding(
    df=cleaned_df,
    categorical_columns=categorical_columns,
    min_samples=100
)

# Convert fraud_detected to numeric
encoded_df = encoded_df.with_columns([
    pl.col("fraud_detected").fill_null(False).cast(pl.Int8)
])

# Drop duplicates
encoded_df = encoded_df.unique()

These enconding mapping should be persisted in parquet format for later database upload and encoded dataframe persistence is requied for checkpoint purpose.

In [66]:
if not os.path.exists('../payment-solution/model'):
    os.makedirs('../payment-solution/model', exist_ok=True)
    
# Save gender encoding mapping
gender_encoded_df = encoded_df.select(["gender", "gender_encoded"]).unique()
gender_encoded_df.write_json('../payment-solution/model/gender_encoded.json')

# Save card_brand encoding mapping
card_brand_encoded_df = encoded_df.select(["card_brand", "card_brand_encoded"]).unique()
card_brand_encoded_df.write_json('../payment-solution/model/card_brand_encoded.json')

# Save card_type encoding mapping
card_type_encoded_df = encoded_df.select(["card_type", "card_type_encoded"]).unique()
card_type_encoded_df.write_json('../payment-solution/model/card_type_encoded.json')

# Save merchant_city encoding mapping
merchant_city_encoded_df = encoded_df.select(["merchant_city", "merchant_city_encoded"]).unique()
merchant_city_encoded_df.write_json('../payment-solution/model/merchant_city_encoded.json')

# Save merchant_mcc_code encoding mapping
merchant_mcc_code_encoded_df = encoded_df.select(["merchant_mcc_code", "merchant_mcc_code_encoded"]).unique()
merchant_mcc_code_encoded_df.write_json('../payment-solution/model/merchant_mcc_code_encoded.json')

# Save transaction_error encoding mapping
transaction_error_encoded_df = encoded_df.select(["transaction_error", "transaction_error_encoded"]).unique()
transaction_error_encoded_df.write_json('../payment-solution/model/transaction_error_encoded.json')

# Save transaction_type encoding mapping
transaction_type_encoded_df = encoded_df.select(["transaction_type", "transaction_type_encoded"]).unique()
transaction_type_encoded_df.write_json('../payment-solution/model/transaction_type_encoded.json')

# Drop original categorical columns
encoded_df = encoded_df.drop(categorical_columns)

In [67]:
encoded_df.write_parquet('data/encoded_transactions.parquet')

### Feature Magnitude

Feature magnitude issues can happen when some features in a dataset have different scales. This may impact the efficiency of data analysis algorithms that are scale sensitive. Standardization scaling techniques was applied to handle such cases before using the data for further analysis or model training.

In [68]:
import polars as pl

# Read the parquet file
encoded_df = pl.read_parquet('data/encoded_transactions.parquet')

display(encoded_df.collect_schema())

Schema([('birth_year', Int64),
        ('credit_card_count', Int64),
        ('credit_score', Int64),
        ('latitude', Float64),
        ('longitude', Float64),
        ('retirement_age', Int64),
        ('total_debt', Float64),
        ('number_cards_issued', Int64),
        ('pin_last_changed_year', Int64),
        ('fraud_detected', Int8),
        ('transaction_amount', Float64),
        ('card_expiration_year', Int32),
        ('card_expiration_month', Int8),
        ('transaction_year', Int32),
        ('transaction_month', Int8),
        ('transaction_day', Int8),
        ('transaction_hour', Int8),
        ('gender_encoded', Float64),
        ('card_brand_encoded', Float64),
        ('card_type_encoded', Float64),
        ('merchant_city_encoded', Float64),
        ('merchant_mcc_code_encoded', Float64),
        ('transaction_error_encoded', Float64),
        ('transaction_type_encoded', Float64)])

In [69]:
def missing_values(df: pl.DataFrame):
    null_counts = df.null_count()
    
    # Print each column's missing values on a separate line
    print("Missing values in each column:")
    for column in df.columns:
        count = null_counts.get_column(column)[0]
        print(f"{column}: {count}")
  
# Get and print missing values count
missing_values(encoded_df)

Missing values in each column:
birth_year: 0
credit_card_count: 0
credit_score: 0
latitude: 0
longitude: 0
retirement_age: 0
total_debt: 0
number_cards_issued: 0
pin_last_changed_year: 0
fraud_detected: 0
transaction_amount: 0
card_expiration_year: 0
card_expiration_month: 0
transaction_year: 0
transaction_month: 0
transaction_day: 0
transaction_hour: 0
gender_encoded: 0
card_brand_encoded: 0
card_type_encoded: 0
merchant_city_encoded: 0
merchant_mcc_code_encoded: 0
transaction_error_encoded: 0
transaction_type_encoded: 0


The following boxplot shows that encoded data has serious feature magnitude issue.  Most of features scaled down to zero while the scale of total debts ranging from 0 up to over 450,000.  This situation may impact the upcoming deep learning input.

In [70]:
import altair as alt
import polars as pl

def show_boxplot(df: pl.DataFrame) -> alt.Chart:
    # Boxplot for visualizing feature magnitudes
    boxplot = alt.Chart(df).transform_fold(
        df.columns,  # Polars columns are already in list format
        as_=['Feature', 'Value']
    ).mark_boxplot().encode(
        x='Feature:N',
        y='Value:Q'
    ).properties(
        width=600,
        height=400,
        title='Feature Magnitude Boxplot'
    )
    return boxplot

# Sample the data using Polars
sample_df = (
    encoded_df
    .sample(n=5000, seed=42)
    .drop("fraud_detected")
)

# Show the boxplot
show_boxplot(sample_df).show()

To solve magnitude issue, Standard Scaler was used to balance all numerical features.  Figure 12 shows the improvement after applying scaling method.

In [71]:

from sklearn.preprocessing import StandardScaler

import json
import os
import polars as pl
import numpy as np

def fix_feature_magnitude(data: pl.DataFrame) -> pl.DataFrame:
    # Split features and target
    X = data.drop("fraud_detected")
    y = data.select("fraud_detected")
    
    # Convert to numpy for StandardScaler
    X_numpy = X.to_numpy()
    
    # Standardize the features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_numpy)
    
    # Create new dataframe with scaled features
    df_standardscaled = pl.DataFrame(
        X_scaled,
        schema={col: pl.Float64 for col in X.columns}
    )
    
    # Ensure all values are zero or positive
    min_values = df_standardscaled.select([
        pl.col("*").min()
    ]).row(0)
    
    # Subtract minimum values from each column
    df_standardscaled = df_standardscaled.with_columns([
        (pl.col(col) - min_val).alias(col)
        for col, min_val in zip(df_standardscaled.columns, min_values)
    ])
    
    # Add back the target variable from y
    df_standardscaled = df_standardscaled.with_columns([
        y.to_series().alias("fraud_detected")
    ])
    
    # Save scaler parameters
    scaler_params = {
        'columns': X.columns,
        'mean': scaler.mean_.tolist(),
        'std': scaler.scale_.tolist()
    }
    
    if not os.path.exists('data'):
        os.makedirs('data', exist_ok=True)
        
    if not os.path.exists('../payment-solution/model'):
        os.makedirs('../payment-solution/model', exist_ok=True)
        
    with open('../payment-solution/model/scaler_params.json', 'w') as f:
        json.dump(scaler_params, f)
    
    return df_standardscaled

# Apply scaling
scaled_df = fix_feature_magnitude(encoded_df)

# Ensure fraud_detected column is correctly set
scaled_df = scaled_df.with_columns([
    encoded_df.select("fraud_detected").to_series()
])

In [72]:
import altair as alt
import polars as pl

def show_boxplot(df: pl.DataFrame) -> alt.Chart:
    # Boxplot for visualizing feature magnitudes
    boxplot = alt.Chart(df).transform_fold(
        df.columns,  # Polars columns are already in list format
        as_=['Feature', 'Value']
    ).mark_boxplot().encode(
        x='Feature:N',
        y='Value:Q'
    ).properties(
        width=600,
        height=400,
        title='Feature Magnitude Boxplot'
    )
    return boxplot

# Sample the data using Polars
sample_df = (
    scaled_df
    .sample(n=5000, seed=42)
    .drop("fraud_detected")
)

# Show the boxplot
show_boxplot(sample_df).show()

After applying Standard Scalar, the above feature magnitude chart visualized by botplot become more balanced. It is obvious that all features got balanced with scaler.

Checkpoint is required to persit the scaled framework

In [73]:
import polars as pl

# Save the scaled data to a parquet file
scaled_df.write_parquet('data/scaled_transactions.parquet')

### Outliers

Outliers are observations that deviate significantly from other data points. They can be detected in several ways during Exploratory Data Analysis (EDA). Based on the previous boxplot, it uses 5th and 95th percentile to visualize those outliers.

A study carried out in 2021 compared different outlier detection techniques to help data scientists select an algorithm for building a better model (Agarwal & Gupta, 2021). The researchers concluded that Angel-based Outlier Detection (ABOD) and One-class SVM (OCSVM) techniques improved data analysis and machine learning model performance most across classifiers. In addition, each classifier had specific outlier detection techniques performing best..

The following code is tried to use OCSVM for further outlier detection.

In [74]:
import polars as pl

scaled_df = pl.read_parquet('data/scaled_transactions.parquet')
display(scaled_df.collect_schema())

Schema([('birth_year', Float64),
        ('credit_card_count', Float64),
        ('credit_score', Float64),
        ('latitude', Float64),
        ('longitude', Float64),
        ('retirement_age', Float64),
        ('total_debt', Float64),
        ('number_cards_issued', Float64),
        ('pin_last_changed_year', Float64),
        ('transaction_amount', Float64),
        ('card_expiration_year', Float64),
        ('card_expiration_month', Float64),
        ('transaction_year', Float64),
        ('transaction_month', Float64),
        ('transaction_day', Float64),
        ('transaction_hour', Float64),
        ('gender_encoded', Float64),
        ('card_brand_encoded', Float64),
        ('card_type_encoded', Float64),
        ('merchant_city_encoded', Float64),
        ('merchant_mcc_code_encoded', Float64),
        ('transaction_error_encoded', Float64),
        ('transaction_type_encoded', Float64),
        ('fraud_detected', Int8)])

In [75]:
import polars as pl
import numpy as np

def detect_outliers_using_iqr(df: pl.DataFrame) -> pl.DataFrame:
    # Split features and target
    X = df.drop("fraud_detected")
    y = df.select("fraud_detected")
    
    # Calculate IQR for each column
    outlier_mask = pl.Series([False] * len(df))
    
    for col in X.columns:
        q1 = X.select(pl.col(col).quantile(0.25)).item()
        q3 = X.select(pl.col(col).quantile(0.75)).item()
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        
        col_outliers = (X.select(pl.col(col) < lower_bound).to_series() | 
                       X.select(pl.col(col) > upper_bound).to_series())
        outlier_mask = outlier_mask | col_outliers
    
    return X.with_columns([
        pl.Series("Outlier", outlier_mask),
        y.to_series().alias("fraud_detected")
    ])

df_with_outliers = detect_outliers_using_iqr(scaled_df)


In [76]:
# Get overall statistics
total_records = len(df_with_outliers)
outlier_count = df_with_outliers.filter(pl.col("Outlier") == True).height

print("#### Overall Outlier Statistics ####")
print(f"Total records: {total_records}")
print(f"Total outliers detected: {outlier_count}")
print(f"Overall percentage: {(outlier_count/total_records)*100:.2f}%\n")

#### Overall Outlier Statistics ####
Total records: 2957641
Total outliers detected: 1733075
Overall percentage: 58.60%



In [77]:
from scipy.stats.mstats import winsorize
import polars as pl
import numpy as np
import json

def fix_outliers_using_winsorizing(df: pl.DataFrame, outlier_features: list, limits: list) -> pl.DataFrame:
    # Create a copy of the DataFrame
    df_winsorized = df.clone()
    winsor_params = {}
    
    for feature in outlier_features:
        # Convert column to numpy array for winsorizing
        feature_values = df_winsorized.get_column(feature).to_numpy()
        lower_bound = np.percentile(feature_values, 5)  # 5th percentile
        upper_bound = np.percentile(feature_values, 95) # 95th percentile
        winsor_params[feature] = {
            'lower_bound': float(lower_bound),
            'upper_bound': float(upper_bound),
            'limits': limits
        }        
        
        if not os.path.exists('../payment-solution/model'):
            os.makedirs('../payment-solution/model', exist_ok=True)
        
        # Save parameters to JSON
        with open('../payment-solution/model/winsor_params.json', 'w') as f:
            json.dump(winsor_params, f)
            
        # Apply winsorizing
        winsorized_values = winsorize(feature_values, limits=limits)
        
        # Update the column with winsorized values
        df_winsorized = df_winsorized.with_columns([
            pl.Series(name=feature, values=winsorized_values)
        ])
    
    return df_winsorized

feature_columns = df_with_outliers.drop(["Outlier", "fraud_detected"]).columns

# Apply winsorizing
winsorized_df = fix_outliers_using_winsorizing(
    scaled_df, 
    outlier_features=feature_columns,
    limits=[0.05, 0.05]
)

# Display first few rows
print(winsorized_df.head())

# If you want to see the effect of winsorizing, you can compare statistics:
for feature in feature_columns:
    print(f"\n{feature} statistics:")
    print("Original:")
    print(scaled_df.select(pl.col(feature)).describe())
    print("\nWinsorized:")
    print(winsorized_df.select(pl.col(feature)).describe())


shape: (5, 24)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ bir ┆ cre ┆ cre ┆ lat ┆ lon ┆ ret ┆ tot ┆ num ┆ pin ┆ tra ┆ car ┆ car ┆ tra ┆ tra ┆ tra ┆ tra ┆ gen ┆ car ┆ car ┆ mer ┆ mer ┆ tra ┆ tra ┆ fra │
│ th_ ┆ dit ┆ dit ┆ itu ┆ git ┆ ire ┆ al_ ┆ ber ┆ _la ┆ nsa ┆ d_e ┆ d_e ┆ nsa ┆ nsa ┆ nsa ┆ nsa ┆ der ┆ d_b ┆ d_t ┆ cha ┆ cha ┆ nsa ┆ nsa ┆ ud_ │
│ yea ┆ _ca ┆ _sc ┆ de  ┆ ude ┆ men ┆ deb ┆ _ca ┆ st_ ┆ cti ┆ xpi ┆ xpi ┆ cti ┆ cti ┆ cti ┆ cti ┆ _en ┆ ran ┆ ype ┆ nt_ ┆ nt_ ┆ cti ┆ cti ┆ det │
│ r   ┆ rd_ ┆ ore ┆ --- ┆ --- ┆ t_a ┆ t   ┆ rds ┆ cha ┆ on_ ┆ rat ┆ rat ┆ on_ ┆ on_ ┆ on_ ┆ on_ ┆ cod ┆ d_e ┆ _en ┆ cit ┆ mcc ┆ on_ ┆ on_ ┆ ect │
│ --- ┆ cou ┆ --- ┆ f64 ┆ f64 ┆ ge  ┆ --- ┆ _is ┆ nge ┆ amo ┆ ion ┆ ion ┆ yea ┆ mon ┆ day ┆ hou ┆ ed  ┆ nco ┆ cod ┆ y_e ┆ _co ┆ err ┆ typ ┆ ed  │
│ f64 ┆ nt  ┆ f64 ┆     ┆     ┆ --- ┆ f64 ┆ sue ┆ d_y ┆ unt ┆ _ye ┆ _mo ┆ r   ┆ th  ┆ --- ┆ r   ┆ --- ┆ ded ┆

In [78]:
winsorized_df.write_parquet('data/winsorized_transactions.parquet')

### Distribution of Data

The deviation of data can have an effect on how statistical analysis and machine learning models perform. To address this, the data should be transformed to center at zero and have a standard deviation of unity. This paper will discuss different approaches to dealing with skewed data using Python.

In [79]:
import polars as pl

winsorized_df = pl.read_parquet('data/winsorized_transactions.parquet')

display(winsorized_df.collect_schema())

Schema([('birth_year', Float64),
        ('credit_card_count', Float64),
        ('credit_score', Float64),
        ('latitude', Float64),
        ('longitude', Float64),
        ('retirement_age', Float64),
        ('total_debt', Float64),
        ('number_cards_issued', Float64),
        ('pin_last_changed_year', Float64),
        ('transaction_amount', Float64),
        ('card_expiration_year', Float64),
        ('card_expiration_month', Float64),
        ('transaction_year', Float64),
        ('transaction_month', Float64),
        ('transaction_day', Float64),
        ('transaction_hour', Float64),
        ('gender_encoded', Float64),
        ('card_brand_encoded', Float64),
        ('card_type_encoded', Float64),
        ('merchant_city_encoded', Float64),
        ('merchant_mcc_code_encoded', Float64),
        ('transaction_error_encoded', Float64),
        ('transaction_type_encoded', Float64),
        ('fraud_detected', Int8)])

In [80]:
import polars as pl


# Set display options for better visibility
pl.Config.set_tbl_rows(23)  # Show all 23 rows
pl.Config.set_tbl_cols(-1)  # Show all columns


# Drop fraud_detected column
X = winsorized_df.drop("fraud_detected")
y = winsorized_df.select("fraud_detected")

# Create lists to store column names and skewness values
columns = []
skew_values = []

# Calculate skewness for each column
for col in X.columns:
    skew_value = X.select(pl.col(col).skew()).item()
    columns.append(col)
    skew_values.append(skew_value)

# Create a DataFrame with the results
skewness_df = pl.DataFrame({
    "Column": columns,
    "Skewness": skew_values
})

# Sort by absolute skewness value to see most skewed columns first (optional)
skewness_df = skewness_df.with_columns(
    pl.col("Skewness").abs().alias("abs_skew"),
    pl.when(pl.col("Skewness").abs() < 0.5)
    .then(pl.lit("Approximately symmetric"))
    .when(pl.col("Skewness").abs() < 1)
    .then(pl.lit("Moderately skewed"))
    .otherwise(pl.lit("Highly skewed"))
    .alias("Interpretation")
).sort("abs_skew", descending=True).drop("abs_skew")

# Print the results
print("\nSkewness DataFrame:")
print(skewness_df)


Skewness DataFrame:
shape: (23, 3)
┌───────────────────────────┬───────────┬─────────────────────────┐
│ Column                    ┆ Skewness  ┆ Interpretation          │
│ ---                       ┆ ---       ┆ ---                     │
│ str                       ┆ f64       ┆ str                     │
╞═══════════════════════════╪═══════════╪═════════════════════════╡
│ merchant_city_encoded     ┆ 2.193569  ┆ Highly skewed           │
│ merchant_mcc_code_encoded ┆ 1.845289  ┆ Highly skewed           │
│ card_type_encoded         ┆ 1.785501  ┆ Highly skewed           │
│ transaction_type_encoded  ┆ 1.530993  ┆ Highly skewed           │
│ transaction_error_encoded ┆ -1.0      ┆ Highly skewed           │
│ longitude                 ┆ -0.820656 ┆ Moderately skewed       │
│ total_debt                ┆ 0.524274  ┆ Moderately skewed       │
│ transaction_amount        ┆ 0.471247  ┆ Approximately symmetric │
│ birth_year                ┆ -0.454947 ┆ Approximately symmetric │
│ card_brand

The Box-Cox Transformation calculates the best power transformation for your data that reduces its skewness to make it as close to normal distribution as possible.

In [81]:
import polars as pl
from scipy.stats import boxcox
import numpy as np

boxcox_params = {}
    
def apply_boxcox_transform(df: pl.DataFrame, skewness_threshold: float = 0.5) -> pl.DataFrame:
    # Create a copy of the DataFrame
    df_transformed = df.clone()
    
    # Store Box-Cox parameters

    
    # Calculate skewness for each column
    skewness_values = {}
    for col in df.columns:
        skew_value = df.get_column(col).skew()
        skewness_values[col] = skew_value
    
    # Apply Box-Cox only to columns with abs(skewness) >= threshold
    transformed_cols = []
    for col in df.columns:
        if abs(skewness_values.get(col, 0)) >= skewness_threshold:
            # Get column values as numpy array
            values = df.get_column(col).to_numpy()
            
            # Ensure all values are positive for Box-Cox
            min_val = values.min()
            if min_val <= 0:
                values = values - min_val + 1  # Shift to make all values positive
            
            try:
                # Apply Box-Cox transformation
                transformed_values, lambda_param = boxcox(values)
                
                # Store parameters
                boxcox_params[col] = {
                    'min_value': float(min_val),
                    'lambda': float(lambda_param)
                }
                
                
                # Update the DataFrame with transformed values
                transformed_cols.append(
                    pl.Series(name=col, values=transformed_values)
                )
                
                print(f"Applied Box-Cox to {col} (skewness: {skewness_values[col]:.3f})")
            except Exception as e:
                print(f"Could not transform {col}: {str(e)}")
        else:
            # Keep original column if skewness is below threshold
            transformed_cols.append(pl.col(col))
    
    # Create new DataFrame with transformed columns
    df_transformed = df_transformed.with_columns(transformed_cols)
    
    return df_transformed

# Apply the transformation
boxcoxed_df = apply_boxcox_transform(winsorized_df, skewness_threshold=0.5)


# Print skewness before and after transformation
print("\nSkewness Comparison:")
print("Column | Before | After")
print("-" * 40)
for col in boxcoxed_df.columns:
    before_skew = winsorized_df.get_column(col).skew()
    after_skew = boxcoxed_df.get_column(col).skew()
    print(f"{col}: {before_skew:.3f} | {after_skew:.3f}")
    
boxcoxed_df = boxcoxed_df.with_columns([
    winsorized_df["fraud_detected"].alias("fraud_detected")
])

if not os.path.exists('../payment-solution/model'):
    os.makedirs('../payment-solution/model', exist_ok=True)

# Save Box-Cox parameters to JSON
with open('../payment-solution/model/boxcox_params.json', 'w') as f:
    json.dump(boxcox_params, f, indent=2)


Applied Box-Cox to longitude (skewness: -0.821)
Applied Box-Cox to total_debt (skewness: 0.524)
Applied Box-Cox to card_type_encoded (skewness: 1.786)
Applied Box-Cox to merchant_city_encoded (skewness: 2.194)
Applied Box-Cox to merchant_mcc_code_encoded (skewness: 1.845)
Could not transform transaction_error_encoded: Data must not be constant.
Applied Box-Cox to transaction_type_encoded (skewness: 1.531)


  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return np.real(special.logsumexp(2 * logxmu, axis=0)) - np.log(len(logx))
  tmp2 = (xb - xc) * (fb - fa)


Applied Box-Cox to fraud_detected (skewness: 54.087)

Skewness Comparison:
Column | Before | After
----------------------------------------
birth_year: -0.455 | -0.455
credit_card_count: -0.009 | -0.009
credit_score: -0.272 | -0.272
latitude: -0.353 | -0.353
longitude: -0.821 | -0.329
retirement_age: -0.184 | -0.184
total_debt: 0.524 | -0.014
number_cards_issued: -0.036 | -0.036
pin_last_changed_year: 0.123 | 0.123
transaction_amount: 0.471 | 0.471
card_expiration_year: -0.327 | -0.327
card_expiration_month: 0.023 | 0.023
transaction_year: -0.004 | -0.004
transaction_month: -0.013 | -0.013
transaction_day: -0.012 | -0.012
transaction_hour: 0.239 | 0.239
gender_encoded: -0.064 | -0.064
card_brand_encoded: -0.408 | -0.408
card_type_encoded: 1.786 | 0.557
merchant_city_encoded: 2.194 | 0.020
merchant_mcc_code_encoded: 1.845 | 0.485
transaction_error_encoded: -1.000 | -1.000
transaction_type_encoded: 1.531 | 1.531
fraud_detected: 54.087 | 54.087


In [82]:
boxcoxed_df.write_parquet('data/boxcoxed_transactions.parquet')

### Feature Selection

Finding feature importance is a process of determining which features in a dataset have the greatest influence on the outcome. This can be useful for identifying data points that should be given more attention or discarded entirely, depending on their impact. One of the most popular methods used to determine feature importance is Random Forest Regressor (RFR). A research paper demonstrated how he used Random Forest Regressor to select features for his medical prediction model (Speiser, 2021).

RFR with Recursive Feature Elimination with Cross Validation (RFECV) is a powerful feature selection technique that can be used in many machine learning applications. This method uses the Random Forest model to identify important features and then recursively eliminates less relevant ones using 5-Fold Cross-Validation. By combining both RFR and RFECV, this approach can reduce overfitting while selecting an optimal subset of features for further analysis. Additionally, it allows us to measure the importance of each feature by calculating its contribution towards predicting target variables accurately.

In [83]:
import polars as pl

boxcoxed_df = pl.read_parquet('data/boxcoxed_transactions.parquet')

display(boxcoxed_df.collect_schema())
display(boxcoxed_df.shape)

# Print missing values of boxcoxed_df
display(boxcoxed_df.null_count())

Schema([('birth_year', Float64),
        ('credit_card_count', Float64),
        ('credit_score', Float64),
        ('latitude', Float64),
        ('longitude', Float64),
        ('retirement_age', Float64),
        ('total_debt', Float64),
        ('number_cards_issued', Float64),
        ('pin_last_changed_year', Float64),
        ('transaction_amount', Float64),
        ('card_expiration_year', Float64),
        ('card_expiration_month', Float64),
        ('transaction_year', Float64),
        ('transaction_month', Float64),
        ('transaction_day', Float64),
        ('transaction_hour', Float64),
        ('gender_encoded', Float64),
        ('card_brand_encoded', Float64),
        ('card_type_encoded', Float64),
        ('merchant_city_encoded', Float64),
        ('merchant_mcc_code_encoded', Float64),
        ('transaction_error_encoded', Float64),
        ('transaction_type_encoded', Float64),
        ('fraud_detected', Int8)])

(2957641, 24)

birth_year,credit_card_count,credit_score,latitude,longitude,retirement_age,total_debt,number_cards_issued,pin_last_changed_year,transaction_amount,card_expiration_year,card_expiration_month,transaction_year,transaction_month,transaction_day,transaction_hour,gender_encoded,card_brand_encoded,card_type_encoded,merchant_city_encoded,merchant_mcc_code_encoded,transaction_error_encoded,transaction_type_encoded,fraud_detected
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [84]:
import polars as pl
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
import numpy as np

# Separate features and target
X = boxcoxed_df.drop("fraud_detected")
y = boxcoxed_df.select("fraud_detected")

# Convert to numpy for easier manipulation
X_numpy = X.to_numpy()
y_numpy = y.to_numpy().ravel()

# Stratified sampling to maintain fraud ratio
sample_size = 100000
stratified_indices = train_test_split(
    np.arange(len(y_numpy)),
    train_size=sample_size/len(y_numpy),
    stratify=y_numpy,
    random_state=42
)[0]

X_sampled = X_numpy[stratified_indices]
y_sampled = y_numpy[stratified_indices]

# Split the sampled data
X_train, X_test, y_train, y_test = train_test_split(
    X_sampled,
    y_sampled,
    train_size=0.8,
    test_size=0.2,
    stratify=y_sampled,
    random_state=42
)

# Initialize Random Forest
rfr_model = RandomForestRegressor(
    n_estimators=50,
    max_depth=8,
    min_samples_split=20,
    min_samples_leaf=10,
    max_features='sqrt',
    n_jobs=-1,
    random_state=0
)

# Feature selector
selector = SelectFromModel(
    estimator=rfr_model,
    prefit=False,
    max_features=10,
    threshold='mean'
)

# Fit the selector
selector.fit(X_train, y_train)

# Get selected features and importance
selected_mask = selector.get_support()
feature_names = np.array(X.columns)  # Convert columns to numpy array
selected_features = feature_names[selected_mask]
feature_importance = selector.estimator_.feature_importances_[selected_mask]

# Create feature importance DataFrame with percentages
total_importance = feature_importance.sum()
feature_importance_df = pl.DataFrame({
    "Feature": selected_features.tolist(),  # Convert numpy array to list
    "Importance": feature_importance,
    "Percentage": (feature_importance / total_importance) * 100
}).sort("Importance", descending=True)

print("\nFeature Importance:")
print(feature_importance_df)

# Print missing values of boxcoxed_df
display(boxcoxed_df.null_count())


Feature Importance:
shape: (9, 3)
┌───────────────────────────┬────────────┬────────────┐
│ Feature                   ┆ Importance ┆ Percentage │
│ ---                       ┆ ---        ┆ ---        │
│ str                       ┆ f64        ┆ f64        │
╞═══════════════════════════╪════════════╪════════════╡
│ merchant_city_encoded     ┆ 0.178335   ┆ 25.535405  │
│ transaction_amount        ┆ 0.082314   ┆ 11.786365  │
│ latitude                  ┆ 0.074745   ┆ 10.702586  │
│ merchant_mcc_code_encoded ┆ 0.072532   ┆ 10.385699  │
│ total_debt                ┆ 0.070271   ┆ 10.061938  │
│ credit_score              ┆ 0.066112   ┆ 9.466462   │
│ longitude                 ┆ 0.058404   ┆ 8.362797   │
│ transaction_day           ┆ 0.049227   ┆ 7.048753   │
│ birth_year                ┆ 0.046442   ┆ 6.649994   │
└───────────────────────────┴────────────┴────────────┘


birth_year,credit_card_count,credit_score,latitude,longitude,retirement_age,total_debt,number_cards_issued,pin_last_changed_year,transaction_amount,card_expiration_year,card_expiration_month,transaction_year,transaction_month,transaction_day,transaction_hour,gender_encoded,card_brand_encoded,card_type_encoded,merchant_city_encoded,merchant_mcc_code_encoded,transaction_error_encoded,transaction_type_encoded,fraud_detected
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [85]:
feature_importance_df.write_parquet('data/feature_importance.parquet')

## Model Development

LightGBM represents Light Gradient Boosting Machine and is one of the most effective and efficient machine learning algorithms that is used on large datasets. According to the publication of Huang (2020), it was LightGBM that developed an improved version that was capable of overcoming challenges of large-scale data which is relevant to most operational environments including fraud detection.

As for our choice for model development, LightGBM is also very fast and efficient. A new efficient technique called GOSS (Gradient-based One-Side Sampling) is employed to remove the majority of data samples required to find a split value which in return, cuts down the cost related to more sample data. Also, it should be noted that LightGBM has out of the box support for categorical features and thus, no heavy pre-processing is required.

Such means that the accuracy is also on a high level and many other classical algorithms in machine learning tasks cannot compete with it. In Huang et al. analysis, AUC (Area Under the Curve) and accuracy from LightGBM was higher than scores from both logistic regression and SVM based models.

It is a good tool for model development as LightGBM implements strategies to deal with missing values and provides regularization to avoid overfitting. In addition, the model is interpretable since it is able to give explanation as to which factors contributed to the predictions made. Thus, the use of LightGBM is likely to assist in effective model building that is characterized by high predictive accuracy and interpretability.

### Model Training

The light GBM model was aimed at binary classification and here training and testing was done in an exclusive of focus on binary classification. It used a gbdt method and for column sampling, the model specified an 0.8 feature fraction, round one used 50 early round stopping to avoid any cases of overfitting. The model used 9 features and was tested on a dataset of 2,366,112 data points, which included 808 positive instances and 2,365,304 negative instances. Majority of the necessary training was effective as the best iteration was recorded in the first round where the Area Under the Curve (AUC) scoring was 0.983662 as the model was able to log a binary log loss of 1.23685 on the training set. 

In [86]:
import polars as pl

feature_importance_df = pl.read_parquet('data/feature_importance.parquet')
boxcoxed_df = pl.read_parquet('data/boxcoxed_transactions.parquet')


In [87]:
import polars as pl
import sklearn
from sklearn.model_selection import train_test_split
import fastparquet
import pyarrow
import numpy as np
import pandas as pd

selected_features = feature_importance_df["Feature"].to_list()
label = 'fraud_detected'

# First split the fraud cases to ensure representation in both sets
fraud_df = boxcoxed_df.filter(pl.col(label) == 1).to_pandas()
non_fraud_df = boxcoxed_df.filter(pl.col(label) == 0).to_pandas()

# Split fraud data
X_fraud = fraud_df[selected_features]
y_fraud = fraud_df[label]
X_fraud_train, X_fraud_test, y_fraud_train, y_fraud_test = train_test_split(
    X_fraud, 
    y_fraud, 
    test_size=0.2, 
    random_state=42
)

# Split non-fraud data
X_non_fraud = non_fraud_df[selected_features]
y_non_fraud = non_fraud_df[label]
X_non_fraud_train, X_non_fraud_test, y_non_fraud_train, y_non_fraud_test = train_test_split(
    X_non_fraud, 
    y_non_fraud, 
    test_size=0.2, 
    random_state=42
)

# Combine the splits
X_train = pd.concat([X_fraud_train, X_non_fraud_train])
y_train = pd.concat([y_fraud_train, y_non_fraud_train])
X_test = pd.concat([X_fraud_test, X_non_fraud_test])
y_test = pd.concat([y_fraud_test, y_non_fraud_test])


# # Check feature statistics
print("\nFeature Statistics:")
print(X_train.describe())

# Check for missing values
print("\nMissing values in features:")
print(X_train.isnull().sum())

# Check feature correlations
correlations = X_train.corr()
high_corr = np.where(np.abs(correlations) > 0.95)
print("\nHighly correlated features (>0.95):")
for i, j in zip(*high_corr):
    if i != j:
        print(f"{selected_features[i]} - {selected_features[j]}: {correlations.iloc[i, j]:.3f}")



Feature Statistics:
       merchant_city_encoded  transaction_amount      latitude  \
count           2.366112e+06        2.366112e+06  2.366112e+06   
mean           -2.945747e+00        6.788619e+00  3.168152e+00   
std             1.889664e+00        5.680224e-01  9.220367e-01   
min            -6.308005e+00        5.626906e+00  1.380197e+00   
25%            -4.364887e+00        6.404201e+00  2.475310e+00   
50%            -3.149890e+00        6.652239e+00  3.363984e+00   
75%            -1.396720e+00        7.116522e+00  3.898761e+00   
max             3.537207e-01        8.074484e+00  4.598690e+00   

       merchant_mcc_code_encoded    total_debt  credit_score     longitude  \
count               2.366112e+06  2.366112e+06  2.366112e+06  2.366112e+06   
mean                1.052607e-01  6.762707e-01  3.409843e+00  2.037173e+01   
std                 1.119477e-01  4.385967e-01  8.934951e-01  1.013658e+01   
min                 0.000000e+00  0.000000e+00  1.557801e+00  3.250482e+

In [None]:
import lightgbm as lgbm
from sklearn.metrics import precision_recall_curve, average_precision_score, roc_auc_score

pos_scale = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

# Calculate class weight
pos_scale = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

SEARCH_PARAMS = {
    "learning_rate": 0.01,
    "max_depth": 8,
    "feature_fraction": 0.8,
    "subsample": 0.8,
    "min_child_samples": 5,
    "min_child_weight": 0.001
}

FIXED_PARAMS = {
    "objective": "binary",
    "metric": ["auc", "binary_logloss"],
    "boosting": "gbdt",
    "early_stopping_rounds": 50,
    "verbose": 1,
    "scale_pos_weight": pos_scale, 
    "n_estimators": 1000,
    "random_state": 42
}

TRAINING_RUN_PARAMS = {**FIXED_PARAMS, **SEARCH_PARAMS}

clf = lgbm.LGBMClassifier(**TRAINING_RUN_PARAMS)
clf.fit(
    X_train, 
    y_train, 
    eval_set=[(X_train, y_train), (X_test, y_test)],
    eval_metric=['auc', 'binary_logloss'],
    eval_names=['train', 'valid']
)


[LightGBM] [Info] Number of positive: 808, number of negative: 2365304
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010974 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1656
[LightGBM] [Info] Number of data points in the train set: 2366112, number of used features: 9
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.000341 -> initscore=-7.981855
[LightGBM] [Info] Start training from score -7.981855
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[1]	train's auc: 0.983679	train's binary_logloss: 1.28113	valid's auc: 0.964953	valid's binary_logloss: 1.28214


### Model Testing

AUC score on the validation set was 0.956276 with a binary log loss of 1.23453. Furthermore, several model parameters for the theory were a scale_pos_weight value of 2927.3564356435645, a max number of 8 and learning rate of 0.01. On the average classification score, the model obtained a score of 0.0146 while on the prediction score it got 0.9563, which shows that this model is reliable in doing binary tasks classification. All other aspects tested aimed at effectiveness and prediction ability delivered positive results giving evidence to findings by Huang et al. (2020).  

In [89]:
y_predict_test = clf.predict(X_test)

# Get predictions and evaluate
y_pred_proba = clf.predict_proba(X_test)[:, 1]
y_pred = clf.predict(X_test)

# Calculate metrics
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)
avg_precision = average_precision_score(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)

print(f"Average Precision Score: {avg_precision:.4f}")
print(f"AUC Score: {auc_score:.4f}")

Average Precision Score: 0.0141
AUC Score: 0.9650


# Model Deployment

The final  model was deployed as Open Neural Network Exchange (ONNX) format. Being a framework-neutral format, ONNX provides an opportunity to create models in one framework and use them in one more model without the need to address compatibility issues between the two models. Moreover, it allows for portable implementation of the models, from cloud servers to edge devices which renders the device agnostic deployment of the models. To achieve this ONNX designed the high-performance computing optimizations that are suitable for production, ones that make ONNX fast and efficient. One of the reasons developers make use of ONNX, an industry-standard model format, is to ensure that their models remain usable in the long term and that they do not succumb to framework lock-in as well as updates and migration become less of a hassle. Lastly, ONNX provides migration across multiple machine learning frameworks, tools and compilers which is ideal for teams working with various technologies during the development process.

In [90]:
import onnxmltools
from onnxmltools.convert import convert_lightgbm
from onnxmltools.convert.common.data_types import FloatTensorType

# Define initial types for your features
initial_types = [
    ('float_input', FloatTensorType([None, len(selected_features)]))
]

# Convert the LightGBM model to ONNX
onnx_model = convert_lightgbm(
    model=clf,
    initial_types=initial_types,
    target_opset=None,
    name='LightGBM_FraudDetector'
)

if not os.path.exists('../payment-solution/model'):
    os.makedirs('../payment-solution/model', exist_ok=True)
    
# Save the model
onnx_model_path = "../payment-solution/model/lightgbm_fraud_detector.onnx"
onnxmltools.utils.save_model(onnx_model, onnx_model_path)

The maximum opset needed by this model is only 9.
