| Benfords Law | Fraud detection |
| ----------- | ----------- |

Benford's law is a statistical law that describes the distribution of leading digits in naturally occurring sets of data. It states that in many real-life datasets, the digit 1 appears as the leading digit about 30% of the time, and larger digits occur as the leading digit less frequently. For example, in a dataset of numbers from real-life sources, you would expect to see the digit 1 as the leading digit about three times as often as the digit 9.

In [1]:
import pandas as pd
import numpy as np
import random

from collections import defaultdict
from math import log10, sqrt



In [3]:
# import from directory
from employee import employee_df
from transactions import trans_df

    TransactionDescription TransactionDate SubmittedDate  Amount
0               Hotel stay      2022-03-27    2022-04-03   89.44
1               Hotel stay      2022-04-19    2022-04-29  192.07
2            Entertainment      2022-11-24    2022-11-27  137.54
3               Car rental      2022-02-02    2022-02-04  195.13
4            Entertainment      2022-05-05    2022-05-13  183.58
..                     ...             ...           ...     ...
995        Office supplies      2022-11-15    2022-11-19  103.43
996             Car rental      2022-12-24    2023-01-02  197.24
997       Meal with client      2022-10-29    2022-11-07  128.92
998       Meal with client      2022-08-23    2022-09-01  115.33
999          Entertainment      2022-01-12    2022-01-23  148.48

[1000 rows x 4 columns]


In [4]:
# look at employee df
employee_df.head(5)

Unnamed: 0,EmployeeID,FirstName,LastName,Username
0,1,John,Smith,JSmith
1,2,Jane,Doe,JDoe
2,3,Bob,Williams,BWilliams
3,4,Alice,Johnson,AJohnson
4,5,Mike,Brown,MBrown


In [5]:
# look at transactions df
trans_df.head(5)

Unnamed: 0,TransactionDescription,TransactionDate,SubmittedDate,Amount
0,Hotel stay,2022-03-27,2022-04-03,89.44
1,Hotel stay,2022-04-19,2022-04-29,192.07
2,Entertainment,2022-11-24,2022-11-27,137.54
3,Car rental,2022-02-02,2022-02-04,195.13
4,Entertainment,2022-05-05,2022-05-13,183.58


In [6]:
# create a dataframe from transactions
benfords_df = pd.DataFrame(trans_df)

In [7]:
# get employee ids (primary key)
employee_ids = employee_df.iloc[:,0]

type(employee_ids)

pandas.core.series.Series

In [8]:
# convert series to list
employee_ids_list = employee_ids.to_list()

# employee_ids_list

In [9]:
# use apply() to randomly apply valuse from the list randomly to the column
benfords_df['EmployeeID'] = benfords_df['Amount'].apply(lambda x: random.choice(employee_ids_list))

In [10]:
benfords_df

Unnamed: 0,TransactionDescription,TransactionDate,SubmittedDate,Amount,EmployeeID
0,Hotel stay,2022-03-27,2022-04-03,89.44,0004
1,Hotel stay,2022-04-19,2022-04-29,192.07,0013
2,Entertainment,2022-11-24,2022-11-27,137.54,0100
3,Car rental,2022-02-02,2022-02-04,195.13,0073
4,Entertainment,2022-05-05,2022-05-13,183.58,0039
...,...,...,...,...,...
995,Office supplies,2022-11-15,2022-11-19,103.43,0012
996,Car rental,2022-12-24,2023-01-02,197.24,0007
997,Meal with client,2022-10-29,2022-11-07,128.92,0027
998,Meal with client,2022-08-23,2022-09-01,115.33,0034


The data is generated randomly and should comply with benfords law. 
Let's see though..

In [11]:
def test_benfords_law(data):
    # Count the number of times each digit appears as the leading digit in the dataset
    leading_digits = defaultdict(int)
    for number in data:
        # Convert the number to a string and get the first character (the leading digit)
        leading_digit = str(number)[0]
        leading_digits[leading_digit] += 1

    # Calculate the expected distribution of leading digits according to Benford's law
    expected_distribution = {str(d): log10(d+1) - log10(d) for d in range(1, 10)}

    # Compare the observed and expected distributions
    for d, count in leading_digits.items():
        expected_count = expected_distribution[d] * len(data)
        if abs(count - expected_count) > 2 * sqrt(expected_count):  
            # Use the chi-squared test to determine whether the difference is statistically significant
            return False

    return True

In [12]:
# test on the data
test_benfords_law(benfords_df['Amount'])

False

In [13]:
# return the benford offenders

def benford_offenders(data):
    # Count the number of times each digit appears as the leading digit in the dataset
    leading_digits = defaultdict(int)
    for number in data:
        # Convert the number to a string and get the first character (the leading digit)
        leading_digit = str(number)[0]
        leading_digits[leading_digit] += 1

    # Calculate the expected distribution of leading digits according to Benford's law
    expected_distribution = {str(d): log10(d+1) - log10(d) for d in range(1, 10)}

    # Compare the observed and expected distributions
    noncompliant_rows = []
    for d, count in leading_digits.items():
        expected_count = expected_distribution[d] * len(data)
        if abs(count - expected_count) > 2 * sqrt(expected_count):  # Use the chi-squared test to determine whether the difference is statistically significant
            noncompliant_rows.extend([row for row in data if str(row)[0] == d])

    return noncompliant_rows


In [None]:
benford_offenders(benfords_df['Amount'])

In [20]:
total = len(benfords_df['Amount'])

total

1000

In [18]:
offenders = benford_offenders(benfords_df['Amount'])
total_offenders = len(offenders)

total_offenders

872

In [25]:
perc_offend = ((total_offenders/total)*100)

perc_offend

87.2

| END OF PROGRAM |
|----------------|

Hmm so this data isn't really random.
Nearly 90% of rows violate benfords law