# What is Benford's Law and how it is used for Fraud Detection?
A recent episode in a Netflix show called 'Connected' talked about pattern followed by natural numbers called **Benford's Law**. According to this strange law, the probability of a digit 'd' appearing in a set of natural numbers is equal to log(1+1/d) with base 10. In other words, given a dataset and first digits of numbers in a column, if that dataset is not manipulated, then most likely, the probability of the occurrence of digits is [30.1, 17.6, 12.5, 9.7, 7.9, 6.7, 5.8, 5.1, 4.6] for 1-9 digits respectively.

This law is utilized in catching financial frauds and to raise red flags if deemed suspicious. It's mind-boggling that everything that involves numbers and not handcrafted follow Benford's Law.

Forensic accountant Mark Nigrini has developed five-digit tests. Below are the tests listed in the order they will be performed:
* The first digit test
* The second digit test
* The first two digits test
* The first three digits test
* The last two digits test

The first and second digit tests are done to determine if the dataset appears reasonable. In this notebook, only the first digit test is performed on Covid-India data.

In order to determine if the first digits of dataset conform with Benford's Law, I used **Chi-Square** test and the critical value of chi-square distribution for p=0.05 and 8 degrees of freedom (#categories -1) is 15.51. So if the value of chi-square exceeds 15.51, we can conclude the distribution doesn't follow Benford's Law!



In [None]:

import numpy as np 
import pandas as pd 
import seaborn as sns
import random
import matplotlib.pyplot as plt
import math

In [None]:
def firstDigit(n) : 
  
    while n >= 10:  
        n = n / 10

    return int(n) 

In [None]:
BENFORD = [30.1, 17.6, 12.5, 9.7, 7.9, 6.7, 5.8, 5.1, 4.6]

A typical Benford distribution looks like below plot:

In [None]:
plt.figure(figsize=(15,10))
plt.plot(BENFORD)
plt.title('Benford law distribution of first digits')

In [None]:
#Source: https://towardsdatascience.com/frawd-detection-using-benfords-law-python-code-9db8db474cf8
def chi_square_test(data_count,expected_counts):
    """Return boolean on chi-square test (8 degrees of freedom & P-val=0.05)."""
    chi_square_stat = 0  # chi square test statistic
    for data, expected in zip(data_count,expected_counts):

        chi_square = math.pow(data - expected, 2)

        chi_square_stat += chi_square / expected

    print("\nChi-squared Test Statistic = {:.3f}".format(chi_square_stat))
    print("Critical value at a P-value of 0.05 is 15.51.")    
    return chi_square_stat < 15.51

## COVID WORLDWIDE

Before going to India, first lets check the conformity of Benford's law for worldwide data.

In [None]:
covid_daily = pd.read_csv('../input/corona-virus-report/day_wise.csv')
covid_daily.head()

In [None]:
confirmed_fd = []
confirmed = covid_daily.Confirmed.values

for i in confirmed:
    confirmed_fd.append(firstDigit(i))

In [None]:
confired_fd_counts = pd.Series(confirmed_fd).value_counts().values

confired_fd_percent = (confired_fd_counts/np.sum(confired_fd_counts))*100

plt.figure(figsize=(15,10))
plt.plot(confired_fd_percent)
plt.plot(BENFORD)
plt.legend(['Covid worldwide','BENFORD'])

In [None]:
chi_square_test(confired_fd_percent,BENFORD)

Since the chi_square test came out to be positive, it means that the above dataset conforms with Benfords law for first digits. No foul play here..

## STATE WISE FOR INDIAN SUBCONTINENT

In [None]:
india_daily = pd.read_csv('../input/covid19-in-india/covid_19_india.csv')

In [None]:
india_daily.head(-5)

In [None]:
india_daily['State/UnionTerritory'].unique()

This dataset has typos for Telangana state. So, replace misspelt words with the correct one

In [None]:
india_daily['State/UnionTerritory'].replace('Telengana','Telangana', inplace=True)

india_daily['State/UnionTerritory'].replace('Telangana***','Telangana', inplace=True)
india_daily['State/UnionTerritory'].replace('Telengana***','Telangana', inplace=True)

In [None]:
india_daily['State/UnionTerritory'].unique()

In [None]:
#Containers for states that failed to conform with Benford's Law
global cured_sus_states
global confirmed_sus_states
global death_sus_states

confirmed_sus_states = []
cured_sus_states = []
death_sus_states = []

In [None]:
def covid_distribution(state_name, col):
    state = india_daily[india_daily['State/UnionTerritory'] == state_name]
    state = state[state[col]!=0]
    if(len(state)>100):
        state_fd = []
        state_confirmed = state[col].values

        for i in state_confirmed:
            state_fd.append(firstDigit(i))


        confired_fd_counts = pd.Series(state_fd).value_counts().sort_index().values
        #Consider only natural numbers
        if(0 in state_fd):
            confired_fd_counts = confired_fd_counts[1:]
        confired_fd_percent = (confired_fd_counts/np.sum(confired_fd_counts))*100

        if(chi_square_test(confired_fd_percent,BENFORD)):
            title = '{0}  {1} conforms with Benfords Law'.format(state_name, col)
        else:
            title = ' {0} {1} state seem to have some manipulation'.format(state_name, col)

            if(col=='Confirmed'):
                confirmed_sus_states.append(state_name)
            elif(col=='Cured'):
                cured_sus_states.append(state_name)
            else:
                death_sus_states.append(state_name)

        plt.figure(figsize=(15,10))
        plt.plot(confired_fd_percent)
        plt.plot(BENFORD)
        plt.legend(['Statewise distribution','Benfords Distribution'])
        plt.title(title)
        plt.show()
    else:
        print('{0} for #{1} cases doesnt have enough records to run Benfords law'.format(state_name,col))
    
    
    
    

In [None]:
states = ['Kerala', 'Telangana', 'Delhi', 'Rajasthan', 'Uttar Pradesh',
       'Haryana', 'Ladakh', 'Tamil Nadu', 'Karnataka', 'Maharashtra',
       'Punjab', 'Jammu and Kashmir', 'Andhra Pradesh', 'Uttarakhand',
       'Odisha', 'Puducherry', 'West Bengal', 'Chhattisgarh',
       'Chandigarh', 'Gujarat', 'Himachal Pradesh', 'Madhya Pradesh',
       'Bihar', 'Manipur', 'Mizoram', 'Andaman and Nicobar Islands',
       'Goa',  'Assam', 'Jharkhand', 'Arunachal Pradesh',
       'Tripura', 'Nagaland', 'Meghalaya', 
       'Sikkim', 'Dadra and Nagar Haveli and Daman and Diu']

## Distributions of Confirmed cases for each state

In [None]:
for state in states:
    covid_distribution(state,'Confirmed' )

In [None]:
print('List of states that have suspicious entries for number of Confirmed cases are: ',confirmed_sus_states)

## Distribution of Cured for each state

In [None]:
for state in states:
    covid_distribution(state,'Cured' )

In [None]:
print('List of states that have suspicious entries for number of Cured cases are: ',cured_sus_states)

## Distribution of Deaths for each state

In [None]:
for state in states:
    covid_distribution(state,'Deaths' )

In [None]:
print('List of states that have suspicious entries for number of Deaths are: ',death_sus_states)

In [None]:
len(death_sus_states)

Out of 36 states, 12 states have suspicious numbers for #Deaths!

In [None]:
#States that have suspicious behaviour on all three: Confirmed, Cured, Deaths

states_all_sus = set(death_sus_states).intersection(set(cured_sus_states)).intersection(set(confirmed_sus_states))

In [None]:
print(states_all_sus)

Looks like most of the states' reported numbers are suspicious. Any thoughts on this weired phenomenon? 

[NOTE]: Need to perform second digit test also, which will be updated in this notebook soon
