Defining isBenford Using KS Statistics

In [10]:
import numpy as np
import pandas as pd
np.random.seed(6011993)
from pandas import DataFrame,Series
def isBenford_ks(x):
    """This function accepts a Series and checks if it follows the Benford's Law using KS Statistics"""
    x = Series(np.array([i[0] for i in np.abs(x).astype('str')]).astype('int')) #Extracting 1st digit
    x = (x.value_counts()).sort_index()
    x = DataFrame(x,columns = ['Count'])
    x['Actual Distribution'] = x['Count']/x['Count'].sum()
    x['Expected Distribution'] = np.log10(1+1.0/(x.index)) #Expected Distribution according to Benford's Law
    x['Cumulative Difference'] = np.abs(x['Actual Distribution'].cumsum() -\
    x['Expected Distribution'].cumsum())
    actual = x['Cumulative Difference'].max()
    cutoff = 1.36/np.sqrt(x['Count'].sum())

    return (actual,cutoff,actual <= cutoff,x)

Generating Random Variables and ensuring that they their absolute value is greater than 0 so that 1st digit can be chopped off

In [11]:
y = (np.random.normal(1,1,1000))*100000
z = (np.random.normal(0,1,1000))*100000
x = (np.random.uniform(0,1,1000))*100000

print (np.sum(np.abs(x)<1))
print (np.sum(np.abs(y)<1))
print (np.sum(np.abs(z)<1))

0
0
0


Doing a KS Statistics test to see if they follow Benford's Distribution

In [12]:
x_actual,x_cutoff,x_isBenford,x_dist =  isBenford_ks(x)
y_actual,y_cutoff,y_isBenford,y_dist =  isBenford_ks(y)
z_actual,z_cutoff,z_isBenford,z_dist =  isBenford_ks(z)
print ("For x Distribution:")
print (x_dist)
print ("Actual KS Statistic: {0}, Cut-off: {1}, isBenford = {2}".format(x_actual,\
x_cutoff,x_isBenford))
print ("For y Distribution:")
print (y_dist)
print ("Actual KS Statistic: {0}, Cut-off: {1}, isBenford = {2}".format(y_actual,\
y_cutoff,y_isBenford))
print ("For z Distribution:")
print (z_dist)
print ("Actual KS Statistic: {0}, Cut-off: {1}, isBenford = {2}".format(z_actual,\
z_cutoff,z_isBenford))

For x Distribution:
   Count  Actual Distribution  Expected Distribution  Cumulative Difference
1    112                0.112               0.301030           1.890300e-01
2    121                0.121               0.176091           2.441213e-01
3    110                0.110               0.124939           2.590600e-01
4    113                0.113               0.096910           2.429700e-01
5     95                0.095               0.079181           2.271513e-01
6    112                0.112               0.066947           1.820980e-01
7    119                0.119               0.057992           1.210900e-01
8    107                0.107               0.051153           6.524251e-02
9    111                0.111               0.045757           1.110223e-16
Actual KS Statistic: 0.2590599913279624, Cut-off: 0.04300697617828996, isBenford = False
For y Distribution:
   Count  Actual Distribution  Expected Distribution  Cumulative Difference
1    428                0.428      

As we can see from above, the KS Statistic obtained for each of the Random variables is greater than the cutoff value and thus they do not obey Benford's law.
For the uniform distribution, each first digit would have an equal probability of distribution which violates the Benford's law
For Normal distribution, most of the data would be concentrated around the mean, thus they would not have the freedom to distribute among themselves according to Benford's law. They are not naturally occuring datasets.


Now, lets see if the products of randomly generated numbers follow the Benford's Law. Generating the product and ensuring that they their absolute value is greater than 0 so that 1st digit can be chopped off

In [13]:
x = (np.random.uniform(0,1,1000))
y = (np.random.normal(1,1,1000))
z = (np.random.normal(0,1,1000))
p = x * y * z * 1000000000000
print (np.sum(np.abs(p)<1))

0


In [14]:
p_actual,p_cutoff,p_isBenford,p_dist =  isBenford_ks(p)
print ("For Product:")
print (p_dist)
print ("Actual KS Statistic: {0}, Cut-off: {1}, isBenford = {2}".format(p_actual,\
p_cutoff,p_isBenford))

For Product:
   Count  Actual Distribution  Expected Distribution  Cumulative Difference
1    301                0.301               0.301030               0.000030
2    183                0.183               0.176091               0.006879
3    128                0.128               0.124939               0.009940
4     88                0.088               0.096910               0.001030
5     81                0.081               0.079181               0.002849
6     77                0.077               0.066947               0.012902
7     54                0.054               0.057992               0.008910
8     54                0.054               0.051153               0.011757
9     34                0.034               0.045757               0.000000
Actual KS Statistic: 0.012901959985743061, Cut-off: 0.04300697617828996, isBenford = True


As we can see from above, the KS Statistic obtained from the product of 3 random variables is less than the cutoff value and thus they obey Benford's law. This is because it is a product of 3 random variables and it behaves like a naturally occurring dataset.

Now, let us test Benford's Law on product on a Series which is a product of 3 random distributions but only filtering for those values who's product is greater that 0.1 .Generating the product and ensuring that they their absolute value is greater than 0 so that 1st digit can be chopped off

In [15]:
x = (np.random.uniform(0,1,1000))
y = (np.random.normal(1,1,1000))
z = (np.random.normal(0,1,1000))
p = x * y * z
p = p[np.abs(p)>=0.1]
p = np.round(p,decimals = 1)*10
print (np.sum(np.abs(p)<1))

0


In [16]:
p_actual,p_cutoff,p_isBenford,p_dist =  isBenford_ks(p)
print ("For Product after Rounding:")
print (p_dist)
print ("Actual KS Statistic: {0}, Cut-off: {1}, isBenford = {2}".format(p_actual,\
p_cutoff,p_isBenford))

For Product after Rounding:
   Count  Actual Distribution  Expected Distribution  Cumulative Difference
1    189             0.291667               0.301030               0.009363
2    141             0.217593               0.176091               0.032138
3     95             0.146605               0.124939               0.053804
4     68             0.104938               0.096910               0.061832
5     50             0.077160               0.079181               0.059812
6     37             0.057099               0.066947               0.049964
7     29             0.044753               0.057992               0.036725
8     14             0.021605               0.051153               0.007177
9     25             0.038580               0.045757               0.000000
Actual KS Statistic: 0.06183246479978366, Cut-off: 0.05342584568965026, isBenford = False


As we can see from above, the KS Statistic obtained for round of Product of 3 random variables is greater than the cutoff value and thus they do not obey Benford's law.  This is because firstly, we had to omit all the values <0.1 and rounding them off to the nearest 1 decimal is 0. Secondly, by rounding off the data, the data is no more naturally occurring numbers.