In [None]:
import pandas as pd 
import numpy as np
import pymc3 as pm


import dask.dataframe as dd
from dask.diagnostics import ProgressBar
import arviz as az
from matplotlib import pyplot as plt 
ProgressBar().register()

Based on https://www.kaggle.com/code/raddar/the-data-has-random-uniform-noise-added/notebook.  

In [None]:
df = dd.read_csv('/kaggle/input/amex-default-prediction/train_data.csv')

We have a very large dataset. In fact, it is not so important whether we have 1 million data or 200 thousand (for our model).     
The distribution roughly remains the same.

To speed up the calculations of our model, we will make a sample. I chose the P_3 field because its distribution looks good.     
Using this field as an example, I will try to find out something about the noise that we have (Thanks https://www.kaggle.com/raddar for finding him for us).      

In [None]:
p3data = df['P_3'].compute()
p3data_sample = p3data.sample(100000).dropna()

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches(12.5, 5.5)

axs[0].hist(p3data_sample, bins=500);
axs[1].hist(p3data, bins=500);

The form of distribution remained the same.Our sample is ok.     
Now let's try to model our distribution. Our shape is similar to the Laplace distribution. As raddar said, we assume that our noise is uniformly distributed.     
We also assume that the upper and lower bounds of our noise are taken from a normal distribution, where the upper bound cannot be less than the lower.    

In [None]:
with pm.Model() as noise_model:

    lower_bound_noise = pm.Normal('lower_bound_noise', mu=0, sigma=.02)# as raddar said, lets assume that our lowwer bound is close to 0 
    BoundedNormal = pm.Bound(pm.Normal, lower=lower_bound_noise) # upper bound cannot be less than lower 
    upper_bound_noise = BoundedNormal('upper_bound_noise', mu=0.01, sigma=.02) # upper bound is close to 0.01
    
    noise = pm.Uniform('noise', lower=lower_bound_noise, upper=upper_bound_noise)
    
    mu = pm.Uniform('mu', lower=.3, upper=.8) # empirical choice of parameters, approximately can be understood from the sampling plot
    b = pm.Uniform('sigma', lower=.0, upper=.5)
    
    
    P_3 = pm.Laplace('P_3', mu = mu, b = b, observed = p3data_sample) + noise 

Okey, now we want to sample from posterior.    
Our first goal is to see that the model works. How ? From the first, we want to see that our "chains", plus or minus, worked out the same way.    
Also, we want to see that our model, when sampled from it, gives a distribution that is approximately similar to what we have.      

NOTE:    

On my PC, i have large then 16 gb ram. This allows me use more that 2000 sample. Kaggle notebooks blows up when you try to plot posterior predictive or something else which sample size more than 2000. 

In [None]:
with noise_model:
    posterior = pm.sample(draws=2000, tune=10000, cores=2, progressbar=True)

In [None]:
with noise_model:
    az.plot_trace(posterior, var_names=["lower_bound_noise", "upper_bound_noise"]);

Our lower bound sample looks not so similar, but, trust me. It is okay. If you increase sample size and chains, it will look much better.     
What about sampling?

In [None]:
with noise_model:  
    posterior_pred = pm.sample_posterior_predictive(posterior)

In [None]:
a = az.from_pymc3(posterior_predictive=posterior_pred, model=noise_model)
az.plot_ppc(a);    

Nice, our model gives a distribution very very close to ours. Final step!

In [None]:
a = az.from_pymc3(posterior, model=noise_model)
az.plot_posterior(a, var_names=['lower_bound_noise', 'upper_bound_noise'], ref_val= [0, 0.01]);

We can see that Raddar was right and we can accept the hypothesis that there is artificial noise in our data. If you try the same method to test other data whose distributions look reasonable, the result will be similar. As we increase the number of data in our sample, and as we increase the sample in our model, the HDI environment will tend to 0 on one side of the interval, and to 0.1 on the other. Tnx for reading!