Initial version forked from SCTP-working(LGB) https://www.kaggle.com/dromosys/sctp-working-lgb courtesy of Dromosys

This kernel is intended to examine a statistical hypothesis about the population which contest entries sometimes assume but may not always test. 

This formal statistical testing may help you validate your contest winning straegy!

I noticed several entries in this contest look at the TRAINING set and count the percentage of targets being equal to "1," These entries go on to hypothesize that the percentage of targets equal to "1" in the TEST set is the same. By this hypothesis, the use of stratified k-fold segmentation, and other probability-assuming techniques are used. Are these techniques really justified, and how can we test this hypothesis?

Statistical analysis offers us a straightforward way to test this hypothesis, as follows:

* We begin with a Hypothesis H-0 and assume it is correct. We also identify the opposite (or alternate) hypothesis H-a.
* We then choose a level of significance we are targeting, alpha, which equals the probability that we find H-0 false when it really is true. Common values for alpha are 5%, 10%, or even as low as 1%. Note, this also provides us with a confidence level c = 1 - alpha
* Next, we take random samples (of sufficient size) from the population, and test if any of them prove that H-0 is false. We only expect that very few (alpha) will be statistically significant.
* Finally we will extend our findings to further useful statistical inferences about the data useful for this competition.

Note: we can never prove out hypothesis, we can only disprove it, or fail to disprove it!

In [None]:
# First some "import" housekeeping
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('whitegrid')
import os
import time
import lightgbm as lgb
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.model_selection import GridSearchCV
import math
from math import sqrt

In [None]:
# Here is the data 
print(os.listdir("../input"))
train_df = pd.read_csv('../input/train.csv')
print('Rows: ',train_df.shape[0],'Columns: ',train_df.shape[1])
train_df.info()

Here are the first few examples from the training set

In [None]:
train_df.head()

In [None]:
# The usual split of the data into an X and a y array
X = train_df.drop(['ID_code','target'],axis=1)
y = train_df['target']

Here are the target value counts. It is from these target value counts that we will form our hypothesis. 

In statistics terminology, the Santander problem is a Bernoulli trial of the random variable X. 
* The probability of having a "1" for the target is p, and so the expected value of X is p, and the variance is p * (1 - p)
* if we take n samples from the Santander population, and we add up all the ones that had a target of "1", the total T will be a binomial distribution with an integer value anywhere from 0 to n.
* The binomial distribution is such that the probability of having k targets = 1 is P(T = k) = (n choose k) * p^k / (1 - p)^k
* This makes the expected value of T = np, the variance of T = n * p * (1 - p), and the standard deviation of T = sqrt(n * p * (1 - p)) or equivalently, (n * p * (1 - p)) ^ 0.5
* Since we cannot sample the entire Santander population, we can only make estimates. We can estimate p as a value we call p_hat
* Using the above sample from the population, p_hat is by definition T / n, so the variance of p_hat is p * (1 - p) / n, and the standard deviation (sometimes called standard error because we're only talking about a sample) is the square root of the variance.
* We fully expect that p_hat approahes p, as n approaches infinity.

**How confident are we ** in the use of p_hat (or "p from the training set") as the value p in our Bayes theorem, and therefore the use of "StratifiedKFold" etc.?

* First we see the **probability p that the target is "1"** in the training set as being simply the number of "1" targets divided by the whole set; where the probability of "not-p" is q = 1 - p.
* Then we calculate the **minimum number of samples** we need to take from the training set, such that within those samples, p-hat, the number of "1" targets, is approximately normally distributed. This is true if np ≥ 5 and nq ≥ 5; where n is the number of items in our samples.
* If the above is true, and we take repeated n samples from the training set, then the mean value of p_hat for those samples will be mu(p_hat) = p, and the standard deviation sigma(p_hat) = ( (p * q) / n ) ^ 0.5
* We can create what is called a "two-tailed" hypothesis stating that 95% of p_hat values will fall within the range of p plus or minus 1.96 sigma(p)

* We can then find confidence intervals, after we shift and stretch the distribution until it has a standard normal distribution (mean = 0, standard deviation = 1), by using Z-Tables for the standard normal distribution


In [None]:
split = train_df['target'].value_counts()
print("target value counts ")
print( split)
tot = split[0] + split[1]
p = split[1]/(split[0] + split[1])
q = 1 - p
sigma = sqrt((p * q) / tot)
# If np ≥ 5 and nq ≥ 5 for a binomial distribution, then the sampling can have distribution for p-hat that is approximately normal
# and since p is smaller than q, we have
min_sample = int((5 / p)+1)
print("Probability p of target being 1: ", p )
print("Probability q of target being 0: ", q )
print("Minimum number of samples of test set for an approximately normal distribution is: ", min_sample )
print('Expected standard deviation if minimum number of samples taken ', sigma)
sns.countplot(train_df['target'])
sns.set_style('whitegrid')

So what this means is that if we take a sample of data from our training set, and calculate the p-hat in that sample, over time, it should form a normal distribution with a mean of p, and a standard deviation of sigma. 

* With 200,000 samples in our training set, we can perform at most 4,000 tests since each test must contain 50 samples.
* Let's try that with 50 tests (folds) taking 4000 random samples each.
* NOTE: we are not using "StratifiedKFold" and instead we use simply "KFold"  because we purposely do not want to assume the probability "P_hat"

In [None]:
n_fold = 50 # number of sets of samples to randomly take from the training set
n = tot / n_fold
all_p_hats = np.zeros(n_fold)
all_std_errors = np.zeros(n_fold)

folds = KFold(n_splits=n_fold, shuffle=True, random_state = 12345)
for fold_n, (remainder, this_fold) in enumerate(folds.split(X,y)):
    y_this_fold = y.iloc[this_fold]
    n = y_this_fold.count()
    p_hat = y_this_fold.sum() / n
    std_error = sqrt((p_hat * (1 - p_hat)) / n)
#    print("Fold k = ", fold_n, "has a probability p of target being 1: ", p_hat  )
    all_p_hats[fold_n] = p_hat
    all_std_errors[fold_n] = std_error
        
print('Mean of distribution ', all_p_hats.mean(), 'as compared to the predicted ', p)
print('Standard Deviation of distribution', all_p_hats.std(), 'as compared to the predicted  ', sigma)
# Use Freedman–Diaconis rule to pick a good number of bins 
bin_width = 2.0 * ((np.percentile(all_p_hats, q=75) - np.percentile(all_p_hats, q=25)) / np.cbrt(n_fold))
bins = int ( (all_p_hats.max() - all_p_hats.min() ) / bin_width )
sns.distplot(all_p_hats, bins=bins)

You can try different values for "n_fold" and thereby change the number of experiments, but be sure not to exceed 4,000 because then our assumptions of approximate normal distribution are violated.

So from the above, it looks like the mean is spot-on, whereas the standard deviation is a bit bigger than expected (a bit wide). What does this mean to our hypothesis?

In order to formally test a hypothesis, we need to formally state the hypothesis.
* H-0, our null hypothesis, states that the probability of the target being equal to "1" in the population is p = 10.049 %
* We choose a level of significance alpha such as 5%, which would equivalently give us a confidence level c = 95%
* We repeat our tests (as per the above) and test if any of them prove that H-0 is false. We only expect that very few (alpha) will be statistically significant.

First, shift and squeeze the distribution to be "standard" (zero mean, variance = 1) normal distribution, this will give us a "Z-value" for each of our samples (folds).
* The graph should look roughly the same, just shifted and squeezed.

In [None]:
all_zs = np.zeros(n_fold)
for j in range (n_fold) :
    p_hat = all_p_hats[j]
    std_error = all_std_errors[j]
    z = (p_hat - p) / std_error
#    print("Fold k = ", j, "has a z value of ", z  )
    all_zs[j] = z
bin_width = 2.0 * ((np.percentile(all_zs, q=75) - np.percentile(all_zs, q=25)) / np.cbrt(n_fold))
bins = int ( (all_zs.max() - all_zs.min() ) / bin_width )
sns.distplot(all_zs, bins=bins)

Beside a hypothesis null and alternate values, we also have to have a "significance"
* If we choose a level of significance alpha such as 5%, it would equivalently give us a confidence level c = 95%
* We now examine our z values again, except we are now comparing them to the rejection regions, otherwise known as the Z-test

For each alpha, count how many z values exceed the + or - z0 rejection regions for our given confidence level.

In [None]:
# Choose a few Alphas
alphas = (.5, .4, .2, 0.1, 0.05, 0.01, 0.001, 0.0001)
z0s = (.675, .84, 1.28, 1.645, 1.960, 2.576, 3.290, 3.891)
num_alphas = len(alphas)
rej_count = np.zeros(num_alphas, dtype = int)
expected_rej_count = np.zeros(num_alphas)
for j in range(num_alphas) :
    expected_rej_count[j] = int(alphas[j] * n_fold)

for i in range (n_fold):
    this_z = all_zs[i]
    for j in range(num_alphas) :
        alpha = alphas[j]
        z0 = z0s[j] 
        if this_z > z0 :
            rej_count[j] += 1
        if this_z < -z0 :
            rej_count[j] += 1

print ("Alpha values we will test")
print(alphas)
print ("Expected number (alpha pecent out of ",n_fold,") that will fail the z-test")
print(expected_rej_count)
print ("Actual number that failed the z-test")
print(rej_count)
print("Confidence level = 1 - Alpha")
for j in range(num_alphas) :
    if (expected_rej_count[j] < rej_count[j]) :
        print ("For confidence level of", ((1 - alphas[j])*100), "percent, the hypothesis is proven FALSE")
    else : 
        print ("For confidence level of", ((1 - alphas[j])*100), "percent, the hypothesis may still be true")


CONCLUSION:
* Whenever the hypothesis is proven FALSE for a particular confidence level, we can no longer have that level of confidence that p and sigma will hold true for the population if p-hat and standard error are measured for a sample from that population. 
* If we prove our hypothesis FALSE at a certain confidence level, but it may still be true at GREATER confidence levels, then we might want to attribute that to the random nature of the selection (and may want to run more tests) - but in either case, we would stick with the HIGHEST CONFIDENCE LEVEL LOWER THAN A PROVEN FALSE LEVEL.

NEXT STEPS:
* Try additional numbers of folds, and/or different random selection menthods (remember, using the above K-folkds still makes sure every training sample is selected once and only once - rather than re-scrambling every fold)
* Once you have a good feel for your best confidence level, re-examine assumptions.

ASSUMPTIONS TO RE-EXAMINE:
* StratifiedKFold is teaching the classification algorithm an assumed value of p with a high confidence level, but is that a good assumption?
* Alternatively, randomly selecting from the training set (KFold) let's the p_hat dominate the training for that fold, which varies as we can see above.
* Finally, we may choose a different training / validation split methodology altogether. For example...
* How close are the training set samples to the test set samples (classification notwithstanding)? 
* Perhaps an adversarial validation scheme is warranted.

I hope this was helpful to you!