### Sampling

In [None]:
import pandas as pd
df = pd.read_csv("sample")
df

#### Simple Random Sampling

Say you want to select a subset of a population in which each member of the subset has an equal probability of being chosen.

Below we select 100 sample points from a dataset.

In [None]:
sample_df = df.sample(5000)

In [None]:
# sample_df

#### Stratified Sampling

Assume that we need to estimate the average number of votes for each candidate in an election. Assume that the country has 3 towns:

Town A has 1 million factory workers,

Town B has 2 million workers, and

Town C has 3 million retirees.

We can choose to get a random sample of size 60 over the entire population but there is some chance that the random sample turns out to be not well balanced across these towns and hence is biased causing a significant error in estimation.

Instead, if we choose to take a random sample of 10, 20 and 30 from Town A, B and C respectively then we can produce a smaller error in estimation for the same total size of the sample.

You can do something like this pretty easily with Python:

In [None]:
X = df.iloc[:,:2]
y = df.iloc[:,2]

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.25)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.hist(X)

In [None]:
plt.hist(X_train)

#### Reservoir Sampling'

I love this problem statement:

Say you have a stream of items of large and unknown length that we can only iterate over once.

Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.

How can we do that?

Let us assume we have to sample 5 objects out of an infinite stream such that each element has an equal probability of getting selected.

In [None]:
import random

def generator(max):
    number = 1
    while number < max:
        number += 1
        yield number
# Create as stream generator
stream = generator(10000)

# Doing Reservoir Sampling from the stream
k=5
reservoir = []
for i, element in enumerate(stream):
    if i+1<= k:
        reservoir.append(element)
    else:
        probability = k/(i+1)
        if random.random() < probability:
            # Select item in stream and remove one of the k items already selected
            reservoir[random.choice(range(0,k))] = element
                
                

In [None]:
print(reservoir)

It can be mathematically proved that in the sample each element has the same probability of getting selected from the stream.

How?

It always helps to think of a smaller problem when it comes to mathematics.

So, let us think of a stream of only 3 items and we have to keep 2 of them.

We see the first item, we hold it in the list as our reservoir has space. We see the second item, we hold it in the list as our reservoir has space.

We see the third item. Here is where things get interesting. We choose the third item to be in the list with probability 2/3.

Let us now see the probability of first item getting selected:

The probability of removing the first item is the probability of element 3 getting selected multiplied by the probability of Element 1 getting randomly chosen as the replacement candidate from the 2 elements in the reservoir. That probability is:

2/3*1/2 = 1/3

Thus the probability of 1 getting selected is:

1–1/3 = 2/3

We can have the exact same argument for the Second Element and we can extend it for many elements.

Thus each item has the same probability of getting selected: 2/3 or in general k/n


#### Random Undersampling and Oversampling

It is too often that we encounter an imbalanced dataset.

A widely adopted technique for dealing with highly imbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).

Let us first create some example imbalanced data.

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_classes=2, class_sep=1.5, weights=[0.9, 0.1],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=100, random_state=10
)
X = pd.DataFrame(X)
X['target'] = y

We can now do random oversampling and undersampling using:

In [None]:
num_0 = len(X[X['target']==0])
num_1 = len(X[X['target']==1])

print(num_0,num_1)

# random undersample
undersampled_data = pd.concat([ X[X['target']==0].sample(num_1) , X[X['target']==1] ])

In [None]:
print(len(undersampled_data))

# random oversample

oversampled_data = pd.concat([ X[X['target']==0] , X[X['target']==1].sample(num_0, replace=True) ])
print(len(oversampled_data))

In [None]:
df1 = df.sample(100)
X = df1.iloc[:,:2]
y = df1.iloc[:,2]
n_0 = df1.loc[df1['smoker']==0,].shape[0]
n_1 = df1.loc[df1['smoker']==1,].shape[0]
n_0, n_1

In [None]:
df_under = pd.concat([df1.loc[df1['smoker']==0,].sample(n_1), df1.loc[df1['smoker']==1,]])
df_under.shape

In [None]:
df_over = pd.concat([df1.loc[df1['smoker']==0,], df1.loc[df1['smoker']==1,].sample(n_0, replace=True)])
df_over.shape

In [None]:
plt.hist(df);

In [None]:
plt.hist(df_under);

In [None]:
plt.hist(df_over);

#### Undersampling and Oversampling using imbalanced-learn

imbalanced-learn(imblearn) is a Python Package to tackle the curse of imbalanced datasets.

It provides a variety of methods to undersample and oversample.

a. Undersampling using Tomek Links:

One of such methods it provides is called Tomek Links. Tomek links are pairs of examples of opposite classes in close vicinity.

In this algorithm, we end up removing the majority element from the Tomek link which provides a better decision boundary for a classifier.

In [None]:
# !pip uninstall scikit-learn

In [None]:
from imblearn.under_sampling import TomekLinks

In [None]:
df = df.sample(1000)
X = df.iloc[:,:2]
y = df.iloc[:,2]

In [None]:
tl = TomekLinks(sampling_strategy='majority')
X_tl, y_tl = tl.fit_resample(X, y)

Oversampling using SMOTE:

In SMOTE (Synthetic Minority Oversampling Technique) we synthesize elements for the minority class, in the vicinity of already existing elements.

In [None]:
plt.hist(X)
# plt.hist(y)

In [None]:
n_0 = df.loc[df['smoker']==0,].shape[0]
n_1 = df.loc[df['smoker']==1,].shape[0]
n_0, n_1

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_resample(X, y)

In [None]:
plt.hist(X_tl)
# plt.hist(y_tl)


In [None]:
plt.hist(X_sm)
# plt.hist(y_sm)

There are a variety of other methods in the imblearn package for both undersampling(Cluster Centroids, NearMiss, etc.) and oversampling(ADASYN and bSMOTE) that you can check out.

### Multiple hypothesis testing

In [None]:
import pandas as pd#Import the data samples
from multipy.data import neuhaus#Import the FWER methods 
from multipy.fwer import bonferroni, holm_bonferroni#Import the FDR methods (LSU is the other name for BH method)
from multipy.fdr import lsu

Let’s assume we have 15 features, and we already did our hypothesis testing for each feature.

In [None]:
pvals = neuhaus()
df = pd.DataFrame({'Features': ['Feature {}'.format(i) for i in range(1,len(pvals)+1  )], 'P-value':pvals})
# df

In [None]:
# Now, let’s try the Bonferroni Correction to our data sample
#Set the alpha level for your desired significant level
df['bonferroni'] = bonferroni(pvals, alpha = 0.05)
df

With the function from MultiPy, we end up either with True or False results. True means we Reject the Null Hypothesis, while False, we Fail to Reject the Null Hypothesis.

From the Bonferroni Correction method, only three features are considered significant. Let’s try the Holm-Bonferroni method to see if there is any difference in the result.

In [None]:
df['holm_bonferroni'] = holm_bonferroni(pvals, alpha = 0.05)
df

No change at all in the result. It seems the conservative method FWER has restricted the significant result we could get. Let’s see if there is any difference if we use the BH method.

In [None]:
#set the q parameter to the FDR rate you want
df['benjamin_hochberg'] = lsu(pvals, q=0.05) # q = desired FDR
df

The less strict method FDR resulted in a different result compared to the FWER method. In this case, we have four significant features. The FDR is proven to laxer to find the features, after all.

If you want to learn more about the methods available for Multiple Hypothesis Correction, you might want to visit the MultiPy homepage.

In [None]:
from statsmodels.stats.multitest import multipletests

In [None]:
reject, p_value_corrected, sidak_corr, bonf_corr = multipletests(pvals, alpha=0.05, method='fdr_bh')

In [None]:
reject, p_value_corrected, sidak, bonferroni =  multipletests(pvals, alpha = 0.05, method='sidak')
df['sidak'] = reject
reject, p_value_corrected, sidak, bonferroni =  multipletests(pvals, alpha = 0.05, method='holm-sidak')
df['holm-sidak'] = reject

In [None]:
df


    reject — булевский массив длины 𝑚, в котором True — нулевую гипотезу можно отвергнуть и False — если нельзя
    pvals_corrected — массив длины 𝑚 со скорректированными p-value
    alphacSidak — поправка Шидака
    alphacBonf — поправка Бонферонни



Поправка Шидака


Как и в поправке Бонферонни, поправка Шидака корректирует $\alpha$ (уровни значимости для проверки единичных гипотез). Она также сохраняет $F W E R \leq \alpha$
Посчитаем, чему равна поправка Шидака. $P(V \leq 1)=1-P(V=0) \leq 1-\left(1-\alpha_{1}\right)^{m}=\alpha$, где $\alpha-$ заданный нами уровень значимости для семейства гипотез и $\alpha_{1}-$ искомый уровень значимости для проверки каждой единичной гипотезы.
Выразим $\alpha_{1}$ через $\alpha$ и получим $\alpha_{1}=1-(1-\alpha)^{1 / m} \mid$

Метод Шидака-Холма


Как и в предыдущем методе, где отметился Холм, используется итерационная корректировка р-value. Аналогично сортируем наши р-value по возрастанию и корректируем их согласно поправке Шидака: $\alpha_{1}=1-(1-\alpha)^{\frac{\pi}{m}}$
$$
\begin{array}{l}
\alpha_{i}=1-(1-\alpha)^{\frac{\alpha}{m-l+1}} \\
\ldots \\
\alpha_{m}=\alpha
\end{array}
$$
Обладает несколькими свойствами:
1. Контролирует FWER на уровне значимости $\alpha$, если статистики независимы в совокупности.
2. Если статистики независимы в совокупности, нельзя построить контролирующую FWER на уровне $\alpha$ процедуру мощнее, чем метод Шидака-Холма.
3. При больших $m$ мало отличается от метода Холма