# Sampling Methods

### Learning Goals

- Introduce a few Normality Test
- Brief introduction to sklearn.
- Generate synthetic data to test algorithms.
- Understand Bootstrap Sampling.
- Understand Random Walk and Monte Carlo Simulation.


## Part I. Testing for Normality

In [None]:
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from random import random
import scipy.stats as stats
warnings.simplefilter(action='ignore', category=FutureWarning)

### Is the actual data statistically different than the computed normal curve?

#### Graphical
• Q-Q probability plots

• Cumulative frequency (P-P) plots

#### Statistical
• Jarque-Bera test

• Shapiro-Wilks test

• Kolmogorov-Smirnov test

• D’Agostino test

The hypotheses used are:

${H_0}$: The sample data is not significantly different than a
normal population.

${H_a}$: The sample data is significantly different than a normal
population

In [None]:
np.random.seed(4)
population_age = stats.poisson.rvs(loc = 18, mu = 35, size = 10)
sns.kdeplot(population_age, shade=True, label="prior height")
plt.show()

#### D’Agostino and Pearson’s test

In [None]:
k2, p = stats.normaltest(population_age)
k2
p

#### Shapiro-Wilk test

In [None]:
stats.shapiro(population_age)

#### Kolmogorov-Smirnov test 

In [None]:
stats.kstest(population_age, 'norm')

#### Jarque-Bera goodness of fit test
- works for a large enough number of data samples (>2000)

In [None]:
stats.jarque_bera(population_age)

#### Anderson-Darling test

In [None]:
stats.anderson(population_age, dist='norm')

- Different normality tests produce vastly different probabilities. 
- This is due to where in the distribution (central, tails) or what moment (skewness, kurtosis) they are examining.
    

## Permutation Test


${H_0}$: (treatment = control).

${H_a}$: the distribution of the samples are different (e.g., treatment != control)

**Note** that there are (n+m)! permutations, where n is the number of records in the treatment sample, and m is the number of records in the control sample. 

1. Compute the difference (here: mean) of sample x and sample y
2. Combine all measurements into a single dataset
3. Draw a permuted dataset from all possible permutations of the dataset in 2.
4. Divide the permuted dataset into two datasets x' and y' of size n and m, respectively
5. Compute the difference (here: mean) of sample x' and sample y' and record this difference
6. Repeat steps 3-5 until all permutations are evaluated
7. Return the p-value as the number of times the recorded differences were more extreme than the original difference from 1. and divide this number by the total number of permutations

Here, the p-value is defined as the probability, given the null hypothesis (no difference between the samples) is true, that we obtain results that are at least as extreme as the results we observed (i.e., the sample difference from 1.).

In [None]:
# If you do not have mlxtend already installed uncomment one of the install methods below.
#conda install -c conda-forge mlxtend 
# !pip install mlxtend  

treatment = [ 28.44,  29.32,  31.22,  29.58,  30.34,  28.76,  29.21,  30.4 ,
              31.12,  31.78,  27.58,  31.57,  30.73,  30.43,  30.31,  30.32,
              29.18,  29.52,  29.22,  30.56]
control = [ 33.51,  30.63,  32.38,  32.52,  29.41,  30.93,  49.78,  28.96,
            35.77,  31.42,  30.76,  30.6 ,  23.64,  30.54,  47.78,  31.98,
            34.52,  32.42,  31.32,  40.72]

from mlxtend.evaluate import permutation_test
p_value = permutation_test(treatment, control,
                           method='approximate',
                           seed=0)
print(p_value)

## Part II. SKLEARN and Data Generation


![sklearn.png](attachment:sklearn.png)

# scikit-learn

**Machine Learning in Python**
- https://scikit-learn.org/stable/
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
from sklearn.datasets import make_blobs

## `make_blobs()`
- This function generates isotropic Gaussian blobs for clustering and classification problems, similar to the ones we earlier saw with Naive Bayes Algorithm. 
- Official documentation for this function can be found [HERE](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html).

In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');

 ## `make_regression()`

- This function allows you to create datasets which can be used to test regression algorithms for linear regression. 
- We can create datasets by setting number of samples, number of input features, level of noise, and much more. 
- Here is how we import this function:

In [None]:
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=2,n_targets=2, noise=0.6)

# plot regression dataset
plt.scatter(X,y)
plt.show()

## Part III. Boostrap Sampling 


![Bootstrap-.jpg](attachment:Bootstrap-.jpg)

### What is Bootstrap Sampling?

- The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples.
- Iteratively resampling a dataset with replacement.

### Boostrap Alogrithm

1. Choose a number of bootstrap samples to perform
2. Choose a sample size
3. For each bootstrap sample  
    a. Draw a sample with replacement with the chosen size  
    b. Calculate the statistic on the sample
4. Calculate the mean of the calculated sample statistics.

### Note
- We have to determine sample size and number of repitions to perform.
- **The samples not included in a given sample are called the out-of-bag samples, or OOB for short.** Will be useful when it is time to test Machine Learning Algos.

### Illustration

![bootrap_concept.png](attachment:bootrap_concept.png)

In [None]:
# scikit-learn bootstrap
from sklearn.utils import resample
# data sample
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,10,10]
# prepare bootstrap sample
boot = resample(data, replace=True, n_samples=10)
print('Bootstrap Sample: %s' % boot)
# out of bag observations
oob = [x for x in data if x not in boot]
print('OOB Sample: %s' % oob)

### Why is Bootstrapping Important?

- Used in ensemble methods.
- Foundation for state of the art Machine Learning Algorithms, such as  **Random Forest and XGBOOST**.
- Obtain a test set for "free" so no need for cross-validation.

![randomforest.png](attachment:randomforest.png)

## Part IV. Monte Carlo (MCMC)

###  What is Monte Carlo?


- Popular method for obtaining information about distributions, especially for estimating posterior distributions in Bayesian inference. 


###  Monte Carlo

- Monte Carlo can be used for estimating a parameter by randomly simulating values repeatedly. 
- Monte Carlo is the practice of estimating the properties of a distribution by examining random samples from the distribution. 

### Example
Instead of finding the mean of a normal distribution by directly calculating it from the distribution’s equations, a Monte–Carlo approach would be to draw a large number of random samples from a normal distribution, and calculate the sample mean of the samples.

### Illustration
- https://www.youtube.com/watch?v=BfS2H1y6tzQ

## Module 3 Final Project

https://learn.co/tracks/data-science-career-v2/module-3-probability-sampling-and-ab-testing/end-of-module-3-project/module-3-final-project