# What is sampling?

Sampling is the process of selecting a random number of units from a population. It makes it possible to get inforamtion and to draw conclusions about the population based on the statistics of that sample.
sampling technics can be categorized into two groups:

1. probability sampling: where every member of the population has the same probability of being selected

2. non-probability sampling: where elements do not have same probabilty of being selected



## Probability Sampling

simple random sampling,

systematic sampling, 

cluster sampling and 

stratified random sampling



In order to practice different sampling, let's generate a population the length of rulers produced in a factory. The ruler is supposed to be 12 inches however, due to production technics, this length is not exaxt



In [1]:
import pandas as pd
import numpy as np

number_of_rullers = 1000

np.random.seed(42)
data = {'ruller_id': np.arange(1, number_of_rullers+1).tolist(),
        'Length_in': np.random.normal(loc=12, scale=0.05, size=number_of_rullers)}


In [3]:
data = pd.DataFrame(data, columns=['ruller_id', 'Length_in'])

In [4]:
data.head()

Unnamed: 0,ruller_id,Length_in
0,1,12.024836
1,2,11.993087
2,3,12.032384
3,4,12.076151
4,5,11.988292


In [5]:
# real mean of population

population_mean = data['Length_in'].mean()

## Simple Random Sampling

In [7]:
simple_random_sample = data.sample(n=10).sort_values(by='ruller_id')
simple_random_mean = simple_random_sample['Length_in'].mean()
simple_random_sample

Unnamed: 0,ruller_id,Length_in
79,80,11.900622
84,85,11.959575
316,317,12.034098
366,367,12.011205
421,422,12.087767
474,475,12.082248
593,594,12.016683
836,837,12.077525
872,873,11.931034
886,887,12.029196


## Systematic Sampling

The systematic sampling method selects rullers based on a fixed sampling interval (i.e. every 100th unit is selected from a given process or population). 

In [9]:
 
indexes = np.arange(0,len(data),step=100)
systematic_sample = data.iloc[indexes]
systematic_mean = systematic_sample['Length_in'].mean()
systematic_sample

Unnamed: 0,ruller_id,Length_in
0,1,12.024836
100,101,11.929231
200,201,12.017889
300,301,11.95855
400,401,11.920279
500,501,12.046309
600,601,12.037849
700,701,11.973864
800,801,12.046914
900,901,12.018434


## Cluster Sampling
The cluster sampling method divides the population in clusters of equal size, and selects clusters at certain intervals. Lets divide data into clusters of 2 and select 5 of the clusters at interval of 200

In [12]:
number_of_clusters = 500
data['cluster_id'] = np.repeat([range(1,number_of_clusters+1)],len(data)/number_of_clusters)
data.head()

Unnamed: 0,ruller_id,Length_in,cluster_id
0,1,12.024836,1
1,2,11.993087,1
2,3,12.032384,2
3,4,12.076151,2
4,5,11.988292,3


In [20]:
mask = data['cluster_id'].isin([1,101,201,301,401])
cluster_sample = data[mask]
cluster_mean = cluster_sample['Length_in'].mean()
cluster_sample

Unnamed: 0,ruller_id,Length_in,cluster_id
0,1,12.024836,1
1,2,11.993087,1
200,201,12.017889,101
201,202,12.028039,101
400,401,11.920279,201
401,402,11.970031,201
600,601,12.037849,301
601,602,11.953892,301
800,801,12.046914,401
801,802,11.974198,401


In [22]:
data.drop(columns='cluster_id', inplace=True)

## Stratified Random Sampling

The stratified random sampling method divides the population in subgroups (i.e. strata) and selects random samples where every unit has the same probability of getting selected.


In [48]:
data['ruller_strata'] = np.repeat(['white', 'red', 'blue', 'green', 'green'], 1000/5)
data.head()

Unnamed: 0,ruller_id,Length_in,ruller_strata
0,1,12.024836,white
1,2,11.993087,white
2,3,12.032384,white
3,4,12.076151,white
4,5,11.988292,white


In [49]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=10)

for x, y in split.split(data, data['ruller_strata']):
    stratified_random_sample = data.iloc[y].sort_values(by='ruller_id')

stratified_random_sample_mean = stratified_random_sample['Length_in'].mean()
stratified_random_sample

Unnamed: 0,ruller_id,Length_in,ruller_strata
122,123,12.07014,white
177,178,12.072677,white
323,324,12.104619,red
379,380,11.959585,red
422,423,11.987552,blue
466,467,12.030708,blue
717,718,12.007515,green
722,723,11.932591,green
732,733,12.020413,green
941,942,12.043623,green


In [50]:
stratified_random_sample.groupby('ruller_strata').mean().drop(['ruller_id'],axis=1)

Unnamed: 0_level_0,Length_in
ruller_strata,Unnamed: 1_level_1
blue,12.00913
green,12.001035
red,12.032102
white,12.071408


## Comparison of the results

In [51]:
# Create a dictionary with the mean outcomes for each sampling method and the real mean
outcomes = {'sample_mean':[simple_random_mean,systematic_mean,cluster_mean, stratified_random_sample_mean],
           'population_mean':population_mean}

# Transform dictionary into a data frame
outcomes = pd.DataFrame(outcomes, index=['Simple Random Sampling','Systematic Sampling','Cluster Sampling', 'Stratified Sampling'])

# Add a value corresponding to the absolute error
outcomes['abs_error'] = abs(outcomes['population_mean'] - outcomes['sample_mean'])

# Sort data frame by absolute error
outcomes.sort_values(by='abs_error')

Unnamed: 0,sample_mean,population_mean,abs_error
Systematic Sampling,11.997416,12.000967,0.003551
Cluster Sampling,11.996701,12.000967,0.004265
Simple Random Sampling,12.012995,12.000967,0.012029
Stratified Sampling,12.022942,12.000967,0.021976
