# Probability Sampling

## 1. Random Sampling

Under Random sampling, every element of the population has an equal probability of getting selected

![](https://miro.medium.com/max/340/1*wl0Ex5ydLiuJGxHkZZrqhw.png)

In [1]:
# Import Library
import random

population = 100

data = range(population)

print(random.sample(data,5))

[78, 82, 65, 86, 85]


## 2. Stratified Sampling

1. group the entire population into subpopulations by some common property. 
2. Example:
    1. Class labels in a typical ML classification task. 
    2. Then randomly sample from those groups individually, such that the groups are still maintained in the same ratio as they were in the entire population.

![](https://miro.medium.com/max/333/1*htUih4E3pQfl9uXQq_fDoA.png)

In [8]:
import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
                   'position': ['G', 'G', 'F', 'G', 'F', 'F', 'C', 'C'],
                   'assists': [5, 7, 7, 8, 5, 7, 6, 9],
                   'rebounds': [11, 8, 10, 6, 6, 9, 6, 10]})

#view DataFrame
df

Unnamed: 0,team,position,assists,rebounds
0,A,G,5,11
1,A,G,7,8
2,A,F,7,10
3,A,G,8,6
4,B,F,5,6
5,B,F,7,9
6,B,C,6,6
7,B,C,9,10


In [9]:
df.groupby('team', group_keys=False).apply(lambda x: x.sample(2))

Unnamed: 0,team,position,assists,rebounds
0,A,G,5,11
3,A,G,8,6
6,B,C,6,6
5,B,F,7,9


## Example 2: Stratified Sampling Using Proportions

In [10]:
import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
                   'position': ['G', 'G', 'F', 'G', 'F', 'F', 'C', 'C'],
                   'assists': [5, 7, 7, 8, 5, 7, 6, 9],
                   'rebounds': [11, 8, 10, 6, 6, 9, 6, 10]})

#view DataFrame
df

Unnamed: 0,team,position,assists,rebounds
0,A,G,5,11
1,A,G,7,8
2,B,F,7,10
3,B,G,8,6
4,B,F,5,6
5,B,F,7,9
6,B,C,6,6
7,B,C,9,10


In [11]:
import numpy as np

#define total sample size desired
N = 4

#perform stratified random sampling
df.groupby('team', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)

Unnamed: 0,team,position,assists,rebounds
0,B,F,7,9
1,B,G,8,6
2,B,F,5,6
3,A,G,7,8


## 3. Cluster Sampling

1. Divide the entire population into subgroups, 
2. Each of those subgroups has similar characteristics to that of the population when considered in totality
3. Instead of sampling individuals, we randomly select the entire subgroups

![](https://miro.medium.com/max/468/1*2g1Tp81I9i5Sn4GMwk_sHg.png)

### Example:

In [13]:
import numpy as np

In [14]:
clusters=5
pop_size = 100
sample_clusters=2

In [15]:
#assigning cluster ids sequentially from 1 to 5 on gap of 20
cluster_ids = np.repeat([range(1,clusters+1)], pop_size/clusters)
cluster_to_select = random.sample(set(cluster_ids), sample_clusters)

In [16]:
indexes = [i for i, x in enumerate(cluster_ids) if x in cluster_to_select]
cluster_associated_elements = [el for idx, el in enumerate(range(1, 101)) if idx in indexes]

In [17]:
print (cluster_associated_elements)

[21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]


## 4. Systematic Sampling

1. Systematic sampling is about sampling items from the population at regular predefined intervals(basically fixed and periodic intervals)

![](https://miro.medium.com/max/385/1*CBh9pbPc2qNlZcskXUX6mw.png)

In [18]:
population = 100
step = 5
sample = [element for element in range(1, population, step)]
print (sample)

[1, 6, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61, 66, 71, 76, 81, 86, 91, 96]


## 5. Multistage sampling

1. Stack multiple sampling methods one after the other.
2. cluster sampling can be used to choose clusters from the population,
3. then we can perform random sampling to choose elements from each cluster to form the final set

![](https://miro.medium.com/max/463/1*WIY6mlmFDzHj4oomMMkoWg.png)

**Implementation**

In [19]:
import numpy as np

In [20]:
clusters=5
pop_size = 100
sample_clusters=2
sample_size=5

In [21]:
#assigning cluster ids sequentially from 1 to 5 on gap of 20
cluster_ids = np.repeat([range(1,clusters+1)], pop_size/clusters)
cluster_to_select = random.sample(set(cluster_ids), sample_clusters)

In [22]:
indexes = [i for i, x in enumerate(cluster_ids) if x in cluster_to_select]
cluster_associated_elements = [el for idx, el in enumerate(range(1, 101)) if idx in indexes]

In [23]:
print (random.sample(cluster_associated_elements, sample_size))

[66, 69, 63, 78, 70]
