# Stratified Sampling

Stratified Sampling is a sampling technique where the population is divided into distinct subgroups or strata based on certain characteristics, and then random samples are drawn from each stratum in proportion to their size in the population.

## Example

Let's assume we have a dataset with employee IDs, ages, and departments. We want to perform stratified sampling to select a sample of employee, ensuring that the sample is representative of each department.

In [2]:
import pandas as pd
import random

### Create sample student data with employee id, age and department.

In [16]:
# Number of employees
num_emp = 100

# Generate random employee IDs
emp_ids = range(1, num_emp + 1)

# Generate random ages for employees
ages = [random.randint(18, 25) for _ in range(num_emp)]

# GEnerate data for three departments
departments = [random.choice(['Marketing', 'Finance', 'Operations']) for _ in range(num_emp)]

In [17]:
emp_ids

range(1, 101)

In [18]:
str(ages)

'[20, 21, 18, 25, 18, 18, 22, 24, 23, 23, 22, 21, 23, 20, 23, 19, 18, 18, 23, 22, 24, 18, 18, 20, 23, 21, 23, 21, 21, 23, 22, 21, 21, 24, 23, 20, 18, 24, 19, 20, 18, 22, 20, 25, 20, 19, 22, 22, 22, 23, 24, 20, 19, 23, 25, 20, 19, 25, 18, 19, 25, 22, 25, 24, 21, 25, 19, 23, 21, 24, 23, 20, 24, 24, 19, 19, 19, 23, 20, 23, 19, 24, 21, 25, 21, 24, 18, 18, 23, 21, 19, 23, 21, 21, 22, 18, 20, 23, 19, 18]'

In [19]:
str(departments)

"['Operations', 'Marketing', 'Marketing', 'Operations', 'Operations', 'Marketing', 'Operations', 'Finance', 'Finance', 'Operations', 'Finance', 'Marketing', 'Finance', 'Marketing', 'Marketing', 'Operations', 'Finance', 'Finance', 'Finance', 'Marketing', 'Marketing', 'Finance', 'Finance', 'Operations', 'Operations', 'Operations', 'Operations', 'Operations', 'Operations', 'Finance', 'Finance', 'Operations', 'Marketing', 'Finance', 'Operations', 'Operations', 'Finance', 'Operations', 'Finance', 'Operations', 'Marketing', 'Marketing', 'Marketing', 'Operations', 'Marketing', 'Operations', 'Marketing', 'Operations', 'Finance', 'Finance', 'Marketing', 'Marketing', 'Finance', 'Operations', 'Operations', 'Marketing', 'Operations', 'Operations', 'Marketing', 'Finance', 'Operations', 'Finance', 'Finance', 'Finance', 'Marketing', 'Operations', 'Operations', 'Operations', 'Finance', 'Finance', 'Marketing', 'Operations', 'Finance', 'Finance', 'Operations', 'Marketing', 'Operations', 'Finance', 'Oper

In [20]:
len(departments)

100

### Create Dataframe

In [21]:
df = pd.DataFrame({'emp_id': emp_ids, 'Age': ages, 'Department': departments})

In [22]:
df

Unnamed: 0,emp_id,Age,Department
0,1,20,Operations
1,2,21,Marketing
2,3,18,Marketing
3,4,25,Operations
4,5,18,Operations
...,...,...,...
95,96,18,Operations
96,97,20,Marketing
97,98,23,Finance
98,99,19,Finance


### Perform stratified sampling

In [27]:
# Define sample size for each stratum
sample_size = 10 

In [30]:
# Define strata based on department
strata = df['Department'].unique()
strata

array(['Operations', 'Marketing', 'Finance'], dtype=object)

In [31]:
# Perform stratified sampling
sampled_data = pd.DataFrame(columns=df.columns)

In [32]:
for stratum in strata:
    stratum_data = df[df['Department'] == stratum]
    sampled_stratum = stratum_data.sample(n=sample_size, random_state=42)
    sampled_data = pd.concat([sampled_data, sampled_stratum])

### Stratified Sampled Data

In [35]:
sampled_data

Unnamed: 0,emp_id,Age,Department
43,44,25,Operations
34,35,23,Operations
9,10,23,Operations
74,75,19,Operations
92,93,21,Operations
65,66,25,Operations
23,24,20,Operations
66,67,19,Operations
60,61,25,Operations
39,40,20,Operations


Result DataFrame sampled_data containing a stratified sample of employees, ensuring that the sample is representative of each department.