# Cluster Sampling

Cluster sampling is a sampling technique where the population is divided into groups or clusters, and then a random sample of clusters is selected. All individuals within the chosen clusters are included in the sample. It is often used when it is impractical or too costly to sample individuals directly from the entire population.

## Example

Let's assume we have a dataset with employee IDs, ages, and departments. We want to perform cluster sampling by randomly selecting a few departments (clusters) and then selecting all employees from those departments.

In [2]:
import pandas as pd
import random

### Create sample student data with employee id, age and department.

In [3]:
# Number of employees
num_emp = 100

# Generate random employee IDs
emp_ids = range(1, num_emp + 1)

# Generate random ages for employees
ages = [random.randint(18, 25) for _ in range(num_emp)]

# GEnerate data for three departments
departments = [random.choice(['Marketing', 'Finance', 'Operations']) for _ in range(num_emp)]

In [4]:
emp_ids

range(1, 101)

In [5]:
str(ages)

'[20, 21, 22, 18, 20, 19, 24, 23, 18, 23, 18, 19, 23, 21, 21, 19, 24, 25, 22, 20, 20, 21, 23, 24, 22, 21, 18, 21, 18, 25, 24, 25, 22, 22, 22, 21, 22, 23, 24, 24, 19, 21, 20, 25, 18, 25, 25, 21, 18, 19, 25, 21, 25, 18, 24, 21, 21, 21, 19, 18, 25, 25, 24, 23, 25, 21, 19, 23, 25, 20, 23, 22, 18, 21, 25, 20, 18, 24, 25, 22, 21, 23, 22, 20, 19, 24, 18, 22, 23, 23, 18, 22, 25, 21, 19, 19, 25, 20, 21, 18]'

In [6]:
str(departments)

"['Finance', 'Marketing', 'Marketing', 'Operations', 'Marketing', 'Operations', 'Operations', 'Finance', 'Marketing', 'Finance', 'Finance', 'Marketing', 'Finance', 'Finance', 'Finance', 'Operations', 'Operations', 'Finance', 'Marketing', 'Operations', 'Finance', 'Operations', 'Operations', 'Operations', 'Finance', 'Operations', 'Finance', 'Finance', 'Operations', 'Operations', 'Marketing', 'Operations', 'Finance', 'Marketing', 'Marketing', 'Marketing', 'Finance', 'Finance', 'Operations', 'Finance', 'Finance', 'Operations', 'Finance', 'Marketing', 'Operations', 'Finance', 'Finance', 'Finance', 'Finance', 'Marketing', 'Marketing', 'Operations', 'Marketing', 'Operations', 'Finance', 'Finance', 'Operations', 'Operations', 'Finance', 'Finance', 'Marketing', 'Marketing', 'Marketing', 'Finance', 'Operations', 'Marketing', 'Marketing', 'Operations', 'Operations', 'Finance', 'Operations', 'Finance', 'Operations', 'Finance', 'Operations', 'Finance', 'Marketing', 'Operations', 'Marketing', 'Finan

In [7]:
len(departments)

100

### Create Dataframe

In [8]:
df = pd.DataFrame({'emp_id': emp_ids, 'Age': ages, 'Department': departments})

In [9]:
df

Unnamed: 0,emp_id,Age,Department
0,1,20,Finance
1,2,21,Marketing
2,3,22,Marketing
3,4,18,Operations
4,5,20,Marketing
...,...,...,...
95,96,19,Marketing
96,97,25,Marketing
97,98,20,Operations
98,99,21,Finance


### Perform cluster sampling

In [10]:
# Define number of clusters
num_clusters = 2

In [11]:
# Randomly select clusters (departments)
clusters = random.sample(list(df['Department'].unique()), num_clusters)
clusters

['Marketing', 'Finance']

In [12]:
# Perform cluster sampling
sampled_data = df[df['Department'].isin(clusters)]

### Cluster Sampled Data

In [13]:
sampled_data

Unnamed: 0,emp_id,Age,Department
0,1,20,Finance
1,2,21,Marketing
2,3,22,Marketing
4,5,20,Marketing
7,8,23,Finance
...,...,...,...
93,94,21,Finance
95,96,19,Marketing
96,97,25,Marketing
98,99,21,Finance


Result DataFrame sampled_data containing a cluster sample of employees, where all employees are selected from the randomly chosen departments. 