# Cluster Sampling

**Cluster sampling** is a type of probability sampling method where the population is divided into separate groups, known as clusters, and a random selection of these clusters is then made to represent the population. Instead of sampling individuals directly, entire clusters (or groups) are sampled, and data is collected from all or some individuals within the selected clusters.

**Key Points**

1. Clusters: The population is divided into clusters, often based on geographic locations, institutions, or other natural groupings. Each cluster is considered a mini-representation of the larger population.

2. Random Selection of Clusters: Clusters are randomly selected from the total number of clusters. The number of clusters selected depends on the size of the population and desired sample size.

3. Data Collection: Once the clusters are selected, either all individuals within the selected clusters are surveyed (one-stage cluster sampling), or a random sample of individuals from within each selected cluster is surveyed (two-stage cluster sampling).

**Example**

Let’s say you want to survey students in a country, but it's impractical to randomly sample students from all schools. Instead, you:

    Divide the population by schools (clusters).
    Randomly select a number of schools (clusters).
    Survey all students from the selected schools (one-stage) or randomly sample students from these schools (two-stage).

**Types of Cluster Sampling**

1. One-Stage Cluster Sampling: Entire clusters are selected, and all individuals within these clusters are surveyed.

2. Two-Stage Cluster Sampling: Clusters are selected, but only a random sample of individuals from each cluster is surveyed.

**Advantages of Cluster Sampling**

1. Cost-effective: It reduces travel and administrative costs when dealing with large populations spread over large areas.
2. Easier data collection: Since the sample is grouped, data collection can be more manageable.

**Disadvantages**

1. Less accurate: Cluster sampling can introduce bias if the clusters aren't representative of the population.
2. Higher sampling error: Compared to simple random sampling, there's a higher chance of sampling error due to differences between clusters.

In [None]:
import pandas as pd

print(pd.Series(wnba['Team'].unique()).sample(4, random_state = 0))

data = pd.read_csv('wnba.csv')
df = pd.DataFrame(data)

df_pho = df[df['Team'] == 'PHO']
df_ind = df[df['Team'] == 'IND']
df_min = df[df['Team'] == 'MIN']
df_atl = df[df['Team'] == 'ATL']

all_data = pd.concat([df_pho, df_ind, df_min, df_atl])
print(all_data)


# Calculate population means for the whole dataset (true population values)
population_mean_height = df['Height'].mean()
population_mean_age = df['Age'].mean()
population_mean_BMI = df['BMI'].mean()
population_mean_points = df['PTS'].mean()

# Calculate the means for Height, Age, BMI, and PTS from the selected cluster data
cluster_mean_height = all_data['Height'].mean()
cluster_mean_age = all_data['Age'].mean()
cluster_mean_BMI = all_data['BMI'].mean()
cluster_mean_points = all_data['PTS'].mean()

sampling_error_height = population_mean_height - cluster_mean_height
sampling_error_age = population_mean_age - cluster_mean_age
sampling_error_BMI = population_mean_BMI - cluster_mean_BMI
sampling_error_points = population_mean_points - cluster_mean_points

**Explanation:**

**Purpose:**

The code demonstrates cluster sampling and illustrates the concept of sampling error, which shows that samples may not perfectly represent the whole population.