# Sampling Techniques
1. Random Sampling
2. Stratified Sampling
3. Systematic sampling
4. Cluster sampling

## Random Sampling
 In this method, everyone in the population has an equal chance of being selected, and each selection is independent of the others. 

In [None]:


import numpy as np
import pandas as pd

# Creating a dummy dataset
data = pd.DataFrame({
    'ID': range(1, 1001),  # 1000 individuals
    'Gender': np.random.choice(['Male', 'Female'], size=1000),
    'Age': np.random.randint(18, 70, size=1000),
    'Income': np.random.randint(20000, 150000, size=1000)  # Random income values
})

data.head()
Random Sampling

In [3]:
# Creating bins for by Age and Income variables
age_bins = [18, 30, 40, 50, 60, 70]
income_bins = [20000, 50000, 80000, 110000, 140000, 170000]
data['Age_Group'] = pd.cut(data['Age'], bins=age_bins)
data['Income_Group'] = pd.cut(data['Income'], bins=income_bins)

data.head()

Unnamed: 0,ID,Gender,Age,Income,Age_Group,Income_Group
0,1,Female,39,53055,"(30, 40]","(50000, 80000]"
1,2,Female,48,123023,"(40, 50]","(110000, 140000]"
2,3,Male,39,40846,"(30, 40]","(20000, 50000]"
3,4,Male,69,86241,"(60, 70]","(80000, 110000]"
4,5,Female,54,45993,"(50, 60]","(20000, 50000]"


In [28]:
# Random sampling

#we will create a sample of 50 people randomly
random_sample = data.sample(n=100, random_state=42) 
print("Random Sample:")
random_sample.head()

Random Sample:


Unnamed: 0,ID,Gender,Age,Income,Age_Group,Income_Group
521,522,Male,68,28490,"(60, 70]","(20000, 50000]"
737,738,Male,33,106411,"(30, 40]","(80000, 110000]"
740,741,Male,39,39744,"(30, 40]","(20000, 50000]"
660,661,Female,63,74745,"(60, 70]","(50000, 80000]"
411,412,Male,54,57772,"(50, 60]","(50000, 80000]"


In [44]:
# proportion of age groups in the population 

# Age 

print('proportion of age groups in the population')

print(data['Age_Group'].value_counts() / len(data))

print()

#proportion of age groups in the sample 

print('proportion of age groups in the sample')

print(random_sample['Age_Group'].value_counts() / len(random_sample))

# Income 

print('proportion of Income groups in the population')

print(data['Income_Group'].value_counts() / len(data))

print()

print('proportion of Income groups in the sample')

#proportion of age groups in the sample 

print(random_sample['Income_Group'].value_counts() / len(random_sample))

proportion of age groups in the population
Age_Group
(18, 30]    0.257
(30, 40]    0.186
(40, 50]    0.183
(50, 60]    0.176
(60, 70]    0.176
Name: count, dtype: float64

proportion of age groups in the sample
Age_Group
(18, 30]    0.25
(50, 60]    0.22
(60, 70]    0.21
(40, 50]    0.16
(30, 40]    0.14
Name: count, dtype: float64
proportion of Income groups in the population
Income_Group
(110000, 140000]    0.241
(80000, 110000]     0.238
(50000, 80000]      0.232
(20000, 50000]      0.222
(140000, 170000]    0.067
Name: count, dtype: float64

proportion of Income groups in the sample
Income_Group
(50000, 80000]      0.32
(80000, 110000]     0.22
(110000, 140000]    0.22
(20000, 50000]      0.17
(140000, 170000]    0.07
Name: count, dtype: float64


## Stratified Sampling

The problem with random sampling is that it can yield inaccurate results, and the samples may not be representative of the population. This means that the sample may not have equal proportions for the variables compared to the population. For example, if gender is expected to have an impact, and the population has a male-to-female ratio of 8:2, then stratified sampling ensures that the sample also maintains a similar proportion. This level of adjustment is not typically expected in random sampling.

In [27]:
# Stratified sampling (stratifying by Age and Income)

# Calculate the number of samples needed for each group
total_samples = 100
group_sizes = data.groupby(['Age_Group', 'Income_Group']).size()
sample_sizes = (group_sizes / group_sizes.sum()) * total_samples

# Adjust the sampling fraction for each group
sampling_frac = sample_sizes / group_sizes

# Stratified sampling
stratified_sample = data.groupby(['Age_Group', 'Income_Group']).apply(lambda x: x.sample(frac=sampling_frac[x.name]))

# Reset index to remove multi-index caused by groupby
stratified_sample = stratified_sample.reset_index(drop=True)

# Displaying the samples
print("Stratified Sample:")
stratified_sample.head()


Stratified Sample:


Unnamed: 0,ID,Gender,Age,Income,Age_Group,Income_Group
0,683,Female,20,43405,"(18, 30]","(20000, 50000]"
1,952,Male,30,43280,"(18, 30]","(20000, 50000]"
2,350,Male,19,46799,"(18, 30]","(20000, 50000]"
3,598,Female,24,45628,"(18, 30]","(20000, 50000]"
4,436,Female,26,21107,"(18, 30]","(20000, 50000]"


In [43]:
# proportion of age groups in the population 

# Age 

print('proportion of age groups in the population')

print(data['Age_Group'].value_counts() / len(data))

print()

#proportion of age groups in the sample 

print('proportion of age groups in the sample')

print(stratified_sample['Age_Group'].value_counts() / len(stratified_sample))

# Income 

print('proportion of Income groups in the population')

print(data['Income_Group'].value_counts() / len(data))

print()

print('proportion of Income groups in the sample')

#proportion of age groups in the sample 

print(stratified_sample['Income_Group'].value_counts() / len(stratified_sample))

proportion of age groups in the population
Age_Group
(18, 30]    0.257
(30, 40]    0.186
(40, 50]    0.183
(50, 60]    0.176
(60, 70]    0.176
Name: count, dtype: float64

proportion of age groups in the sample
Age_Group
(18, 30]    0.26
(30, 40]    0.19
(40, 50]    0.19
(50, 60]    0.18
(60, 70]    0.18
Name: count, dtype: float64
proportion of Income groups in the population
Income_Group
(110000, 140000]    0.241
(80000, 110000]     0.238
(50000, 80000]      0.232
(20000, 50000]      0.222
(140000, 170000]    0.067
Name: count, dtype: float64

proportion of Income groups in the sample
Income_Group
(20000, 50000]      0.23
(50000, 80000]      0.23
(80000, 110000]     0.23
(110000, 140000]    0.23
(140000, 170000]    0.08
Name: count, dtype: float64


## Systematic sampling:
Systematic sampling involves selecting every nth member from the population to create the sample. For example, if the population size is 100 and we want a sample of 10, we would choose every 10th element. Start by choosing any random point within the first 10 elements, lets pick 5th element, we would then select every 10th element thereafter. Therefore, our sample would consist of the elements at positions 5, 15, 25, 35, 45, 55, 65, 75, 85, and 95.
The advantage of using this technique is that the sample will be representative of the population. However, a disadvantage is that we may miss capturing any periodic patterns in the population, which could lead to bias in the sample.


In [26]:
# Systematic sampling

#selecting the randome index from 1 to 10
random_index = np.random.randint(1,11)

print(f'Random index selected for sampling is {random_index}')

systematic_sample = data.iloc[random_index::10] # Select every 10th individual


print("\nSystematic Sample:")
systematic_sample.head()


Random index selected for sampling is 8

Systematic Sample:


Unnamed: 0,ID,Gender,Age,Income,Age_Group,Income_Group
8,9,Female,57,78468,"(50, 60]","(50000, 80000]"
18,19,Female,26,48352,"(18, 30]","(20000, 50000]"
28,29,Male,36,126269,"(30, 40]","(110000, 140000]"
38,39,Male,26,44157,"(18, 30]","(20000, 50000]"
48,49,Female,42,105469,"(40, 50]","(80000, 110000]"


## Cluster sampling:
Cluster sampling involves creating clusters within the population, and then sampling is conducted by selecting entire clusters rather than individual elements. Clusters are typically formed based on some similarity or natural grouping within the population. For example, a cluster might consist of households within a particular neighborhood. advantage of cluster sampling is its ability to efficiently handle large populations with high diversity.
However, cluster sampling can introduce bias if the clusters are not properly defined or if they do not adequately represent the diversity of the population.  


In [48]:
# Cluster sampling

#lets create a function that will create clusters in the dataframe and then randomly selected on cluster and set is as sample 

import random as rd

def sample_cluster(dataframe, clusters, state=None):
    print('define variables')
    length = len(dataframe)
    print(f'  - length: {length}')
    element_max = length / clusters
    print(f'  - elements by cluster: {element_max}')
    
    cluster_list = []
    cluster_id = 0
    element_count = 0
    
    print('define clusters')
    for _ in dataframe.iterrows():
        cluster_list.append(cluster_id)
        element_count += 1
        if element_count > (element_max - 1):
            element_count = 0
            cluster_id += 1
    
    dataframe['cluster'] = cluster_list
    print(' - cluster list')
    print(dataframe['cluster'].value_counts())
    print('')
    rd.seed(state)
    cluster_selected = rd.randint(0, clusters - 1)
    print('cluster selected:', cluster_selected)
    dataframe_clustered = dataframe[dataframe['cluster'] == cluster_selected]
    print('cluster size:', dataframe_clustered.shape[0], '\n')
    
    
    #proportion of age groups in the sample 

    print('proportion of age groups in the sample')

    print(dataframe_clustered['Age_Group'].value_counts() / len(dataframe_clustered))
    
    print()
    
    print('proportion of Income groups in the sample')

    #proportion of age groups in the sample 

    print(dataframe_clustered['Income_Group'].value_counts() / len(dataframe_clustered))
    
    return dataframe_clustered

sample_cluster(data, 10)

define variables
  - length: 1000
  - elements by cluster: 100.0
define clusters
 - cluster list
cluster
0    100
1    100
2    100
3    100
4    100
5    100
6    100
7    100
8    100
9    100
Name: count, dtype: int64

cluster selected: 0
cluster size: 100 

proportion of age groups in the sample
Age_Group
(18, 30]    0.29
(40, 50]    0.22
(50, 60]    0.20
(60, 70]    0.18
(30, 40]    0.10
Name: count, dtype: float64

proportion of Income groups in the sample
Income_Group
(110000, 140000]    0.30
(20000, 50000]      0.23
(50000, 80000]      0.19
(80000, 110000]     0.19
(140000, 170000]    0.09
Name: count, dtype: float64


Unnamed: 0,ID,Gender,Age,Income,Age_Group,Income_Group,cluster
0,1,Female,39,53055,"(30, 40]","(50000, 80000]",0
1,2,Female,48,123023,"(40, 50]","(110000, 140000]",0
2,3,Male,39,40846,"(30, 40]","(20000, 50000]",0
3,4,Male,69,86241,"(60, 70]","(80000, 110000]",0
4,5,Female,54,45993,"(50, 60]","(20000, 50000]",0
...,...,...,...,...,...,...,...
95,96,Female,31,143076,"(30, 40]","(140000, 170000]",0
96,97,Male,62,131058,"(60, 70]","(110000, 140000]",0
97,98,Female,27,35376,"(18, 30]","(20000, 50000]",0
98,99,Male,57,87750,"(50, 60]","(80000, 110000]",0


### Code explanation:
   Input Parameters:

    dataframe: Input the DataFrame containing the data points to be clustered.
    clusters: The number of clusters to create.
    state: (Optional) The random seed for reproducibility.
   
   Initialization:

    The function initializes variables such as length, which represents the total number of data points in the DataFrame, and element_max, which represents the maximum number of elements per cluster.
    It initializes an empty list cluster_list to store cluster assignments and sets cluster_id and element_count to track cluster assignment and element counts, respectively.

   Cluster Assignment:

    The function iterates over each row in the DataFrame using iterrows().
    For each row, it assigns a cluster ID to the cluster_list based on the current cluster_id. It also increments element_count.
    If element_count exceeds element_max - 1, it resets element_count to 0 and increments cluster_id to move to the next cluster.

   Dataframe Modification:

    The function adds a new column 'cluster' to the DataFrame, storing the assigned cluster IDs.

   Cluster Selection:

    It selects a random cluster using rd.randint(0, clusters - 1) and stores the selected cluster ID in cluster_selected.

   Output:

    The function filters the DataFrame to include only the data points belonging to the selected cluster.
    It prints information about the selected cluster, including the cluster size.
    It returns the filtered DataFrame containing data points belonging to the selected cluster.

   Print Statements:

    The function includes print statements for logging and debugging purposes, providing information about variable values, cluster assignments, and the selected cluster.
   

In [47]:
# proportion of age groups in the population 

# Age 

print('proportion of age groups in the population')

print(data['Age_Group'].value_counts() / len(data))

print()


# Income 

print('proportion of Income groups in the population')

print(data['Income_Group'].value_counts() / len(data))

print()


proportion of age groups in the population
Age_Group
(18, 30]    0.257
(30, 40]    0.186
(40, 50]    0.183
(50, 60]    0.176
(60, 70]    0.176
Name: count, dtype: float64

proportion of Income groups in the population
Income_Group
(110000, 140000]    0.241
(80000, 110000]     0.238
(50000, 80000]      0.232
(20000, 50000]      0.222
(140000, 170000]    0.067
Name: count, dtype: float64



## Conculsions

    1. Random sampling is very easy to perform and required less time compared to other techniques, but it has disadvanatges of Sample may not fully represent population characteristics; proportions in the sample may differ from the population and in our case we saw it did not make the proportions in sample similar to the population
    
    2. Stratified sampling ensures that the sample is representative of the population; yields better results in maintaining proportions compared to random sampling 
    
    3. Systematic sampling make sure sample is repersentative of the population. but it May miss any patterns that exist in the population
    
    4. Cluster sampling on the other hand performas better than rest but it involves careful selection of the cluster and is a lengthy process
    