# Stratified Sampling in Python
#### This kernel gives a simple solution for stratified sampling in Python.

According to [Wikipedia](https://en.wikipedia.org/wiki/Stratified_sampling), "in statistics, stratified sampling is a method of sampling from a population which can be partitioned into subpopulations." This method of sampling can be advantageous because it tries to keep in the sample the same proportion of each desired variable (strata) that is present in the population. A simple random sample could ignore this fact.

There are many use cases for stratified sampling. The main idea is that we want to mimic a population with a sample. It is widely used to generate comparable samples when the objective is to perform hypothesis testing of any kind.

Many times I had to face this situation, so I developed a module in Python with functions that performs stratified sampling given a pandas DataFrame object. I hope it can be useful in your endevors.

For this example I will use a dataset with information about customers. It will be useful since it contains lots of variables we can use to perform stratified sampling.

*This module can also be found in [this github repo](https://github.com/flaviobossolan/python_stratified_sampling)*

In [None]:
# Required libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('/kaggle/input/bank-additional-full.csv', sep=';')
df.head()

In [None]:
df.describe()

In [None]:
# the functions:
def stratified_sample(df, strata, size=None, seed=None, keep_index= True):
    '''
    It samples data from a pandas dataframe using strata. These functions use
    proportionate stratification:
    n1 = (N1/N) * n
    where:
        - n1 is the sample size of stratum 1
        - N1 is the population size of stratum 1
        - N is the total population size
        - n is the sampling size
    Parameters
    ----------
    :df: pandas dataframe from which data will be sampled.
    :strata: list containing columns that will be used in the stratified sampling.
    :size: sampling size. If not informed, a sampling size will be calculated
        using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    :seed: sampling seed
    :keep_index: if True, it keeps a column with the original population index indicator
    
    Returns
    -------
    A sampled pandas dataframe based in a set of strata.
    Examples
    --------
    >> df.head()
    	id  sex age city 
    0	123 M   20  XYZ
    1	456 M   25  XYZ
    2	789 M   21  YZX
    3	987 F   40  ZXY
    4	654 M   45  ZXY
    ...
    # This returns a sample stratified by sex and city containing 30% of the size of
    # the original data
    >> stratified = stratified_sample(df=df, strata=['sex', 'city'], size=0.3)
    Requirements
    ------------
    - pandas
    - numpy
    '''
    population = len(df)
    size = __smpl_size(population, size)
    tmp = df[strata]
    tmp['size'] = 1
    tmp_grpd = tmp.groupby(strata).count().reset_index()
    tmp_grpd['samp_size'] = round(size/population * tmp_grpd['size']).astype(int)

    # controlling variable to create the dataframe or append to it
    first = True 
    for i in range(len(tmp_grpd)):
        # query generator for each iteration
        qry=''
        for s in range(len(strata)):
            stratum = strata[s]
            value = tmp_grpd.iloc[i][stratum]
            n = tmp_grpd.iloc[i]['samp_size']

            if type(value) == str:
                value = "'" + str(value) + "'"
            
            if s != len(strata)-1:
                qry = qry + stratum + ' == ' + str(value) +' & '
            else:
                qry = qry + stratum + ' == ' + str(value)
        
        # final dataframe
        if first:
            stratified_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
            first = False
        else:
            tmp_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
            stratified_df = stratified_df.append(tmp_df, ignore_index=True)
    
    return stratified_df



def stratified_sample_report(df, strata, size=None):
    '''
    Generates a dataframe reporting the counts in each stratum and the counts
    for the final sampled dataframe.
    Parameters
    ----------
    :df: pandas dataframe from which data will be sampled.
    :strata: list containing columns that will be used in the stratified sampling.
    :size: sampling size. If not informed, a sampling size will be calculated
        using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    Returns
    -------
    A dataframe reporting the counts in each stratum and the counts
    for the final sampled dataframe.
    '''
    population = len(df)
    size = __smpl_size(population, size)
    tmp = df[strata]
    tmp['size'] = 1
    tmp_grpd = tmp.groupby(strata).count().reset_index()
    tmp_grpd['samp_size'] = round(size/population * tmp_grpd['size']).astype(int)
    return tmp_grpd


def __smpl_size(population, size):
    '''
    A function to compute the sample size. If not informed, a sampling 
    size will be calculated using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    Parameters
    ----------
        :population: population size
        :size: sample size (default = None)
    Returns
    -------
    Calculated sample size to be used in the functions:
        - stratified_sample
        - stratified_sample_report
    '''
    if size is None:
        cochran_n = round(((1.96)**2 * 0.5 * 0.5)/ 0.02**2)
        n = round(cochran_n/(1+((cochran_n -1) /population)))
    elif size >= 0 and size < 1:
        n = round(population * size)
    elif size < 0:
        raise ValueError('Parameter "size" must be an integer or a proportion between 0 and 0.99.')
    elif size >= 1:
        n = size
    return n

Note that the above function already have a documentation.

Let´s first take a look at the "stratified_sample_report" function:

In [None]:
help(stratified_sample_report)

This function gives us a report showing the population proportion and the final sample proportion.
Note that these functions calculate proportionate stratified sampling in Python. 
If we don´t inform a sample size, the function will calculate a significant size based on [Cochran's formula](https://en.wikipedia.org/wiki/Cochran%27s_theorem).

Let´s say we want to make a random stratified sampling on the above dataset based on the variable "marital". Let´s first compute the sampling report:

In [None]:
stratified_sample_report(df, ['marital'])

This simply means that in the original population (our imported DataFrame), there are 4.612 divorced people. In a stratified sampling it would contain 264 divorced people in order to keep the same population proportion, for instance.

Note that here we did not specify the sample size. If we specify one, the sample size may change:

In [None]:
# here I want a sample size of 10K rows
stratified_sample_report(df, ['marital'], 10000)

Note how the ample size changes. The important thing here is that the proportions are kept even though the sample size increases. This accounts for a sample that mimics the population in the variable "marital".

Here we did an example with only a variable. We could add more:

In [None]:
stratified_sample_report(df, ['marital', 'education'], 10000)

Here we see that in a sample containing 10K rows, we would have 119 rows of people that are divorced and have an education level of basic 4y.

Please note that some cases will have zero representation in the sample. For instance, take a look at row 20: single and illiterate have such a small representation in this population that it will not be considered in a sample of 10K rows.

#### Creating a Stratified Sampled DataFrame
Let´s take a look at the function´s parameters:

In [None]:
help(stratified_sample)

The help informs us that it basically uses the same parameters as the previous function. The difference is that here we can give it a seed in order to have the same sample each time. We can also ask it to keep the same index as the original DataFrame or create a new one.

Let´s create a sample of 10K rows using variables age, marital and education as strata:

In [None]:
# sample
sample_df = stratified_sample(df, ['age', 'marital', 'education'], size=10000, seed=123, keep_index= True)
sample_df.head()

In [None]:
sample_df.tail()

Here we can see a sample of size 10K maintainig the same proportions of the desired population strata.
I hope this kernel may be useful in your statistical studies.

Please upvote if you liked this content, it helps me a lot!
Thanks!