# Sampling: Expenditure per capita of entrepreneurs in DKI Jakarta in 2020

## A. Introduction & Background

DKI Jakarta is known as one of the provinces that has thriving entrepreneurial activity in various sectors such as fashion, food & beverage, service, technology, etc. However, in 2020, Indonesia faced economic challenges due to the COVID-19 pandemic, including DKI Jakarta. The widespread disruptions caused by the COVID-19 pandemic forced many businesses to adapt to survive and put the resilience of businesses and entrepreneurs in all over the world to the test, making it a year of significant change.

This study will try to understand how entrepreneurs spend their money during that time. By analysing entrepreneurs' spending throughout 2020, it will reveal how they adapted to the new normal, which can guide future economic resilience efforts.



## B. Import Library

In [113]:
# Import library
import numpy as np
import pandas as pd
import math
import copy



## C. Import Raw Data Set

In [114]:
# Import raw dataset as initial data
initial_2020_gender_exp = pd.read_csv('Sampling Project - Rifa Aisya Putri - Expenditure by Gender.csv')
initial_2020_working_status = pd.read_csv('Sampling Project - Rifa Aisya Putri - Total Workers.csv')
initial_2020_gender = pd.read_csv('Sampling Project - Rifa Aisya Putri - Dist gender.csv')
initial_2020_city = pd.read_csv('Sampling Project - Rifa Aisya Putri - Dist Population.csv')
initial_2020_district = pd.read_csv('Sampling Project - Rifa Aisya Putri - Kecamatan.csv')

In [115]:
initial_2020_city

Unnamed: 0,City,Distribution
0,Jakarta Selatan,21.1
1,Jakarta Timur,28.8
2,Jakarta Pusat,10.2
3,Jakarta Barat,23.3
4,Jakarta Utara,16.6


## D. Sampling Design

### Population Stages

**Population**
- First Stage: All regencies/cities in DKI Jakarta
- Second Stage: All districts in regencies/cities in DKI Jakarta
- Third Stage: All entrepreneurs/self-employed individuals in districts in DKI Jakarta

**Target Population**
<br>The target population includes all individuals who are entrepreneurs or self-employed workers in DKI Jakarta.

**Sampling Frame**
<br>To obtain information from the entire population, I used the data from [Jakarta Bureau of Statistics](https://jakarta.bps.go.id/).

**Sampling Unit**
- First Stage: All regencies/cities in DKI Jakarta
- Second Stage: All districts in regencies/cities in DKI Jakarta
- Third Stage: All entrepreneurs/self-employed individuals in districts in DKI Jakarta

**Observation Unit**
<br> The observation unit for this study is entrepreneurs in DKI Jakarta, as they are the subjects from whom data on per capita spending will be collected.

**Analysis Unit**
<br>The analysis unit for this study is at the level of entrepreneurs.

**Characteristics Studied**
<br>The characteristic under this study is the expenditure per capita of entrepreneurs in DKI Jakarta. Our main area of interest is expenditure per capita.

**Estimated characteristic values**
<br>The estimated characteristic values are average expenditure, proportion of expenditure, and total expenditure.

### Sampling Method
The sampling approach used is multistage sampling. This process involves breaking down the sampling into several stages to ensure a representative and manageable sample. First, we would stratify the city of DKI Jakarta based on expenditure per capita.
<br>Following this, we implement cluster sampling within each subgroup. This involves selecting specific clusters or geographic units, such as subdistricts or "kecamatan" within DKI Jakarta. Within these chosen clusters, we perform random sampling to select individual entrepreneurs. This approach ensures that the sampling process remains unbiased and representative.
<br>Once the data is collected, we can analyze it to determine the expenditure per capita of entrepreneurs in DKI Jakarta for the year 2020.


## E. Estimate The Total Sample

By choosing the multistage sampling, our sampling process will encompass a total of four stages, including:

- Stage 1: Dividing the population into two strata

- Stage 2: Dividing the population into five strata

- Stage 3: Cluster sampling for sitrict within each stratum

- Stage 4: Sampling using Simple Random Sampling method or SRS

### Stage 1: Dividing/Categorizing the population into two strata (Gender)

This categorization is undertaken because of the initial hypothesis that there are discrepancies in per capita expenditure associated with gender (female or male). As prior research has shown, females' per capita expenditure amounts to approximately IDR 9.04 million, which is roughly 39% less than the per capita expenditure of males, already standing at IDR 15 million (source: https://jurnal.bppk.kemenkeu.go.id/jurnalbppk/article/download/626/331). Hence, the population is divided into two gender categories using data provided by BPS DKI Jakarta.


As the cost and variance of the population are not available, we will start the steps by calculating the cost first and then calculate the variance
<br>$ C = c_{0} + c_{1}n + c_{2}nm $

In [116]:
# To calculate the cost, we will try to put some assumption
c0 = 4_000_000
c1 = 50_000
c2 = 50_000
N = 5
n_ = 5
M = 2000
nm = 2000

In [117]:
initial_2020_gender_exp

Unnamed: 0,City,Male,Female,Mean
0,South Jakarta,27360314,23003910,2717039
1,East Jakarta,21873567,17113162,1866273
2,Central Jakarta,19390719,16577167,2388623
3,West Jakarta,23697197,19398457,2236259
4,North Jakarta,26025198,18087289,2297793
5,DKI Jakarta,22614576,16742345,2257991


In [118]:
# Clean the data set in "Mean" column
clean_2020_exp = initial_2020_gender_exp.loc[5::,["Mean"]]
clean_2020_exp = clean_2020_exp["Mean"].values[0]

initial_2020_gender_exp["Variance"] = (initial_2020_gender_exp["Mean"] - clean_2020_exp) ** 2
initial_2020_gender_exp

Unnamed: 0,City,Male,Female,Mean,Variance
0,South Jakarta,27360314,23003910,2717039,210725066304
1,East Jakarta,21873567,17113162,1866273,153442991524
2,Central Jakarta,19390719,16577167,2388623,17064719424
3,West Jakarta,23697197,19398457,2236259,472279824
4,North Jakarta,26025198,18087289,2297793,1584199204
5,DKI Jakarta,22614576,16742345,2257991,0


$$
m_{\text{opt}} = \frac{c_2(\sigma_b^2 - \bar{M}\sigma_w^2)}{c_1\sigma_w^2}
$$

$$
\sigma_b^2 = \frac{1}{N-1} \sum_{i=1}^{N} (\mu_i - \mu)^2
$$

$$
\sigma_w^2 = \frac{1}{N} \sum_{i=1}^{N} \sigma_i^2
$$


As we've already calculate the data's variance, we can now calculate σ_b^2 and σ_w^2 base on the formula above. Finally, we can get the value for m.opt

In [119]:
# Calculate σ_b^2 and σ_w^2
sigma_b = (1/(N-1))*(initial_2020_gender_exp["Variance"].sum())
sigma_w = (1/(N))*(initial_2020_gender_exp["Variance"].sum())
print("The value of σ_b^2 is", sigma_b)
print("The value of σ_w^2 is", sigma_w)

The value of σ_b^2 is 95822314070.0
The value of σ_w^2 is 76657851256.0


In [120]:
# Calculate value of m_opt
m_opt = math.sqrt(c1*sigma_w/(c2*(sigma_b-(sigma_w/M))))
print("The value of m_opt is", m_opt)

The value of m_opt is 0.8946061301216421


The next step is calculate total allocation cost (C) with 5 PSU, corresponding to the number of selected cities and assuming a prior data sample size of 2,000 (assumed data). This can be computed as follows:

In [121]:
# Calculate total allocation cost with C formula
C = c0 + c1*n_+c2*nm
print("Total allocation cost is",C)

Total allocation cost is 104250000


The total allocation cost has already been calculated, so we can use the following equation to determine the required sample size.

$$
n = \frac{C - c_{0}}{c_{1} + c_{2} m_{\text{opt}}}
$$


In [122]:
# Calculate total sample or n
n = (C-c0)/(c1+c2*m_opt)
n = math.ceil(n)
print("Total sample needed is", n)

Total sample needed is 1059


The total samples will be allocated to the first stratum for gender groups using proportional allocation, resulting in the following sample sizes:

In [123]:
# Duplicate the data of gender distribution
df_gender_2020 = copy.copy(initial_2020_gender)
df_gender_2020["Total Percentage"] = df_gender_2020["Total Percentage"]/100
df_gender_2020["Total Sample"] = round(df_gender_2020["Total Percentage"]*n,0)
df_gender_2020



Unnamed: 0,Gender,Total Percentage,Total Sample
0,Female,0.495,524.0
1,Male,0.505,535.0


In [124]:
initial_2020_city

Unnamed: 0,City,Distribution
0,Jakarta Selatan,21.1
1,Jakarta Timur,28.8
2,Jakarta Pusat,10.2
3,Jakarta Barat,23.3
4,Jakarta Utara,16.6


### Stage 2: Dividing/Categorizing the population into five strata (City)

This categorization is undertaken because we assume there's differences in per capita expenditure based on the city. We've divided the population into five cities: South Jakarta, East Jakarta, Central Jakarta, West Jakarta, and North Jakarta, using data from BPS DKI Jakarta. We'll calcultae the sample size using proportional allocation as well.

In [125]:
initial_2020_city
gender = df_gender_2020["Gender"].values.tolist()
city = initial_2020_city["City"].values.tolist()

In [126]:
initial_2020_city

Unnamed: 0,City,Distribution
0,Jakarta Selatan,21.1
1,Jakarta Timur,28.8
2,Jakarta Pusat,10.2
3,Jakarta Barat,23.3
4,Jakarta Utara,16.6


In [127]:
df_gender_2020

Unnamed: 0,Gender,Total Percentage,Total Sample
0,Female,0.495,524.0
1,Male,0.505,535.0


In [128]:
sampling_design = pd.DataFrame({'Gender' : ['Female','Female','Female','Female','Female','Male','Male','Male','Male','Male'],
                                 'City' : ['South Jakarta', 'East Jakarta', 'Central Jakarta', 'West Jakarta', 'North Jakarta',
                                            'South Jakarta', 'East Jakarta', 'Central Jakarta', 'West Jakarta', 'North Jakarta']})
sampling_design

Unnamed: 0,Gender,City
0,Female,South Jakarta
1,Female,East Jakarta
2,Female,Central Jakarta
3,Female,West Jakarta
4,Female,North Jakarta
5,Male,South Jakarta
6,Male,East Jakarta
7,Male,Central Jakarta
8,Male,West Jakarta
9,Male,North Jakarta


In [129]:
# Calculate total sample for female
sampling_design["Total Sample"] = round((initial_2020_city["Distribution"].iloc[0:5]/100) * df_gender_2020["Total Sample"][0])

# Calculate total sample for male
sampling_design["Total Sample"].iloc[5:10] = round((initial_2020_city["Distribution"].iloc[0:5]/100) * df_gender_2020["Total Sample"][1])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampling_design["Total Sample"].iloc[5:10] = round((initial_2020_city["Distribution"].iloc[0:5]/100) * df_gender_2020["Total Sample"][1])


In [130]:
sampling_design

Unnamed: 0,Gender,City,Total Sample
0,Female,South Jakarta,111.0
1,Female,East Jakarta,151.0
2,Female,Central Jakarta,53.0
3,Female,West Jakarta,122.0
4,Female,North Jakarta,87.0
5,Male,South Jakarta,113.0
6,Male,East Jakarta,154.0
7,Male,Central Jakarta,55.0
8,Male,West Jakarta,125.0
9,Male,North Jakarta,89.0


### Stage 3: Cluster sampling for sitrict within each stratum


The calculation of district clusters used for sampling is determined through the cluster sampling design calculations above. So we need to calculate the total sample for each city with these formulas:
<br> n = Ν * σ² /N * D + σ²
<br> σc² = 1 / (N - 1) * Σ(i = 1 to N) (yi - μ * Mi)²
<br> D = (B ^ 2)/(4N ^ 2)

In [131]:
# Calculate the total sample needed for each city
sample_per_city = sampling_design.groupby(["City"]).sum(numeric_only=True)
sample_per_city 

Unnamed: 0_level_0,Total Sample
City,Unnamed: 1_level_1
Central Jakarta,108.0
East Jakarta,305.0
North Jakarta,176.0
South Jakarta,224.0
West Jakarta,247.0


In [132]:
# Total sample from all cities
total_sample_agg = sampling_design["Total Sample"].sum()
print("Total sample needed is ",total_sample_agg)

Total sample needed is  1060.0


In [133]:
# Display district table
initial_2020_district

Unnamed: 0,City,Total district
0,South Jakarta,10
1,East Jakarta,10
2,Centra Jakarta,8
3,West Jakarta,8
4,North Jakarta,6


In [134]:
# Extract total sample values for each district
sample_district = sample_per_city.iloc[0:5, 0].tolist()

# Display the result
sample_district


[108.0, 305.0, 176.0, 224.0, 247.0]

In [135]:
# Input the data into the table
initial_2020_district["M District"] = sample_district
initial_2020_district

Unnamed: 0,City,Total district,M District
0,South Jakarta,10,108.0
1,East Jakarta,10,305.0
2,Centra Jakarta,8,176.0
3,West Jakarta,8,224.0
4,North Jakarta,6,247.0


In [136]:
# Make an assumption for the value of B, setting it at 10,000,000
B = 10000000

# Calculate "y" and "miu x m"
initial_2020_gender_exp["y"] = initial_2020_gender_exp["Mean"][0:5]*initial_2020_district["M District"]
initial_2020_gender_exp["miu x m"] = clean_2020_exp*initial_2020_district["M District"]

# Calculate "(y-miu x m)^2"
initial_2020_gender_exp["(y-miu x m)^2"] = (initial_2020_gender_exp["y"] - initial_2020_gender_exp["miu x m"])**2

# Calculate "sigma c"
initial_2020_gender_exp["sigma c"] = initial_2020_gender_exp["(y-miu x m)^2"] / (initial_2020_district["Total district"].iloc[0:5] - 1)

# Calculate "D"
initial_2020_gender_exp["D"] = B**2 / (4 * initial_2020_district["Total district"].iloc[0:5]**2)

# Calculate "n" with rounding
initial_2020_gender_exp["n"] = round((initial_2020_district["Total district"].iloc[0:5] * initial_2020_gender_exp["sigma c"]) / (initial_2020_district["Total district"].iloc[0:5] * initial_2020_gender_exp["D"] + initial_2020_gender_exp["sigma c"]))

# Display the DataFrame
initial_2020_gender_exp

Unnamed: 0,City,Male,Female,Mean,Variance,y,miu x m,(y-miu x m)^2,sigma c,D,n
0,South Jakarta,27360314,23003910,2717039,210725066304,293440212.0,243863028.0,2457897000000000.0,273099700000000.0,250000000000.0,10.0
1,East Jakarta,21873567,17113162,1866273,153442991524,569213265.0,688687255.0,1.427403e+16,1586004000000000.0,250000000000.0,10.0
2,Central Jakarta,19390719,16577167,2388623,17064719424,420397648.0,397406416.0,528596700000000.0,75513820000000.0,390625000000.0,8.0
3,West Jakarta,23697197,19398457,2236259,472279824,500922016.0,505789984.0,23697110000000.0,3385302000000.0,390625000000.0,4.0
4,North Jakarta,26025198,18087289,2297793,1584199204,567554871.0,557723777.0,96650410000000.0,19330080000000.0,694444400000.0,5.0
5,DKI Jakarta,22614576,16742345,2257991,0,,,,,,


### Stage 4: Sampling using Simple Random Sampling method or SRS

In [137]:
# Create a new column in "sampling_design" DataFrame to store the "Total Sample District"
sampling_design["Total Sample District"] = initial_2020_gender_exp["n"][0:5]
sampling_design["Total Sample District"][5:20] = initial_2020_gender_exp["n"][0:5]
sampling_design

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sampling_design["Total Sample District"][5:20] = initial_2020_gender_exp["n"][0:5]


Unnamed: 0,Gender,City,Total Sample,Total Sample District
0,Female,South Jakarta,111.0,10.0
1,Female,East Jakarta,151.0,10.0
2,Female,Central Jakarta,53.0,8.0
3,Female,West Jakarta,122.0,4.0
4,Female,North Jakarta,87.0,5.0
5,Male,South Jakarta,113.0,10.0
6,Male,East Jakarta,154.0,10.0
7,Male,Central Jakarta,55.0,8.0
8,Male,West Jakarta,125.0,4.0
9,Male,North Jakarta,89.0,5.0


I try to calculate the distribution of each district manually by excel and here is the result

In [138]:
new_value = [
    [10, 10, 10, 10, 10, 10, 10, 10, 11, 11],
    [15, 15, 15, 15, 15, 15, 15, 15, 15, 16],
    [7, 7, 7, 7, 7, 7, 7, 4],
    [31, 31, 30, 30],
    [17, 17, 17, 18, 18],
    [11, 11, 11, 11, 11, 11, 11, 11, 11, 12],
    [15, 15, 15, 15, 15, 15, 15, 15, 15, 16],
    [7, 7, 7, 7, 7, 7, 7, 6],
    [31, 31, 31, 32],
    [18, 18, 18, 18, 17]
]
sampling_design["District Distribution"] = new_value
sampling_design

Unnamed: 0,Gender,City,Total Sample,Total Sample District,District Distribution
0,Female,South Jakarta,111.0,10.0,"[10, 10, 10, 10, 10, 10, 10, 10, 11, 11]"
1,Female,East Jakarta,151.0,10.0,"[15, 15, 15, 15, 15, 15, 15, 15, 15, 16]"
2,Female,Central Jakarta,53.0,8.0,"[7, 7, 7, 7, 7, 7, 7, 4]"
3,Female,West Jakarta,122.0,4.0,"[31, 31, 30, 30]"
4,Female,North Jakarta,87.0,5.0,"[17, 17, 17, 18, 18]"
5,Male,South Jakarta,113.0,10.0,"[11, 11, 11, 11, 11, 11, 11, 11, 11, 12]"
6,Male,East Jakarta,154.0,10.0,"[15, 15, 15, 15, 15, 15, 15, 15, 15, 16]"
7,Male,Central Jakarta,55.0,8.0,"[7, 7, 7, 7, 7, 7, 7, 6]"
8,Male,West Jakarta,125.0,4.0,"[31, 31, 31, 32]"
9,Male,North Jakarta,89.0,5.0,"[18, 18, 18, 18, 17]"


## F. Data Collection Method

The data collection method for the sampling of per capita expenditure in DKI Jakarta in 2020 is a crucial component in understanding the economic landscape of this region. We use phone and online messenger surveys for this purpose, which offer advantages like cost-effectiveness and speed compared to traditional methods like face-to-face interviews or mailed questionnaires. To begin, we gather a random sample of phone numbers from the target population, and trained interviewers call these numbers to collect data through a series of questions.

<br> However, there are some important considerations to bear in mind. Ensuring that the sample truly reflects the population is crucial. Interviewers need to be skilled in phone interviews to get reliable answers. Costs can be a factor, and sometimes people may not be available or willing to participate. Also, responses may not always be completely accurate. Collecting the data through phone interview or online messages are popular due to its convenience and lower cost.

## G. Parameter Estimation Method

Following the execution of the sampling method as described previously, the data obtained from the samples can be used to estimate the expenditure per capita of self-employed individuals in DKI Jakarta's population. The parameters of interest for inference include the total, mean, and variance.Here is the formula:

a. An unbiased estimator of the population
<br>![image](https://github.com/rifaisyap/sampling_project/assets/134842689/7e9898c8-fd7d-409f-843c-9f63db2b33ad)


b. The variance formula
<br>![image](https://github.com/rifaisyap/sampling_project/assets/134842689/116c1419-8b43-406b-88c0-a80ee38d53f1)


c. The mean formula
<br>![image](https://github.com/rifaisyap/sampling_project/assets/134842689/cf4077f1-d068-4e7f-9584-d851993f1be5)

<br> Once the sampling is completed, the obtained sample results can be input into the equation mentioned earlier to derive sample statistics that closely estimate the population parameters.


## H. Analysis & Conclusion

Based on the results of the sampling method, it can be concluded that:

<br><b>Sampling Design Overview</b>
<br>The above sampling design results lead to several key insights. Firstly, it emphasizes the critical role of prior knowledge in the sampling design process, allowing for better-informed assumptions regarding parameter values in the sampling. Second, the sampling method involves a multistage approach. The first stratification is based on gender, and the second is based on districts and cities. Then, cluster sampling is implemented at the subdistrict level, while simple random sampling is applied at the lowest level.

<br><b>Sample Size and Cost Analysis</b>
<br>The sampling design calculations show that 1.060 (rounding up) samples are needed to collect reliable data about entrepreneurs in DKI Jakarta. The total cost for this sampling is Rp. 104,250,000. The cost is relatively reasonable for the respective stakeholders with survey needs.

<br><b>Flexible Data Collection Method</b>
<br>To facilitate the sampling process, a flexible data collection method is utilized, incorporating telephone and online messenger channels to gather information on participants’ monthly expenses. This approach enhances resource allocation and surveyor flexibility, making it more adaptable to the dynamic approach of data collection while still delivering great results.

## I. Advantages and Disadvantages of The Sampling Method


<br>The advantages of this sampling design are:

- More efficient: Multi-stage sampling can be efficient because the population of this sampling is large and widely dispersed. 

- Geographic Coverage: To obtain samples from different geographic areas or strata, ensuring that the sample represents a diverse range of locations within the population which is DKI Jakarta

- Diversity: Multi-stage sampling allows for the inclusion of various strata or groups within the population, ensuring that different subpopulations are represented in the sample.

- Resource Allocation: Researchers can allocate resources more efficiently by focusing on specific strata or clusters that are of particular interest.

<br> While, the disadvantages are:

- Complexity: Multi-stage sampling can be more complex to design and execute compared to simpler sampling methods, which may require a greater understanding of the population's structure and characteristics.

- Potential for Bias: The use of multiple stages introduces the potential for bias at each stage, which can affect the overall representativeness of the final sample and its generalizability to the population.

- Increased Costs: Implementing multi-stage sampling can be more expensive than single-stage methods due to the need for additional resources, such as selecting primary sampling units, conducting multiple stages of sampling, and managing the logistics of the process.



## J. Initial Data Source

- https://www.bps.go.id/indicator/40/461/2/pengeluaran-per-kapita-yang-disesuaikan-menurut-jenis-kelamin.html 
- https://jakarta.bps.go.id/indicator/26/897/1/-metode-baru-pengeluaran-per-kapita-disesuaikan.html 
- https://jakarta.bps.go.id/indicator/6/1283/1/jumlah-pekerja-menurut-status-pekerjaan-di-provinsi-dki-jakarta.html 
- https://jakarta.bps.go.id/indicator/12/1270/1/jumlah-penduduk-menurut-kabupaten-kota-di-provinsi-dki-jakarta-.html
