In [1]:
insurance <- read.csv("insurance.csv")

insurance <- na.omit(insurance) #removing NA observations
head(insurance) #check the dataset loaded
N <- nrow(insurance) #population size 
N

Unnamed: 0_level_0,age,sex,bmi,children,smoker,region,charges
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<int>,<chr>,<chr>,<dbl>
1,19,female,27.9,0,yes,southwest,16884.924
2,18,male,33.77,1,no,southeast,1725.552
3,28,male,33.0,3,no,southeast,4449.462
4,33,male,22.705,0,no,northwest,21984.471
5,32,male,28.88,0,no,northwest,3866.855
6,31,female,25.74,0,no,southeast,3756.622


In [2]:
N.h <- tapply(insurance$charges, insurance$region, length) #population size for different regions
regions <- names(N.h) # name of the regions
regions 
N.h


n <- 400 #sample size total
n.h.prop <- round( (N.h/N) * n) # WE ARE USING PROPORTIONAL ALLOCATION TO GET THE SAMPLE SIZE FOR EACH OF THEM
n.h.prop

In [3]:
set.seed(0)  # Set a seed for reproducibility
stratified_sample <- NULL  # Initialize an empty data frame for the stratified sample

# Loop over each region to create a stratified sample
for (i in 1:length(regions)) {
  # Get row indices for the current region starting from northwest
  row_indices <- which(insurance$region == regions[i])
  
  # Sample the indices without replacement
  sample_indices <- sample(row_indices, n.h.prop[i], replace = FALSE)
  
  # Extract the rows for the sampled indices and select only the charges and age columns
  stratified_sample <- rbind(stratified_sample, insurance[sample_indices, ])
}


ybar.h.prop <- tapply(stratified_sample$charges, stratified_sample$region, mean) #mean of charges for each strata sampled
var.h.prop <- tapply(stratified_sample$charges, stratified_sample$region, var) #variance of charges for each strata sampled
se.h.prop <- sqrt((1 - n.h.prop / N.h) * var.h.prop / n.h.prop) #standard error for each strata sampled

ybar.str.prop <- sum(N.h / N * ybar.h.prop) #Estimated population mean using Stratified Sampling
se.str.prop <- sqrt(sum((N.h / N)^2 * se.h.prop^2)) #Standard error of the estimated population mean
str.prop <- c(ybar.str.prop, se.str.prop) #Combined both the Estimated population mean and the Standard error


ybar.h.prop
var.h.prop 
se.h.prop
rbind(ybar.h.prop, se.h.prop) #binding the mean and standard error for each strata
ybar.str.prop
se.str.prop
str.prop

Unnamed: 0,northeast,northwest,southeast,southwest
ybar.h.prop,13551.318,13308.585,15004.426,13942.864
se.h.prop,1064.675,1010.793,1224.407,1063.141


In [4]:
# Obtain a 95% C.I. for our estimate of population mean
lower = ybar.str.prop - 1.96*se.str.prop
upper = ybar.str.prop + 1.96*se.str.prop
c(lower, upper)

We also decided to perform a stratified sampling on our data. The parameter of interest is the population mean medical costs billed by health insurance of each individuals in the U.S. We chose to use region to split our strata because we believe that there should be some level of variations between the medical costs in each strata. There are four regions in our data set, so we obtained four strata and used proportional allocation to decide the sample sizes for each of the strata. The total sample size is 400. As a result of the proportional allocation, we decided to sample 97 individuals from Northeast, Northwest, and Southwest, and sample 109 individuals from Southeast. Then we used R functions to obtain the stratified samples from each region and calculated the mean and standard error for each strata. To obtain a stratified sample estimate of population mean, we multiplied the sample mean of each strata by their relative population size and summed these results up. At the end, we obtained an estimate of 13982.78, which means that the population mean medical costs of individuals in the U.S. is estimated to be 13982.78 dollars. We also get the standard error of this estimate by applying the formula:______. This gave us a result of 551.71 dollars. Since we had an estimate of the population mean and its standard error, we also constructed a 95% confidence interval. The lower and upper bounds of this confidence interval are given by our estimate less or plus 1.96 times of the standard error. The result is (12901.42, 15064.14). An interpretation of the 95% confidence interval under the context of our project would be that over repeated samplings, approximately 95% of our confidence intervals will capture the true population mean of medical costs of individuals in U.S.