# Assigment 2, Question 2

In [1]:
library(data.table)
library(weights)
library(autumn)
library(survey)

Loading required package: Hmisc


Attaching package: 'Hmisc'


The following objects are masked from 'package:base':

    format.pval, units


Loading required package: grid

Loading required package: Matrix

Loading required package: survival


Attaching package: 'survey'


The following object is masked from 'package:Hmisc':

    deff


The following object is masked from 'package:graphics':

    dotchart




# Full joint adjustment

In [2]:
df = fread('survey_data.csv')

In [3]:
df[, mean(drug_use), .(gender, age_group)]

gender,age_group,V1
<chr>,<chr>,<dbl>
male,18-30,0.24193548
male,31-50,0.18567639
male,51+,0.08080808
female,18-30,0.12668464
female,31-50,0.05592105
female,51+,0.03526448


In [4]:
sample_counts = df[, .(group_counts=.N), .(gender, age_group)]
sample_counts[, sample_prop := group_counts/sum(group_counts)]

In [5]:
# census distribution
proportions = c(0.1589, 0.2119, 0.1094, 0.1682, 0.2310, 0.1206)
gender = c(rep("male", 3), rep("female", 3))
age_group = rep(c("18-30", "31-50", "51+"), 2)

# Create the data frame
census = data.table(gender = gender, age_group = age_group, census_proportion = proportions)


In [6]:
tab = merge(sample_counts, census, by=c('gender', 'age_group'))
tab[, weights := census_proportion/sample_prop]
df = merge(df, tab[, .(gender, age_group, weights)])

In [7]:
summary(df$weights)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.6379  0.7979  0.9521  1.0000  1.1803  2.3206 

In [8]:
# checking the distribution
test = df[, .(group = sum(weights)), .(gender, age_group)]
test[, prop := group / sum(group)]
test

gender,age_group,group,prop
<chr>,<chr>,<dbl>,<dbl>
female,18-30,353.22,0.1682
female,31-50,485.1,0.231
female,51+,253.26,0.1206
male,18-30,333.69,0.1589
male,31-50,444.99,0.2119
male,51+,229.74,0.1094


In [9]:
# estimate without weights
ds = svydesign(ids=~1,data=df)
sm = svymean(~drug_use, ds)
print(sm)
confint(sm)

"No weights or probabilities supplied, assuming equal probability"


            mean     SE
drug_use 0.11095 0.0069


Unnamed: 0,2.5 %,97.5 %
drug_use,0.0975163,0.1243885


In [10]:
# joint adjustment
ds = svydesign(ids=~1,data=df, weights=~weights)
sm = svymean(~drug_use, ds)
print(sm)
confint(sm)


            mean     SE
drug_use 0.12511 0.0079


Unnamed: 0,2.5 %,97.5 %
drug_use,0.1095774,0.1406382


In [11]:
design_effect(df$weights)

# Raking

In [12]:
target = list(
    gender = c('male'=0.4802, 'female'=0.5198), 
    age_group = c('18-30'=0.3271, '31-50'=0.4429, '51+'=0.23)
)
target = normalize(target)

df = harvest(df, target, weight_column = 'rake_weights')
diagnose_weights(df, target, df$rake_weights)

variable,level,prop_original,prop_weighted,target,error_original,error_weighted
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
gender,male,0.3447619,0.4802,0.4802,0.135438095,6.071699e-12
gender,female,0.6552381,0.5198,0.5198,0.135438095,6.067924e-12
age_group,18-30,0.2947619,0.3271,0.3271,0.032338095,2.275957e-15
age_group,31-50,0.4690476,0.4429,0.4429,0.026147619,1.249001e-14
age_group,51+,0.2361905,0.23,0.23,0.006190476,2.581269e-15


In [13]:
design_effect(df$rake_weights)

In [14]:
summary(df$rake_weights)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7281  0.7281  0.8466  1.0000  1.2928  1.5032 

In [15]:
ds = svydesign(ids=~1, data=df, weights=~rake_weights)
sm = svymean(~drug_use, ds)
print(sm)
confint(sm)

            mean    SE
drug_use 0.12811 0.008


Unnamed: 0,2.5 %,97.5 %
drug_use,0.1123591,0.1438537


# Which solution is better?

The joint adjustment results in a higher "deff" and more extreme weights. However, if drug use is influenced by the interaction between gender and age, the estimate may be slightly less biased compared to the raking estimate (although slightly less precise). The raking procedure yields lower deff (less extreme weights), but we may lose some information. If we have access to the joint distribution (which is not always the case), we can use it to make slight adjustments to the weights (trimming them to reduce their extremeness) in order to improve the design effect.