/
LearningActivity-BDA.Rmd
189 lines (160 loc) · 5.74 KB
/
LearningActivity-BDA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "Learning activity - Bayesian Data Analysis"
author: "Eva Ascarza and Tian Zheng"
date: "December 21, 2015"
output: html_document
---
In this learning activity, we will first simulate some customer behavior data according to a model and then we will use Bayesian modeling and `rstan` to model the simulated data. Simulation is a great way of learning about how a method works.
If you have not done so already, you need to first install `rstan` by typing `install.packages("rstan"")` in your console windown (lower left panel in Rstudio).
```{r}
require(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores()-2)
```
### Simulate the data
We will simulate a data set with 5000 invidiuals drawn from a population of musume members. Among museum members,
+ $X_1$ `gender`: Approximately 60% are females.
+ $X_2$ `fotography`: 50% expressed interest for fotography.
+ $X_3$ `sculpture`:30% expressed interest for sculpture.
+ $X_4$ `digital_content`: The average number of digital contents consumed by the members is 0.5.
+ $X_5$ `months_since`: The average length since last visit is 5 months.
```{r}
N <- 5000
# we assume 5000 members
female <- rbinom(N,1,0.6)
# simulate 60% of female
fotography <- rbinom(N,1,0.5)
# simulate 50% of customers who expressed interest for fotography
sculpture <- rbinom(N,1,0.3)
# simulate 50% of customers who expressed interest for sculpture
digital_content <- rpois(N,0.5)
# simulate digital content
months_since <- rpois(N,5)
# simulate months since last visit, with a max of 12
months_since[months_since>12] <- 12
```
### simulate renewal behavior and prepare the data
We are interested in whether these members will renew their membership. $Y$ (`renewal`) equals 1 if the member decides to renew, 0 otherwise. We assume the following [probit model](https://en.wikipedia.org/wiki/Probit_model) for Y given X.
$$
P(Y=1 | X_1, X_2, X_3, X_4, X_5) = \Phi(\beta_0 +\beta_1 X_1 +\beta_2 X_2 + \beta_3 X_3 +\beta_4 X_4 + \beta_5 X_5)
$$
where $\Phi(\cdot)$ is the [cumulative distribution function (CDF)](https://en.wikipedia.org/wiki/Cumulative_distribution_function) of the standard normal distribution.
```{r}
beta0 <- 0.6
beta1 <- 0.9
beta2 <- 0.6
beta3 <- 0.01
beta4 <- -0.01
beta5 <- -0.2
prob_simul <- pnorm(beta0 + beta1*female + beta2*fotography + beta3*sculpture + beta4*digital_content + beta5*months_since)
renewal <- rbinom(N,1,prob_simul)
```
### Prepare data for Stan
We first create an X matrix that combines all the input variables, including a column corresponding to the intercept. `cbind` is the `R` command for combining vectors as columns.
```{r}
X <- cbind(rep(1,N),
female,
fotography,
sculpture,
digital_content,
months_since)
# assign column names
colnames(X) <- c("intercept",
"female",
"fotography",
"sculpture",
"digital_content",
"months_since")
K<-dim(X)[2]
```
### Code for Stan
`stan` uses a specific modeling syntax. It requires the specification of the types of data and parameters, in addition to model statements.
```{r}
probit <- '
data{
int<lower=0> N; # number of observations
int<lower=0> K; # number of parameters
int<lower=0,upper=1> y[N];
vector[K] X[N];
}
parameters{
vector[K] beta;
}
model{
beta ~ cauchy(0,5);
for(j in 1:N)
y[j] ~ bernoulli(Phi_approx(dot_product(X[j],beta)));
}
'
```
### Run STAN
The initialization step of `stan` may take a little while but the running time should
be just a couple of minuets.
```{r}
data_list <- list(N=N,K=K,y=renewal,X=X)
fit <- stan(model_code=probit,
data=data_list,
warmup=2000,
iter=4000,
chains=2)
```
### SUMMARY of restuls
In the summary of output, the posterior distribution for each model coefficient is summaried using its percentiles.
```{r}
print(fit)
fitlist <- extract(fit)
```
#### Convergence
Convergence plots show whether the draws from posterior distributions are well mixed together.
```{r}
par(mfrow=c(2,3))
plot(fitlist$beta[,1],type="l",xlab="",ylab="Intercept")
plot(fitlist$beta[,2],type="l",xlab="",ylab="Gender Coefficient")
plot(fitlist$beta[,3],type="l",xlab="",ylab="Fotography Coefficient")
plot(fitlist$beta[,4],type="l",xlab="",ylab="Sculpture Coefficient")
plot(fitlist$beta[,5],type="l",xlab="",ylab="Digital content Coefficient")
plot(fitlist$beta[,6],type="l",xlab="",ylab="Months since login")
```
#### Histograms of posterior distributions
Posterior distributions of model coefficients.
```{r}
par(mfrow=c(2,3))
par(bg="white", fg="black",
col.lab="black", col.axis="black",
col.main="black", lwd=1)
#Gender
param <- fitlist$beta[,2]
min <- min(0,min(param))
max <- max(0,max(param))
hist(param,breaks=seq(min-0.05,max+0.05,0.05),main="Female",xlab="", ylab="",yaxt='n')
axis(1, lab=T , lwd=2)
abline(v=beta1, col=2)
#PHOTO
param <- fitlist$beta[,3]
min <- min(0,min(param))
max <- max(0,max(param))
hist(param,breaks=seq(min-0.04,max+0.04,0.04),main="Photography",xlab="", ylab="",yaxt='n')
axis(1, lab=T , lwd=2)
abline(v=beta2, col=2)
#SCULPTURE
param <- fitlist$beta[,4]
min <- min(0,min(param))
max <- max(0,max(param))
hist(param,breaks=seq(min-0.03,max+0.03,0.03),main="Sculpture",xlab="", ylab="",yaxt='n')
axis(1, lab=T , lwd=2)
abline(v=beta3, col=2)
#ONLINE CONTENT
param <- fitlist$beta[,5]
min <- min(0,min(param))
max <- max(0,max(param))
hist(param,breaks=seq(min-0.02,max+0.02,0.02),main="Downloads",xlab="", ylab="",yaxt='n')
axis(1, lab=T , lwd=2)
abline(v=beta4, col=2)
#VISIT
param <- fitlist$beta[,6]
min <- min(0,min(param))
max <- max(0,max(param))
hist(param,breaks=seq(min-0.01,max+0.01,0.01),main="Months since visit",xlab="", ylab="",yaxt='n')
axis(1, lab=T , lwd=2)
abline(v=beta5, col=2)
```