# Chapter 9: Conditional Expectation

## Mystery prize simulation

We can use simulation to show that in Example 9.1.7, the example of bidding on a mystery prize with unknown value, any bid will lead to a negative payout on average. First choose a bid `b` (we chose $0.6$); then simulate a large number of hypothetical mystery prizes and store them in `v`:

In [None]:
b <- 0.6
nsim <- 10^5
v <- runif(nsim)

The bid is accepted if `b > (2/3)*v`. To get the average profit conditional on an accepted bid, we use square brackets to keep only those values of `v` satisfying the condition:

In [None]:
mean(v[b > (2/3)*v]) - b

This value is negative regardless of `b`, as you can check by experimenting with different values of `b`.

## Time until $HH$ vs. $HT$

To verify the results of Example 9.1.9, we can start by generating a long sequence of fair coin tosses. This is done with the `sample` command. We use `paste` with the `collapse=""` argument to turn these tosses into a single string of $H$'s and $T$'s:

In [None]:
paste(sample(c("H","T"),100,replace=TRUE),collapse="")

A sequence of length 100 is enough to virtually guarantee that both $HH$ and $HT$ will have appeared at least once.

To determine how many tosses are required on average to see $HH$ and $HT$, we need to generate many sequences of coin tosses. For this, we use our familiar friend `replicate`:

In [None]:
r <- replicate(10^3,paste(sample(c("H","T"),100,replace=T),collapse=""))

Now `r` contains a thousand sequences of coin tosses, each of length $100$. To find the first appearance of $HH$ in each of these sequences, you can use the `str_locate` command from the `stringr` package.

In [None]:
install.packages("stringr") # comment out if already installed
library(stringr)

After you've installed and loaded the package,

In [None]:
t <- str_locate(r,"HH")

creates a two-column table `t`, whose columns contain the starting and ending positions of the first appearance of $HH$ in each sequence of coin tosses. (Use `head(t)` to display the first few rows of the table and get an idea of what your results look like.) What we want are the ending positions, given by the second column. In particular,we want the average value of the second column, which is an approximation of the average waiting time for $HH$:

In [None]:
mean(t[,2])

Is your answer around 6? Trying again with `"HT"` instead of `"HH"`, is your answer around 4?

## Linear regression

In Example 9.3.10, we derived formulas for the slope and intercept of a linear regression model, which can be used to predict a response variable using an explanatory variable. Let's try to apply these formulas to a simulated dataset:

In [None]:
x <- rnorm(100)
y <- 3 + 5*x + rnorm(100)

The vector `x` contains $100$ realizations of the random variable $X \sim \mathcal N(0,1)$, and the vector `y` contains $100$ realizations of the random variable $Y=a+bX+\epsilon$ where $\epsilon \sim \mathcal N(0,1)$. As we can see, the true values of $a$ and $b$ for this dataset are $3$ and $5$, respectively. We can visualize the data as a scatterplot with `plot(x,y)`.

Now let's see if we can get good estimates of the true $a$ and $b$, using the formulas in Example 9.3.10:

In [None]:
b <- cov(x,y) / var(x)
a <- mean(y) - b*mean(x)

Here `cov(x,y)`, `var(x)`, and `mean(x)` provide the sample covariance, sample variance, and sample mean, estimating the quantities $\text{Cov}(X,Y)$, $\text{Var}(X)$, and $E(X)$, respectively. (We have discussed sample mean and sample variance in detail in earlier chapters. Sample covariance is defined analogously, and is a natural way to estimate the true covariance.)

You should find that `b` is close to $5$ and `a` is close to $3$. These estimated values define the _line of best fit_. The `abline` command lets us plot the line of best fit on top of our scatterplot:

In [None]:
plot(x,y)
abline(a=a,b=b)

The first argument to `abline` is the intercept of the line, and the second argument is the slope.