# How to design efficient experiments

In this lecture we are going to

* Properly define what is an efficient design
* Use the mathematical definition to create good designs
* Give some final guidelines on design of experiments for choice modelling

In [None]:
import pandas as pd
import numpy as np

In [None]:
betas = np.matrix(' 1 1; 2 2')

In [None]:
def choice_prob(betas, X):
  V = np.matmul(X, betas)
  P = np.exp(V)
  return P / np.sum(P, axis = 1)

# Working example: Designing a smartphone
We are trying to design a smartphone, we want to create a survey to ask potential customers about their preferences, then model them using a choice model.
 For the sake of exposition, lets imagine that we can alter three attributes for two alternatives.
The attributes are:
 * **Price**
 * **Screen Size**
 * **Processor Speed**

The two alternatives are identified by the 'operating system' either 'apple' or
'android'.

In [None]:
colnames = ['price_apple', 'size_apple', 'os_apple', 'price_android', 'size_android', 'os_android']

When creating the survey, we have to decide how many choice situations we are going to create. Ideally, we would test a wide range of price, screen size and processor combinations.

But lets assume that in the survey, we ask each respondent to choose between only two alternatives at a time. For example, the respondant is prompted
the following question:

*Which of these two smartphones would you buy?*



| Attrib      | Apple |  Android |
| ----------- | ----------- | ----------- |
| Price      | 800       |  1200       |
| Screen Size   | 4.7        | 5.8        |
| Processor   | 3.2        | 1.8        |

In order to get a good model out of our survey, it is intuitive that we should get data from different values for the attributes of the alternatives.

Therefore we would have to create several variations of the question.

But just how many variations?

Lets simplify and say that we consider only 3 prices
 * **Price**: 800 AUD, 1000 AUD, 1200 AUD
 * **Screen Size**: 4.7 in, 5.8 in
 * **Processor**: 1.8Ghz, 3.2 GHz

Note that this is a strong reduction of the possible values that would
create 'feasible' smartphones, even ignoring pricing there is much more variety
in the Screen Size and Processor attributes in the current market.


How many variations?
For each smartphone, we have 12 different combinations of price, size and speed,
coming from the 3 different levels for price, 2 different levels for size and 2 different levels for speed.
$3 \times 2 \times 2 = 12$

In [None]:
from sklearn.utils.extmath import cartesian
pd.DataFrame(cartesian(([800.0, 1000.0, 1200.0], [4.7, 5.8], [1.8, 3.2])), columns=['price', 'size', 'speed'])

Unnamed: 0,price,size,speed
0,800.0,4.7,1.8
1,800.0,4.7,3.2
2,800.0,5.8,1.8
3,800.0,5.8,3.2
4,1000.0,4.7,1.8
5,1000.0,4.7,3.2
6,1000.0,5.8,1.8
7,1000.0,5.8,3.2
8,1200.0,4.7,1.8
9,1200.0,4.7,3.2


And this is for each type of smartphone, either apple or android.

If we want to compare all possible options of apple smartphones to all possible
options of android smartphone, we would have 144 possible choice questions.

In [None]:
full_factorial = pd.DataFrame(cartesian(([800.0, 1000.0, 1200.0], [4.7, 5.8], [1.8, 3.2], [800.0, 1000.0, 1200.0], [4.7, 5.8], [1.8, 3.2])),
                              columns=colnames)
full_factorial

Unnamed: 0,price_apple,size_apple,os_apple,price_android,size_android,os_android
0,800.0,4.7,1.8,800.0,4.7,1.8
1,800.0,4.7,1.8,800.0,4.7,3.2
2,800.0,4.7,1.8,800.0,5.8,1.8
3,800.0,4.7,1.8,800.0,5.8,3.2
4,800.0,4.7,1.8,1000.0,4.7,1.8
...,...,...,...,...,...,...
139,1200.0,5.8,3.2,1000.0,5.8,3.2
140,1200.0,5.8,3.2,1200.0,4.7,1.8
141,1200.0,5.8,3.2,1200.0,4.7,3.2
142,1200.0,5.8,3.2,1200.0,5.8,1.8


We would have to sample at least 144 persons (Assuming one question per subject) to even get an observation per choice situation.

In a bit more 'realistic' scenario, it is clear that the potential number of choice sitations goes out of control, too many variations to get a sample for each one.

The practical question now becomes:

**Assuming a limited budget of people we can ask, which choice situations should we choose among the possible in order to estimate the model as well as possible?**

The solution comes from defining what we mean by *'estimate the model as well as possible'*. This measure is often called **'efficiency'** of the experiment
and is something **arbitrary**. There are many ways of defining efficiency,
we will see what can be considered the most popular.



# Efficiency in experimental designs

Plainly, when we design an experiment we have to deal with the problem that we cannot perform the 'ideal' experiment. The efficiency of a experiment is intuitively a measure of how much we lose in a given experiment compared to the ideal experiment. The less we lose, the more efficient.

Most measures of the efficieny of the experiment account for the variance of the estimation of the coefficients in the model, the covariance of the estimator of $\beta$s.

# Covariance matrix for the $\beta$

*We will illustrate the concept with standard linear regression and the add the 'choice probability layer' on top of it later. You can also think of modelling the utilities which are linear in the Multinomial Logit.*


Recall from basic Stats, the sample mean estimation of
a vector of observations $x$ of size $N$ is:
$$ \text{sample_mean}(x) = \frac{1}{N}\sum_{i=1}^N x_i$$

In a random sample of indepent observations coming from a normal distribution of $N(\mu, \sigma^2)$, the sample mean is itself a random variable,
distributed as a normal $N(\mu,\frac{\sigma^2}{N})$

As long as $x$ is indepent and identically distributed, even if $x$ is not gaussian, for $N$ 'large enough', the sample mean is also $N(\mu,\frac{\sigma^2}{N})$, with $\mu$ and $\sigma^2$ mean and variance of the distribution of $x$.

When we do least squares linear regression under the typical assumptions of linearity, independence and gaussianity,

 $$ y = X\beta^* + \varepsilon$$

 With $X$ the matrix of observations (one column per variable), $\beta^*$ the true underlying beta and $\varepsilon$ a i.i.d. gaussian noise of mean 0 and variance $\sigma^2$.

The least squares estimator of the $\beta$ is also normal with distribution:
 $$N(\beta^*, \sigma^2 (X'X)^{-1})$$

The important part for us is the term $(X'X)^{-1}$, because the $\sigma^2$ is given, we cannot experiment with it. But when designing and experiment, we get to decide the values of the $X$.

To simplify, asume that $\sigma^2=1$ (we do not loose much, we can get there by just scaling everythin by a constant). This way we simplyfy the expression for the covariance matrix of the least squares estimator.

$$\text{covariance}(\beta) = (X'X)^{-1}$$

$$Y = X\beta + \varepsilon$$

and 3 variables, so $\beta$ is a vector of three numbers $\beta_1, \beta_2, \beta_3$, each is the coefficient for one variable.


Remember that the covariance matrix contains the variances of the individual coefficients (denoted by $\sigma^2_{\beta_i}$) and their covariances (denoted by $\sigma_{\beta_i \beta_j}$).

We know that having the less variance in the estimator, the more likely we are to be close to the true values when estimating.  So, in the covariance matrix, 'big numbers are bad'.

$$\text{covariance}(\beta) = (X'X)^{-1} =
\begin{pmatrix}
\sigma^2_{\beta_1} & \sigma_{\beta_1 \beta_2} & \sigma_{\beta_1 \beta_3}\\
\sigma_{\beta_1 \beta_2} & \sigma^2_{\beta_2} & \sigma_{\beta_2 \beta_3} \\
\sigma_{\beta_1 \beta_3} & \sigma_{\beta_2 \beta_3} & \sigma^2_{\beta_3}
\end{pmatrix}$$

So we have to find values for $X$ that make the covariance matrix of $\beta$ 'small'. The smaller the $\beta$.
So if we have a limited number of samples, we will select the combinations of levels of the alternatives that create the smallest covariance matrix possible.



# A-efficiency

We have mentioned that we want to make the covariance of the estimation of $\beta$ as small as possible.

**What do we mean by small?**

 And the measurement of the 'size' of the matrix is what we mean by **efficiency** of the experiment.

One idea for measuring the size of the matrix, or how large are the numbers is the following:
The average variance of the coefficients, so the sum of the diagional of the variance-covariance matrix. So from all possible variantions of the experiment, we will pick the one that produces a covariance matrix with the smallest sum of its diagonal.

This idea is called *A-efficiency*, and it is defined in proper mathemathical terms as
$ A= \frac{\text{trace}((X'X)^{-1})}{K+1}$
with $K$ the number of variables to estimate in the model.

When we pick the design with the best A-efficiency we call it $A-optimal$.

There are many other measures of efficiency, identified by letters $A, D, E, G, ...$



# D-efficiency

D-efficiency is considered the most popular measure of efficiency, though which one to use is not clear, it will depend on the application.

The D-efficiency is defined in terms of the determinant of the covariance matrix of $\beta$

$D = \text{det}((X'X)^{-1})^{\frac{1}{K+1}}$

In [None]:
def cov_reg(X):
  X = np.matrix(X)
  return np.linalg.inv(np.matmul(np.transpose(X), X))

In [None]:
np.transpose(full_factorial)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,134,135,136,137,138,139,140,141,142,143
price_apple,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,...,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0
size_apple,4.7,4.7,4.7,4.7,4.7,4.7,4.7,4.7,4.7,4.7,...,5.8,5.8,5.8,5.8,5.8,5.8,5.8,5.8,5.8,5.8
os_apple,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,...,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2
price_android,800.0,800.0,800.0,800.0,1000.0,1000.0,1000.0,1000.0,1200.0,1200.0,...,800.0,800.0,1000.0,1000.0,1000.0,1000.0,1200.0,1200.0,1200.0,1200.0
size_android,4.7,4.7,5.8,5.8,4.7,4.7,5.8,5.8,4.7,4.7,...,5.8,5.8,4.7,4.7,5.8,5.8,4.7,4.7,5.8,5.8
os_android,1.8,3.2,1.8,3.2,1.8,3.2,1.8,3.2,1.8,3.2,...,1.8,3.2,1.8,3.2,1.8,3.2,1.8,3.2,1.8,3.2


In [None]:
COV_ff = cov_reg(full_factorial)

In [None]:
pd.DataFrame(COV_ff*100000, columns = ['price_apple', 'size_apple', 'cost_apple', 'price_android', 'size_android', 'cost_android'])

Unnamed: 0,price_apple,size_apple,cost_apple,price_android,size_android,cost_android
0,0.0226,-1.592868,-0.468263,-0.003442,-1.592868,-0.468263
1,-1.592868,1558.489201,-216.716677,-1.592868,-737.194913,-216.716677
2,-0.468263,-216.716677,1353.524333,-0.468263,-216.716677,-63.709227
3,-0.003442,-1.592868,-0.468263,0.0226,-1.592868,-0.468263
4,-1.592868,-737.194913,-216.716677,-1.592868,1558.489201,-216.716677
5,-0.468263,-216.716677,-63.709227,-0.468263,-216.716677,1353.524333


Lets see what can we do if we 'do nothing', this is just pick the variations at random until we get to the given sample size.

*Repeat to see how the covariance varies*

In [None]:
N = 70
sub_fact = np.array(full_factorial)[np.random.choice(full_factorial.shape[0], N, replace=False), :]
sub_fact

array([[ 800. ,    4.7,    1.8, 1200. ,    4.7,    1.8],
       [1000. ,    5.8,    3.2,  800. ,    4.7,    3.2],
       [1000. ,    5.8,    1.8, 1200. ,    5.8,    1.8],
       [ 800. ,    4.7,    1.8,  800. ,    5.8,    3.2],
       [ 800. ,    5.8,    1.8, 1200. ,    5.8,    3.2],
       [1000. ,    4.7,    1.8, 1000. ,    5.8,    3.2],
       [1200. ,    5.8,    1.8, 1000. ,    4.7,    1.8],
       [1000. ,    5.8,    3.2,  800. ,    4.7,    1.8],
       [1200. ,    4.7,    1.8, 1200. ,    4.7,    1.8],
       [1200. ,    5.8,    1.8, 1000. ,    5.8,    1.8],
       [1000. ,    4.7,    3.2, 1200. ,    5.8,    1.8],
       [1000. ,    5.8,    3.2, 1000. ,    5.8,    1.8],
       [1200. ,    4.7,    1.8, 1200. ,    5.8,    3.2],
       [ 800. ,    5.8,    3.2,  800. ,    5.8,    3.2],
       [1200. ,    5.8,    1.8, 1200. ,    5.8,    3.2],
       [1200. ,    4.7,    3.2, 1000. ,    4.7,    1.8],
       [1200. ,    4.7,    3.2,  800. ,    5.8,    3.2],
       [1200. ,    5.8,    1.8,

In [None]:
pd.DataFrame(cov_reg(sub_fact)*100000, columns = ['price_apple', 'size_apple', 'cost_apple', 'price_android', 'size_android', 'cost_android'])

Unnamed: 0,price_apple,size_apple,cost_apple,price_android,size_android,cost_android
0,0.050852,-2.796748,0.196867,-0.017855,-2.71378,-1.925642
1,-2.796748,3316.496338,-682.938769,-2.806583,-1754.019175,-321.622883
2,0.196867,-682.938769,2620.282514,-0.712264,-370.01904,-183.545185
3,-0.017855,-2.806583,-0.712264,0.051881,-2.820997,-0.814672
4,-2.71378,-1754.019175,-370.01904,-2.820997,3239.80687,-463.704691
5,-1.925642,-321.622883,-183.545185,-0.814672,-463.704691,2849.337329


Lets compare both covariances, the one with the full experiment vs the one with the reduced experiment.

In [None]:
pd.DataFrame(cov_reg(sub_fact) / cov_reg(full_factorial))

Unnamed: 0,0,1,2,3,4,5
0,2.250093,1.755794,-0.42042,5.187866,1.703707,4.112311
1,1.755794,2.12802,3.151298,1.761969,2.379315,1.484071
2,-0.42042,3.151298,1.935896,1.521077,1.707386,2.880983
3,5.187866,1.761969,1.521077,2.295637,1.771018,1.739775
4,1.703707,2.379315,1.707386,1.771018,2.078813,2.139682
5,4.112311,1.484071,2.880983,1.739775,2.139682,2.105125


In [None]:
def deffic_reg(X):
  covX = cov_reg(X)
  return np.power( np.linalg.det(covX), 1 / (covX.shape[0] + 1) )

When we calculate the d-efficiency (smaller is better), we see that
the fll experiment is roughly twice as efficient.

In [None]:
deffic_reg(full_factorial)*1000

0.5913907394643236

In [None]:
deffic_reg(sub_fact)*1000

1.1138452778901702

So, now we will try to do something a bit more clever, we will try to pick the design with the best D-efficiency.
But how many design variations are there? Maybe we can compute them all and pick the best?

Lets say of size 70, when the full factorial is 144, combinations without
repetition...


In [None]:
from math import factorial

factorial(144) / factorial(70) / factorial(144 - 70)

1.4007495090837087e+42

No, but lets check a few thousands really quickly, trying to get the improvement

In [None]:
def calc_sub_effic():
 N = 70
 sub_fact = np.array(full_factorial)[np.random.choice(full_factorial.shape[0], N, replace=False), :]
 return deffic_reg(sub_fact)*1000



np.min([calc_sub_effic() for _ in range(100000)])

1.0397622938773254

The best efficiency we can find in this random search is 1.015.

There are better algorithms to look for good design, to 'optimize' the experimental design.

# D-efficiency for Discrete Choice

We have seen the definition of D-efficiency for standard linear regression.
It is based on the covariance matrix for the least squares estimator for the coefficients. For discrete choice, the D-efficiency is technically the same, but the covariance matrix that it works on is slightly different.

In choice modelling, specifically the multinomial logit (MNL), the coefficients are
not estimated by least squares, they are estimated by maximum likelihood. Moreover, the MNL transforms the linear predictions by the softmax transformation.

$$\text{covariance}(\beta) = (Z' P Z )^{-1}$$

when working with $J$ alternatives:
*  $P$ is the matrix of choice probabilities computed by the model.
* $Z$ is similar to design matrix, but 'centered' using the choice probabilities. Basically, to each row of observations, we substract the weighted mean of the variables across all alternatives. The weights are the choice probabilities computed by the model.

 $$z_{jn} = x_{jn} - \sum_{i=1}^Jx_{in}P_{in}$$

 The $x_{in}$ represents the attributes of alternative $i$ for individual $n$.

 Just to clarify, if all choice probabilities are equal, we get
  $$z_{jn} = x_{jn} - \overline{x
  _n}$$
   with $\overline{x
  _n}$ denotine the simple mean of the variables.


**There is an important difference comparing to the linear regression:** In the
expression for the covariance we have the choice probabilities, which themselves depend on the true values of the coefficients $\beta^*$!!

The solution is to consider some 'prior' values for the $\beta*$, for example, it is typical to start with all $\beta^*_i = 0$ to first design the pilot experiment. The from the pilot experiment get a better estimate, and use it to calculate the efficiency fot the final experiment.


In [None]:
def cov_mnl(Xj, J, betas):
  Xj = np.hsplit(np.array(Xj), 2)
  P = np.hstack( [np.matmul(Xj[0], betas[0].T ), np.matmul(Xj[1], betas[0].T )])
  P = np.exp(P)
  PP = P / np.sum(P, axis = 1)
  P0D = np.diag(np.array(PP[:,0].flatten()[0].T[:]).T[0])
  return np.linalg.inv(np.matmul( np.matmul(Xj[0].T, P0D), Xj[0]))

And now we calculate

In [None]:
 sub_fact = np.array(full_factorial)[np.random.choice(full_factorial.shape[0], 10, replace=False), :]

In [None]:
betas = [ np.matrix('0.5 0.1 1.1')]
betas[0]

matrix([[0.5, 0.1, 1.1]])

In [None]:
pd.DataFrame(cov_mnl(sub_fact, 2, betas))

Unnamed: 0,0,1,2
0,9e-06,-0.00232,0.000663
1,-0.00232,0.834454,-0.59347
2,0.000663,-0.59347,0.80061


In [None]:
def deffic_mnl(X, J, betas):
  covX = cov_mnl(X, J, betas)
  return np.power( np.linalg.det(covX), 1 / (covX.shape[0] + 1) )

In [None]:
deffic_mnl(sub_fact, 2, betas)

0.015101961211840247

In [None]:
deffic_mnl(full_factorial, 2, betas)

0.001539231431536287

#Relationship to the principles of design of experiments

Recall the four principles

1. Level balance
2. Orthogonality
3. Minimal level overlap
4. Utility balance


These principles are all summarized in the D-efficiency, meaning that they are 'rules of thumb' to create designs with good efficiency. Nowadays we can just put the computer to work to find a good design, before that, we used to pick the design manually by following the principles... It is important to get an intuition on how it works.


# Example: Level balance and overlap



In [None]:
np.random.seed(1234)
sub_fact = np.array(full_factorial)[np.random.choice(full_factorial.shape[0], 20, replace=False), :]
sub_fact

array([[ 800. ,    4.7,    1.8, 1000. ,    5.8,    1.8],
       [1200. ,    4.7,    1.8, 1000. ,    4.7,    3.2],
       [1000. ,    4.7,    3.2,  800. ,    5.8,    3.2],
       [1000. ,    5.8,    3.2, 1000. ,    4.7,    1.8],
       [1200. ,    4.7,    3.2, 1200. ,    4.7,    3.2],
       [ 800. ,    5.8,    3.2, 1000. ,    4.7,    1.8],
       [1200. ,    4.7,    3.2,  800. ,    5.8,    3.2],
       [1200. ,    4.7,    1.8,  800. ,    5.8,    1.8],
       [ 800. ,    5.8,    1.8,  800. ,    5.8,    3.2],
       [ 800. ,    5.8,    3.2, 1000. ,    5.8,    3.2],
       [1000. ,    4.7,    3.2, 1000. ,    5.8,    1.8],
       [1000. ,    4.7,    1.8, 1200. ,    5.8,    1.8],
       [1000. ,    4.7,    1.8,  800. ,    5.8,    3.2],
       [ 800. ,    5.8,    1.8,  800. ,    4.7,    3.2],
       [1000. ,    5.8,    1.8, 1000. ,    5.8,    3.2],
       [ 800. ,    5.8,    3.2, 1000. ,    5.8,    1.8],
       [1000. ,    5.8,    3.2,  800. ,    4.7,    3.2],
       [ 800. ,    5.8,    3.2,

Pick unbalanced levels for attribute 1 (almost all 800)

In [None]:

sub_fact = sub_fact[[ 0, 5, 8, 9, 13, 15,  17, 2],:]


In [None]:
sub_fact

array([[ 800. ,    4.7,    1.8, 1000. ,    5.8,    1.8],
       [ 800. ,    5.8,    3.2, 1000. ,    4.7,    1.8],
       [ 800. ,    5.8,    1.8,  800. ,    5.8,    3.2],
       [ 800. ,    5.8,    3.2, 1000. ,    5.8,    3.2],
       [ 800. ,    5.8,    1.8,  800. ,    4.7,    3.2],
       [ 800. ,    5.8,    3.2, 1000. ,    5.8,    1.8],
       [ 800. ,    5.8,    3.2,  800. ,    4.7,    1.8],
       [1000. ,    4.7,    3.2,  800. ,    5.8,    3.2]])

In [None]:
deffic_mnl(sub_fact, 2, betas)

0.025075612050753034

Compare with random experiments of the same size (look at the largest efficiency in a random search of experiments of 8 rows).
Almost all will be more efficient that our hand-picked unbalanced design.

In [None]:
np.random.seed(1234)
[deffic_mnl(np.array(full_factorial)[np.random.choice(full_factorial.shape[0], 8, replace=False), :], 2, betas) for i in range(20)]

[0.012616982133300486,
 0.011695402568277412,
 0.03184815359360849,
 0.03509450126161441,
 0.012031835107409962,
 0.012735934085607309,
 0.012301612107096209,
 0.024778969051915976,
 0.012241126754476505,
 0.015347405493874052,
 0.016536761463146362,
 0.015980132366145514,
 0.017566893133088412,
 0.008988701322920645,
 0.013424721792247204,
 0.02550199312206731,
 0.05737907706040104,
 0.016783149094670952,
 0.011277858292034334,
 0.019826977666191882]

#Orthogonality

We pick rows that cannot tell attribute
(column) 2 vs 3.

In [None]:
np.random.seed(1234)
sub_fact = np.array(full_factorial)[np.random.choice(full_factorial.shape[0], 20, replace=False), :]
sub_fact
sub_fact_orth = sub_fact[[ 0, 1, 3, 7, 9, 11, 12, 15],:]
sub_fact_orth

array([[ 800. ,    4.7,    1.8, 1000. ,    5.8,    1.8],
       [1200. ,    4.7,    1.8, 1000. ,    4.7,    3.2],
       [1000. ,    5.8,    3.2, 1000. ,    4.7,    1.8],
       [1200. ,    4.7,    1.8,  800. ,    5.8,    1.8],
       [ 800. ,    5.8,    3.2, 1000. ,    5.8,    3.2],
       [1000. ,    4.7,    1.8, 1200. ,    5.8,    1.8],
       [1000. ,    4.7,    1.8,  800. ,    5.8,    3.2],
       [ 800. ,    5.8,    3.2, 1000. ,    5.8,    1.8]])

In [None]:

deffic_mnl(sub_fact_orth, 2, betas)

0.028968245815118584

# The workflow

1) Define attributes and levels


2) Pilot Study to get some tentative Betas

3) Design of the Experiment

4) Design the Survey

5) Conduct the survey and data analysis

# Recommendations


* **Which variables should we choose?**
 Create an exhaustive list of attributes, the reduce it to a number between 3 to 7 by discarding some and merging others (important combinations of a pair of attributes). For example, screen size and speed can be merged if these do not really vary independently (i.i there are no small fast smartphones), just create a new categorical attribute with a few levels for the realistic combinations.

* **How do we choose the levels?**
 Try a large range and pick the best subset using a computer.

* **How many alternatives**
 From 2 to 3 alternatives can be handled by people before getting into decision fatigue.