# PRACTICE FOR THE FINAL

---
---


## Education choices
We have a dataset coming from a discrete choice experiment on preferences for children education. The researchers want to establish the effects of the cost of education, foreign language used at school and the distance to the school.

The experiment presents households with several choice situations, each one in a different card. On each situation, the household has to decide between two alternatives. 
As in many choice experiments, the alternatives are used just to compare the effects attributes. Unlike, for example, a transportation choice with alternatives train, car and bus, here the alternatives do not encode information, we can just consider them as 'alternative A' and 'alternative B'.
**Alternative 'A' should be indistiguisable from Alternative B if their attributes in the choice situation are equal.**

Each household may answer several choice situations, we keep track of each household using the variable *id* in the dataset.

---
---

# Description of the dataset

Survey variables
 * **choice:** The response variable (1= alternative A, 2=alternative B).
 * **id:** Household ID.

\

Attributes

 * **cost_A, cost_B:** Yeaarly cost in dollars.
 * **foreign_A, foreign_B:** Whether school uses a foreign language as the default language for all units (except when teaching the local language unit).
 * **distance_A, distance_B:** Distance to the school, in meters.

\
  
Socioeconomic characteristics

 * **male:**  1 if the child in the household is male, 0 otherwise.
 * **female:**  1 if the child is in the household is female, 0 otherwise.
 * **parent_educ:** Head of the family education level (0=no formal education, 1=high school, 2=undergad, 3=postgrad)

---
---


---
---

# Preparing the environment
*The preparation and dataset loading code is given to the students, you might modify it.*

In [1]:
!pip install biogeme



Load the packages.

In [2]:
import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp
import biogeme.tools as tools
import biogeme.distributions as dist

  from .autonotebook import tqdm as notebook_tqdm



---
---


In [3]:
def qbus_update_globals_bgm(pd_df):
   globals().update(db.Database('tmp_bg_bgm_for_glob', pd_df).variables)


def qbus_estimate_bgm(V, pd_df, tgtvar_name, modelname='bgmdef'):
 av_auto = V.copy()
 for key, value in av_auto.items():
   av_auto[key] = 1
 bgm_db = db.Database(modelname + '_db', pd_df)
 globals().update(bgm_db.variables)
 logprob = models.loglogit (V , av_auto , bgm_db.variables[tgtvar_name] )
 bgm_model = bio.BIOGEME ( bgm_db, logprob )
 bgm_model.utility_dic = V.copy()
 return bgm_model, bgm_model.estimate()



def qbus_simulate_bgm(qbus_bgm_model, betas, pred_pd_df):
  av_auto = None
  targets = None
  if hasattr(qbus_bgm_model, 'ord_probs'):
    av_auto = qbus_bgm_model.ord_probs.copy()
    targets = qbus_bgm_model.ord_probs.copy()
  else:
    av_auto = qbus_bgm_model.utility_dic.copy()
    targets = qbus_bgm_model.utility_dic.copy()

  for key, value in av_auto.items():
    av_auto[key] = 1
  for key, value in targets.items():
    if hasattr(qbus_bgm_model, 'nest_tuple'):
      targets[key] = models.nested(qbus_bgm_model.utility_dic, av_auto, qbus_bgm_model.nest_tuple, key)
    else:
      if hasattr(qbus_bgm_model, 'ord_probs'):
       0
       #targets[key] = qbus_bgm_model.ord_probs[key]
      else:
       targets[key] = models.logit(qbus_bgm_model.utility_dic, av_auto, key)

  bgm_db = db.Database('simul', pred_pd_df)
  globals().update(bgm_db.variables)
  bgm_pred_model = bio.BIOGEME(bgm_db, targets)
  simulatedValues = bgm_pred_model.simulate(betas)
  return simulatedValues



def qbus_calc_accu_confusion(sim_probs, pd_df, choice_var):
  which_max = sim_probs.idxmax(axis=1)
  data = {'y_Actual':   pd_df[choice_var],
          'y_Predicted': which_max
        }

  df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
  confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])
  accu = np.mean(which_max == pd_df[choice_var])
  return accu, confusion_matrix



def qbus_likeli_ratio_test_bgm(results_complex, results_reference, signif_level):
  return tools.likelihood_ratio_test( (results_complex.data.logLike, results_complex.data.nparam),
                                     (results_reference.data.logLike, results_reference.data.nparam), signif_level)



In [4]:
def qbus_estimate_ordered_bgm(V, ord_alt_ids, pd_df, tgtvar_name, modelname='ord_bgm'):
 bgm_db = db.Database(modelname + '_db', pd_df)
 globals().update(bgm_db.variables)

 taus_map = {ord_alt_ids[0]: exp.Beta('tau1', -1, None, None, 0) }
 i = 1
 for id in ord_alt_ids[1:-1]:
  taus_map[id] = taus_map[ ord_alt_ids[i-1] ] + exp.Beta('delta_'+ str(i + 1), i, 0, None, 0)
  i = i + 1

 alt_probs_map = {ord_alt_ids[0]: dist.logisticcdf( taus_map[ord_alt_ids[0] ] - V_ord) }

 i = 1
 for id in ord_alt_ids[1:-1]:
  alt_probs_map[id] = dist.logisticcdf( taus_map[id] - V_ord) - dist.logisticcdf( taus_map[ ord_alt_ids[i-1] ] - V_ord)
  i = i + 1

 alt_probs_map[ord_alt_ids[i] ] = 1 - dist.logisticcdf( taus_map[ord_alt_ids[i-1]] - V_ord)

 logprob = exp.log(exp.Elem(alt_probs_map, bgm_db.variables[tgtvar_name]))

 #logprob = models.loglogit (V , av_auto , bgm_db.variables[tgtvar_name] )
 bgm_model = bio.BIOGEME ( bgm_db, logprob )
 bgm_model.utility_dic = V
 bgm_model.ord_probs = alt_probs_map.copy()
 return bgm_model, bgm_model.estimate()

def qbus_estimate_mixed_bgm(V, pd_df, tgtvar_name, panelvar_name=None, n_draws=50, seed=1, modelname='bgmdef'):
 do_panel = not (panelvar_name==None)

 av_auto = V.copy()
 for key, value in av_auto.items():
   av_auto[key] = 1
 bgm_db = db.Database(modelname + '_db', pd_df)
 if (do_panel):
   bgm_db.panel(panelvar_name)

 globals().update(bgm_db.variables)
 #logprob = models.loglogit (V , av_auto , bgm_db.variables[tgtvar_name] )
 obsprob = models.logit(V, av_auto, bgm_db.variables[tgtvar_name])
 if (do_panel):
  condprobIndiv = exp.PanelLikelihoodTrajectory(obsprob)
 else:
  condprobIndiv = obsprob
 logprob = exp.log(exp.MonteCarlo(condprobIndiv))
 bgm_model  = bio.BIOGEME(bgm_db,logprob,numberOfDraws=n_draws, seed=seed)
 bgm_model.utility_dic = V.copy()
 return bgm_model, bgm_model.estimate()




def qbus_estimate_nested_bgm(V, pd_df, nests,  tgtvar_name, modelname='bgmdef'):
 av_auto = V.copy()
 for key, value in av_auto.items():
   av_auto[key] = 1
 bgm_db = db.Database(modelname + '_db', pd_df)
 globals().update(bgm_db.variables)
 logprobnest = models.lognested (V, av_auto , nests , bgm_db.variables[tgtvar_name] )
 #logprob = models.loglogit (V , av_auto , bgm_db.variables[tgtvar_name] )
 bgm_model = bio.BIOGEME ( bgm_db, logprobnest )
 bgm_model.utility_dic = V.copy()
 bgm_model.nest_tuple = nests
 return bgm_model, bgm_model.estimate()

In [5]:
def calc_mnl_cov(design_m, cprobs, num_alt, attrs_per_alt):
  P_rep = np.repeat(cprobs.to_numpy(), np.repeat(attrs_per_alt, num_alt), axis=1)
  num_cols = num_alt * attrs_per_alt
  XP_rep = np.repeat((design_m.to_numpy()*P_rep).sum(axis=1).T.reshape(-1,1), num_cols, axis=1)
  Z = design_m - XP_rep
  ZPZ = np.matmul(Z.T, P_rep*Z.to_numpy())
  covMNL = np.linalg.pinv(ZPZ)
  if (np.linalg.det(covMNL)):
    return covMNL
  else:
    return np.eye(covMNL.shape[0])*1000
  return covMNL

def d_effic(covMAT):
  return np.power( np.linalg.det(covMAT), 1 / (covMAT.shape[0] + 1) )

---
---

# Load the datasets

*Auxiliary code is provided to load the dataset*


In [6]:

url = 'https://drive.google.com/file/d/1flpaR4wwM9DaToAo5urt7T1Kfgb7zUN3/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
educ_pd = pd.read_csv(path)


In [7]:
educ_pd.head(7)

Unnamed: 0,id,choice,cost_A,foreign_A,distance_A,cost_B,foreign_B,distance_B,male,female,parent_educ
0,5,1,201218,0,3248,409082,1,2463,1,0,2
1,8,1,106655,1,5375,204348,0,2508,0,1,1
2,8,2,302259,0,4349,205217,1,2113,0,1,1
3,8,2,307518,0,4188,406835,1,2975,1,0,2
4,9,1,208701,1,4218,104187,0,5247,0,1,1
5,10,2,204284,0,2427,103413,1,2850,0,1,1
6,10,2,302429,0,4202,208732,1,2165,0,1,2



___
___


# 1) Fit a multinomial logit model to act as reference model, using cost and distance as attributes.
*Hint: Do not consider alternative-specific constants.*

This one we give you a little hint:

In [8]:
bgm_edu = db.Database('edu', educ_pd)
globals().update(bgm_edu.variables)

B_cost = exp.Beta( 'B_cost', 0, None, None, 0)
B_dist = exp.Beta( 'B_dist', 0, None, None, 0)
B_fore = exp.Beta( 'B_fore', 0, None, None, 0)


V_A = B_cost*cost_A + B_dist*distance_A #+ B_fore*foreign_A
V_B = B_cost*cost_B + B_dist*distance_B #+ B_fore*foreign_B

V = {1: V_A ,
  2: V_B 
  }
av = {1: 1,
  2: 1
 }

logprob = models.loglogit (V , av , choice )
bgm_model = bio.BIOGEME ( bgm_edu, logprob )
bgm_model.modelName = 'my first multinomial logit'
results = bgm_model.estimate()



results.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
B_cost,-4e-06,4.371428e-07,-9.613835,0.0
B_dist,-0.000369,3.187013e-05,-11.566283,0.0


---

# 2) Use parents education as a characteristic interacting with cost.  Comment on the results (signs of the variables, interpretation of the interaction).

In [9]:
B_cost = exp.Beta( 'B_cost', 0, None, None, 0)
B_dist = exp.Beta( 'B_dist', 0, None, None, 0)
B_paredu = exp.Beta( 'B_paredu', 0, None, None, 0)

V_A = B_cost*cost_A + B_dist*distance_A + B_paredu*cost_A*parent_educ
V_B = B_cost*cost_B + B_dist*distance_B + B_paredu*cost_B*parent_educ

V = {1: V_A , 2: V_B }

model_2, results_2 = qbus_estimate_bgm(V, educ_pd, 'choice', 'model_2')
results_2.getEstimatedParameters()

You have not defined a name for the model. The output files are named from the model name. The default is [biogemeModelDefaultName]


Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
B_cost,-6e-06,1.141183e-06,-4.988998,6.06932e-07
B_dist,-0.000367,3.184888e-05,-11.512212,0.0
B_paredu,1e-06,8.500272e-07,1.449734,0.1471327


---

# 3) Imagine that we model parents education as an additive characteristic (not an interaction, just adding to the utilities). However, in the context of our discrete choice experiment, this does not make sense. Explain why.
*Hint: No programming required but you might fit some models if it helps you.*



In [10]:
educ_pd_interaction = educ_pd.copy()
educ_pd_interaction['cost_A_paredu'] = educ_pd_interaction['cost_A'] * educ_pd_interaction['parent_educ']
educ_pd_interaction['cost_B_paredu'] = educ_pd_interaction['cost_B'] * educ_pd_interaction['parent_educ']

qbus_update_globals_bgm(educ_pd_interaction)

B_cost = exp.Beta( 'B_cost', 0, None, None, 0)
B_dist = exp.Beta( 'B_dist', 0, None, None, 0)
B_paredu = exp.Beta( 'B_paredu', 0, None, None, 0)

V_A = B_cost*cost_A + B_dist*distance_A + B_paredu*cost_A_paredu
V_B = B_cost*cost_B + B_dist*distance_B + B_paredu*cost_A_paredu

V_3 = {1: V_A , 2: V_B }

model_3, results_3 = qbus_estimate_bgm(V, educ_pd_interaction, 'choice')
results_3.getEstimatedParameters()

You have not defined a name for the model. The output files are named from the model name. The default is [biogemeModelDefaultName]


Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
B_cost,-6e-06,1.142833e-06,-5.021388,5.129954e-07
B_dist,-0.00037,3.194461e-05,-11.569075,0.0
B_paredu,1e-06,8.509465e-07,1.466112,0.1426177


---

# 4) Create a new multinomial logit model, including foreign language and gender variables in the model in Exercise 1. Is this a better model than the outcome of Exercise 1? 
*Hint: Pay attention to the way the variables are included (adding, interaction, per-altenative, etc.)*

---

# 5) Fit a mixed logit model (no panel) using the specification of Exercise 1, consider that the effect of distance is random. Comment on the results (changes on all coefficients with respect to Exercise 1, variance of the random coefficient)

**You might use a small number of draws in the Montecarlo, or consider a smaller dataset if you get problems.**

---

# 6) Consider the mixed logit with panel information about the household. Compare the results to Exercise 5.

---

#7) Calculate the willingness to pay for reducing the distance to the school, based on the results of Exercise 6.
*Hint: consider the distribution of the WTP, since the coefficient for distance is random.*

# 8) The researchers want to test if there is some form of 'undesirable' systematic effect introduced in their survey method. In particular, if there are systematic differences between alternative A and alternative B (For example, because alternative A appears on top of the survey card, and B on the bottom). How would you test that? What would be the result of the test?

*Hint: You might want to compare some models to others.*

#9) Based on the experiment, the government is thinking about increasing the number of schools in the area, with the idea of reducing the distance that the children have to travel. Assume that the rows of data in the survey represent the population. Building one school in the area would reduce the distance by 10%, two schools would reduce the distance by 15%, three schools would reduce the distance by 18%. Assuming a budget of 4 million dollars to build one school, how many schools should we build if we want to generate a net benefit for the population in less than 30 years?.


*Hint: It can be 0, 1, 2 or 3, the net benefit can come from WTP*

# A1) Use a fixed effects model to estimate whether male-children households are less affected by distance than female-children households.

# A2) In an alternative view of the dataset, we want to study the education level as the choice variable. Using as covariates the attributes of option 'A' and the characteristics, fit an ordered logit and comment on the results.

# A3) Continuing on the alternative view: Compare two different orders to the default order given in the dataset, and decide if they are better than the default.