# PRACTICE for the QBUS3840 In-semester exam

# Rubric
* The **marking scheme** is simple: Each question has a some points assigned. Then the points for each question are divided between
  * Code: 50% if it works OK, 35% if minor problems, 20% if it does not work but is well explained.
  * Text explanations: 40% if it is: Clearly written. Complete, all points are addressed. Decisions are properly justified, the right reasons are given for the answer. Demonstrates knowledge of the topic, explaining nuances/ alternatives. Then it will degrade from 40% if is fails to achieve that.
  * Appearance: the remaining 10%. Structure in sections if needed. Properly sized cells (not too large code cells). Even mix of code cells and explanations instead of very few large cells. Code should be readable.



# Guidelines

* The exam will be a colab jupyter notebook that you have to fill in, then upload to canvas.

* You will have **120 minutes** to do the exam. This is an **important point**, try to become familiar with the functions to run biogeme, pandas, numpy, etc. Even if you have full access to the material of the course and can look at online programming forums for python issues, it might take some extra time. You can also prepare your own auxiliary functions to reduce the 'verbosity' of raw biogeme code. We will see this in the practice notebook.

* **The questions will be very similar to what you will see in this practice notebook.** The point of the exam is to prove that you can do a basic analysis with a multinomial logit and use biogeme as a tool. The differenciation from typical exercises vs exam will come mostly in the type of data, what variables are involved and the 'what-if scenarios' questions.

* The answers should be technically correct, but **the explanation in the text cells should demonstrate knowledge** of what you are doing. Why did you make that decision?. For example, a variable transformation, why do you choose that particular one? Why do you choose to add that variable to the model? What do you think it is going to do? After the results: Are the results as expected? Please do not be afraid of being 'too obvious' when explaining something.
When explaining coefficients, Do these have the expected sign?, What is the interpretation with respect to the reference alternative?
 A perfect code but no explanation will net you 50-60% of the marks. The opposite example, if you get stuck with a python issue but know 'conceptually' how to answer, writing a good text explanation and some pseudocode will potentially net you up to half marks.


* There will be no data cleaning involved, and the dataset will have full availability, you can create a full availability dictionary to pass to biogeme by just setting all entries to 1. ` av = {1:1, 2:1, 3:1, ...}`. We will see this in the practice.

* Please **do not to identify yourself explicitly in the anwers**, writing your name. Besides that, you are free to express yourself.

* The 'visual appearance' part of the exam stands for a small percentage of the mark, 10%. Try to clarify, do not leave very large code cells followed/ preceded by large text cells, try to interleave them so it is more natural to follow. Section you answers if they become long or address different topics. Do not write very long outputs, for example do not print the full dataset. It is critical that the main part of each answer is cleary identified with its own text cell and code cell. The bad example would be a large text cell explaining all the steps and the a large code cell that prints the output, with the answers in the middle of the code. Of course, if your answer does not require code (it might happen) do not force a code chunk in. The way we will mark the visual part is to read the notebook and if something stands out in a negative sense, this subtracts points.

* There can be one or two small 'theoretical' questions that can be answered directly by understanding some theoretical concepts. The question can also
be solved practically by estimating a model to 'try' the ideas.




# The practice problem
We will model a dataset of choice of 'recreational fishing' mode. Fishers this is whether to go for a shipping trip in either the beach, the pier, a public charter boat or a private boat. The data was collected via phone interview and
the attributes of the alternatives are the cost of the trip and the 'catch rate', the expected number of catches per hour for the particular species of fish that each fisher was targeting in their trip.
The socio-economic characteristics is income, in fact the dataset was used to study different transformations of the income and price variable and how they influence utility, drawing deeper consequences for economic theory.

The reference study, including a more detailed description of the dataset ca be found [here (Section IV Data and references therein)](https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=1017&context=econ_las_pubs)

## Description of the dataset

Each row represents a different customer, customers are 'independent' of each other.

The variables in the dataset are:

**mode**: a categorical variable indicating the fishing model selected for the trip. It is encoded in numbers, with the code:
 1. Beach
 2. Pier
 3. Private boat
 4. Charter boat

**price_x**:  Cost of the fishing mode, variable in dollars. Where x stands for one of the alternatives, e.g. price_beach is the cost of the fishing from the beach in one fishing trip.

**catch_x**: Catch rate, in catches per hour. Where x stands for one of the alternatives, e.g. catch_beach is the catch rate of the beach alternative.

**income**: Monthly income of the recreational fisher, in dollars.


---
---

# Preparing the environment
*The preparation and dataset loading code is given to the students*

In [1]:
!pip install biogeme



Load the packages, feel free to change the names.

In [2]:
import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp

# Load the dataset

# <font color='red'>IMPORTANT</font>
Enter your student id in the cell below and run the dataset loading code.


In [3]:
student_id = 520005325

In [4]:
np.random.seed(student_id)
version = np.random.randint(1007) % 2

if (version):
  path = 'https://raw.githubusercontent.com/pmontman/pub-choicemodels/main/data/fishing.csv'
  fish_pd = pd.read_csv(path)
else:
  path = 'https://raw.githubusercontent.com/pmontman/pub-choicemodels/main/data/fishing.csv'
  fish_pd = pd.read_csv(path)
  fish_pd[ ['price_beach', 'catch_beach', 'price_boat', 'catch_boat']] =  fish_pd[ ['price_boat', 'catch_boat', 'price_beach', 'catch_beach']]

A simple look at the dataset.

In [5]:
fish_pd.head(5)

Unnamed: 0,mode,price_beach,price_pier,price_boat,price_charter,catch_beach,catch_pier,catch_boat,catch_charter,income
0,4,157.93,157.93,157.93,182.93,0.2601,0.0503,0.0678,0.5391,7083.3317
1,4,10.534,15.114,15.114,34.534,0.1574,0.0451,0.1049,0.4671,1249.9998
2,3,24.334,161.874,161.874,59.334,0.2413,0.4522,0.5333,1.0266,3749.9999
3,2,55.93,15.134,15.134,84.93,0.1643,0.0789,0.0678,0.5391,2083.3332
4,3,41.514,106.93,106.93,71.014,0.1082,0.0503,0.0678,0.324,4583.332


---
---

# 1) Adjust a model with alternative specific constants and shared parameters for price and catch rate. Select one of the alternatives as the reference (pick the one that you prefer). Comment on the results: Signs of the variables and alternative specific constants.

In [6]:
def qbus_update_globals_bgm(pd_df):
   globals().update(db.Database('tmp_bg_bgm_for_glob', pd_df).variables)

def qbus_estimate_bgm(V, pd_df, tgtvar_name, modelname='bgmdef'):
 av_auto = V.copy()
 for key, value in av_auto.items():
   av_auto[key] = 1
 bgm_db = db.Database(modelname + '_db', pd_df)
 globals().update(bgm_db.variables)
 logprob = models.loglogit (V , av_auto , bgm_db.variables[tgtvar_name] )
 bgm_model = bio.BIOGEME ( bgm_db, logprob )
 bgm_model.utility_dic = V.copy()
 return bgm_model, bgm_model.estimate()

fish = qbus_update_globals_bgm(fish_pd)


ASC_beach = exp.Beta ( 'ASC_beach' ,0, None , None ,0)
ASC_pier = exp.Beta ( 'ASC_pier' ,0, None , None ,1)
ASC_boat = exp.Beta ( 'ASC_boat' ,0, None , None ,0)
ASC_charter = exp.Beta ( 'ASC_charter' ,0, None , None ,0)
B_price = exp.Beta ( 'B_price' ,0, None , None ,0)
B_catch = exp.Beta ( 'B_catch' ,0, None , None ,0)


V_beach = ASC_beach + B_price*price_beach + B_catch*catch_beach
V_pier = ASC_pier + B_price*price_pier + B_catch*catch_pier
V_boat = ASC_boat + B_price*price_boat + B_catch*catch_boat
V_charter = ASC_charter + B_price*price_charter + B_catch*catch_charter

V_base = {1: V_beach,
     2: V_pier,
     3: V_boat,
     4: V_charter}

model_base, results_base = qbus_estimate_bgm(V_base, fish_pd, 'mode', 'fish')
results_base.getEstimatedParameters()

You have not defined a name for the model. The output files are named from the model name. The default is [biogemeModelDefaultName]


Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
ASC_beach,-0.251721,0.127699,-1.971202,0.048701
ASC_boat,0.841877,0.090719,9.280051,0.0
ASC_charter,0.874607,0.104903,8.337258,0.0
B_catch,0.148717,0.091371,1.627622,0.103605
B_price,0.000688,0.000506,1.360759,0.17359


We set the pier as a the reference, and estimate the multiple choise model through biogme, in the base point, the beach have a negative intercept utility of -0.2517, the private boat has a positive intercept utility of 0.8419, and charter boat has 0.8746, the catch rate contribute a positive 0.1487 for the utility for all the alternatives, and the price contribute a slightly positive 0.001 utility for all the alternatives, and this is different to the common sense for the demand vs price curve, which need to further discussed.

---
---

# 2) Calculate the willingness to pay for increasing the catch rate and comment on the interpretation



In [7]:
results_base.getBetaValues()['B_catch'] / results_base.getBetaValues()['B_price']

216.1296990680225

The WTP of 216.1341 suggests that a consumer would be willing to pay an additional 216.1341 units of currency for a one-unit increase in the catch rate. For example, if the catch rate goes from 5 fish per hour to 6 fish per hour, the consumer is willing to pay an extra 216.1341 dollars for this increase, all else being equal.

---
---

# 3) Fit per-alternative parameters for cost and catch rate. Add one variable that has not been considered, apply a transformation of your choosing (to the new or other variables) and estimate a new model. Comment on the results and compare the new model to the model in Exercise 1. What changes are relevant? Is the new model a better fit?


In [8]:
fish_pd['log_income'] = np.log(fish_pd['income'])
qbus_update_globals_bgm(fish_pd)

B_price_beach = exp.Beta ( 'B_price_beach' ,0, None , None ,0)
B_price_pier = exp.Beta ( 'B_price_pier' ,0, None , None ,0)
B_price_boat = exp.Beta ( 'B_price_boat' ,0, None , None ,0)
B_price_charter = exp.Beta ( 'B_price_charter' ,0, None , None ,0)

B_catch_beach = exp.Beta ( 'B_catch_beach' ,0, None , None ,0)
B_catch_pier = exp.Beta ( 'B_catch_pier' ,0, None , None ,0)
B_catch_boat = exp.Beta ( 'B_catch_boat' ,0, None , None ,0)
B_catch_charter = exp.Beta ( 'B_catch_charter' ,0, None , None ,0)

B_log_income_beach = exp.Beta ( 'B_log_income_beach' ,0, None , None ,0)
B_log_income_pier = exp.Beta ( 'B_log_income_pier' ,0, None , None ,0)
B_log_income_boat = exp.Beta ( 'B_log_income_boat' ,0, None , None ,0)
B_log_income_charter = exp.Beta ( 'B_log_income_charter' ,0, None , None ,0)


V_beach_adv = ASC_beach + B_price_beach*price_beach + B_catch_beach*catch_beach + B_log_income_beach*log_income
V_pier_adv = ASC_pier + B_price_pier*price_pier + B_catch_pier*catch_pier + B_log_income_pier*log_income
V_boat_adv = ASC_boat + B_price_boat*price_boat + B_catch_boat*catch_boat + B_log_income_boat*log_income
V_charter_adv = ASC_charter + B_price_charter*price_charter + B_catch_charter*catch_charter + B_log_income_charter*log_income

V_adv = {1: V_beach_adv,
     2: V_pier_adv,
     3: V_boat_adv,
     4: V_charter_adv}

model_adv, results_adv = qbus_estimate_bgm(V_adv, fish_pd, 'mode', 'fish_adv')
results_adv.getEstimatedParameters()




You have not defined a name for the model. The output files are named from the model name. The default is [biogemeModelDefaultName]


Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
ASC_beach,4.336826,1.749591,2.478766,0.01318377
ASC_boat,3.027895,1.412361,2.143853,0.03204465
ASC_charter,1.91053,1.297449,1.472527,0.1408786
B_catch_beach,-0.458508,0.358053,-1.280557,0.2003491
B_catch_boat,-2.782359,0.438866,-6.339883,2.299401e-10
B_catch_charter,0.085188,0.093338,0.912679,0.3614116
B_catch_pier,3.188682,0.717647,4.443243,8.861313e-06
B_log_income_beach,-0.420517,0.127944,-3.286715,0.001013632
B_log_income_boat,-0.038327,0.099031,-0.387021,0.6987405
B_log_income_charter,0.110245,0.07889,1.397459,0.1622757


In [9]:
import biogeme.tools as tools
def qbus_likeli_ratio_test_bgm(results_complex, results_reference, signif_level):
  return tools.likelihood_ratio_test( (results_complex.data.logLike, results_complex.data.nparam),
                                     (results_reference.data.logLike, results_reference.data.nparam), signif_level)
qbus_likeli_ratio_test_bgm(results_adv, results_base, 0.95)

LRTuple(message='H0 can be rejected at level 95.0%', statistic=383.4144222546133, threshold=3.940299136119061)

ASC Values: All alternatives have positive ASCs. Beach now becomes significant, perhaps due to the inclusion of new variables.
Catch Rate: Different alternatives have different coefficients for catch rate, which now varies in significance. For example, the catch rate for boat fishing is significantly negative.
Price: Like catch rates, price effects also vary across alternatives and have differing significance. For example, the price for pier fishing is significant and negative.
Income: Log of income has been introduced, and it has varying effects. For instance, for beach fishing, higher income seems to have a negative effect.
Fit Per-Alternative: The model now captures more nuanced behavior by allowing parameters to vary across alternatives, thus potentially offering a better fit.

From the loglikelihood test, the model have a better fit

---
---

# 4) Calculate the accuracy of that model and confusion matrix, comment on the results.


In [10]:
def qbus_calc_accu_confusion(sim_probs, pd_df, choice_var):
  which_max = sim_probs.idxmax(axis=1)
  data = {'y_Actual':   pd_df[choice_var],
          'y_Predicted': which_max
        }

  df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
  confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])
  accu = np.mean(which_max == pd_df[choice_var])
  return accu, confusion_matrix

def qbus_simulate_bgm(qbus_bgm_model, betas, pred_pd_df):
  av_auto = qbus_bgm_model.utility_dic.copy()
  for key, value in av_auto.items():
   av_auto[key] = 1

  targets = qbus_bgm_model.utility_dic.copy()
  for key, value in targets.items():
   targets[key] = models.logit(qbus_bgm_model.utility_dic, av_auto, key)

  bgm_db = db.Database('simul', pred_pd_df)
  globals().update(bgm_db.variables)
  bgm_pred_model = bio.BIOGEME(bgm_db, targets)
  simulatedValues = bgm_pred_model.simulate(betas)
  return simulatedValues

betas = results_adv.getBetaValues()
simulatedValues = qbus_simulate_bgm(model_adv, betas, fish_pd)
accu, confusion_matrix = qbus_calc_accu_confusion(simulatedValues, fish_pd, 'mode')
print(accu)
confusion_matrix

0.4450084602368866


Predicted,1,2,3,4
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,8,57,14,55
2,6,69,19,84
3,5,12,202,199
4,8,15,182,247


The model's accuracy is approximately 44.5%, which suggests that it is performing better than random guessing but is still far from a highly accurate model. The confusion matrix shows that the model is most accurate at predicting Charter Boat (Category 4) and Private Boat (Category 3) outcomes, with 247 and 202 correct predictions, respectively. However, it seems to struggle with Beach (Category 1) and Pier (Category 2), often misclassifying these as other categories—particularly confusing Beach with Pier and vice versa. The model is also prone to incorrectly classifying other categories as Charter Boat or Private Boat, as indicated by the higher off-diagonal entries in these columns. Overall, while the model captures some aspects of the choice behavior, there's considerable room for improvement, particularly in distinguishing between Beach and Pier options.

---
---

# 5) Suppose that the company that runs the charter boats is offering a 75% discount for the population with a monthly income under 2100 dollars. What would be the market share for each of the alernatives in the new situation? Use your model in exercise 3.

In [11]:
fish_pd_2 = fish_pd.copy()
for i in range(len(fish_pd)):
    if(fish_pd['income'][i] < 2100):
        fish_pd_2['price_charter'][i] = fish_pd_2['price_charter'][i] * 0.75
fish_pd_2.head()


84.37924365482233
80.26969796954315


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fish_pd_2['price_charter'][i] = fish_pd_2['price_charter'][i] * 0.75


Unnamed: 0,mode,price_beach,price_pier,price_boat,price_charter,catch_beach,catch_pier,catch_boat,catch_charter,income,log_income
0,4,157.93,157.93,157.93,182.93,0.2601,0.0503,0.0678,0.5391,7083.3317,8.8655
1,4,10.534,15.114,15.114,25.9005,0.1574,0.0451,0.1049,0.4671,1249.9998,7.130899
2,3,24.334,161.874,161.874,59.334,0.2413,0.4522,0.5333,1.0266,3749.9999,8.229511
3,2,55.93,15.134,15.134,63.6975,0.1643,0.0789,0.0678,0.5391,2083.3332,7.641724
4,3,41.514,106.93,106.93,71.014,0.1082,0.0503,0.0678,0.324,4583.332,8.430182


In [18]:
simulatedValues_2 = qbus_simulate_bgm(model_adv, betas, fish_pd_2)
adjusted_price_place = np.mean(simulatedValues_2,axis=0)
price_place = np.mean(simulatedValues,axis=0)
print('---Previous market share----')
print(price_place)
print('---Adjusted market share--')
adjusted_price_place

---Previous market share----
1    0.113366
2    0.150594
3    0.353637
4    0.382402
dtype: float64
---Adjusted market share--


1    0.113416
2    0.150678
3    0.353730
4    0.382176
dtype: float64

The model suggests that offering a 75% discount for the population with a monthly income under 2100 dollars would have a minimal impact on the market share for each alternative. The market shares for Beach, Pier, and Private Boat see extremely slight increases, whereas the market share for Charter Boat decreases very slightly. These small adjustments suggest that price sensitivity among low-income individuals may not be a dominant factor affecting choice in this market, or that the current model may not capture all relevant influences. It could also indicate that the low-income group either is a small segment of the overall market or has other barriers besides price that influence their choice of fishing experience. Overall, despite what might seem like a substantial discount, the market dynamics appear largely unchanged according to the model.

---
---

# 6) Due to poor weather conditions at sea, the fishing trips that go farther away from the coast (both private and charter boats) are going to cut their capture rate by half during the season. What would be the expected impact in the total revenue from fishing trips during the season (assume that everything else stays the same (the same fishers still go for a trip and the remaining variables do not change). Use your model in exercise 3.

In [19]:
fish_pd_3 = fish_pd.copy()
fish_pd_3['catch_boat'] *= 0.5
fish_pd_3['catch_charter'] *= 0.5

In [22]:
simulatedValues_adv = qbus_simulate_bgm(model_adv, betas, fish_pd)
simulatedValues_weather = qbus_simulate_bgm(model_adv, betas, fish_pd_3)

revenue_adv = simulatedValues_adv.to_numpy() * fish_pd[['price_beach', 'price_pier', 'price_boat', 'price_charter']].to_numpy()
revenue_weather = simulatedValues_weather.to_numpy() * fish_pd[['price_beach', 'price_pier', 'price_boat', 'price_charter']].to_numpy()

print('previous revenue:', revenue_adv.sum())
print('revenue with bad weather:', revenue_weather.sum())

previous revenue: 109978.26653354138
revenue with bad weather: 115938.36533888773


The model suggests an unexpected outcome: despite a cut in the capture rate by half for trips that go farther from the coast (both private and charter boats), the total revenue from fishing trips during the season actually increases from approximately 109,978 to 115,938 units of currency. One possible explanation could be that with a lower catch rate, people may be inclined to spend more on fishing trips to meet their desired catch, thereby driving up total revenue. Alternatively, it might be the case that the model's parameters don't adequately capture the true price and catch sensitivity across the different alternatives, and therefore might not be a reliable predictor in such scenarios. However, if we assume the model is accurate, it implies that poor weather conditions, surprisingly, could have a beneficial impact on total revenue, despite lowering the attractiveness of private and charter boat options in terms of catch rate.