# Modelling Choice Behaviour - Group E
## London Passenger Mode Choice
Students : DAMBREVILLE Nathan, DELPLANQUE Maxime, POULY Thimothée

### **<u>I/ Imports</u>**
#### **1 - Libraries**

In [1]:
# %conda env create -f env/MCB_E.yml
# %pip install biogeme

import biogeme.database as db
import biogeme.biogeme as bio
from biogeme import models
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from biogeme.expressions import Beta, Variable, log, exp
from biogeme.results_processing import get_pandas_estimated_parameters, html_output

  from tqdm.autonotebook import tqdm


#### **2 - Data**

In [2]:
DATA_FOLDER = 'data/'
data = pd.read_csv(DATA_FOLDER + 'lpmc09.dat', sep='\t')
data.head()

Unnamed: 0,trip_id,household_id,person_n,trip_n,travel_mode,purpose,fueltype,faretype,bus_scale,survey_year,...,dur_pt_access,dur_pt_rail,dur_pt_bus,dur_pt_int,pt_interchanges,dur_driving,cost_transit,cost_driving_fuel,cost_driving_ccharge,driving_traffic_percent
0,13,1,1,1,4,3,1,5,0.0,1,...,0.241389,0.0,0.122222,0.0,0,0.132222,0.0,0.5,0.0,0.065126
1,43,10,0,0,3,3,6,1,1.0,1,...,0.072778,0.0,0.344722,0.120556,1,0.167778,3.0,0.44,0.0,0.145695
2,46,12,0,0,4,3,1,5,0.0,1,...,0.136389,0.0,0.070278,0.0,0,0.072222,0.0,0.24,0.0,0.107692
3,53,12,1,3,4,3,2,1,1.0,1,...,0.0825,0.0,0.061944,0.0,0,0.0625,1.5,0.17,0.0,0.124444
4,65,13,1,3,3,5,1,5,0.0,1,...,0.050833,0.216667,0.590556,0.237778,2,0.863889,0.0,2.6,0.0,0.675884


### **<u>II/ Model 0</u>**

**Description:**
- This model includes alternative specific constant, cost and travel time for each alternative.
- Cost and travel time are associated with generic parameters.

#### **1 - Creation of the model**

In [3]:
# Create the Biogeme database
database = db.Database("lpmc09", data)

# Define the base variables
travel_mode = Variable('travel_mode')
dur_walking = Variable('dur_walking')
dur_cycling = Variable('dur_cycling')
dur_pt_bus = Variable('dur_pt_bus')
dur_pt_access = Variable('dur_pt_access')
dur_pt_int = Variable('dur_pt_int')
dur_driving = Variable('dur_driving')
cost_transit = Variable('cost_transit')
cost_driving_fuel = Variable('cost_driving_fuel')
cost_driving_ccharge = Variable('cost_driving_ccharge')

# Create new variables
dur_pt_tot = dur_pt_bus + dur_pt_access + dur_pt_int
cost_drive = cost_driving_fuel + cost_driving_ccharge

# Define the ASC to be estimated
asc_pt = Beta('asc_pt', 0, None, None, 0)
asc_cycling = Beta('asc_cycling', 0, None, None, 0)
asc_driving = Beta('asc_driving', 0, None, None, 0)

# Define the Betas to be estimated
beta_cost = Beta('beta_cost', 0, None, None, 0)
beta_time = Beta('beta_time', 0, None, None, 0)

# Define the utility functions per alternative
v_walking = dur_walking * beta_time
v_cycling = asc_cycling + dur_cycling * beta_time
v_pt = asc_pt + dur_pt_tot * beta_time + cost_transit * beta_cost
v_drive = asc_driving + dur_driving * beta_time + cost_drive * beta_cost

# Define the association between alternatives and utility functions
V = {1: v_walking,
     2: v_cycling,
     3: v_pt,
     4: v_drive}
logprob = models.loglogit(V, None, travel_mode)

# Initialisation of the Biogeme object
model_0 = bio.BIOGEME(database, logprob)
model_0.model_name = 'model_0'

#### **2 - Estimating the model's results**

In [4]:
results_model_0 = model_0.estimate()
general_statistics_model_0 = results_model_0.print_general_statistics()
print(general_statistics_model_0)

Number of estimated parameters             5
Sample size                                5000
Excluded observations                      0
Init log likelihood                        -4552.633
Final log likelihood                       -4552.633
Likelihood ratio test for the init. model  -0
Rho-square for the init. model             0
Rho-square-bar for the init. model         -0.0011
Akaike Information Criterion               9115.265
Bayesian Information Criterion             9147.851
Final gradient norm                        1.8474E-05
Bootstrapping time                         None


In [5]:
general_statistics_model_0

'Number of estimated parameters             5\nSample size                                5000\nExcluded observations                      0\nInit log likelihood                        -4552.633\nFinal log likelihood                       -4552.633\nLikelihood ratio test for the init. model  -0\nRho-square for the init. model             0\nRho-square-bar for the init. model         -0.0011\nAkaike Information Criterion               9115.265\nBayesian Information Criterion             9147.851\nFinal gradient norm                        1.8474E-05\nBootstrapping time                         None'

In [6]:
print(len("                       "))

23


In [7]:
get_pandas_estimated_parameters(estimation_results=results_model_0)

Unnamed: 0,Name,Value,Robust std err.,Robust t-stat.,Robust p-value
0,beta_time,-4.374888,0.164549,-26.587163,0.0
1,asc_cycling,-3.351221,0.099973,-33.521236,0.0
2,asc_pt,-0.641355,0.058471,-10.968706,0.0
3,beta_cost,-0.137783,0.013934,-9.888477,0.0
4,asc_driving,-0.735068,0.067722,-10.854174,0.0


#### **3 - Results analysis**
##### a - Overall Model Fit
- **Log-likelihood**: The initial log-likelihood is **-6931.472**, and the final log-likelihood is **-4552.633**. The improvement suggests the model fits the data better after estimation.
- **Likelihood Ratio Test**: The value is **4757.678**, which is high, indicating a significant improvement in model fit compared to the null model (no predictors).
- **Rho-square (Pseudo R²)**: Both the **Rho-square for the init. model (0.343)** and the **Rho-square-bar for the init. model (0.342)** suggest that about 34% of the variation in the dependent variable is explained by the model, which is a moderate level of explanatory power.
- **AIC and BIC**: The **Akaike Information Criterion (9115.265)** and **Bayesian Information Criterion (9147.851)** are provided for model comparison. Lower values indicate better fit, but without comparison to other models, their absolute values are less interpretable.

##### b - Interpretation of Parameter Signs
- **beta_time (-4.37)**: The negative sign indicates that as the time variable increases, the utility decreases. This is intuitive: people generally prefer options that take less time.
- **asc_cycling (-3.35)**: The negative sign for the alternative-specific constant (ASC) for cycling suggests that, all else being equal, individuals have a lower inherent preference for cycling compared to the base alternative.
- **asc_pt (-0.64)**: The negative ASC for public transport (PT) indicates a lower inherent preference for public transport compared to the base alternative.
- **beta_cost (-0.14)**: The negative sign for cost means that as the cost increases, the utility decreases. This is expected, as people prefer cheaper options.
- **asc_driving (-0.74)**: The negative ASC for driving suggests a lower inherent preference for driving compared to the base alternative.

##### c - Statistical Significance
- **Robust p-values**: All parameters have **p-values of 0.0**, which means they are **statistically significant** at any conventional level.
- **Robust t-statistics**: All t-statistics are far from zero (absolute values much greater than 2), further confirming the statistical significance of each parameter.

### **<u>III/ Model 1</u>**

**Description:**
- This model turns the parameter of time from a generic one into an alternative specific one.

#### **0 - Creation of the model**

In [8]:
# Define alternative-specific time Betas
beta_time_walk = Beta('beta_time_walk', 0, None, None, 0)
beta_time_cycling = Beta('beta_time_cycling', 0, None, None, 0)
beta_time_pt = Beta('beta_time_pt', 0, None, None, 0)
beta_time_drive = Beta('beta_time_drive', 0, None, None, 0)

# Utilities with alternative-specific time parameters
v_walking = dur_walking * beta_time_walk
v_cycling = asc_cycling + dur_cycling * beta_time_cycling
v_pt = asc_pt + dur_pt_tot * beta_time_pt + cost_transit * beta_cost
v_drive = asc_driving + dur_driving * beta_time_drive + cost_drive * beta_cost

# Association and estimation
V = {1: v_walking,
     2: v_cycling,
     3: v_pt,
     4: v_drive}
logprob = models.loglogit(V, None, travel_mode)

model_1 = bio.BIOGEME(database, logprob)
model_1.model_name = 'model_1'


#### **1 - Underlying assumption for alternative-specific time parameters**

Defining separate time parameters for each alternative implies that travellers value travel time differently depending on the mode.
For instance, a minute spent walking may be perceived as more onerous than a minute spent on public transport.
This specification lets the marginal disutility (or sensitivity) to travel time vary by mode, capturing mode-specific perceptions of time.

#### **2 - Estimating the model's results**

##### **a/ Coding the estimation**

In [9]:
# Estimate and display results
results_model_1 = model_1.estimate()
general_statistics_model_1 = results_model_1.print_general_statistics()
print(general_statistics_model_1)
get_pandas_estimated_parameters(estimation_results=results_model_1)

Number of estimated parameters             8
Sample size                                5000
Excluded observations                      0
Init log likelihood                        -4338.658
Final log likelihood                       -4338.658
Likelihood ratio test for the init. model  -0
Rho-square for the init. model             0
Rho-square-bar for the init. model         -0.00184
Akaike Information Criterion               8693.316
Bayesian Information Criterion             8745.453
Final gradient norm                        8.5424E-05
Bootstrapping time                         None


Unnamed: 0,Name,Value,Robust std err.,Robust t-stat.,Robust p-value
0,beta_time_walk,-7.595577,0.409546,-18.546347,0.0
1,asc_cycling,-4.37024,0.196393,-22.252547,0.0
2,beta_time_cycling,-4.354759,0.457139,-9.526125,0.0
3,asc_pt,-2.486912,0.141213,-17.611026,0.0
4,beta_time_pt,-1.981675,0.171941,-11.525341,0.0
5,beta_cost,-0.154665,0.013395,-11.546452,0.0
6,asc_driving,-1.918849,0.140329,-13.673965,0.0
7,beta_time_drive,-3.456668,0.222864,-15.510187,0.0


##### **b/ Results interpretation**

- The `general statistics` of the model suggests that it has a small explanatory power It is only slightly better than the model 0, but still better.
- Compared to the original `beta_time` (~ -4.375), the `beta_time_walk` is almost two times bigger (~ -7.596). This suggests that walking becomes even less attractive as time increases than it normally does.
- `beta_time_cycling` is very close to the original. This calls for the same interpretation as the model 0's parameter.
- `beta_time_drive` is lower than the original. Meaning that driving will be less unattractive because of travel time increase than average.
- `beta_time_pt` is even lower which suggests an even lower unattraction to public transport because of travel time increase.
- `asc_pt` and `asc_driving` increase the basic unattractivity of these modes as they respectively tripled and quadrupled.
- `asc_cycling` however remains fairly close to the model 0 estimation (0: ~ -3.351 vs ~ -4.370 :1). The same goes for `beta_cost`  (0: ~ -0.138 vs ~ -0.155 :1).
- All parameters estimated in model 1 are `statistically significant` as they all have **Robust t-stats**<-10 (except `beta_time_cycling` which is at -9.5 -very close-) and their **Robust p-values** are 0.0.

#### **3 - Comparing `Model 0` and `Model 1`**

##### **a/ Choice of statistical test**

To compare **Model 0** and **Model 1**, we need to test whether the added parameters in Model 1 (alternative-specific ones) significantly improve the fit of the model compared to Model 0.
Both models are **nested** — Model 0 is a restricted version of Model 1 (obtained by constraining some parameters to be equal across alternatives).

The appropriate test is therefore a **Likelihood Ratio Test (LRT)**.

**Test statistic:**
[
LR = -2 \times [LL_0 - LL_1]
]
where

* (LL_0) = log-likelihood of Model 0,
* (LL_1) = log-likelihood of Model 1.

The statistic follows a **χ² (chi-square) distribution** with degrees of freedom equal to the difference in the number of parameters between the two models ((df = k_1 - k_0)).

**Null hypothesis (H₀):**
The more complex Model 1 does *not* provide a statistically significant improvement in fit over Model 0 — i.e. the parameters added in Model 1 are jointly equal to zero.
[
H_0: \beta_{\text{new}} = 0
]

**Alternative hypothesis (H₁):**
At least one of the new parameters improves the model fit.

**Expected result:**
Since Model 1 introduces alternative-specific parameters, we expect a better fit — that is, a higher log-likelihood and a significant LR statistic (p < 0.05).
Hence, Model 1 is expected to be preferred.

In [11]:
# --- Likelihood Ratio Test between Model 0 and Model 1 ---

import scipy.stats as stats
from functions.find_word_in_str import find_word_in_str

# Retrieve general statistics from both estimated models
LL0_start_index = find_word_in_str(total_str=general_statistics_model_0, key='Final log likelihood')[0][1] + 23
LL1_start_index = find_word_in_str(total_str=general_statistics_model_1, key='Final log likelihood')[0][1] + 23
LL0_end_index = LL0_start_index+8
LL1_end_index = LL1_start_index+8
LL0 = float(general_statistics_model_0[LL0_start_index:LL0_end_index])
LL1 = float(general_statistics_model_1[LL1_start_index:LL1_end_index])

# Retrieve number of estimated parameters for each model
k0_index = find_word_in_str(total_str=general_statistics_model_0, key='Number of estimated parameters')[0][1] + 13
k1_index = find_word_in_str(total_str=general_statistics_model_1, key='Number of estimated parameters')[0][1] + 13
k0 = int(general_statistics_model_0[k0_index])
k1 = int(general_statistics_model_1[k1_index])

# Compute the Likelihood Ratio (LR) statistic
LR = -2 * (LL0 - LL1)

# Degrees of freedom (difference in number of parameters)
dif = k1 - k0

# Compute the p-value using the chi-square distribution
p_value = 1 - stats.chi2.cdf(LR, dif)

# Print results
print(f"--- Likelihood Ratio Test ---")
print(f"Log-likelihood (Model 0): {LL0:.3f}")
print(f"Log-likelihood (Model 1): {LL1:.3f}")
print(f"LR statistic: {LR:.3f}")
print(f"Degrees of freedom: {dif}")
print(f"p-value: {p_value:.4f}")

# Decision rule at 5% significance level
if p_value < 0.05:
    print("Reject H0 → Model 1 significantly improves the fit. Model 1 is preferred.")
    model_pref = model_1
else:
    print("Fail to reject H0 → Model 1 does not significantly improve the fit. Keep Model 0.")
    model_pref = model_0


|Final log likelihood| was found 1 time(s) in the string.
|Final log likelihood| was found 1 time(s) in the string.
|Number of estimated parameters| was found 1 time(s) in the string.
|Number of estimated parameters| was found 1 time(s) in the string.
--- Likelihood Ratio Test ---
Log-likelihood (Model 0): -4552.630
Log-likelihood (Model 1): -4338.650
LR statistic: 427.960
Degrees of freedom: 3
p-value: 0.0000
Reject H0 → Model 1 significantly improves the fit. Model 1 is preferred.


##### **c/ Results interpretation**

There are two possible outcomes:

<u>Case 1 — Model 1 is preferred (Reject H₀):</u>

* The **LR statistic** is large enough (p < 0.05), meaning that the likelihood improvement due to the additional parameters is statistically significant.
* This indicates that allowing **alternative-specific parameters** captures real differences in travelers’ sensitivities across modes.
* **Model 1 becomes Model_pref**, i.e. the preferred model for the next steps.

<u>Case 2 — Model 1 is *not* preferred (Fail to reject H₀):</u>

* The **LR statistic** is small (p ≥ 0.05), meaning that the improvement in fit is not statistically significant.
* In that case, the more complex Model 1 does not justify its additional parameters.
* **Model 0 remains Model_pref**.

<u>Quantifying the degree of preference</u>

The **magnitude of the LR statistic** (and its corresponding **p-value**) tells you *how strongly* Model 1 is preferred:

* A **large LR value** (e.g. > 6 for df = 1, > 9 for df = 2) → strong evidence that Model 1 fits significantly better.
* A **small LR value** (close to 0) → almost no gain in explanatory power.

You can also look at **information criteria** (AIC, BIC) for a secondary check: Lower AIC/BIC values indicate a preferred model while penalizing for complexity.

### **<u>IV/ Model 2</u>**

**Description:**
- This model 

#### **1 - Creation of the model**

### **<u>V/ Model 3</u>**

**Description:**
- This model includes an appropriate non-linear transformation of the `variable` variable.

#### **1 - Creation of the model**

##### Underlying assumption of the non-linear specification defined in this situation

When the price reaches a certain point, it becomes too expansive for the user whatever the price is after this level, using a logarithm will make the price less significative as the price will increase

In [12]:
#
model_pref = "model_0" # Basic prefferenced model before creation of model 1 & 2