# Modelling Choice Behaviour - Group E
## London Passenger Mode Choice
Students : DAMBREVILLE Nathan, DELPLANQUE Maxime, POULY Thimothée

### **<u>I/ Imports</u>**
#### **1 - Libraries**

In [79]:
# %conda env create -f env/MCB_E.yml
# %pip install biogeme

import biogeme.database as db
import biogeme.biogeme as bio
from biogeme import models
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from biogeme.expressions import Beta, Variable, log, exp
from biogeme.results_processing import get_pandas_estimated_parameters, html_output
from scipy.stats import norm

#### **2 - Data**

In [80]:
DATA_FOLDER = 'data/'
data = pd.read_csv(DATA_FOLDER + 'lpmc09.dat', sep='\t')
data.head()

Unnamed: 0,trip_id,household_id,person_n,trip_n,travel_mode,purpose,fueltype,faretype,bus_scale,survey_year,...,dur_pt_access,dur_pt_rail,dur_pt_bus,dur_pt_int,pt_interchanges,dur_driving,cost_transit,cost_driving_fuel,cost_driving_ccharge,driving_traffic_percent
0,13,1,1,1,4,3,1,5,0.0,1,...,0.241389,0.0,0.122222,0.0,0,0.132222,0.0,0.5,0.0,0.065126
1,43,10,0,0,3,3,6,1,1.0,1,...,0.072778,0.0,0.344722,0.120556,1,0.167778,3.0,0.44,0.0,0.145695
2,46,12,0,0,4,3,1,5,0.0,1,...,0.136389,0.0,0.070278,0.0,0,0.072222,0.0,0.24,0.0,0.107692
3,53,12,1,3,4,3,2,1,1.0,1,...,0.0825,0.0,0.061944,0.0,0,0.0625,1.5,0.17,0.0,0.124444
4,65,13,1,3,3,5,1,5,0.0,1,...,0.050833,0.216667,0.590556,0.237778,2,0.863889,0.0,2.6,0.0,0.675884


### **<u>II/ Model 0</u>**

**Description:**
- This model includes alternative specific constant, cost and travel time for each alternative.
- Cost and travel time are associated with generic parameters.

#### **1 - Creation of the model**

In [81]:
# Create the Biogeme database
database = db.Database("lpmc09", data)

# Define the base variables
travel_mode = Variable('travel_mode')
dur_walking = Variable('dur_walking')
dur_cycling = Variable('dur_cycling')
dur_pt_bus = Variable('dur_pt_bus')
dur_pt_access = Variable('dur_pt_access')
dur_pt_int = Variable('dur_pt_int')
dur_driving = Variable('dur_driving')
cost_transit = Variable('cost_transit')
cost_driving_fuel = Variable('cost_driving_fuel')
cost_driving_ccharge = Variable('cost_driving_ccharge')

# Create new variables
dur_pt_tot = dur_pt_bus + dur_pt_access + dur_pt_int
cost_drive = cost_driving_fuel + cost_driving_ccharge

# Define the ASC to be estimated
asc_pt = Beta('asc_pt', 0, None, None, 0)
asc_cycling = Beta('asc_cycling', 0, None, None, 0)
asc_driving = Beta('asc_driving', 0, None, None, 0)

# Define the Betas to be estimated
beta_cost = Beta('beta_cost', 0, None, None, 0)
beta_time = Beta('beta_time', 0, None, None, 0)

# Define the utility functions per alternative
v_walking = dur_walking * beta_time
v_cycling = asc_cycling + dur_cycling * beta_time
v_pt = asc_pt + dur_pt_tot * beta_time + cost_transit * beta_cost
v_drive = asc_driving + dur_driving * beta_time + cost_drive * beta_cost

# Define the association between alternatives and utility functions
V = {1: v_walking,
     2: v_cycling,
     3: v_pt,
     4: v_drive}
logprob = models.loglogit(V, None, travel_mode)

# Initialisation of the Biogeme object
model_0 = bio.BIOGEME(database, logprob)
model_0.model_name = 'model_0'

#### **2 - Estimating the model's results**

In [82]:
results_model_0 = model_0.estimate()
general_statistics_model_0 = results_model_0.print_general_statistics()
print(general_statistics_model_0)

Number of estimated parameters             5
Sample size                                5000
Excluded observations                      0
Init log likelihood                        -4552.633
Final log likelihood                       -4552.633
Likelihood ratio test for the init. model  -0
Rho-square for the init. model             0
Rho-square-bar for the init. model         -0.0011
Akaike Information Criterion               9115.265
Bayesian Information Criterion             9147.851
Final gradient norm                        1.8474E-05
Bootstrapping time                         None


In [83]:
general_statistics_model_0

'Number of estimated parameters             5\nSample size                                5000\nExcluded observations                      0\nInit log likelihood                        -4552.633\nFinal log likelihood                       -4552.633\nLikelihood ratio test for the init. model  -0\nRho-square for the init. model             0\nRho-square-bar for the init. model         -0.0011\nAkaike Information Criterion               9115.265\nBayesian Information Criterion             9147.851\nFinal gradient norm                        1.8474E-05\nBootstrapping time                         None'

In [84]:
print(len("                       "))

23


In [85]:
get_pandas_estimated_parameters(estimation_results=results_model_0)

Unnamed: 0,Name,Value,Robust std err.,Robust t-stat.,Robust p-value
0,beta_time,-4.374888,0.164549,-26.587163,0.0
1,asc_cycling,-3.351221,0.099973,-33.521236,0.0
2,asc_pt,-0.641355,0.058471,-10.968706,0.0
3,beta_cost,-0.137783,0.013934,-9.888477,0.0
4,asc_driving,-0.735068,0.067722,-10.854174,0.0


#### **3 - Results analysis**
##### a - Overall Model Fit
- **Log-likelihood**: The initial log-likelihood is **-6931.472**, and the final log-likelihood is **-4552.633**. The improvement suggests the model fits the data better after estimation.
- **Likelihood Ratio Test**: The value is **4757.678**, which is high, indicating a significant improvement in model fit compared to the null model (no predictors).
- **Rho-square (Pseudo R²)**: Both the **Rho-square for the init. model (0.343)** and the **Rho-square-bar for the init. model (0.342)** suggest that about 34% of the variation in the dependent variable is explained by the model, which is a moderate level of explanatory power.
- **AIC and BIC**: The **Akaike Information Criterion (9115.265)** and **Bayesian Information Criterion (9147.851)** are provided for model comparison. Lower values indicate better fit, but without comparison to other models, their absolute values are less interpretable.

##### b - Interpretation of Parameter Signs
- **beta_time (-4.37)**: The negative sign indicates that as the time variable increases, the utility decreases. This is intuitive: people generally prefer options that take less time.
- **asc_cycling (-3.35)**: The negative sign for the alternative-specific constant (ASC) for cycling suggests that, all else being equal, individuals have a lower inherent preference for cycling compared to the base alternative.
- **asc_pt (-0.64)**: The negative ASC for public transport (PT) indicates a lower inherent preference for public transport compared to the base alternative.
- **beta_cost (-0.14)**: The negative sign for cost means that as the cost increases, the utility decreases. This is expected, as people prefer cheaper options.
- **asc_driving (-0.74)**: The negative ASC for driving suggests a lower inherent preference for driving compared to the base alternative.

##### c - Statistical Significance
- **Robust p-values**: All parameters have **p-values of 0.0**, which means they are **statistically significant** at any conventional level.
- **Robust t-statistics**: All t-statistics are far from zero (absolute values much greater than 2), further confirming the statistical significance of each parameter.

### **<u>III/ Model 1</u>**

**Description:**
- This model turns the parameter of time from a generic one into an alternative specific one.

#### **0 - Creation of the model**

In [86]:
# Define alternative-specific time Betas
beta_time_walk = Beta('beta_time_walk', 0, None, None, 0)
beta_time_cycling = Beta('beta_time_cycling', 0, None, None, 0)
beta_time_pt = Beta('beta_time_pt', 0, None, None, 0)
beta_time_drive = Beta('beta_time_drive', 0, None, None, 0)

# Utilities with alternative-specific time parameters
v_walking = dur_walking * beta_time_walk
v_cycling = asc_cycling + dur_cycling * beta_time_cycling
v_pt = asc_pt + dur_pt_tot * beta_time_pt + cost_transit * beta_cost
v_drive = asc_driving + dur_driving * beta_time_drive + cost_drive * beta_cost

# Association and estimation
V = {1: v_walking,
     2: v_cycling,
     3: v_pt,
     4: v_drive}
logprob = models.loglogit(V, None, travel_mode)

model_1 = bio.BIOGEME(database, logprob)
model_1.model_name = 'model_1'


#### **1 - Underlying assumption for alternative-specific time parameters**

Defining separate time parameters for each alternative implies that travellers value travel time differently depending on the mode.
For instance, a minute spent walking may be perceived as more onerous than a minute spent on public transport.
This specification lets the marginal disutility (or sensitivity) to travel time vary by mode, capturing mode-specific perceptions of time.

#### **2 - Estimating the model's results**

##### **a/ Coding the estimation**

In [87]:
# Estimate and display results
results_model_1 = model_1.estimate()
general_statistics_model_1 = results_model_1.print_general_statistics()
print(general_statistics_model_1)
parameters_model_1 = get_pandas_estimated_parameters(estimation_results=results_model_1)
print(parameters_model_1)

Number of estimated parameters             8
Sample size                                5000
Excluded observations                      0
Init log likelihood                        -4338.658
Final log likelihood                       -4338.658
Likelihood ratio test for the init. model  -0
Rho-square for the init. model             0
Rho-square-bar for the init. model         -0.00184
Akaike Information Criterion               8693.316
Bayesian Information Criterion             8745.453
Final gradient norm                        6.5342E-05
Bootstrapping time                         None
                Name     Value  Robust std err.  Robust t-stat.  \
0     beta_time_walk -7.595577         0.409546      -18.546347   
1        asc_cycling -4.370240         0.196393      -22.252548   
2  beta_time_cycling -4.354759         0.457139       -9.526126   
3             asc_pt -2.486912         0.141213      -17.611026   
4       beta_time_pt -1.981675         0.171941      -11.525341   
5    

##### **b/ Results interpretation**

- The `general statistics` of the model suggests that it has a small explanatory power It is only slightly better than the model 0, but still better.
- Compared to the original `beta_time` (~ -4.375), the `beta_time_walk` is almost two times bigger (~ -7.596). This suggests that walking becomes even less attractive as time increases than it normally does.
- `beta_time_cycling` is very close to the original. This calls for the same interpretation as the model 0's parameter.
- `beta_time_drive` is lower than the original. Meaning that driving will be less unattractive because of travel time increase than average.
- `beta_time_pt` is even lower which suggests an even lower unattraction to public transport because of travel time increase.
- `asc_pt` and `asc_driving` increase the basic unattractivity of these modes as they respectively tripled and quadrupled.
- `asc_cycling` however remains fairly close to the model 0 estimation (0: ~ -3.351 vs ~ -4.370 :1). The same goes for `beta_cost`  (0: ~ -0.138 vs ~ -0.155 :1).
- All parameters estimated in model 1 are `statistically significant` as they all have **Robust t-stats**<-10 (except `beta_time_cycling` which is at -9.5 -very close-) and their **Robust p-values** are 0.0.

#### **3 - Comparing `Model 0` and `Model 1`**

##### **a/ Choice of statistical test**

To compare **Model 0** and **Model 1**, we need to test whether the added parameters in Model 1 (alternative-specific ones) significantly improve the fit of the model compared to Model 0.
Both models are **nested** — Model 0 is a restricted version of Model 1 (obtained by constraining some parameters to be equal across alternatives).

The appropriate test is therefore a **Likelihood Ratio Test (LRT)**.

**Test statistic:**
[
LR = -2 \times [LL_0 - LL_1]
]
where

* (LL_0) = log-likelihood of Model 0,
* (LL_1) = log-likelihood of Model 1.

The statistic follows a **χ² (chi-square) distribution** with degrees of freedom equal to the difference in the number of parameters between the two models ((df = k_1 - k_0)).

**Null hypothesis (H₀):**
The more complex Model 1 does *not* provide a statistically significant improvement in fit over Model 0 — i.e. the parameters added in Model 1 are jointly equal to zero.
[
H_0: \beta_{\text{new}} = 0
]

**Alternative hypothesis (H₁):**
At least one of the new parameters improves the model fit.

**Expected result:**
Since Model 1 introduces alternative-specific parameters, we expect a better fit — that is, a higher log-likelihood and a significant LR statistic (p < 0.05).
Hence, Model 1 is expected to be preferred.

In [88]:
# --- Likelihood Ratio Test between Model 0 and Model 1 ---

import scipy.stats as stats
from functions.find_word_in_str import find_word_in_str

# Retrieve general statistics from both estimated models
LL0_start_index = find_word_in_str(total_str=general_statistics_model_0, key='Final log likelihood')[0][1] + 23
LL1_start_index = find_word_in_str(total_str=general_statistics_model_1, key='Final log likelihood')[0][1] + 23
LL0_end_index = LL0_start_index+8
LL1_end_index = LL1_start_index+8
LL0 = float(general_statistics_model_0[LL0_start_index:LL0_end_index])
LL1 = float(general_statistics_model_1[LL1_start_index:LL1_end_index])

# Retrieve number of estimated parameters for each model
k0_index = find_word_in_str(total_str=general_statistics_model_0, key='Number of estimated parameters')[0][1] + 13
k1_index = find_word_in_str(total_str=general_statistics_model_1, key='Number of estimated parameters')[0][1] + 13
k0 = int(general_statistics_model_0[k0_index])
k1 = int(general_statistics_model_1[k1_index])

# Compute the Likelihood Ratio (LR) statistic
LR = -2 * (LL0 - LL1)

# Degrees of freedom (difference in number of parameters)
dif = k1 - k0

# Compute the p-value using the chi-square distribution
p_value = 1 - stats.chi2.cdf(LR, dif)

# Print results
print(f"--- Likelihood Ratio Test ---")
print(f"Log-likelihood (Model 0): {LL0:.3f}")
print(f"Log-likelihood (Model 1): {LL1:.3f}")
print(f"LR statistic: {LR:.3f}")
print(f"Degrees of freedom: {dif}")
print(f"p-value: {p_value:.4f}")

# Decision rule at 5% significance level
if p_value < 0.05:
    print("Reject H0 → Model 1 significantly improves the fit. Model 1 is preferred.")
    model_pref = model_1
else:
    print("Fail to reject H0 → Model 1 does not significantly improve the fit. Keep Model 0.")
    model_pref = model_0


|Final log likelihood| was found 1 time(s) in the string.
|Final log likelihood| was found 1 time(s) in the string.
|Number of estimated parameters| was found 1 time(s) in the string.
|Number of estimated parameters| was found 1 time(s) in the string.
--- Likelihood Ratio Test ---
Log-likelihood (Model 0): -4552.630
Log-likelihood (Model 1): -4338.650
LR statistic: 427.960
Degrees of freedom: 3
p-value: 0.0000
Reject H0 → Model 1 significantly improves the fit. Model 1 is preferred.


##### **c/ Results interpretation**

There are two possible outcomes:

<u>Case 1 — Model 1 is preferred (Reject H₀):</u>

* The **LR statistic** is large enough (p < 0.05), meaning that the likelihood improvement due to the additional parameters is statistically significant.
* This indicates that allowing **alternative-specific parameters** captures real differences in travelers’ sensitivities across modes.
* **Model 1 becomes Model_pref**, i.e. the preferred model for the next steps.

<u>Case 2 — Model 1 is *not* preferred (Fail to reject H₀):</u>

* The **LR statistic** is small (p ≥ 0.05), meaning that the improvement in fit is not statistically significant.
* In that case, the more complex Model 1 does not justify its additional parameters.
* **Model 0 remains Model_pref**.

<u>Quantifying the degree of preference</u>

The **magnitude of the LR statistic** (and its corresponding **p-value**) tells you *how strongly* Model 1 is preferred:

* A **large LR value** (e.g. > 6 for df = 1, > 9 for df = 2) → strong evidence that Model 1 fits significantly better.
* A **small LR value** (close to 0) → almost no gain in explanatory power.

You can also look at **information criteria** (AIC, BIC) for a secondary check: Lower AIC/BIC values indicate a preferred model while penalizing for complexity.

### **<u>IV/ Model 2</u>**

**Description:**
- This model 

#### **1 - Creation of the model**

In [89]:
# add the additional variables
bus_scale = Variable('bus_scale')
female = Variable('female')

# Define the ASC to be estimated
asc_pt = Beta('asc_pt', 0, None, None, 0)
asc_cycling = Beta('asc_cycling', 0, None, None, 0)
asc_driving = Beta('asc_driving', 0, None, None, 0)

# Define the Betas to be estimated
beta_cost = Beta('beta_cost', 0, None, None, 0)
beta_time_walk = Beta('beta_time_walk', 0, None, None, 0)
beta_time_cycling = Beta('beta_time_cycling', 0, None, None, 0)
beta_time_pt = Beta('beta_time_pt', 0, None, None, 0)
beta_time_drive = Beta('beta_time_drive', 0, None, None, 0)
beta_female = Beta('beta_female', 0, None, None, 0)
beta_bus_scale = Beta('beta_bus_scale', 0, None, None, 0)

# Define the utility functions
v_walking = dur_walking * beta_time_walk + beta_female * female
v_cycling = asc_cycling + dur_cycling * beta_time_cycling + beta_female * female
v_pt = asc_pt + dur_pt_tot * beta_time_pt + cost_transit * beta_cost + beta_bus_scale * bus_scale + beta_female * female
v_drive = asc_driving + dur_driving * beta_time_drive + cost_drive * beta_cost

# Define the association between alternatives and utility functions
V = {1: v_walking,
     2: v_cycling,
     3: v_pt,
     4: v_drive}
logprob2 = models.loglogit(V, None, travel_mode)

# Initialisation of the Biogeme object
biogeme = bio.BIOGEME(database, logprob2)
biogeme.model_name = 'model_2'

# Results
results_model_2 = biogeme.estimate()
print(results_model_2.print_general_statistics())
parameters_model_2 = get_pandas_estimated_parameters(estimation_results=results_model_2)
print(parameters_model_2)

Number of estimated parameters             10
Sample size                                5000
Excluded observations                      0
Init log likelihood                        -4337.219
Final log likelihood                       -4337.219
Likelihood ratio test for the init. model  1.818989e-12
Rho-square for the init. model             2.22e-16
Rho-square-bar for the init. model         -0.00231
Akaike Information Criterion               8694.438
Bayesian Information Criterion             8759.61
Final gradient norm                        1.0089E-03
Bootstrapping time                         None
                Name     Value  Robust std err.  Robust t-stat.  \
0     beta_time_walk -7.604628         0.409880      -18.553312   
1        beta_female  0.077159         0.062418        1.236173   
2        asc_cycling -4.372047         0.196648      -22.232825   
3  beta_time_cycling -4.374083         0.458487       -9.540250   
4             asc_pt -2.438817         0.147399      -1

### **<u>V/ Model 3</u>**

**Description:**
- This model includes an appropriate non-linear transformation of the `cost` variable.

#### **1 - Creation of the model**

##### Underlying assumption of the non-linear specification defined in this situation

When the price reaches a certain point, it becomes too expansive for the user whatever the price is after this level, using a logarithm will make the price less significative as the price will increase

In [90]:
data["cost_transit"].isna().sum()

np.int64(0)

In [91]:
# Define variables
travel_mode = Variable('travel_mode')
dur_walking = Variable('dur_walking')
dur_cycling = Variable('dur_cycling')
dur_pt_bus = Variable('dur_pt_bus')
dur_pt_access = Variable('dur_pt_access')
dur_pt_int = Variable('dur_pt_int')
dur_driving = Variable('dur_driving')
cost_transit = Variable('cost_transit')
cost_driving_fuel = Variable('cost_driving_fuel')
cost_driving_ccharge = Variable('cost_driving_ccharge')

# Create new variables
dur_pt_tot = dur_pt_bus + dur_pt_access + dur_pt_int
cost_drive = cost_driving_fuel + cost_driving_ccharge

# Define the ASC to be estimated
asc_pt = Beta('asc_pt', 0, None, None, 0)
asc_cycling = Beta('asc_cycling', 0, None, None, 0)
asc_driving = Beta('asc_driving', 0, None, None, 0)

# Define the Betas to be estimated
beta_cost = Beta('beta_cost', 0, None, None, 0)
beta_time_walk = Beta('beta_time_walk', 0, None, None, 0)
beta_time_cycling = Beta('beta_time_cycling', 0, None, None, 0)
beta_time_pt = Beta('beta_time_pt', 0, None, None, 0)
beta_time_drive = Beta('beta_time_drive', 0, None, None, 0)

# Box-Cox transformation of costs
lambda_boxcox = Beta('lambda_boxcox', 1.01, -10, 10, 0) # Can't put 1 as it creates a log and it is impossible due to values equal to 0 in the cost variable
boxcox_cost_transit = models.boxcox(cost_transit, lambda_boxcox)
boxcox_cost_drive = models.boxcox(cost_drive, lambda_boxcox)

# Define the utility functions
v_walking = dur_walking * beta_time_walk
v_cycling = asc_cycling + dur_cycling * beta_time_cycling
v_pt = asc_pt + dur_pt_tot * beta_time_pt + boxcox_cost_transit * beta_cost
v_drive = asc_driving + dur_driving * beta_time_drive + boxcox_cost_drive * beta_cost

# Define the association between alternatives and utility functions
V = {1: v_walking,
     2: v_cycling,
     3: v_pt,
     4: v_drive}

# Define the logit model
logprob = models.loglogit(V, None, travel_mode)

# Initialisation of the Biogeme object
model_3 = bio.BIOGEME(database, logprob)
model_3.model_name = 'model_3'

In [92]:
results_model_3 = model_3.estimate()
print(results_model_3.print_general_statistics())
parameters_model_3 = get_pandas_estimated_parameters(estimation_results=results_model_3)
print(parameters_model_3)

Number of estimated parameters             9
Sample size                                5000
Excluded observations                      0
Init log likelihood                        -4338.54
Final log likelihood                       -4338.54
Likelihood ratio test for the init. model  2.455882e-05
Rho-square for the init. model             2.83e-09
Rho-square-bar for the init. model         -0.00207
Akaike Information Criterion               8695.08
Bayesian Information Criterion             8753.734
Final gradient norm                        1.1499E-03
Bootstrapping time                         None
                Name     Value  Robust std err.  Robust t-stat.  \
0     beta_time_walk -7.695582         0.468093      -16.440299   
1        asc_cycling -4.399111         0.210551      -20.893369   
2  beta_time_cycling -4.486412         0.505014       -8.883743   
3             asc_pt -2.584011         0.144489      -17.883785   
4       beta_time_pt -2.046115         0.221129       -9.2

**Overall Model Fit**
- **Log-likelihood**: The initial and final log-likelihood are both **-4553.795**, which is unusual because the final log-likelihood should improve (become less negative) after estimation. This suggests a potential issue with the estimation process or the model specification.
- **Likelihood Ratio Test**: The value is **0**, which is unexpected and indicates no improvement over the null model. This is likely due to the identical init and final log-likelihoods.
- **Rho-square**: Both values are **0**, indicating that the model does not explain any additional variation compared to the null model. This is unusual and suggests a problem with the model or data.
- **AIC and BIC**: The values are **9119.59** and **9158.693**, respectively. These are not directly interpretable without comparison to other models, but given the other metrics, they suggest a poor fit.
- **Final Gradient Norm**: The value is **5.0376E-04**, which is close to zero, indicating convergence.


**Interpretation of Parameter Signs**
- **beta_time (-4.37)**: The negative sign indicates that as the time variable increases, the utility decreases. This is intuitive and expected: people prefer options that take less time.
- **asc_cycling (-3.36)**: The negative alternative-specific constant (ASC) for cycling suggests a lower inherent preference for cycling compared to the base alternative.
- **asc_pt (-0.72)**: The negative ASC for public transport (PT) indicates a lower inherent preference for public transport compared to the base alternative.
- **lambda_boxcox (0.78)**: This is the estimated lambda for the Box-Cox transformation of the cost variable. A value between 0 and 1 suggests a transformation between a log and a linear relationship.
- **beta_cost (-0.19)**: The negative sign for the cost parameter means that as the cost increases, the utility decreases. This is expected, as people prefer cheaper options.
- **asc_driving (-0.91)**: The negative ASC for driving suggests a lower inherent preference for driving compared to the base alternative.


**Statistical Significance**
- **Robust p-values**:
  - All parameters except `lambda_boxcox` and `beta_cost` have **p-values of 0.0**, indicating they are **statistically significant**
  - `lambda_boxcox` has a p-value of **0.001**, which is still statistically significant.
  - `beta_cost` has a p-value of **0.003**, which is also statistically significant.
- **Robust t-statistics**:
  - All t-statistics are far from zero (absolute values much greater than 2), confirming the statistical significance of each parameter.

In [103]:
results_pref = results_model_1
parameters_pref = parameters_model_1

In [104]:
# Extract estimated betas from both models
def get_betas_dict(results):
    params = results.getEstimatedParameters()
    return dict(zip(params.index, params['Value']))


def compute_chosen_probs(df, betas, boxcox=False, lambda_name='lambda_boxcox'):
    """
    df : DataFrame contenant les données brutes
    betas : dict contenant les valeurs des coefficients (Name -> Value)
    boxcox : booléen pour indiquer si on applique la transformation Box–Cox
    lambda_name : nom du paramètre lambda dans le dictionnaire betas
    """
    # Extraction des coefficients
    b_time_pt = betas['beta_time_pt']
    b_time_cycling = betas['beta_time_cycling']
    b_time_walk = betas['beta_time_walk']
    b_time_drive = betas['beta_time_drive']
    b_cost = betas['beta_cost']
    asc_cyc = betas['asc_cycling']
    asc_pt = betas['asc_pt']
    asc_drv = betas['asc_driving']
    lam = betas[lambda_name] if boxcox else None
    
    dur_pt_tot = df['dur_pt_bus'] + df['dur_pt_access'] + df['dur_pt_int']
    cost_drive = df['cost_driving_fuel'] + df['cost_driving_ccharge']

    # Cost transform
    if boxcox:
        eps = 1e-6
        x_pt = np.maximum(df['cost_transit'], eps)
        x_dr = np.maximum(cost_drive, eps)
        cost_pt = (x_pt**lam - 1)/lam if lam != 0 else np.log(x_pt)
        cost_dr = (x_dr**lam - 1)/lam if lam != 0 else np.log(x_dr)
    else:
        cost_pt = df['cost_transit']
        cost_dr = cost_drive

    # Utilities
    V = np.column_stack([
        b_time_walk * df['dur_walking'],
        asc_cyc + b_time_cycling * df['dur_cycling'],
        asc_pt + b_time_pt * dur_pt_tot + b_cost * cost_pt,
        asc_drv + b_time_drive * df['dur_driving'] + b_cost * cost_dr
    ])
    # Logit probabilities
    vmax = V.max(axis=1, keepdims=True)
    expV = np.exp(V - vmax)
    P = expV / expV.sum(axis=1, keepdims=True)

    # Probability of chosen alternative
    chosen = df['travel_mode'].astype(int).values - 1
    p_chosen = np.clip(P[np.arange(len(df)), chosen], 1e-300, 1.0)
    return p_chosen

In [105]:
# Compute chosen probabilities for both models
betas_pref = get_pandas_estimated_parameters(results_pref)
betas_3 = get_pandas_estimated_parameters(results_model_3)

betas_model_pref = dict(zip(betas_pref['Name'], betas_pref['Value']))
betas_model_3 = dict(zip(betas_3['Name'], betas_3['Value']))

p1 = compute_chosen_probs(data, betas_model_pref, boxcox=False)
p2 = compute_chosen_probs(data, betas_model_3, boxcox=True)

# Compute log probabilities and differences
logp1 = np.log(p1)
logp2 = np.log(p2)
r = logp1 - logp2

# Cox test statistic
N = len(r)
r_bar = np.mean(r)
s_r = np.std(r, ddof=1)
T = np.sqrt(N) * r_bar / s_r
p_value = 2 * (1 - norm.cdf(abs(T)))

print("=== Cox Test ===")
print(f"Mean diff in log-likelihoods per obs: {r_bar:.6f}")
print(f"Std dev: {s_r:.6f}")
print(f"T statistic: {T:.3f}")
print(f"p-value: {p_value:.5f}")

if p_value < 0.05:
    if r_bar > 0:
        print("→ Reject H0: Model Pref fits significantly better than Model 3.")
    else:
        print("→ Reject H0: Model 3 fits significantly better than Model Pref.")
else:
    print("→ Fail to reject H0: No significant difference in fit.")


=== Cox Test ===
Mean diff in log-likelihoods per obs: 0.005003
Std dev: 0.096713
T statistic: 3.658
p-value: 0.00025
→ Reject H0: Model Pref fits significantly better than Model 3.


### **<u>V/ Market Share</u>**

**Question 1:** Report the size and weight of each stratum in your sample
By dividing every value of the Table 1 by the total population (8673713) we can obtain the weight, as the size of each stratum correspond to each value of Table 1
After we have to calculate the proportion based on our database, and after we divide the london proportion by our database proportion to get the weight

Now we need to get the number of corresponding people in our initial database

In [96]:
data["age_over_41"] = (data["age"] > 40).astype(int) # Create age_over_41 variable
data_grouped = data.groupby(["age_over_41","female"]).size() # Group by age_over_41 and female

In [97]:
london_female_over_41 = 1765143
london_female_under_40 = 2599058
london_male_over_41 = 1633263
london_male_under_40 = 2676249
london_total = london_female_over_41 + london_female_under_40 + london_male_over_41 + london_male_under_40

data_female_over_41 = data_grouped[1,1]
data_female_under_40 = data_grouped[0,1]
data_male_over_41 = data_grouped[1,0]
data_male_under_40 = data_grouped[0,0]

In [98]:
prop_london_female_over_41 = london_female_over_41 / london_total
prop_london_female_under_40 = london_female_under_40 / london_total
prop_london_male_over_41 = london_male_over_41 / london_total
prop_london_male_under_40 = london_male_under_40 / london_total

In [99]:
prop_data_female_over_41 = data_female_over_41 / data_grouped.sum()
prop_data_female_under_40 = data_female_under_40 / data_grouped.sum()
prop_data_male_over_41 = data_male_over_41 / data_grouped.sum()
prop_data_male_under_40 = data_male_under_40 / data_grouped.sum()
print(prop_data_female_over_41, prop_data_female_under_40, prop_data_male_over_41, prop_data_male_under_40)

0.2404 0.2882 0.2186 0.2528


In [100]:
print("The size of each stratum in our database is:")
print("Female Over 41:", data_female_over_41)
print("Female Under 40:", data_female_under_40)
print("Male Over 41:", data_male_over_41)
print("Male Under 40:", data_male_under_40)
print("Total population in our database:", data_grouped.sum())

weight_female_over_41 = prop_london_female_over_41 / prop_data_female_over_41
weight_female_under_40 = prop_london_female_under_40 / prop_data_female_under_40
weight_male_over_41 = prop_london_male_over_41 / prop_data_male_over_41
weight_male_under_40 = prop_london_male_under_40 / prop_data_male_under_40
print("")
print("The weight of each stratum is:")
print("Weight Female Over 41:", weight_female_over_41)
print("Weight Female Under 40:", weight_female_under_40)
print("Weight Male Over 41:", weight_male_over_41)
print("Weight Male Under 40:", weight_male_under_40)

The size of each stratum in our database is:
Female Over 41: 1202
Female Under 40: 1441
Male Over 41: 1093
Male Under 40: 1264
Total population in our database: 5000

The weight of each stratum is:
Weight Female Over 41: 0.846526159950492
Weight Female Under 40: 1.0397213136760648
Weight Male Over 41: 0.8613921668262056
Weight Male Under 40: 1.2205185952462472


In [101]:
df_weight = pd.DataFrame({
    "stratum_name": ["Female Over 41", "Female Under 40", "Male Over 41", "Male Under 40"],
    'age_over_41': [1, 0, 1, 0],
    'female': [1, 1, 0, 0],
    "size": [data_female_over_41, data_female_under_40, data_male_over_41, data_male_under_40],
    'weight ': [weight_female_over_41, weight_female_under_40, weight_male_over_41, weight_male_under_40]
})

In [102]:
df_weight

Unnamed: 0,stratum_name,age_over_41,female,size,weight
0,Female Over 41,1,1,1202,0.846526
1,Female Under 40,0,1,1441,1.039721
2,Male Over 41,1,0,1093,0.861392
3,Male Under 40,0,0,1264,1.220519
