# Testing Torch Choice Package

The purpose of this notebook is to test out the capabilities of the torch-choice package in python. The first section will try the vignette and then some new data will be used.

## Vignette


Install required packages

In [57]:
import pandas as pd
import numpy as np
import torch
from torch_choice.utils.easy_data_wrapper import EasyDatasetWrapper
from torch_choice.data import load_mode_canada_dataset
from torch_choice.model import ConditionalLogitModel
from torch_choice import run
from skimpy import skim

In [58]:
if torch.cuda.is_available():
    device = 'cuda'  # use GPU if available
else:
    device = 'cpu'  # use CPU otherwise

In [59]:
car_choice = pd.read_csv("https://raw.githubusercontent.com/gsbDBI/torch-choice/main/tutorials/public_datasets/car_choice.csv")

Display the data

In [60]:
display(car_choice)

Unnamed: 0,record_id,session_id,consumer_id,car,purchase,gender,income,speed,discount,price
0,1,1,1,American,1,1,46.699997,10,0.94,90
1,1,1,1,Japanese,0,1,46.699997,8,0.94,110
2,1,1,1,European,0,1,46.699997,7,0.94,50
3,1,1,1,Korean,0,1,46.699997,8,0.94,10
4,2,2,2,American,1,1,26.100000,10,0.95,100
...,...,...,...,...,...,...,...,...,...,...
3155,884,884,884,Japanese,1,1,20.900000,8,0.89,100
3156,884,884,884,European,0,1,20.900000,7,0.89,40
3157,885,885,885,American,1,1,30.600000,10,0.81,100
3158,885,885,885,Japanese,0,1,30.600000,8,0.81,50


Quickly summarise the dataset

In [61]:
skim(car_choice)

Use the EasyDataSet function to create the choice dataset object

What do each of the terms mean?
- main data: the data source 
- purchase_record_column: identifier for each choice set (note that the data is in long format, so a single choice set is across multiple rows)
- choice_column: was the option selected
- item_name_column: an alternative label?
- user_index_column: identifier for a single user
- session_index_column: identifier for a single session (accomodates measurement over time?)
- user_observable_columns: variables that are invariant within a user (e.g. gender, income)
- item_observable_columns: variables that are invariant within a item within a choice set (e.g. brand?)
- session_observable_columns: variables that are invariant within a session (e.g. month?)
- itemsession_observable_columns: variables that change within a choice set (i.e. the attributes)

In [63]:
data_wrapper_from_columns = EasyDatasetWrapper(
    main_data = car_choice,
    purchase_record_column = 'record_id',
    choice_column = 'purchase',
    item_name_column = 'car',
    user_index_column = 'consumer_id',
    session_index_column = 'session_id',
    user_observable_columns = ['gender', 'income'],
    item_observable_columns = ['speed'],
    session_observable_columns = ['discount'],
    itemsession_observable_columns = ['price'],
    device = device)

data_wrapper_from_columns.summary()

Creating choice dataset from stata format data-frames...
Note: choice sets of different sizes found in different purchase records: {'size 4': 'occurrence 505', 'size 3': 'occurrence 380'}
Finished Creating Choice Dataset.
* purchase record index range: [1 2 3] ... [883 884 885]
* Space of 4 items:
                   0         1         2       3
item name  American  European  Japanese  Korean
* Number of purchase records/cases: 885.
* Preview of main data frame:
      record_id  session_id  consumer_id       car  purchase  gender  \
0             1           1            1  American         1       1   
1             1           1            1  Japanese         0       1   
2             1           1            1  European         0       1   
3             1           1            1    Korean         0       1   
4             2           2            2  American         1       1   
...         ...         ...          ...       ...       ...     ...   
3155        884         884  



In [65]:
car_data = data_wrapper_from_columns.choice_dataset

In [67]:
car_data



ChoiceDataset(num_items=4, num_users=885, num_sessions=885, label=[], item_index=[885], user_index=[885], session_index=[885], item_availability=[885, 4], item_speed=[4, 1], user_gender=[885, 1], user_income=[885, 1], session_discount=[885, 1], itemsession_price=[885, 4, 1], device=cpu)

In [74]:
model = ConditionalLogitModel(
    formula='(itemsession_price|constant) + (intercept|item)',
    dataset=car_data,
    num_items=4)

In [75]:
run(model, car_data, num_epochs=500, learning_rate=0.01, model_optimizer="LBFGS", batch_size=-1)


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


ConditionalLogitModel(
  (coef_dict): ModuleDict(
    (itemsession_price[constant]): Coefficient(variation=constant, num_items=4, num_users=None, num_params=1, 1 trainable parameters in total, initialization=normal, device=cpu).
    (intercept[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, initialization=normal, device=cpu).
  )
)
Conditional logistic discrete choice model, expects input features:

X[itemsession_price[constant]] with 1 parameters, with constant level variation.
X[intercept[item]] with 1 parameters, with item level variation.
device=cpu
[Train dataset] ChoiceDataset(num_items=4, num_users=885, num_sessions=885, label=[], item_index=[885], user_index=[885], session_index=[885], item_availability=[885, 4], item_speed=[4, 1], user_gender=[885, 1], user_income=[885, 1], session_discount=[885, 1], itemsession_price=[885, 4, 1], device=cpu)
[Validation dataset] None
[Test dataset] None


C:\Users\steph\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pytorch_lightning\trainer\configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.

  | Name  | Type                  | Params | Mode 
--------------------------------------------------------
0 | model | ConditionalLogitModel | 4      | train
--------------------------------------------------------
4         Trainable params
0         Non-trainable params
4         Total params
0.000     Total estimated model params size (MB)
C:\Users\steph\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader`

Epoch 499: 100%|██████████| 1/1 [00:00<00:00, 13.06it/s, v_num=7]

`Trainer.fit` stopped: `max_epochs=500` reached.


Epoch 499: 100%|██████████| 1/1 [00:00<00:00, 12.89it/s, v_num=7]
Time taken for training: 28.415333032608032
Skip testing, no test dataset is provided.
Log-likelihood: [Training] nan, [Validation] N/A, [Test] N/A

| Coefficient                   |   Estimation |   Std. Err. |   z-value |   Pr(>|z|) | Significance   |
|:------------------------------|-------------:|------------:|----------:|-----------:|:---------------|
| itemsession_price[constant]_0 |          nan |         nan |       nan |        nan |                |
| intercept[item]_0             |          nan |         nan |       nan |        nan |                |
| intercept[item]_1             |          nan |         nan |       nan |        nan |                |
| intercept[item]_2             |          nan |         nan |       nan |        nan |                |
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




ConditionalLogitModel(
  (coef_dict): ModuleDict(
    (itemsession_price[constant]): Coefficient(variation=constant, num_items=4, num_users=None, num_params=1, 1 trainable parameters in total, initialization=normal, device=cpu).
    (intercept[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, initialization=normal, device=cpu).
  )
)
Conditional logistic discrete choice model, expects input features:

X[itemsession_price[constant]] with 1 parameters, with constant level variation.
X[intercept[item]] with 1 parameters, with item level variation.
device=cpu

In [40]:
from torch_choice.model import ConditionalLogitModel
from torch_choice import run
dataset = load_mode_canada_dataset()

No `session_index` is provided, assume each choice instance is in its own session.


In [41]:
dataset.summary()

ChoiceDataset with 2779 sessions, 4 items, single users, 2779 purchase records (observations) .
The most frequent item is 2, it was chosen 1267 times; the least frequent item is 1 it was 10 times; on average, each item was purchased 694.75 times.
4 most frequent items are: 2(1267 times), 0(1039 times), 3(463 times), 1(10 times).
4 least frequent items are: 1(10 times), 3(463 times), 0(1039 times), 2(1267 times).
Attribute Summaries:
Observable Tensor 'itemsession_cost_freq_ovt' with shape torch.Size([2779, 4, 3])
Observable Tensor 'session_income' with shape torch.Size([2779, 1])
                 0
count  2779.000000
mean     54.519611
std      17.514179
min       5.000000
25%      45.000000
50%      55.000000
75%      70.000000
max      70.000000
Observable Tensor 'itemsession_ivt' with shape torch.Size([2779, 4, 1])
device=cpu




In [47]:
dataset.__dict__

{'label': None,
 'item_index': tensor([0, 0, 0,  ..., 2, 2, 2]),
 '_num_items': None,
 '_num_users': None,
 '_num_sessions': None,
 'user_index': None,
 'session_index': tensor([   0,    1,    2,  ..., 2776, 2777, 2778]),
 'item_availability': None,
 'itemsession_cost_freq_ovt': tensor([[[142.8000,   9.0000,  85.0000],
          [ 27.5200,   8.0000,  63.0000],
          [ 71.6300,   0.0000,   0.0000],
          [ 58.2500,   4.0000,  74.0000]],
 
         [[142.8000,   9.0000,  85.0000],
          [ 27.5200,   8.0000,  63.0000],
          [ 71.6300,   0.0000,   0.0000],
          [ 58.2500,   4.0000,  74.0000]],
 
         [[142.8000,   9.0000,  85.0000],
          [ 27.5200,   8.0000,  63.0000],
          [ 71.6300,   0.0000,   0.0000],
          [ 58.2500,   4.0000,  74.0000]],
 
         ...,
 
         [[155.3000,  16.0000, 135.0000],
          [ 27.9600,  24.0000, 130.0000],
          [ 64.9800,   0.0000,   0.0000],
          [ 58.1000,   3.0000, 135.0000]],
 
         [[155.3000, 

In [9]:
# load Mode Canada transportation dataset
model = ConditionalLogitModel(
    formula = '(itemsession_cost_freq_ovt | constant ) + ( session_income | item ) + (itemsession_ivt |item - full ) + ( intercept | item )',
    dataset = dataset,
    num_items = 4)
# fit the conditional logit model.
run(model, dataset, num_epochs=500, learning_rate=0.003, batch_size=-1, model_optimizer="LBFGS")

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


No `session_index` is provided, assume each choice instance is in its own session.
ConditionalLogitModel(
  (coef_dict): ModuleDict(
    (itemsession_cost_freq_ovt[constant]): Coefficient(variation=constant, num_items=4, num_users=None, num_params=3, 3 trainable parameters in total, initialization=normal, device=cpu).
    (session_income[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, initialization=normal, device=cpu).
    (itemsession_ivt[item-full]): Coefficient(variation=item-full, num_items=4, num_users=None, num_params=1, 4 trainable parameters in total, initialization=normal, device=cpu).
    (intercept[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, initialization=normal, device=cpu).
  )
)
Conditional logistic discrete choice model, expects input features:

X[itemsession_cost_freq_ovt[constant]] with 3 parameters, with constant level variation.
X[sess

C:\Users\steph\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pytorch_lightning\trainer\configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.

  | Name  | Type                  | Params | Mode 
--------------------------------------------------------
0 | model | ConditionalLogitModel | 13     | train
--------------------------------------------------------
13        Trainable params
0         Non-trainable params
13        Total params
0.000     Total estimated model params size (MB)
C:\Users\steph\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader`

Epoch 0:   0%|          | 0/1 [00:00<?, ?it/s] 



Epoch 499: 100%|██████████| 1/1 [00:00<00:00, 22.39it/s, v_num=2]

`Trainer.fit` stopped: `max_epochs=500` reached.


Epoch 499: 100%|██████████| 1/1 [00:00<00:00, 21.24it/s, v_num=2]
Time taken for training: 47.50977802276611
Skip testing, no test dataset is provided.
Log-likelihood: [Training] -1874.3446044921875, [Validation] N/A, [Test] N/A

| Coefficient                           |   Estimation |   Std. Err. |    z-value |    Pr(>|z|) | Significance   |
|:--------------------------------------|-------------:|------------:|-----------:|------------:|:---------------|
| itemsession_cost_freq_ovt[constant]_0 |  -0.0335193  |  0.00709538 |  -4.72411  | 2.31125e-06 | ***            |
| itemsession_cost_freq_ovt[constant]_1 |   0.0925774  |  0.00509755 |  18.1612   | 0           | ***            |
| itemsession_cost_freq_ovt[constant]_2 |  -0.0429614  |  0.00322391 | -13.3259   | 0           | ***            |
| session_income[item]_0                |  -0.0885213  |  0.0183422  |  -4.82611  | 1.39227e-06 | ***            |
| session_income[item]_1                |  -0.0279529  |  0.00387153 |  -7.2201 



ConditionalLogitModel(
  (coef_dict): ModuleDict(
    (itemsession_cost_freq_ovt[constant]): Coefficient(variation=constant, num_items=4, num_users=None, num_params=3, 3 trainable parameters in total, initialization=normal, device=cpu).
    (session_income[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, initialization=normal, device=cpu).
    (itemsession_ivt[item-full]): Coefficient(variation=item-full, num_items=4, num_users=None, num_params=1, 4 trainable parameters in total, initialization=normal, device=cpu).
    (intercept[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, initialization=normal, device=cpu).
  )
)
Conditional logistic discrete choice model, expects input features:

X[itemsession_cost_freq_ovt[constant]] with 3 parameters, with constant level variation.
X[session_income[item]] with 1 parameters, with item level variation.
X[itemsession_ivt[i