# Sampling benchmarks

Sam Maurer, August 2018 
- updated Sep 2018 to remove old code and add some MergedChoiceTable sampling benchmarks

In [1]:
import sys
print(sys.version)

3.6.2 |Anaconda custom (64-bit)| (default, Sep 21 2017, 18:29:43) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


In [2]:
import numpy as np
import pandas as pd
import random

In [3]:
import choicemodels

  from pandas.core import datetools


## Performance comparison

`random.choices`: replacement, optional weights  
`random.sample`: no replacement  
`np.random.choice`: optional replacement, optional weights

For each one, draw 100 samples of 10 alternatives from a universe of 100,000

In [39]:
n = int(1e5)
vals = np.random.rand(n)
weights = np.random.rand(n)
scaled_weights = weights/weights.sum(0)  # probs that sum to 1

In [29]:
%%timeit 3
    for i in range(100):
        random.choices(vals, k=10)

1000 loops, best of 3: 302 µs per loop


In [33]:
%%timeit 3
    for i in range(100):
        random.choices(vals, weights, k=10)

1 loop, best of 3: 727 ms per loop


In [32]:
%%timeit 3
    for i in range(100):
        random.sample(vals.tolist(), k=10)

10 loops, best of 3: 153 ms per loop


In [36]:
%%timeit 3
    for i in range(100):
        np.random.choice(vals, replace=True, size=10)

1000 loops, best of 3: 701 µs per loop


In [42]:
%%timeit 3
    for i in range(100):
        np.random.choice(vals, replace=False, size=10)

10 loops, best of 3: 136 ms per loop


In [43]:
%%timeit 3
    for i in range(100):
        np.random.choice(vals, replace=True, p=scaled_weights, size=10)

10 loops, best of 3: 70.2 ms per loop


In [44]:
%%timeit 3
    for i in range(100):
        np.random.choice(vals, replace=False, p=scaled_weights, size=10)

10 loops, best of 3: 78.8 ms per loop


Here are the winners, with times scaled to be relative:

```
1 ms    replacement, core python  
200 ms  replacement with weights, numpy

400 ms  no replacement, numpy
240 ms  no replacement with weights, numpy
```

In [45]:
# What's the real-world hit?

n = int(5e6)
vals = np.random.rand(n)
weights = np.random.rand(n)
scaled_weights = weights/weights.sum(0)  # probs that sum to 1

In [46]:
%%timeit 3
    for i in range(100):
        np.random.choice(vals, replace=False, p=scaled_weights, size=100)

1 loop, best of 3: 5.39 s per loop


So drawing 100k samples of 100 without replacement from a universe of 5 million, with weights, would take 90 minues on a fast iMac

## Integrating MCT with estimation

In [18]:
alts = pd.DataFrame(np.random.rand(10,2), columns=['b','c'])

In [19]:
print(len(alts))
alts.head(3)

10


Unnamed: 0,b,c
0,0.543301,0.50747
1,0.109225,0.223447
2,0.387049,0.036127


In [20]:
n = 100
w = alts.c/alts.c.sum()

obs = pd.DataFrame({'a': np.random.rand(n),
                    'chosen': np.random.choice(range(len(alts)), n, p=w)})

In [21]:
print(len(obs))
obs.head(3)

100


Unnamed: 0,a,chosen
0,0.138992,5
1,0.174695,7
2,0.135044,6


In [22]:
mct = choicemodels.tools.MergedChoiceTable(obs, alts, 'chosen', sample_size=5, replace=False)

print(len(mct.to_frame()))
mct.to_frame().reset_index().head()

500


Unnamed: 0,obs_id,alt_id,a,b,c,chosen
0,99,5,0.892908,0.975887,0.929253,1
1,99,4,0.892908,0.241345,0.980643,0
2,99,6,0.892908,0.34634,0.49887,0
3,99,9,0.892908,0.219413,0.777656,0
4,99,8,0.892908,0.562223,0.397979,0


In [23]:
m = choicemodels.MultinomialLogit(mct.to_frame(), 
                                  observation_id_col = mct.observation_id_col,
                                  choice_col = mct.choice_col,
                                  model_expression = 'a + b + c')

m.fit()

                  CHOICEMODELS ESTIMATION RESULTS                  
Dep. Var.:                chosen   No. Observations:            100
Model:         Multinomial Logit   Df Residuals:                 96
Method:       Maximum Likelihood   Df Model:                      4
Date:                 2018-08-20   Pseudo R-squ.:             0.124
Time:                      13:24   Pseudo R-bar-squ.:         0.099
AIC:                     290.005   Log-Likelihood:         -141.002
BIC:                     300.425   LL-Null:                -160.944
               coef   std err         z     P>|z|   Conf. Int.
--------------------------------------------------------------
Intercept   -0.0000     0.469    -0.000     1.000             
a            0.0000     0.418     0.000     1.000             
b            0.3333     0.404     0.824     0.410             
c            2.3596     0.456     5.176     0.000             

https://gist.github.com/smmaurer/c3b4f2f7c4d612a4520de119f9f497cf