In [1]:
import warnings; warnings.filterwarnings('ignore')
%matplotlib inline

import pandas as pd
import numpy as np
import oxyba as ox
from importlib import reload; reload(ox);

### What does this utility function?

Let `X` a dataset with `N` observations or resp. rows, e.g. `N=14` observations with 5 dimensions each.

In [2]:
np.random.seed(42)
X = np.random.normal(size=(14,5), scale=50).round(1)
N,_ = X.shape

Each observation has a row index, e.g.

In [3]:
rowindicies = list(range(0,N))
rowindicies

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]

The function `block_idxmat_shuffle` will assign these row indicies to one of `K` equal-sized blocks or groups, e.g.

In [4]:
K=3

The `block_idxmat_shuffle` will return a matrix with `K=3` columns that contain `trunc(N/K)=4` (or `int`) random row indicies for the dataset `X`.

In [5]:
idxmat, dropped = ox.block_idxmat_shuffle(N,K)
idxmat

array([[13,  3,  5],
       [ 1,  7, 11],
       [ 8, 10,  6],
       [ 9,  4,  0]])

For example block `0` will has the row indicies

In [6]:
block0_idx = idxmat[:,0]
block0_idx

array([13,  1,  8,  9])

that refer to the `X` observations

In [7]:
X[block0_idx,:]

array([[ 67.8,  -3.6,  50.2,  18.1, -32.3],
       [-11.7,  79. ,  38.4, -23.5,  27.1],
       [ 36.9,   8.6,  -5.8, -15.1, -73.9],
       [-36. , -23. ,  52.9,  17.2, -88.2]])

In our example, 
there `N%K = 14%3 = 2` or two observations that a left over. 
These remainders does not fit in `K` blocks that are required to be equal-sized.

In [8]:
dropped

array([12,  2])

Let's display all results, i.e. all `K` blocks with their `int(N/K)` randomly assigned observations `X[block_idx,:]`

In [9]:
for b in range(K):
    print('\nBlock:',b)
    print(pd.DataFrame(X[idxmat[:,b],:], index=idxmat[:,b]))
    
print('\nDropped observations\n', X[dropped,:] )
print('\nRow indicies of dropped observations:', dropped, '\n')


Block: 0
       0     1     2     3     4
13  67.8  -3.6  50.2  18.1 -32.3
1  -11.7  79.0  38.4 -23.5  27.1
8   36.9   8.6  -5.8 -15.1 -73.9
9  -36.0 -23.0  52.9  17.2 -88.2

Block: 1
       0     1     2     3     4
3  -28.1 -50.6  15.7 -45.4 -70.6
7  -61.0  10.4 -98.0 -66.4   9.8
10  16.2 -19.3 -33.8  30.6  51.5
4   73.3 -11.3   3.4 -71.2 -27.2

Block: 2
       0     1     2     3     4
5    5.5 -57.5  18.8 -30.0 -14.6
11  46.6 -42.0 -15.5  16.6  48.8
6  -30.1  92.6  -0.7 -52.9  41.1
0   24.8  -6.9  32.4  76.2 -11.7

Dropped observations
 [[-24.   -9.3 -55.3 -59.8  40.6]
 [-23.2 -23.3  12.1 -95.7 -86.2]]

Row indicies of dropped observations: [12  2] 



And the orginal dataset `X` for comparisons

In [10]:
print(pd.DataFrame(X))

       0     1     2     3     4
0   24.8  -6.9  32.4  76.2 -11.7
1  -11.7  79.0  38.4 -23.5  27.1
2  -23.2 -23.3  12.1 -95.7 -86.2
3  -28.1 -50.6  15.7 -45.4 -70.6
4   73.3 -11.3   3.4 -71.2 -27.2
5    5.5 -57.5  18.8 -30.0 -14.6
6  -30.1  92.6  -0.7 -52.9  41.1
7  -61.0  10.4 -98.0 -66.4   9.8
8   36.9   8.6  -5.8 -15.1 -73.9
9  -36.0 -23.0  52.9  17.2 -88.2
10  16.2 -19.3 -33.8  30.6  51.5
11  46.6 -42.0 -15.5  16.6  48.8
12 -24.0  -9.3 -55.3 -59.8  40.6
13  67.8  -3.6  50.2  18.1 -32.3
