## Price Optimization Soda Demonstration
This notebook demonstrates in fine grain detail how the Opalytics Price Optimization application works. We demonstrate the functionality with toy-sized data, and also with the Soda Promotion example data set.


-------
To begin let's create a toy example data set. 

In [1]:
from pandas import DataFrame
df = DataFrame({"Column 1": list(range(1, 9)), "Column 2": [x*2+1 for x in range(1,9)], 
                "Column 3": [100-x for x in range(8)]})
df

Unnamed: 0,Column 1,Column 2,Column 3
0,1,3,100
1,2,5,99
2,3,7,98
3,4,9,97
4,5,11,96
5,6,13,95
6,7,15,94
7,8,17,93


I'm not going to identify Dependent/Indepdenent columns at this point, because we're just showing how the different subroutines work.

First lets look at `train_test_split` which is used for the "Single Trial" field on the Experiments report.

In [2]:
from sklearn import model_selection
split_one = model_selection.train_test_split(df, test_size = 0.25)
assert len(split_one) == 2
split_one[0]

Unnamed: 0,Column 1,Column 2,Column 3
2,3,7,98
0,1,3,100
7,8,17,93
4,5,11,96
3,4,9,97
6,7,15,94


In [3]:
split_one[1]

Unnamed: 0,Column 1,Column 2,Column 3
5,6,13,95
1,2,5,99


In [4]:
split_two = model_selection.train_test_split(df, test_size = 0.25)
assert len(split_two) == 2
split_two[0]

Unnamed: 0,Column 1,Column 2,Column 3
0,1,3,100
2,3,7,98
4,5,11,96
7,8,17,93
3,4,9,97
6,7,15,94


In [5]:
split_two[1]

Unnamed: 0,Column 1,Column 2,Column 3
1,2,5,99
5,6,13,95


Every time I call `train_test_split` the rows are shuffled, and `test_size` proportion of them are randomly selected to be the testing rows. The remainder are the training rows. `train_test_split` then returns this segmentation of rows as a pair of (training set, testing set) matrices.

---- 
Now lets use the toy data set to examine how `KFold` works.

In [6]:
kf = model_selection.KFold(n_splits=4)
splits = list(kf.split(df))
splits

[(array([2, 3, 4, 5, 6, 7]), array([0, 1])),
 (array([0, 1, 4, 5, 6, 7]), array([2, 3])),
 (array([0, 1, 2, 3, 6, 7]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5]), array([6, 7]))]

`KFold` is doing the same sort thing, except it is doing it all at once, and returning row indicies instead of actually data rows. So it is creating 4 train-test splits. The first split sets the first 2 rows as the testing set, the second split sets the second two rows, and so forth.

If you want, you can shuffle things, but there is no need to. There is no rational reason to believe the set of splits below will yield a more accurate assessment than the split of splits above.

In [7]:
list(model_selection.KFold(n_splits=4, shuffle=True).split(df))

[(array([0, 2, 3, 4, 5, 6]), array([1, 7])),
 (array([1, 2, 3, 5, 6, 7]), array([0, 4])),
 (array([0, 1, 3, 4, 5, 7]), array([2, 6])),
 (array([0, 1, 2, 4, 6, 7]), array([3, 5]))]