## Price Optimization Soda Demonstration
This notebook demonstrates in fine grain detail how the Opalytics Price Optimization application works. We demonstrate the functionality with toy-sized data, and also with the Soda Promotion example data set.


-------
To begin let's create a toy example data set. 

In [1]:
from pandas import DataFrame
df = DataFrame({"Column 1": list(range(1, 9)), "Column 2": [x*2+1 for x in range(1,9)], 
                "Column 3": [100-x for x in range(8)]})
df

Unnamed: 0,Column 1,Column 2,Column 3
0,1,3,100
1,2,5,99
2,3,7,98
3,4,9,97
4,5,11,96
5,6,13,95
6,7,15,94
7,8,17,93


I'm not going to identify Dependent/Indepdenent columns at this point, because we're just showing how the different subroutines work.

First lets look at `train_test_split` which is used for the "Single Trial" field on the Experiments report.

In [2]:
from sklearn import model_selection
split_one = model_selection.train_test_split(df, test_size = 0.25)
assert len(split_one) == 2
split_one[0]

Unnamed: 0,Column 1,Column 2,Column 3
6,7,15,94
2,3,7,98
0,1,3,100
4,5,11,96
7,8,17,93
3,4,9,97


In [3]:
split_one[1]

Unnamed: 0,Column 1,Column 2,Column 3
1,2,5,99
5,6,13,95


In [4]:
split_two = model_selection.train_test_split(df, test_size = 0.25)
assert len(split_two) == 2
split_two[0]

Unnamed: 0,Column 1,Column 2,Column 3
3,4,9,97
0,1,3,100
1,2,5,99
4,5,11,96
6,7,15,94
5,6,13,95


In [5]:
split_two[1]

Unnamed: 0,Column 1,Column 2,Column 3
2,3,7,98
7,8,17,93


Every time I call `train_test_split` the rows are shuffled, and `test_size` proportion of them are randomly selected to be the testing rows. The remainder are the training rows. `train_test_split` then returns this segmentation of rows as a pair of (training set, testing set) matrices.

---- 
Now lets use the toy data set to examine how `KFold` works.

In [6]:
kf = model_selection.KFold(n_splits=4)
splits = list(kf.split(df))
splits

[(array([2, 3, 4, 5, 6, 7]), array([0, 1])),
 (array([0, 1, 4, 5, 6, 7]), array([2, 3])),
 (array([0, 1, 2, 3, 6, 7]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5]), array([6, 7]))]

`KFold` is doing the same sort thing, except it is doing it all at once, and returning row indicies instead of actually data rows. So it is creating 4 train-test splits. The first split sets the first 2 rows as the testing set, the second split sets the second two rows, and so forth.

If you want, you can shuffle things, but there is no need to. There is no rational reason to believe the set of splits below will yield a more accurate assessment than the split of splits above.

In [7]:
list(model_selection.KFold(n_splits=4, shuffle=True).split(df))

[(array([1, 2, 3, 4, 5, 7]), array([0, 6])),
 (array([0, 1, 2, 4, 5, 6]), array([3, 7])),
 (array([0, 3, 4, 5, 6, 7]), array([1, 2])),
 (array([0, 1, 2, 3, 6, 7]), array([4, 5]))]

----
### Connect to Predictive Analytics Application

Now lets load some realistic historical data and recreate some of the results we'd see in the Predictive Analytics application.

In [8]:
import pandas
df_hist = pandas.read_excel("soda_sales_historical_data.xlsx")
df_hist[:5]

Unnamed: 0,Product,Sales,Cost Per Unit,Easter Included,Super Bowl Included,Christmas Included,Other Holiday,4 Wk Avg Temp,4 Wk Avg Humidity,Sales M-1 weeks,Sales M-2 weeks,Sales M-3 weeks,Sales M-4 Weeks,Sales M-5 weeks
0,11 Down,51.9,1.6625,No,No,Yes,No,80.69,69.19,17.0,22.4,13.5,14.5,28.0
1,Alpine Stream,55.8,2.2725,No,No,Yes,No,80.69,69.19,2.4,2.2,2.0,1.4,0.5
2,Bright,3385.6,1.3475,No,No,Yes,No,80.69,69.19,301.8,188.8,101.4,81.6,213.8
3,Crisp Clear,63.5,1.66,No,No,Yes,No,80.69,69.19,73.8,69.4,72.8,75.4,57.4
4,Popsi Kola,181.1,1.8725,No,No,Yes,No,80.69,69.19,23.1,22.6,22.1,19.9,23.2


In [9]:
df_hist.shape

(596, 14)

Converting categorical data to numeric is one of the "grunt tasks" automated by the Predictive Analytics app.

In [10]:
from pandas import DataFrame, get_dummies
categorical_columns = ['Product','Easter Included','Super Bowl Included', 
                       'Christmas Included', 'Other Holiday']
df_hist = get_dummies(df_hist, prefix={k:"dmy_%s"%k for k in categorical_columns},
                      columns = list(categorical_columns))
df_hist[:5]

Unnamed: 0,Sales,Cost Per Unit,4 Wk Avg Temp,4 Wk Avg Humidity,Sales M-1 weeks,Sales M-2 weeks,Sales M-3 weeks,Sales M-4 Weeks,Sales M-5 weeks,dmy_Product_11 Down,...,dmy_Product_Koala Kola,dmy_Product_Mr. Popper,dmy_Product_Popsi Kola,dmy_Easter Included_No,dmy_Easter Included_Yes,dmy_Super Bowl Included_No,dmy_Super Bowl Included_Yes,dmy_Christmas Included_No,dmy_Christmas Included_Yes,dmy_Other Holiday_No
0,51.9,1.6625,80.69,69.19,17.0,22.4,13.5,14.5,28.0,1,...,0,0,0,1,0,1,0,0,1,1
1,55.8,2.2725,80.69,69.19,2.4,2.2,2.0,1.4,0.5,0,...,0,0,0,1,0,1,0,0,1,1
2,3385.6,1.3475,80.69,69.19,301.8,188.8,101.4,81.6,213.8,0,...,0,0,0,1,0,1,0,0,1,1
3,63.5,1.66,80.69,69.19,73.8,69.4,72.8,75.4,57.4,0,...,0,0,0,1,0,1,0,0,1,1
4,181.1,1.8725,80.69,69.19,23.1,22.6,22.1,19.9,23.2,0,...,0,0,1,1,0,1,0,0,1,1


Lets assume we're only doing one experiment, and thats with Ordinary Least Squares for the '`*all*`' slice.

In [11]:
from sklearn.linear_model import LinearRegression

In [12]:
train_data, test_data = model_selection.train_test_split(df_hist, test_size=0.25)

In [13]:
obj = LinearRegression()
obj.fit(y = train_data["Sales"], X = train_data.drop("Sales", axis=1))

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Here is your single trial result

In [14]:
single_trial = obj.score(y = test_data["Sales"], X = test_data.drop("Sales", axis=1))
single_trial

0.68132683960789886