# TruEra Python SDK Ingestion Demo: OJ Forecasting

## Pre-requisites: Download and Install Truera Python Client
1. Download Python wheel from [Downloads](/downloads) page.
Install the wheel in your Python environment using `pip install truera-*.whl`

## Pre-requisites: Quickstart Data 
2. If not using your own model & data, download the quickstart data zip from the Downloads folder on your deployment



In [1]:
import pandas as pd
import numpy as np

In [2]:
import pickle

In [3]:
# from pandas_profiling import ProfileReport

In [4]:
import random

In [5]:
import sklearn

from sklearn.model_selection import TimeSeriesSplit
#from sklearn.model_selection import GridSearchCV

#from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import RidgeCV

#from sklearn.pipeline import Pipeline

In [6]:
from sklearn.preprocessing import OneHotEncoder

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Connect to Truera endpoint
 * Provide your Truera deployment URI as connection string.
 * Provide your username and password, example is provided for basic auth.
 * TrueraWorkspace creation will also verify the connectivity to Truera services.

In [8]:
!pip list | grep truera

truera                   8.0.0
truera-qii               0.40.1
You should consider upgrading via the '/Users/colingoyette/miniconda3/envs/truera/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [9]:
from truera.client.truera_workspace import TrueraWorkspace

In [10]:
from truera.client.truera_authentication import TokenAuthentication
from truera.client.truera_authentication import BasicAuthentication

In [11]:
#connection_string = "<insert-your-url>"
connection_string = "http://se-demo-server.eastus.cloudapp.azure.com"

In [12]:
#auth = BasicAuthentication("<your username>","<your password>")
auth = BasicAuthentication("ailens", "ailens123")

In [13]:
tru = TrueraWorkspace(connection_string, auth)

, client side 8.0.0
.


# Create Project
A project is a collection of models and datasets solving a single problem statement.
Users can be provided access to collaborate on a project.

Here, we set a _remote_ environment to interact with the TruEra Server. _Local_ environments can also be used, to generate explainability analytics using your Domino Workspace compute resources. 

In [10]:
project_name = "Sales Forecasting OJ"

In [11]:
tru.set_environment("remote")

In [12]:
tru.add_project(project_name, score_type="regression")

In [13]:
tru.get_projects()

['Predictive Maintenance',
 'Marketing Models',
 'Healthcare',
 'Customer Lifetime Value',
 'Insurance Premium Default',
 'telecom_churn',
 'Insurance Underwriting',
 'Sales Forecasting',
 'Anomaly Detection v2',
 'Anomaly Detection v3 - Custom Classification Inference',
 'Customer Churn - Monitoring',
 'R-Diamonds_Classifier',
 'Household Income Demo',
 'California Housing Price Prediction',
 'WF_application1',
 'CLI Ingestion Demo',
 'Household Income Demo Test',
 'NY-Mortgages-Demo',
 'NY-Mortgages-Demo-Debug',
 'WF_application2 v2',
 'WF Application 2',
 'Industrial MFG - Predictive Maintenance',
 'Test_Anoosha',
 'Test_2',
 'AdultCensus_DemoNB',
 'retail_inventory_demo',
 'Predictive Maintenance - Industrial Manufacturing',
 'California Housing Price Prediction - Demo',
 'Sales Forecasting OJ']

# Adding a Data Collection
A data collection is a container for two related things:

* Data splits: A set of in-sample data (train, test, validate) or out-of-sample (OOS) / out-of-time (OOT) data to test model quality, stability and generalizability.
* Feature Metadata: An (optional) set of metadata defining the set of features for a set of splits and the various models trained and evaluated on them. This allows you to group features and provide feature descriptions for use throughout the tool.

Note that all splits associated with a data collection are assumed to follow the same set of features. As a general rule of thumb, if a model can read one split in a data collection it should be able to read all other splits in the data collection.

Reference: https://docs.microsoft.com/en-us/azure/open-datasets/dataset-oj-sales-simulated?tabs=azureml-opendatasets

In [14]:
tru.add_data_collection("OJ Data")

# Train a sample model
As an illustration we train an scikit-learn `GradientBoostingClassifier` model on pre-processed data here.


## Data Prep

In [3]:
import pandas as pd 
import numpy as np 

In [4]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

In [5]:
data = pd.read_csv('oj.csv')

In [6]:
data.shape

(28947, 17)

In [7]:
data.describe()

Unnamed: 0,store,week,logmove,feat,price,AGE60,EDUC,ETHNIC,INCOME,HHLARGE,WORKWOM,HVAL150,SSTRDIST,SSTRVOL,CPDIST5,CPWVOL5
count,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0,28947.0
mean,80.883511,100.459944,9.167864,0.237261,2.282488,0.17313,0.22522,0.155557,10.616735,0.115602,0.359178,0.343766,5.097274,1.207317,2.120359,0.438914
std,35.576511,34.692314,1.019378,0.425411,0.648001,0.061872,0.109945,0.187581,0.282314,0.030168,0.052673,0.239028,3.472386,0.526528,0.729828,0.219248
min,2.0,40.0,4.158883,0.0,0.52,0.058054,0.04955,0.024247,9.867083,0.013506,0.244463,0.002509,0.132097,0.4,0.77253,0.094562
25%,53.0,70.0,8.489616,0.0,1.79,0.1221,0.145985,0.04191,10.456079,0.097938,0.312636,0.123486,2.767046,0.727273,1.626192,0.271673
50%,86.0,101.0,9.03408,0.0,2.17,0.170655,0.22939,0.074656,10.635326,0.111221,0.355635,0.346154,4.650687,1.115385,1.963412,0.383227
75%,111.0,130.0,9.764685,0.0,2.73,0.213949,0.284395,0.187761,10.79696,0.135168,0.402313,0.528313,6.650602,1.538462,2.533672,0.56024
max,137.0,160.0,13.482016,1.0,3.87,0.307398,0.528362,0.995691,11.236197,0.216354,0.472308,0.916699,17.855951,2.571429,4.107902,1.143367


In [8]:
data.brand.describe()

count         28947
unique            3
top       tropicana
freq           9649
Name: brand, dtype: object

In [9]:
data.brand.unique()

array(['tropicana', 'minute.maid', 'dominicks'], dtype=object)

In [47]:
data.columns

Index(['store', 'brand', 'week', 'logmove', 'feat', 'price', 'AGE60', 'EDUC',
       'ETHNIC', 'INCOME', 'HHLARGE', 'WORKWOM', 'HVAL150', 'SSTRDIST',
       'SSTRVOL', 'CPDIST5', 'CPWVOL5'],
      dtype='object')

In [48]:
data.dtypes

store         int64
brand        object
week          int64
logmove     float64
feat          int64
price       float64
AGE60       float64
EDUC        float64
ETHNIC      float64
INCOME      float64
HHLARGE     float64
WORKWOM     float64
HVAL150     float64
SSTRDIST    float64
SSTRVOL     float64
CPDIST5     float64
CPWVOL5     float64
dtype: object

### Variables / Data Dict
Inputs:
- STORE - store number  
- BRAND - brand indicator  
- SSTRDIST - distance to the nearest warehouse store  
- SSTRVOL - ratio of sales of this store to the nearest warehouse store  
- CPDIST5 - average distance in miles to the nearest 5 supermarkets  
- CPWVOL5 - ratio of sales of this store to the average of the nearest five stores  
- FEAT - feature advertisement
- PRICE 

Index:
- WEEK - week number   

Target:
- LOGMOVE -log of the number of units sold  

~~Extra data~~ (Train on this too):
- AGE60 - percentage of the population that is aged 60 or older 
- EDUC - percentage of the population that has a college degree  
- ETHNIC - percent of the population that is black or Hispanic  
- INCOME - median income  
- HHLARGE - percentage of households with 5 or more persons  
- WORKWOM - percentage of women with full-time jobs  
- HVAL150 - percentage of households worth more than $150,000  

In [19]:
data.head()

Unnamed: 0,store,brand,week,logmove,feat,price,AGE60,EDUC,ETHNIC,INCOME,HHLARGE,WORKWOM,HVAL150,SSTRDIST,SSTRVOL,CPDIST5,CPWVOL5
0,2,tropicana,40,9.018695,0,3.87,0.232865,0.248935,0.11428,10.553205,0.103953,0.303585,0.463887,2.110122,1.142857,1.92728,0.376927
1,2,tropicana,46,8.723231,0,3.87,0.232865,0.248935,0.11428,10.553205,0.103953,0.303585,0.463887,2.110122,1.142857,1.92728,0.376927
2,2,tropicana,47,8.253228,0,3.87,0.232865,0.248935,0.11428,10.553205,0.103953,0.303585,0.463887,2.110122,1.142857,1.92728,0.376927
3,2,tropicana,48,8.987197,0,3.87,0.232865,0.248935,0.11428,10.553205,0.103953,0.303585,0.463887,2.110122,1.142857,1.92728,0.376927
4,2,tropicana,50,9.093357,0,3.87,0.232865,0.248935,0.11428,10.553205,0.103953,0.303585,0.463887,2.110122,1.142857,1.92728,0.376927


In [None]:
To Do: 
- create timestamps from week
- store as index
- partition as time series
- train model



In [49]:
timestamps = pd.to_datetime(data.week, unit='D',
               origin=pd.Timestamp('2021-01-01'))

In [50]:
timestamps.describe()

count                   28947
unique                    121
top       2021-05-18 00:00:00
freq                      249
first     2021-02-10 00:00:00
last      2021-06-10 00:00:00
Name: week, dtype: object

In [51]:
data= data.drop(columns='week')
data['datetime'] = timestamps

In [52]:
df = data.set_index('datetime').sort_index()

In [53]:
df.head()

Unnamed: 0_level_0,store,brand,logmove,feat,price,AGE60,EDUC,ETHNIC,INCOME,HHLARGE,WORKWOM,HVAL150,SSTRDIST,SSTRVOL,CPDIST5,CPWVOL5
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2021-02-10,2,tropicana,9.018695,0,3.87,0.232865,0.248935,0.11428,10.553205,0.103953,0.303585,0.463887,2.110122,1.142857,1.92728,0.376927
2021-02-10,59,minute.maid,7.655391,0,2.62,0.110819,0.233036,0.024247,10.71504,0.140676,0.390696,0.292652,0.217275,1.0,3.331154,0.395539
2021-02-10,59,tropicana,8.489616,0,3.19,0.110819,0.233036,0.024247,10.71504,0.140676,0.390696,0.292652,0.217275,1.0,3.331154,0.395539
2021-02-10,124,minute.maid,8.269757,0,3.17,0.119626,0.261876,0.572356,10.258957,0.12495,0.348519,0.416316,6.328747,0.727273,1.793796,0.202169
2021-02-10,56,dominicks,8.565602,1,1.59,0.192889,0.237551,0.041356,10.831825,0.105928,0.362168,0.578125,4.865721,0.533333,2.998578,0.496127


In [58]:
t2 = int(len(df))
t1 = int(len(df)/3)
t1, t2

(9649, 28947)

In [59]:
df_train = df.iloc[:t1,:]
df_holdout = df.iloc[t1:,:]
df_train.shape, df_holdout.shape

((9649, 16), (19298, 16))

In [60]:
len(df_train) + len(df_holdout) == len(df)

True

## Prepare data for modeling
The following utility function is used in two places in this notebook:
1. Standalone, to generate training splits. Could be modified/improved to be more generalized on any set of training data, labels, and/or extra data of interest
2. In "split_data_export" function, for preparing production data simulations in correct format

In [61]:
pd.set_option('mode.chained_assignment', None)

In [62]:
def data_prep(input, extra_feat, target):
    
    #extra data - for segmentation, don't train upon
    if extra_feat != None:
        extra_data = pd.concat([extra_data, input[extra_feat]], axis=1)
        input=input.drop(columns=extra_feat)
    else:
        extra_data = None
    
    #one hot features of type object -- note, be careful in understanding types of "pre" data features before using this method
    cats = input.select_dtypes(include=['object'])
    print('The following variables will be one-hot encoded: '+cats)
    enc = OneHotEncoder(drop=None, sparse=False).fit(cats)
    encoded = enc.transform(cats)
    
    #Create a Pandas DataFrame of the hot encoded column
    ohe_df = pd.DataFrame(encoded, columns=enc.get_feature_names_out(), index=input.index)
    #concat with original data, drop original
    input_post = pd.concat([input, ohe_df], axis=1).drop(cats.columns, axis=1)
    print(input.shape, input_post.shape)

    #prep data & labels
    y = input[target]
    X_pre = input.drop(columns=target)
    X_post = input_post.drop(columns=target)
    
    return X_pre, X_post, y, extra_data

### Generate data artifacts for training & TruEra ingestion

In [71]:
X_train_pre, X_train_post, y, extra_data = data_prep(df_train, None, 'logmove')


                                                        brand
datetime                                                     
2021-02-10  The following variables will be one-hot encode...
2021-02-10  The following variables will be one-hot encode...
2021-02-10  The following variables will be one-hot encode...
2021-02-10  The following variables will be one-hot encode...
2021-02-10  The following variables will be one-hot encode...
...                                                       ...
2021-03-22  The following variables will be one-hot encode...
2021-03-22  The following variables will be one-hot encode...
2021-03-22  The following variables will be one-hot encode...
2021-03-22  The following variables will be one-hot encode...
2021-03-22  The following variables will be one-hot encode...

[9649 rows x 1 columns]
(9649, 16) (9649, 18)


In [72]:
X_train_post.columns

Index(['store', 'feat', 'price', 'AGE60', 'EDUC', 'ETHNIC', 'INCOME',
       'HHLARGE', 'WORKWOM', 'HVAL150', 'SSTRDIST', 'SSTRVOL', 'CPDIST5',
       'CPWVOL5', 'brand_dominicks', 'brand_minute.maid', 'brand_tropicana'],
      dtype='object')

In [76]:
X_train_post.shape

(9649, 17)

In [73]:
X_train_post.head()

Unnamed: 0_level_0,store,feat,price,AGE60,EDUC,ETHNIC,INCOME,HHLARGE,WORKWOM,HVAL150,SSTRDIST,SSTRVOL,CPDIST5,CPWVOL5,brand_dominicks,brand_minute.maid,brand_tropicana
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2021-02-10,2,0,3.87,0.232865,0.248935,0.11428,10.553205,0.103953,0.303585,0.463887,2.110122,1.142857,1.92728,0.376927,0.0,0.0,1.0
2021-02-10,59,0,2.62,0.110819,0.233036,0.024247,10.71504,0.140676,0.390696,0.292652,0.217275,1.0,3.331154,0.395539,0.0,1.0,0.0
2021-02-10,59,0,3.19,0.110819,0.233036,0.024247,10.71504,0.140676,0.390696,0.292652,0.217275,1.0,3.331154,0.395539,0.0,0.0,1.0
2021-02-10,124,0,3.17,0.119626,0.261876,0.572356,10.258957,0.12495,0.348519,0.416316,6.328747,0.727273,1.793796,0.202169,0.0,1.0,0.0
2021-02-10,56,1,1.59,0.192889,0.237551,0.041356,10.831825,0.105928,0.362168,0.578125,4.865721,0.533333,2.998578,0.496127,1.0,0.0,0.0


In [75]:
X_train_pre.to_csv('pre_train.csv')
X_train_post.to_csv('post_train.csv')
y.to_csv('labels_train.csv')

In [82]:
#extra_data.to_csv('extra_data_train.csv')

## Model Selection / Training

### V1: Ridge Regression
- ~~contains guts for cross validation and grid search opt~~ omit for simplicity of ingestion
- ~~TO DO: save model object~~

In [81]:
### from: https://www.programcreek.com/python/example/120827/sklearn.model_selection.TimeSeriesSplit
### author: carl24k
### Modified for my needs

tscv = TimeSeriesSplit(n_splits=5)

#score_models = {'f1': 'f1', 'recall': 'recall', 'precision': 'precision'}

lin_reg = RidgeCV(cv=tscv)

# Note -- empty param grid .. simplified for v1
#gsearch = GridSearchCV(estimator=retain_reg,scoring=score_models, cv=tscv, verbose=1,
   #                    return_train_score=True,  param_grid={'C' : [1]}, refit='f1') #consider adding penalty, addt'l reg. strengths, and class_weighting to grid search 

#pipe = Pipeline([('scaler', StandardScaler()), ('gsearch', gsearch)])
#pipe.fit(X_train_post,y)

lin_reg.fit(X_train_post,y)
lin_reg.best_score_

#result_df = pd.DataFrame(gsearch.cv_results_)

#save_path = 'log_reg_results.csv'
#result_df.to_csv(save_path, index=False)
#print('Saved test scores to ' + save_path) 

pickle.dump(lin_reg, open('linreg.pkl', "wb"))

In [82]:
lin_reg.score(X_train_post,y)

0.6180746796278567

## Model v2: random forest
- unfortunately, there is no random forest implementation that takes advantage of a general cross-validation function
- simply train on full training dataset, yolo / study behavior in TruEra

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [84]:
tscv = TimeSeriesSplit(n_splits=5)

#score_models = {'f1': 'f1', 'precision': 'precision', 'recall':'recall'}

random_forest = RandomForestRegressor(verbose=1, n_jobs=-1, random_state=42) ##fit intercept is true by default; l2 reg by default

random_forest.fit(X_train_post,y)

pickle.dump(random_forest, open('rf.pkl', "wb"))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.6s finished


In [85]:
random_forest.score(X_train_post,y)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished


0.8699881470721936

#### Simulating Holdout Splits

In [271]:
#create an empty dataframe to store infromation about timeframes of each monitoring split to be generated by the following script
mon_splits_df = pd.DataFrame(columns=['name','min','max'])

In [107]:
def split_data_export(n, prod_data, extra_feat, target):
    pre_split_dict = {}
    post_split_dict = {}
    labels_dict = {}
    extra_dict = {}
    
    pre_split_file_names = list()
    post_split_file_names = list()
    label_file_names = list()
    extra_file_names = list()
    monitoring_splits = list()
     
    #prep data -- note use of data prep utility function here
    X_pre, X_post, y, extra_data = data_prep(prod_data, extra_feat, target)
    print(X_pre.shape, X_post.shape, y.shape)
    if extra_data !=None:
        print(extra_data.shape)

    X_splits = np.array_split(X_pre,n)
    X_post_splits = np.array_split(X_post, n)
    y_splits = np.array_split(y,n) 
    if extra_data != None:
        extra_splits = np.array_split(extra_data, n)

    #populate dicts for each data artifact type, for each split, with names and partitioned data
    for i in range(n):
        pre_split_dict["pre_split_{0}".format(i+1)] = X_splits[i]
        post_split_dict["post_split_{0}".format(i+1)] = X_post_splits[i]
        labels_dict["label_{0}".format(i+1)] = y_splits[i]
        if extra_data != None:
            extra_dict["extra_{0}".format(i+1)] = extra_splits[i]
   
    ## save csvs, and create files with names of splits (and associated timestamps)
    for key, value in pre_split_dict.items():
        split_name = './split_sim/{}.csv'.format(key)
            
        #data for each split
        pre_split_file_names.append(split_name)   
        value.to_csv(split_name)
        
        #timestamps -- "extra data" for Monitoring purposes
        ## 2022 update: this probably is no longer necessary, either because timestamps are included as index in data files, and/or that we will no longer require manual set of begin & end timestamps in near future, for monitoring ingestion
        value['timestamp'] = value.index
        timestamps = value['timestamp']
        timestamps.to_csv('./split_sim/timestamp_'+str(key)+'.csv', index=None)
        
        monitoring_splits.append([min(timestamps), max(timestamps)])
    
    #post data
    for key, value in post_split_dict.items():
        post_split_name = './split_sim/{}.csv'.format(key)
            
        #data for each split
        post_split_file_names.append(split_name)   
        value.to_csv(post_split_name)

    ## continued .. labels
    for key, value in labels_dict.items():
        if n ==1: #use this to uniquely identify initial partitions for pre-production purposes 
            label_name = './split_sim/{}.csv'.format(key)
        else:
            label_name = './split_sim/{}.csv'.format(key)
            
        label_file_names.append(label_name)   
        value.to_csv(label_name)
        
    ## continued .. extra data
    if extra_data != None:
        for key, value in extra_dict.items():
            if n ==1:
                extra_name = './split_sim/{}.csv'.format(key)
            else:
                extra_name = './split_sim/{}.csv'.format(key)

            extra_file_names.append(extra_name)   
            value.to_csv(extra_name)
        
    return pre_split_file_names, post_split_file_names, label_file_names, extra_file_names, monitoring_splits

In [87]:
df_holdout.shape

(19298, 16)

In [88]:
min(df_holdout.index)

Timestamp('2021-03-22 00:00:00')

In [89]:
max(df_holdout.index)

Timestamp('2021-06-10 00:00:00')

### Create simulated weekly splits
- use split_data_export function to generate n splits
- also, create "create_splits.csv" for Monitoring split partitions of date range in holdout set

In [108]:
pre_split_file_names, \
post_split_file_names, \
label_file_names, \
extra_file_names, \
monitoring_splits  = split_data_export(80, df_holdout, None, 'logmove')

                                                        brand
datetime                                                     
2021-03-22  The following variables will be one-hot encode...
2021-03-22  The following variables will be one-hot encode...
2021-03-22  The following variables will be one-hot encode...
2021-03-22  The following variables will be one-hot encode...
2021-03-22  The following variables will be one-hot encode...
...                                                       ...
2021-06-10  The following variables will be one-hot encode...
2021-06-10  The following variables will be one-hot encode...
2021-06-10  The following variables will be one-hot encode...
2021-06-10  The following variables will be one-hot encode...
2021-06-10  The following variables will be one-hot encode...

[19298 rows x 1 columns]
(19298, 16) (19298, 18)
(19298, 15) (19298, 17) (19298,)


# Create (or set) Project
A project is a collection of models and datasets solving a single problem statement.
Users can be provided access to collaborate on a project.

Here, we set a _remote_ environment to interact with the TruEra Server. _Local_ environments can also be used, to generate explainability analytics using your Domino Workspace compute resources. 

In [8]:
tru.set_environment("remote")

In [117]:
#tru.add_project(project_name, score_type="logits")
tru.set_project(project_name)

In [118]:
tru.get_projects()

['Predictive Maintenance',
 'Marketing Models',
 'Healthcare',
 'Customer Lifetime Value',
 'Insurance Premium Default',
 'telecom_churn',
 'Insurance Underwriting',
 'Sales Forecasting',
 'Anomaly Detection v2',
 'Anomaly Detection v3 - Custom Classification Inference',
 'Customer Churn - Monitoring',
 'R-Diamonds_Classifier',
 'Household Income Demo',
 'California Housing Price Prediction',
 'WF_application1',
 'CLI Ingestion Demo',
 'Household Income Demo Test',
 'NY-Mortgages-Demo',
 'NY-Mortgages-Demo-Debug',
 'WF_application2 v2',
 'WF Application 2',
 'Industrial MFG - Predictive Maintenance',
 'Test_Anoosha',
 'Test_2',
 'AdultCensus_DemoNB',
 'retail_inventory_demo',
 'Predictive Maintenance - Industrial Manufacturing',
 'California Housing Price Prediction - Demo',
 'Sales Forecasting OJ']

# Adding a Data Collection
A data collection is a container for two related things:

* Data splits: A set of in-sample data (train, test, validate) or out-of-sample (OOS) / out-of-time (OOT) data to test model quality, stability and generalizability.
* Feature Metadata: An (optional) set of metadata defining the set of features for a set of splits and the various models trained and evaluated on them. This allows you to group features and provide feature descriptions for use throughout the tool.

Note that all splits associated with a data collection are assumed to follow the same set of features. As a general rule of thumb, if a model can read one split in a data collection it should be able to read all other splits in the data collection.

In [119]:
tru.set_data_collection("OJ Data")

In [120]:
FEATURE_MAP = {}
for post in X_train_post.columns:
    mapped = None
    for pre in X_train_pre.columns:
        if post.startswith(pre) and (mapped is None or len(mapped) < len(pre)):
            mapped = pre
    if mapped not in FEATURE_MAP:
        FEATURE_MAP[mapped] = []
    FEATURE_MAP[mapped].append(post)

In [121]:
FEATURE_MAP

{'store': ['store'],
 'feat': ['feat'],
 'price': ['price'],
 'AGE60': ['AGE60'],
 'EDUC': ['EDUC'],
 'ETHNIC': ['ETHNIC'],
 'INCOME': ['INCOME'],
 'HHLARGE': ['HHLARGE'],
 'WORKWOM': ['WORKWOM'],
 'HVAL150': ['HVAL150'],
 'SSTRDIST': ['SSTRDIST'],
 'SSTRVOL': ['SSTRVOL'],
 'CPDIST5': ['CPDIST5'],
 'CPWVOL5': ['CPWVOL5'],
 'brand': ['brand_dominicks', 'brand_minute.maid', 'brand_tropicana']}

In [296]:
help(tru.add_feature_metadata)

Help on method add_feature_metadata in module truera.client.truera_workspace:

add_feature_metadata(feature_description_map: Optional[Mapping[str, str]] = None, pre_to_post_feature_map: Optional[Mapping[str, str]] = None, missing_values: Optional[Sequence[str]] = None, force_update: bool = False) method of truera.client.truera_workspace.TrueraWorkspace instance
    Upload metadata describing features and feature groupings to the server.
    
    Args:
        feature_description_map: Map from pre-processed feature name, as provided in the data, to the description of the feature.
        pre_to_post_feature_map: Map from pre-processed human-readable feature name to post-processed model-readable feature name. Ignored if post-processed data is not provided for the data collection.
        missing_values: List of strings to be registered as missing values when reading split data.
        force_update: Overwrite any existing feature metadata.



In [122]:
tru.add_feature_metadata(pre_to_post_feature_map=FEATURE_MAP)

INFO:truera.client.remote_truera_workspace:Uploading feature description for project: Sales Forecasting OJ and data_collection: OJ Data


## Uploading one or more data splits
Now we can upload some data to our data collection to prepare for analyzing the model.
Here we upload the entire data as an "all" split type. We could choose to upload just the train or test datasets as "train" or "test" split types. 
At least one "train" or "all" split is required for generating analysis. You can have 0 or more splits of other kinds. 
You upload a split by providing:
 * A friendly name to indentify the split (required).
 * Input data in the shape the model expects (required). This can be a pandas DataFrame.
 * Labels/target ground-truth values (optional). It is strongly recommended to provide labels when available.

In [318]:
X_train_pre.head()

Unnamed: 0_level_0,voltmean,rotatemean,pressuremean,vibrationmean,voltsd,rotatesd,pressuresd,vibrationsd,voltmean_24hrs,rotatemean_24hrs,...,voltsd_24hrs,rotatesd_24hrs,pressuresd_24hrs,vibrationsd_24hrs,sincelastcomp1,sincelastcomp2,sincelastcomp3,sincelastcomp4,model,age
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-12-31 05:00:00+00:00,180.133784,440.60832,94.137969,41.551544,21.322735,48.770512,2.135684,10.037208,169.733809,445.179865,...,11.23312,48.717395,10.07988,5.853209,19.958333,214.958333,154.958333,169.958333,model3,18
2021-12-31 05:00:00+00:00,178.467494,475.516103,100.135752,39.559125,3.014945,57.622916,9.21181,3.475214,170.162954,456.121049,...,12.612686,47.46797,6.822402,4.680918,34.958333,169.958333,169.958333,79.958333,model2,19
2021-12-31 05:00:00+00:00,173.379233,474.047057,104.498994,51.151612,19.801352,48.168049,2.111651,4.180235,170.810184,463.016634,...,12.102505,55.343991,9.913041,5.412401,109.958333,19.958333,34.958333,79.958333,model3,10
2021-12-31 05:00:00+00:00,181.76092,456.98581,100.331401,45.308425,16.194776,29.005576,5.074431,2.50985,173.060071,458.156925,...,12.433923,38.065548,11.611825,4.317512,184.958333,109.958333,79.958333,0.958333,model4,14
2021-12-31 05:00:00+00:00,169.103113,505.267533,90.784813,38.20899,18.21847,93.075171,6.357613,3.085675,174.224002,454.54487,...,17.271628,49.147092,9.28053,4.332275,64.958333,19.958333,169.958333,19.958333,model4,6


In [319]:
X_train_post.head()

Unnamed: 0_level_0,voltmean,rotatemean,pressuremean,vibrationmean,voltsd,rotatesd,pressuresd,vibrationsd,voltmean_24hrs,rotatemean_24hrs,...,vibrationsd_24hrs,sincelastcomp1,sincelastcomp2,sincelastcomp3,sincelastcomp4,age,model_model1,model_model2,model_model3,model_model4
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-12-31 05:00:00+00:00,180.133784,440.60832,94.137969,41.551544,21.322735,48.770512,2.135684,10.037208,169.733809,445.179865,...,5.853209,19.958333,214.958333,154.958333,169.958333,18,0.0,0.0,1.0,0.0
2021-12-31 05:00:00+00:00,178.467494,475.516103,100.135752,39.559125,3.014945,57.622916,9.21181,3.475214,170.162954,456.121049,...,4.680918,34.958333,169.958333,169.958333,79.958333,19,0.0,1.0,0.0,0.0
2021-12-31 05:00:00+00:00,173.379233,474.047057,104.498994,51.151612,19.801352,48.168049,2.111651,4.180235,170.810184,463.016634,...,5.412401,109.958333,19.958333,34.958333,79.958333,10,0.0,0.0,1.0,0.0
2021-12-31 05:00:00+00:00,181.76092,456.98581,100.331401,45.308425,16.194776,29.005576,5.074431,2.50985,173.060071,458.156925,...,4.317512,184.958333,109.958333,79.958333,0.958333,14,0.0,0.0,0.0,1.0
2021-12-31 05:00:00+00:00,169.103113,505.267533,90.784813,38.20899,18.21847,93.075171,6.357613,3.085675,174.224002,454.54487,...,4.332275,64.958333,19.958333,169.958333,19.958333,6,0.0,0.0,0.0,1.0


In [127]:
y.dtypes

dtype('float64')

In [123]:
tru.add_data_split("train", 
                   pre_data=X_train_pre, 
                   post_data = X_train_post, 
                   label_data=y,
                  # extra_data_df=extra_data,
                   split_type="train")



Uploading tmpht2jr0e3 -- ### -- file upload complete.
Uploading tmphliksveo -- ### -- file upload complete.
Uploading tmpaka7zbli -- ### -- file upload complete.


INFO:truera.client.remote_truera_workspace:Data split "train" is added to remote data collection "OJ Data", and set as the data split for the workspace context.


In [109]:
X_val_pre = pd.read_csv('./split_sim/pre_split_1.csv', index_col='datetime')
X_val_post = pd.read_csv('./split_sim/post_split_1.csv', index_col='datetime')
y_val = pd.read_csv('./split_sim/label_1.csv', index_col='datetime')
#extra_val = pd.read_csv('./split_sim/extra_1.csv', index_col='datetime')

In [126]:
y_val.dtypes

logmove    float64
dtype: object

In [110]:
X_val_pre.shape, X_val_post.shape, y_val.shape

((242, 15), (242, 17), (242, 1))

In [129]:
tru.add_data_split("validation", 
                   pre_data=X_val_pre, 
                   post_data = X_val_post, 
                   label_data=y_val.values,
                   #extra_data_df=extra_val,
                   split_type="test")



Uploading tmp6ugtfvp8 -- ### -- file upload complete.
Uploading tmphoiczo4u -- ### -- file upload complete.
Uploading tmpvksjk5cl -- ### -- file upload complete.


INFO:truera.client.remote_truera_workspace:Data split "validation" is added to remote data collection "OJ Data", and set as the data split for the workspace context.


In [130]:
X_test_pre = pd.read_csv('./split_sim/pre_split_2.csv', index_col='datetime')
X_test_post = pd.read_csv('./split_sim/post_split_2.csv', index_col='datetime')
y_test = pd.read_csv('./split_sim/label_2.csv', index_col='datetime')
#extra_test = pd.read_csv('./split_sim/extra_2.csv', index_col='datetime')

In [132]:
X_test_pre.shape, X_test_post.shape, y_test.shape, #extra_test.shape

((242, 15), (242, 17), (242, 1))

In [133]:
X_test_pre.head()

Unnamed: 0_level_0,store,brand,feat,price,AGE60,EDUC,ETHNIC,INCOME,HHLARGE,WORKWOM,HVAL150,SSTRDIST,SSTRVOL,CPDIST5,CPWVOL5
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2021-03-23,98,minute.maid,0,1.69,0.249201,0.051703,0.164964,10.573596,0.125409,0.299584,0.009843,6.230357,1.5,3.133177,0.381569
2021-03-23,119,minute.maid,0,1.69,0.121575,0.279952,0.049585,10.752719,0.08935,0.462266,0.459406,2.213791,1.153846,2.719594,0.722367
2021-03-23,126,minute.maid,0,1.69,0.107002,0.413222,0.045019,10.980876,0.113699,0.421959,0.573626,5.065201,1.690476,1.85826,0.717799
2021-03-23,111,minute.maid,0,1.69,0.210513,0.096929,0.995691,10.138283,0.157136,0.288515,0.012747,12.190945,1.894737,1.465672,0.289203
2021-03-23,118,dominicks,1,1.59,0.289442,0.224726,0.040669,10.632364,0.090133,0.354977,0.475753,1.945509,0.923077,2.046489,0.513722


In [135]:
tru.add_data_split("test", 
                   pre_data=X_test_pre, 
                   post_data = X_test_post, 
                   label_data=y_test.values,
                   #extra_data_df=extra_test,
                   split_type="test")



Uploading tmpi_44_nso -- ### -- file upload complete.
Uploading tmp419u8vzf -- ### -- file upload complete.
Uploading tmp7u50c5g5 -- ### -- file upload complete.


INFO:truera.client.remote_truera_workspace:Data split "test" is added to remote data collection "OJ Data", and set as the data split for the workspace context.


## Uploading the model
This is the last step before we can start analyzing the model in TruEra dashboards.
Model type and dependency versions are automatically inferred from the environment and the model object. A friendly name is provided to be able to find the model in the Truera dashboard and be able to work with it in the future.
The model is automatically attached to the current data collection, set by invoking `set_data_collection`.

In [137]:
lin_reg.predict(X_test_post)
#if this executes properly, then next step will work well

array([ 9.61043848,  9.55597284,  9.40941478,  9.95176744, 10.55181711,
        9.40000728, 11.36139042, 10.29272375,  9.99942851, 10.47158542,
       11.93879131, 11.57113312, 11.62189656,  9.79207352, 11.56763163,
        8.6299667 , 10.48385454,  8.99160176,  9.01340816,  9.05721319,
       10.14186587,  9.25285589,  8.70109573, 10.15321922, 10.15546948,
        8.3858983 ,  8.85706129,  9.06688224,  8.81005251,  8.69323931,
        8.44106853,  8.82435292, 10.28873237,  8.64592833,  8.74125448,
        8.77716072,  9.86576217,  8.68129165,  9.25475275, 10.07381581,
        9.30541994, 10.05245228,  8.98629458,  9.75834809,  9.07803799,
        8.5799714 ,  8.94960727,  9.32290784, 10.06905438,  9.05660571,
        8.72379839,  9.82485827,  8.92554397,  9.83426576, 10.42327284,
        8.70185701, 10.24977376,  9.05972297,  8.81893752, 10.1834855 ,
       10.15997233,  8.30019816,  9.1228086 ,  9.37258307,  8.23084679,
        9.17423776,  8.740647  ,  8.92868975, 10.18036825,  9.75

In [138]:
random_forest.predict(X_test_post)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished


array([ 9.50132964,  9.17849086,  8.88569871,  9.87473697,  9.78972124,
        8.97816422, 10.08107396,  9.87407935,  8.74464725, 10.7046282 ,
       10.52226548, 10.77968491,  8.98399532,  9.74544996, 10.00728824,
        7.255457  , 10.79657738,  7.86815739,  8.60248528,  6.9746196 ,
       10.39138474,  7.88543422,  8.97099101, 10.57230825,  9.79840718,
        8.37943675,  7.66115293,  7.67491037,  8.72050873,  8.55209688,
        8.01059058,  9.49609067,  9.92702279,  8.90499843,  8.78368907,
        7.64874061,  9.65580711,  8.21893679,  7.64617055, 10.83603847,
        8.96061093, 10.20807405,  7.32246759,  9.5729087 ,  8.76166202,
        7.4028114 ,  7.28992555,  7.96382543, 10.10722971,  7.36956971,
        7.88868811,  9.60548742,  7.02264666,  9.81689697, 10.24985659,
        7.26917632, 10.19632595,  7.812906  ,  8.93017938, 10.21317254,
       10.53899718,  8.711796  ,  8.98579492,  7.88449152,  8.17836   ,
        7.32922513,  8.57593002,  7.13812582, 11.30053031,  9.45

In [139]:
tru.add_python_model('Ridge Regression', lin_reg, additional_pip_dependencies = 'numpy == 1.21.0')

INFO:truera.client.remote_truera_workspace:Uploading sklearn model: RidgeCV
INFO:truera.client.remote_truera_workspace:Using sklearn version 1.0.2


Verification Done
Uploading MLmodel -- ### -- file upload complete.
Uploading tmp979r5yom -- ### -- file upload complete.
Uploading conda.yaml -- ### -- file upload complete.
Uploading sklearn_regression_predict_wrapper.py -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Model "Ridge Regression" is added and associated with remote data collection "OJ Data". "Ridge Regression" is set as the model for the workspace context.


Model uploaded to: http://se-demo-server.eastus.cloudapp.azure.com/p/Sales%20Forecasting%20OJ/m/Ridge%20Regression/


In [140]:
tru.add_python_model('Random Forest Regressor', random_forest, additional_pip_dependencies = 'numpy == 1.21.0')

INFO:truera.client.remote_truera_workspace:Uploading sklearn model: RandomForestRegressor
INFO:truera.client.remote_truera_workspace:Using sklearn version 1.0.2


Verification Done
Uploading tmpi25zzocu -- ################# -- file upload complete.
Uploading MLmodel -- ### -- file upload complete.
Uploading conda.yaml -- ### -- file upload complete.
Uploading sklearn_regression_predict_wrapper.py -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Model "Random Forest Regressor" is added and associated with remote data collection "OJ Data". "Random Forest Regressor" is set as the model for the workspace context.


Model uploaded to: http://se-demo-server.eastus.cloudapp.azure.com/p/Sales%20Forecasting%20OJ/m/Random%20Forest%20Regressor/
