# Before your start with this Tutorial

**Tutorial Intention:** Providing an example of iteration and related step on a data preparation phase for you to:

*   Experience the data science lifecycle using Vectice
*   See how simple it is to connect your notebook to Vectice
*   Learn how to structure and log your work using Vectice

**Resources needed:**
*   <b>Tutorial Project: Forecast in-store unit sales (23.1)</b> - You can find it as part of your personal workspace named after your name
*   Vectice Webapp Documentation: https://docs.vectice.com/
*   Vectice API documentation: https://api-docs.vectice.com/sdk/index.html

## Installing Vectice

In [None]:
%pip install --q vectice -U

## Install optional packages for your project

In [None]:
%pip install --q squarify
%pip install --q plotly
%pip install --q matplotlib -U 

## Import packages

In [11]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Importing the relevant libraries
import IPython.display
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
from matplotlib import pyplot as plt
import os
# D3 modules
from IPython.display import display
import datetime as dt
# sklearn
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
# Vectice
import vectice
from vectice import FileDataWrapper

## Reading the data

The dataset used in this project can be found here:<br>
* [items.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/items.csv)<br>
* [holidays_events.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/holidays_events.csv)<br>
* [stores.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/stores.csv)<br>
* [oil.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/oil.csv)<br>
* [transactions.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/transactions.csv)<br>
* [train_reduced.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/train_reduced.csv)

In [None]:
# Download the files locally
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/items.csv -q --no-check-certificate
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/holidays_events.csv -q --no-check-certificate
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/stores.csv -q --no-check-certificate
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/oil.csv -q --no-check-certificate
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/transactions.csv -q --no-check-certificate
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/train_reduced.csv -q --no-check-certificate

In [12]:
dtypes = {'store_nbr': np.dtype('int64'),
          'item_nbr': np.dtype('int64'),
          'unit_sales': np.dtype('float64'),
          'onpromotion': np.dtype('O')}

items = pd.read_csv("items.csv")
holiday_events = pd.read_csv("holidays_events.csv", parse_dates=['date'])
stores = pd.read_csv("stores.csv")
oil = pd.read_csv("oil.csv", parse_dates=['date'])
transactions = pd.read_csv("transactions.csv", parse_dates=['date'])
train = pd.read_csv("train_reduced.csv", parse_dates=['date'], error_bad_lines=False)


The error_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future.





# Feature engineering

**Here we analyze the data and select the features for our model to be trained on.**

**Train**
id, date, store_nbr, item_nbr, unit_scale, on_promotion

**Items**
item_nbr, family, class, perishable

**Holidays_events**
date, type, locale, locale_name, description, transferred

**Stores**
store_nbr, city, state, type, cluster

**Oil**
date, dcoilwtico

**Transactions**
date, store_nbr, transactions

**Selected features as inputs to the model**

date, holiday.type, holidaye.locale, holiday.locale_name, holiday_transfered, store_nbr, store.city, store.state, store.type, store.cluster, transactions, item_nbr, item.family, item.class, on_promotion, perishable, dcoilwtico.

**Selected features as outputs of the model**

transactions per store, unit_sales per item

## DATA pipeline

In [13]:
class prepare_data(BaseEstimator, TransformerMixin):
    def __init__(self):
        print("prepare_data -> init")
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        train_stores = X[0].merge(X[1], right_on = 'store_nbr', left_on='store_nbr')
        train_stores_oil = train_stores.merge(X[2], right_on='date', left_on='date')
        train_stores_oil_items = train_stores_oil.merge(X[3], right_on = 'item_nbr', left_on = 'item_nbr')
        train_stores_oil_items_transactions = train_stores_oil_items.merge(X[4], right_on = ['date', 'store_nbr'], left_on = ['date', 'store_nbr'])
        train_stores_oil_items_transactions_hol = train_stores_oil_items_transactions.merge(X[5], right_on = 'date', left_on = 'date')
        
        data_df = train_stores_oil_items_transactions_hol.copy(deep = True)
        
        # Fill the empty values
        data_df['onpromotion'] = data_df['onpromotion'].fillna(0)
        # change the bool to int
        data_df['onpromotion'] = data_df['onpromotion'].astype(int)
        data_df['transferred'] = data_df['transferred'].astype(int)

        # change the names
        data_df.rename(columns={'type_x': 'st_type', 'type_y': 'hol_type'}, inplace=True)
        
        # handle date
        data_df['date'] = pd.to_datetime(data_df['date'])
        data_df['date'] = data_df['date'].map(dt.datetime.toordinal)
                
        return data_df

### Custom transform for splitting the data

Here, we split dataframe into numerical values, categorical values and date

In [14]:
# split dataframe into numerical values, categorical values and date
class split_data(BaseEstimator, TransformerMixin):
    def __init__(self):
        print("split_data -> init")
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        # Get columns for each type         
        df_ = X.drop(['date'], axis = 1)
        cols = df_.columns
        num_cols = df_._get_numeric_data().columns
        cat_cols = list(set(cols) - set(num_cols))
        
        data_num_df = X[num_cols]
        data_cat_df = X[cat_cols]
        data_date_df = X['date']
        
        return data_num_df, data_cat_df, data_date_df

Here, we handle the missing data, apply standard scaler to numerical attributes, and convert categorical data into numerical

In [15]:
class process_data(BaseEstimator, TransformerMixin):
    def __init__(self):
        print("process_data -> init")
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        ### numerical data
        # impute nulls in numerical attributes
        imputer = SimpleImputer(strategy="mean", copy=True)
        num_imp = imputer.fit_transform(X[0])
        #########
        data_num_df = pd.DataFrame(num_imp, columns=X[0].columns, index=X[0].index)
        
        # apply standard scaling
        scaler = StandardScaler()
        scaler.fit(data_num_df)
        num_scaled = scaler.transform(data_num_df)
        data_num_df = pd.DataFrame(num_scaled, columns=X[0].columns, index=X[0].index)
        
        ### categorical data
        # one hot encoder
        cat_encoder = OneHotEncoder(sparse=False)
        data_cat_1hot = cat_encoder.fit_transform(X[1])
        
        # convert it to dataframe with n*99 where n number of rows and 99 is no. of categories
        data_cat_df = pd.DataFrame(data_cat_1hot, columns=cat_encoder.get_feature_names_out()) #, index=X[1].index)
                
        return data_num_df, data_cat_df, X[2]

In [16]:
class join_df(BaseEstimator, TransformerMixin):
    def __init__(self):
        print("join_df -> init")
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        ### numerical data
        data_df = X[0].join(X[1])
        data_df = data_df.join(X[2])
        
        return data_df

# Push the datasets through the pipeline

In [17]:
pipe_processing = Pipeline([
        ('prepare_data', prepare_data()),
        ('split_data', split_data()),
        ('process_data', process_data()),
        ('join_data', join_df())
    ])

# our prepared data
data_df = pipe_processing.fit_transform([train, stores, oil, items, transactions, holiday_events])
data_df.to_csv("train_clean.csv") #this is the dataset that will be split into a training, testing, and validation dataset

prepare_data -> init
split_data -> init
process_data -> init
join_df -> init



`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.



#  Document in Vectice   
- To log your work to Vectice, you need to connect your notebook to your profile using your personal API token       
- Click on your profile at the top right corner of the Vectice application --> API Tokens --> Create API Token       
- Provide a name and description for the key. We recommend you name the API Token: "Tutorial_API_Token" to avoid having to make additional changes to the notebook.
- Save it in a location accessible by this code
- #### If you are viewing this notebook in Google Colab, click the folder icon on the left bar and upload the file

#### Update the workspace name below to match the workspace name your project is in

In [18]:
my_vectice = vectice.connect(config="Tutorial_API_token.json")
my_workspace = my_vectice.workspace("YOUR WORKSPACE NAME") # replace workspace name
my_project = my_workspace.project("Tutorial Project: Forecast in store unit sales (23.1)")

2023/02/17 11:39:21 INFO vectice.connection: Vectice successfully connected.
2023/02/17 11:39:23 INFO vectice.connection: Your current workspace: .Retail Ops and project: Corp Forecast in-store unit sales
2023/02/17 11:39:24 INFO vectice.connection: Assets with Latest Activity   Asset Type    Name
2023/02/17 11:39:24 INFO vectice.connection: Assets with Latest Activity   Project       'Corp Forecast in-store unit sales'
2023/02/17 11:39:24 INFO vectice.connection: Assets with Latest Activity   Phase         'Data Understanding'
2023/02/17 11:39:24 INFO vectice.connection: Assets with Latest Activity   Iteration      4
2023/02/17 11:39:24 INFO vectice.connection: Assets with Latest Activity   Step          ''


## Capture milestones for the Data Preparation phase

In [19]:
# Get the phase for Data Preparation 
project_dp = my_project.phase("Data Preparation")   

# Let's start a new iteration (or get the curently opened iteration)
project_iter = project_dp.iteration()

# Let's select the first step
step = project_iter.step('Select Data')

# Provide context into the origin datasets by attaching them to the step
step.origin_dataset = FileDataWrapper(path="items.csv", name="Items origin")
step.origin_dataset = FileDataWrapper(path="holidays_events.csv", name="Holiday origin")
step.origin_dataset = FileDataWrapper(path="stores.csv", name="Stores origin")
step.origin_dataset = FileDataWrapper(path="oil.csv", name="Oil origin")
step.origin_dataset = FileDataWrapper(path="transactions.csv", name="Transactions origin")

# Done with this step, let's close it and get the next step - all in one line
# Alternatively you can explicitly close the step and manually retrieve the next one
step = step.next_step(message="The datasets for the project have been identified as:")

# Log in findings/comments for this milestone, close the step and capture the next one
msg = "As part of our standard Data Pipeline process we applied the following preparation to our datasets:\n - Handling of missing data\n - Applied standard scaler to numerical attributes\n - Converted categorical data into numerical\n - Split values in numerical values, categorical values, and dates"
step = step.next_step(message=msg)

# Log in findings/comments for this milestone, close the step and capture the next one
step = step.next_step(message="We selected \"unit sales\" as our model target.\nThe features used in this model are:\n - date\n - holiday.type\n - holidaye.locale\n - holiday.locale_name\n - holiday_transfered\n - store_nbr\n - store.city\n - store.state\n - store.type\n - store.cluster\n - transactions\n - item_nbr\n - item.family\n - item.class\n - on_promotion\n - perishable\n - dcoilwtico")

# Log in findings/comments for this milestone, close the step and capture the next one
msg = "We processed our origin datasets through our data pipeline to generate a dataset ready for modeling.\n"
msg += f"The resulting modeling datasets contains {data_df.shape[0]} observations and {data_df.shape[1]} features.\n"
msg += "The dataset is ready to be split for modeling."
step.close(message=msg)

# Get the Format data step. Since we used the close() method above we need to specify the step name
step = project_iter.step("Format Data")
step.clean_dataset = FileDataWrapper(path="train_clean.csv", name="Clean&Augmented_Dataset")
# Log in findings/comments for this milestone and close the step
step.close(message="We generated a dataset ready for modeling. We also created a data pipeline to make this process repeatable.")


2023/02/17 11:39:26 INFO vectice.models.phase: Iteration number 1 (id 4570) successfully retrieved.
2023/02/17 11:39:27 INFO vectice.models.datasource.datawrapper.file_data_wrapper: File: items_2.csv wrapped successfully.
2023/02/17 11:39:27 INFO vectice.models.phase: Iteration number 1 (id 4570) successfully retrieved.
2023/02/17 11:39:29 INFO vectice.models.step: Code captured and will be linked to asset.
2023/02/17 11:39:34 INFO vectice.models.git_version: Code captured the following changed files; .gitignore, 22.4/samples/SimpleProject_HelloWorld.ipynb, 23.1/samples/howto_captureDatasets.ipynb, 23.1/samples/howto_captureModels.ipynb, 23.1/tutorial/Data_Preparation.ipynb
2023/02/17 11:39:34 INFO vectice.api.client: Successfully registered Dataset(name='Items origin', id=22604, version='Version 2', type=ORIGIN).
2023/02/17 11:39:34 INFO vectice.models.step: Dataset: Items origin with Version: Version 2 already exists.
2023/02/17 11:39:36 INFO vectice.models.step: Successfully added D