# Before you start with this Modeling Notebook

This notebook is part of the Vectice tutorial project notebook series. It illustrates how the forcast model was trained in the "Modeling" phase of the **"Tutorial: Forecast in store-unit sales"** project you can find in your personal Vectice workspace.

### Pre-requisites:
Before using this notebook you will need:
* An account in Vectice
* An API token to connect to Vectice through the APIs
* The Phase Id of the project where you want to log your work

Refer to Vectice Tutorial Guide for more detailed instructions: https://docs.vectice.com/getting-started/tutorial


### Other Resources
*   Vectice Documentation: https://docs.vectice.com/ </br>
*   Vectice API documentation: https://api-docs.vectice.com/

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Install the latest Vectice Python client library

In [None]:
%pip install --q vectice -U

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [3]:
import mlflow

## Get started by connecting to Vectice

You can learn more about the `Connection` object in the [documentation](https://api-docs.vectice.com/reference/vectice/connection/)

<div class="alert" style="color: #383d41; background-color: #e2e3e5; border-color: #d6d8db" role="alert">
<b>Automated code lineage:</b> The code lineage functionalities are not covered as part of this Tutorial as they require to first setting up a Git repository.
</div>

**First, we need to authenticate to the Vectice server. Before proceeding further:**

- Visit the Vectice app to create and copy an API token (cf. https://docs.vectice.com/getting-started/create-an-api-token)

- Paste the API token in the code below

In [None]:
import vectice

vec = vectice.connect(api_token="my-api-token") #Paste your API token

## Specify which project phase you want to document
In Vectice UI, navigate to your personal workspace inside your default Tutorial project go to the Modeling phase and copy paste your Phase Id below.

In [None]:
phase = vec.phase("PHA-xxxx") #Paste your own Modeling Phase ID

## Next we are going to create an iteration
An iteration allows you to organize your work in repeatable sequences of steps. You can have multiple iteration within a phase.

In [None]:
model_iteration = phase.create_iteration()

## Retrieve your cleaned Dataset previously created in your Data Preparation phase of the project
You can retrieve a variety of Vectice Objects with the `browse('VECTICE-ID')` method. Namely, Phases, Iterations, Datasets, Models etc

In [None]:
cleaned_ds = vec.browse("DTV-xxxx") #Get the ID of your Clean Dataset created in the Data Preparation phase

## Log a Dataset version

Use the following code block to log a local Dataset

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/vectice/GettingStarted/main/23.2/tutorial/ProductSales%20Cleaned.csv", converters = {'Postal Code': str})
df.to_csv("ProductSales Cleaned.csv")
df.head()

Unnamed: 0,Ship Mode,Segment,Country,City,State,Postal Code,Region,Category,Sub-Category,Sales,Quantity,Discount,Profit
0,Second Class,Consumer,United States,others,others,42420,South,Furniture,Bookcases,261.96,2,0.0,41.9136
1,Second Class,Corporate,United States,Los Angeles,California,90036,West,Office Supplies,Labels,14.62,2,0.0,6.8714
2,Standard Class,Consumer,United States,others,Florida,33311,South,Office Supplies,Storage,22.368,2,0.2,2.5164
3,Standard Class,Consumer,United States,Los Angeles,California,90032,West,Furniture,Furnishings,48.86,7,0.0,14.1694
4,Standard Class,Consumer,United States,Los Angeles,California,90032,West,Office Supplies,Art,7.28,4,0.0,1.9656


In [5]:
target="Sales"
X=df.drop([target],axis=1)
y=df[target]
print(X.shape)
print(y.shape)

(7994, 12)
(7994,)


In [6]:
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Save the modeling train test split datasets as csv files
train_df = X_train.copy()
test_df = X_test.copy()

train_df["Sales"] = y_train
test_df["Sales"] = y_test

train_df.to_csv("train dataset.csv", index=False)
test_df.to_csv("test dataset.csv", index=False)

### Log a modeling Dataset
The Vectice resource will automatically extract pertinent metadata from the local dataset file and collect statistics from the pandas dataframe. This information will be documented within the iteration as part of a Dataset version.

In [None]:
train_ds = vectice.FileResource(paths="train dataset.csv", dataframes=train_df)
test_ds = vectice.FileResource(paths="test dataset.csv", dataframes=test_df)

In [None]:
modeling_dataset = vectice.Dataset.modeling(
        name="ProductSales Modeling",
        training_resource=train_ds,
        testing_resource=test_ds, 
        derived_from=cleaned_ds,
    )

In [None]:
model_iteration.step_model_input_data = modeling_dataset

# MLflow Log A Dataset 

In [8]:
mlflow.set_tracking_uri('https://mlflow-dev.vectice.com/')

In [9]:
with mlflow.start_run():
    # Log the dataset to the MLflow Run. Specify the "training" context to indicate that the
    # dataset is used for model training
    mlflow.log_input(df, context="training")

# Retrieve the run, including dataset information
run = mlflow.get_run(mlflow.last_active_run().info.run_id)
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")

# Load the dataset's source, which downloads the content from the source URL to the local
# filesystem
dataset_source = mlflow.data.get_source(dataset_info)
dataset_source.load()

AttributeError: module 'mlflow' has no attribute 'log_input'

In [None]:
modeling_dataset = vectice.Dataset.modeling(
        name="ProductSales Modeling",
        training_resource=train_dataset,
        testing_resource=test_dataset, 
        derived_from=cleaned_ds,
    )

In [None]:
model_iteration.step_model_input_data = modeling_dataset

## Log a Baseline model with MLflow & Vectice

First we log a Naive model to Vectice that always return the average sales to establish a Baseline.

In [None]:
##  baseline mean absolute error
y_mean=y_train.mean()
y_mean_pred=[y_mean] * len(y_train)
baseline_mae=mean_absolute_error(y_train,y_mean_pred)
round(baseline_mae,2)

In [None]:
mlflow.set_tracking_uri('http://localhost:5000')

In [None]:
experiment_id = mlflow.create_experiment("baseline")

In [None]:
experiment_run = mlflow.start_run("run-003", experiment_id=experiment_id)

In [None]:
mlflow.log_metrics({"mae_baseline": round(baseline_mae,2)})

In [None]:
mlflow.log_params({"technique": "Constant predictor"})

In [None]:
mlflow.end_run()

In [None]:
run_id = experiment_run.info.run_uuid

### Vectice log Baseline model 

In [None]:
# Baseline `model` to compare the Ridge Regression against
vect_baseline_model = model = Model.mlflow(run_id=run_id, client=mlflow, derived_from='DTV-59110')

In [None]:
model_iteration.step_build_model = vect_baseline_model

### Train a Ridge regressor as a challenger

In [None]:
with mlflow.start_run(run_name='mlFlow_test', experiment_id=experiment_id) as run:
    OHE = OneHotEncoder(handle_unknown='ignore')
    scaler = StandardScaler()

    cat_cols = ['Ship Mode', 'Segment', 'Country', 'City', 'State', 'Postal Code','Region', 'Category', 'Sub-Category']
    num_cols = ['Quantity', 'Discount', 'Profit']

    transformer = ColumnTransformer([('cat_cols', OHE, cat_cols),
                                    ('num_cols', scaler, num_cols)])

    model = make_pipeline(transformer,Ridge())
    model.fit(X_train,y_train)
    
    # Making Prediction with the training data
    y_train_pred = model.predict(X_train)

    #Evaluating the model 
    mae_train=mean_absolute_error(y_train, y_train_pred)
    print(round(mae_train,2))
    
    #Making prediction on test
    y_test_pred = model.predict(X_test)
    
    #Evaluating the model 
    mae_test = mean_absolute_error(y_test, y_test_pred)
    print(round(mae_test,2))

In [None]:
# Start mlflow session
run_id = run.info.run_uuid

In [None]:
feature_names = transformer.get_feature_names_out()
feature_importances = model.named_steps['ridge'].coef_

feat_imf = pd.Series(feature_importances, index=feature_names).sort_values()

feat_imf.tail(10).plot(kind="barh")
plt.ylabel("Features")
plt.xlabel("Importance")
plt.title("Feature Importance")
plt.tight_layout()
plt.savefig("Feature Importance.png")

In [None]:
vect_model = Model.mlflow(run_id=run_id, client=mlflow, 
                          derived_from=modeling_dataset,          # Pass your modeling dataset to document the lineage
                          attachments="Feature Importance.png")   # Pass your Feature Important graph as an attachment

### Assign the Ridge model to the step

You can add multiple models to a single step by using the `+=` operator.

In [None]:
model_iteration.step_build_model += vect_model

## Add a comment 

Passing a `string` to a step will add a comment.

In [None]:
# Select the model to be staged
model_iteration.step_model_validation = f"Model passed acceptance criteria\nMAE Train: {round(mae_train,2)}\nMAE Test: {round(mae_test,2)}"

In [None]:
model_iteration.complete()

## 🥇 Congrats! You learn how to succesfully use Vectice to auto-document the Modeling phase of the Tutorial Project.<br>
### Next we encourage you to explore other notebooks in the tutorial series. You can find those notebooks in Vectice public GitHub repository : https://github.com/vectice/GettingStarted/

✴ You can view your registered assets and comments in the UI by clicking the links in the output messages..

In [19]:
!mlflow

/bin/bash: mlflow: command not found
