# Before your start with this Tutorial

**Tutorial Intention:** Providing an example of iteration and related step on a modeling phase for you to:

*   Experience the data science lifecycle using Vectice
*   See how simple it is to connect your notebook to Vectice
*   Learn how to structure and log your work using Vectice

**Resources needed:**
*   <b>Tutorial Project: Forecast in-store unit sales (23.1)</b> - You can find it as part of your personal workspace named after your name
*   Dataset ready for modeling: https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/train_clean.csv
*   Vectice Webapp Documentation: https://docs.vectice.com/
*   Vectice API documentation: https://api-docs.vectice.com/sdk/index.html

## Installing Vectice

In [None]:
%pip install --q vectice -U

## Install optional packages for your project

In [None]:
%pip install --q squarify
%pip install --q plotly

## Import libraries

In [None]:
# importing mathematical and ds libraries
import pandas as pd  # data science essentials
import matplotlib.pyplot as plt  # essential graphical output
import numpy as np   # mathematical essentials
%matplotlib inline

# import Visual libraries
import plotly.offline as py
py.init_notebook_mode(connected=True)
from matplotlib import pyplot as plt
#import seaborn as sns  # enhanced graphical output

# Load scikit-learn packages for modeling
from sklearn.model_selection import train_test_split #Split function
from sklearn.linear_model import LinearRegression #LR function
from sklearn.metrics import mean_squared_error,mean_absolute_error

#import the Vectice Library
import vectice
from vectice import FileDataWrapper, DatasetSourceUsage

#importing other libraries
import IPython.display #this is for our data pipeline

##  Vectice Config     
- To log your work to Vectice, you need to connect your notebook to your profile using your personal API token       
- Click on your profile at the top right corner of the Vectice application --> API Tokens --> Create API Token       
- Provide a name and description for the key. We recommend you name the API Token: "Tutorial_API_Token" to avoid having to make additional changes to the notebook.
- Save it in a location accessible by this code
- #### If you are viewing this notebook in Google Colab, click the folder icon on the left bar and upload the file


#### Update the workspace name below to match the workspace name your project is in

In [None]:
my_vectice = vectice.connect(config="Tutorial_API_token.json")
my_workspace = my_vectice.workspace("YOUR WORKSPACE NAME") # replace workspace name
my_project = my_workspace.project("Tutorial Project: Forecast in store unit sales (23.1)")

## Capture milestones for the Modeling phase

In [None]:
# Let's pick the first step of the Modeling phase
step = my_project.phase("Modeling").iteration().step("Select Modeling Techniques")

# Here we are documenting the modeling technique that we will use on this iteration
step = step.next_step(message="For this first iteration we are going to use a Linear Regression model to get a base model.")

# Linear Regression Model

### Generate Test Design

* [Dataset ready for modeling](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/train_clean.csv)          

In [None]:
# Download the files locally
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/train_clean.csv -q --no-check-certificate

In [None]:
#read the dataset
model_ds =pd.read_csv("train_clean.csv")
model_ds = model_ds.fillna(0)

# Set split sizes
test_size = 0.40
# We will set the random seed so we always generate the same split.
random_state = 42

# Generate X_train, X_test, y_train, y_test, which we will need for modeling
X = model_ds.drop(['unit_sales'], axis=1)
y = model_ds["unit_sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

X_train.to_csv("traindataset.csv")
X_test.to_csv("testdataset.csv")
y_test.to_csv("validatedataset.csv")

### Document the split strategy in Vectice

In [None]:
# Document, close the step and get the next one.
step = step.next_step(message=f"We split the dataset in a training, testing and validation datasets. {test_size * 100}% of the data is set aside for testing.\n - Training dataset size: {X_train.shape[0]}\n - Testing dataset size: {X_test.shape[0]}\n - Validation dataset size: {y_test.shape[0]}\nOur seed to generate repeatable datasets is {random_state}")

### Linear Regression

In [None]:
#Lets create a linear regression model
model = LinearRegression()

model.fit(X_train.values, y_train.values)
    
pred = model.predict(X_test.values)

print(f"predicted responses:\n {pred}")
    
RMSE = np.sqrt(mean_squared_error(y_test.values, pred))
MAE = mean_absolute_error(y_test.values, pred)

print("root_mean_squared_error: ",RMSE) 
print("mean_absolute_error: ", MAE)

metrics = {"RMSE": RMSE, "MAE": MAE}

In [None]:
plt.scatter(X_train.iloc[:,0].values, y_train ,color='g') 
plt.plot(X_test, pred,color='k') 
plt.savefig("regression_graph.png")
plt.show()

### Document model and its lineage in Vectice

In [None]:
# Let's log the model we trained along with its metrics, as a new version of the "Regression" model in Vectice.
# Define a testing, training and validation datawrapper
train_ds = FileDataWrapper(name="Modeling Dataset", path="traindataset.csv", usage=DatasetSourceUsage.TRAINING)
test_ds = FileDataWrapper(name="Modeling Dataset", path="testdataset.csv", usage=DatasetSourceUsage.TESTING)
validate_ds = FileDataWrapper(name="Modeling Dataset", path="validatedataset.csv", usage=DatasetSourceUsage.VALIDATION)
step.modeling_dataset = [train_ds, test_ds, validate_ds]

# Get the modeling dataset ID to use in the lineage property of the model object


step.iteration.model = vectice.Model(name="Unit Sales Predictor", library="scikit-learn", technique="linear regression", metrics=metrics, attachments="regression_graph.png", predictor=model)
step = step.next_step(message="RMSE= " + str(metrics["RMSE"]) +  " and MAE= " + str(metrics["MAE"])) 

## Assess Model

In [None]:

step.close(message="As expected the model performs better however this is not good enough and we should try a different method. We recommend doing a Random Forest as a new iteration")