In [14]:
# Install Vectice and other packages
%pip install -q vectice -U
%pip install boto3
%pip install botocore

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Instructions

Paste your API token below and execute the block. (your token can be generated [here](https://app.vectice.com/account/api-keys) )   

Dataset used can be found here: https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/items.csv

In [15]:
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/items.csv -q --no-check-certificate

In [16]:
# Import vectice package
import vectice as vct

# Connect using your token API - Your token can be found here: https://app.vectice.com/account/api-keys
conn = vct.connect(
    api_token='YOUR API TOKEN', 
    host='https://app.vectice.com',
    workspace='Samples'
)

# Open the project
#conn = conn.workspace('Samples')
project = conn.project("How To: Reporting your Milestones")

Workspace 'Samples' successfully retrieved."

For quick access to the workspace in the Vectice web app, visit:
https://app.vectice.com/workspace/dashboard/project-progress?w=2110
Welcome, 'Eric Barre'. You`re now successfully connected to the workspace 'Samples' in Vectice.

To access a specific project, use [1mworkspace[0m.project(Project ID)
To get a list of projects you can access and their IDs, use [1mworkspace[0m.list_projects()

For quick access to the list of projects in the Vectice web app, visit:
https://app.vectice.com/workspace/projects?w=2110


#### Alternate methods of connecting:  
<ul>project = vct.connect(config='~/.config/vectice-config.json')</ul>  
provided the json file contains the "WORKSPACE" and "PROJECT" entries,  
OR  
<ul>project = vct.connect(config='~/.config/vectice-config.json', workspace="ws_name", project="project_name")</ul>  
Both will return a project object

[API Reference](https://api-docs.vectice.com/reference/vectice/connection/)

#### Capture your dataset and their usage

This sample uses data from our Vectice S3 bucket. 
     
We will use boto3 as a client.   

In [17]:
from boto3 import client  # Used to create a client and read from S3
from botocore import UNSIGNED
from botocore.client import Config
from vectice import FileResource, S3Resource, Dataset

s3_client = client('s3', config=Config(signature_version=UNSIGNED), region_name='us-west-1')


The first cell illustrates how to create a Vectice metadataset object, set it as an origin asset, and add it's lineage to the source of the data    

The second cell shows how to tag/attach a clean dataset ready for modeling to your project   

The third one captures the definition of you modeling dataset (compound dataset - training, testing, validation)

In [18]:
import pandas as pd
# Here would be your code to capture your source datasets as part of your DS workflow
# and build one or more dataframes for your EDA/Data Prep 
# ...
response = s3_client.get_object(Bucket="vectice-examples", Key="Tutorial/ForecastTutorial/stores.csv")
status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")
if status == 200:
    print(f"Successful S3 get_object response. Status - {status}")
    stores_df=pd.read_csv(response.get("Body"))
    print(stores_df)
else:
    print(f"Unsuccessful S3 get_object response. Status - {status}")

response = s3_client.get_object(Bucket="vectice-examples", Key="Tutorial/ForecastTutorial/transactions.csv")
status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")
if status == 200:
    print(f"Successful S3 get_object response. Status - {status}")
    transaction_df=pd.read_csv(response.get("Body"))
    print(stores_df)
else:
    print(f"Unsuccessful S3 get_object response. Status - {status}")

Successful S3 get_object response. Status - 200
    store_nbr           city                           state type  cluster
0           1          Quito                       Pichincha    D       13
1           2          Quito                       Pichincha    D       13
2           3          Quito                       Pichincha    D        8
3           4          Quito                       Pichincha    D        9
4           5  Santo Domingo  Santo Domingo de los Tsachilas    D        4
5           6          Quito                       Pichincha    D       13
6           7          Quito                       Pichincha    D        8
7           8          Quito                       Pichincha    D        8
8           9          Quito                       Pichincha    B        6
9          10          Quito                       Pichincha    C       15
10         11        Cayambe                       Pichincha    B        6
11         12      Latacunga                        

In [19]:
# Now that we have identified our data, let's document this milestone
# Where were we...
project.list_phases()

There are 3 phases in the project 'How To: Reporting your Milestones' and a maximum of 10 phases are displayed in the table below:

**Legend**     [C]  Completed     [IP] In Progress
               [IR] In Review     [NS] Not Started




To access a specific phase, use [1mproject[0m.phase(Phase ID)

For quick access to the list of phases for this project, visit:
https://app.vectice.com/project?w=2110&resourceId=8218



In [20]:

# Let's instanciate an iteration for the first phase in the list above (Document Dataset)
phase_iter = project.phase("Document Dataset").create_iteration() #We could have used its ID

# Let's have a look at the steps needed to be completed
phase_iter.list_steps()

Phase 'Document Dataset' successfully retrieved."

For quick access to the Phase in the Vectice web app, visit:
https://app.vectice.com/project/phase?w=2110&pageId=35324
New Iteration number '2' created.

For quick access to the Iteration in the Vectice web app, visit:
https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539


There are 3 steps in Iteration '2' and a maximum of 10 steps are displayed in the table below:




To access a specific step, use the step shortcut [1iiteration[0m.step_my_step_name
The step reference is referred to as shortcut

For quick access to the list of steps in the Vectice web app, visit:
https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539



In [21]:
# Great I now know the structure of my projects and the steps I need to document

# Document the origin datasets used for this iteration - documenting the source data
# Vectice uses your existing objects, like a dataframe, to calculate relevant metrics and publish them straight to Vectice
phase_iter.step_identify_source_datasets = Dataset.origin( 
    name="Stores", 
    resource=S3Resource(
        uris="s3://vectice-examples/Tutorial/ForecastTutorial/stores.csv", 
        dataframes=stores_df
        )
    )

phase_iter.step_identify_source_datasets += Dataset.origin(
    name="Transactions",
    resource=S3Resource(
        uris="s3://vectice-examples/Tutorial/ForecastTutorial/transactions.csv", 
        dataframes=transaction_df
        )
    )

# Easily document datasets from various sources...Using a FileResource for local files
phase_iter.step_identify_source_datasets += Dataset.origin(name="Items",resource=FileResource(paths="items.csv"))

# More examples here: https://docs.vectice.com/python-api-docs/how-to-register-datasets


#Adding a commnent to the step
phase_iter.step_identify_source_datasets = "The datasets for the project have been identified as \'stores.csv\' and \'transaction.csv'.\nBoth files are located under the \'vectice-example' S3 bucket.\n We also identified a local file to help with feature engineering."

New Version: 'Version 1' of Dataset: 'Stores' added to Step: Identify source datasets
Attachments: None
Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539

New Version: 'Version 1' of Dataset: 'Transactions' added to Step: Identify source datasets
Attachments: None
Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539

File: items.csv wrapped successfully.
New Version: 'Version 1' of Dataset: 'Items' added to Step: Identify source datasets
Attachments: None
Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539

Added Comment to Step: Identify source datasets

Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539



In [22]:
# Data Scientist code for data preparation, normalization, etc...
# Here you would execute your EDA, and other Data Prep pipeline you might have

response = s3_client.get_object(Bucket="vectice-examples", Key="Tutorial/ForecastTutorial/train_reduced.csv")
status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")
if status == 200:
    print(f"Successful S3 get_object response. Status - {status}")
    train_clean_df=pd.read_csv(response.get("Body"))
    print(stores_df)
else:
    print(f"Unsuccessful S3 get_object response. Status - {status}")


Successful S3 get_object response. Status - 200
    store_nbr           city                           state type  cluster
0           1          Quito                       Pichincha    D       13
1           2          Quito                       Pichincha    D       13
2           3          Quito                       Pichincha    D        8
3           4          Quito                       Pichincha    D        9
4           5  Santo Domingo  Santo Domingo de los Tsachilas    D        4
5           6          Quito                       Pichincha    D       13
6           7          Quito                       Pichincha    D        8
7           8          Quito                       Pichincha    D        8
8           9          Quito                       Pichincha    B        6
9          10          Quito                       Pichincha    C       15
10         11        Cayambe                       Pichincha    B        6
11         12      Latacunga                        

In [23]:

# This time let's documnet our clean dataset in Vectice...
phase_iter.step_prep_datasets = Dataset.clean( 
    name="Normalized_Cleaned", 
    resource=S3Resource(
        uris="s3://vectice-examples/Tutorial/ForecastTutorial/train_reduced.csv", 
        dataframes=train_clean_df))

msg = "As part of our standard Data Pipeline process we applied the following preparation to our datasets: - Handling of missing data - Applied standard scaler to numerical attributes - Converted categorical data into numerical - Split values in numerical value.."
phase_iter.step_prep_datasets = msg

New Version: 'Version 1' of Dataset: 'Normalized_Cleaned' added to Step: Prep datasets
Attachments: None
Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539

Added Comment to Step: Prep datasets

Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539



In [24]:
# Hmmmmm, which step am I on again? 
# Vectice makes it easy to follow your progress
phase_iter.list_steps()

# Ok, looks like I am working on the "step_partition_dataset"

There are 3 steps in Iteration '2' and a maximum of 10 steps are displayed in the table below:




To access a specific step, use the step shortcut [1iiteration[0m.step_my_step_name
The step reference is referred to as shortcut

For quick access to the list of steps in the Vectice web app, visit:
https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539



In [25]:
# Data Scientist code to generate training, testing, and validation dataframes
# ...

# Define testing, training and validation resources
train_ds = S3Resource(uris="s3://vectice-examples/Tutorial/ForecastTutorial/traindataset.csv")
test_ds = S3Resource(uris="s3://vectice-examples/Tutorial/ForecastTutorial/testdataset.csv")
validate_ds = S3Resource(uris="s3://vectice-examples/Tutorial/ForecastTutorial/validatedataset.csv")

dataset = Dataset.modeling(
    name="Modeling Dataset",
    training_resource=train_ds,
    testing_resource=test_ds,
    validation_resource=validate_ds,
)
phase_iter.step_partition_dataset = dataset
# Document the step and automatically attach the dataset
phase_iter.step_partition_dataset = "We split the cleaned dataset in a training, testing and validation datasets. 40% of the data is set aside for testing and our seed to generate repeatable datasets is 42"
phase_iter.complete()

New Version: 'Version 1' of Dataset: 'Modeling Dataset' added to Step: Partition dataset
Attachments: None
Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539

Added Comment to Step: Partition dataset

Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539

Iteration with index 2 completed.

For quick access to the Iteration in the Vectice web app, visit:
https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6539
