In [None]:
# Install Vectice and other packages
%pip install -q vectice -U
%pip install boto3
%pip install botocore

### Instructions

Paste your API token below and execute the block. (your token can be generated [here](https://app.vectice.com/account/api-keys) )   

Dataset used can be found here: https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/items.csv

In [1]:
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/items.csv -q --no-check-certificate

In [3]:
# Import vectice package
import vectice as vct
from vectice import DatasetType, DatasetSourceUsage, Dataset, FileResource, S3Resource


# Connect using your token API - Your token can be found here: https://app.vectice.com/account/api-keys
conn = vct.connect(
    api_token='YOUR API TOKEN', 
    host='https://app.vectice.com',
    workspace='Samples'
)

# Optional Vectice flags
vct.code_capture = False #ON by default

# Open the project
#conn = conn.workspace('Samples')
project = conn.project("How To: Reporting your Milestones")

Welcome, 'Eric Barre'. You`re now successfully connected to the workspace 'Samples' in Vectice.

To access a specific project, use workspace.project(Project ID)
To get a list of projects you can access and their IDs, use workspace.list_projects()

For quick access to the list of projects in the Vectice web app, visit:
https://app.vectice.com/workspace/projects?w=2110


#### Alternate methods of connecting:  
<ul>project = vct.connect(config='~/.config/vectice-config.json')</ul>  
provided the json file contains the "WORKSPACE" and "PROJECT" entries,  
OR  
<ul>project = vct.connect(config='~/.config/vectice-config.json', workspace="ws_name", project="project_name")</ul>  
Both will return a project object

[API Reference](https://api-docs.vectice.com/reference/vectice/connection/)

#### Capture your dataset and their usage

This sample uses data from our Vectice S3 bucket. 
     
We will use boto3 as a client.   

In [4]:
from boto3 import client  # Used to create a client and read from S3
from botocore import UNSIGNED
from botocore.client import Config
from vectice import FileResource, S3Resource, GCSResource, DatasetSourceUsage

s3_client = client('s3', config=Config(signature_version=UNSIGNED), region_name='us-west-1')


The first cell illustrates how to create a Vectice metadataset object, set it as an origin asset, and add it's lineage to the source of the data    

The second cell shows how to tag/attach a clean dataset ready for modeling to your project   

The third one captures the definition of you modeling dataset (compound dataset - training, testing, validation)

In [5]:
# Data Scientist code to build data frames with data
# ...

# Let's list out the phases and status - Vectice is bi-directional
project.list_phases()

There are 3 phases in the project 'How To: Reporting your Milestones'.

**Legend**     [C]  Completed     [IP] In Progress
               [IR] In Review     [NS] Not Started




>> To access a specific phase, use project.phase(PhaseID)
>> To access an array of phases, use project.phases

For quick access to the list of phases for this project, visit:
https://app.vectice.com/project?w=2110&resourceId=8054



In [6]:
# Let's instanciate an iteration for the second phase in the list above (Document Dataset)
phase_iter = project.phase("Document Dataset").iteration()

# Let's have a look at the steps needed to be completed
phase_iter.list_steps()

Phase 'Document Dataset' successfully retrieved.
New Iteration number 2 created.
Iteration number 2 successfully retrieved.


There are 3 steps in iteration 2.




>> To access a specific step, use iteration.step(Step ID)
>> To access an array of steps, use iteration.steps

For quick access to the list of steps in the Vectice web app, visit:
https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6352



In [7]:
# Great I now know the structure of my projects and the items I need to complete

# Document the original datasets used for this iteration (step_identify_source_datasets)

# Using a S3Resource object for files on AWS S3:
phase_iter.step_identify_source_datasets = Dataset.origin( name="Stores", resource=S3Resource(s3_client=s3_client,bucket_name='vectice-examples',resource_path="Tutorial/ForecastTutorial/stores.csv"))
phase_iter.step_identify_source_datasets += Dataset.origin(name="Transactions",resource=S3Resource(s3_client=s3_client,bucket_name='vectice-examples',resource_path="Tutorial/ForecastTutorial/transactions.csv"))

# Using a FileResource for local files
local_file = FileResource(path="items.csv")
phase_iter.step_identify_source_datasets += Dataset.origin(name="Items",resource=local_file)

# More examples here: https://docs.vectice.com/python-api-docs/how-to-register-datasets


#Adding a commnent to the step
phase_iter.step_identify_source_datasets = "The datasets for the project have been identified as \'stores.csv\' and \'transaction.csv'.\nBoth files are located under the \'vectice-example' S3 bucket.\n We also identified a local file to help with feature engineering."

Successfully registered Dataset(name='Stores', id=23997, version='Version 1', type=ORIGIN).
Existing Dataset: 'Stores' Version: 'Version 1' added to Step: Identify source datasets
Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6352

Iteration number 2 successfully retrieved.
Successfully registered Dataset(name='Transactions', id=23998, version='Version 1', type=ORIGIN).
Successfully added Dataset(name='Transactions', id=23998, version='Version 1', type=ORIGIN) to Identify source datasets
File: items.csv wrapped successfully.
Iteration number 2 successfully retrieved.
Successfully registered Dataset(name='Items', id=24414, version='Version 2', type=ORIGIN).
Successfully added Dataset(name='Items', id=24414, version='Version 2', type=ORIGIN) to Identify source datasets


In [8]:
# Data Scientist code for data preparation, normalization, etc...
# ...

# Register the cleaned dataset in Vectice
phase_iter.step_prep_datasets = Dataset.clean( name="Normalized_Cleaned", resource=S3Resource(s3_client=s3_client,bucket_name='vectice-examples',resource_path="Tutorial/ForecastTutorial/train_clean.csv"))

msg = "As part of our standard Data Pipeline process we applied the following preparation to our datasets: - Handling of missing data - Applied standard scaler to numerical attributes - Converted categorical data into numerical - Split values in numerical value.."
phase_iter.step_prep_datasets = msg

Successfully registered Dataset(name='Normalized_Cleaned', id=24000, version='Version 1', type=CLEAN).
Existing Dataset: 'Normalized_Cleaned' Version: 'Version 1' added to Step: Prep datasets
Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6352



In [9]:
# Hmmmmm, which step am I on again? 
# Vectice makes it easy to follow your progress
phase_iter.list_steps()

# Ok, looks like I am working on the "step_partition_dataset"

There are 3 steps in iteration 2.




>> To access a specific step, use iteration.step(Step ID)
>> To access an array of steps, use iteration.steps

For quick access to the list of steps in the Vectice web app, visit:
https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6352



In [10]:
# Data Scientist code to generate training, testing, and validation dataframes
# ...

# Define testing, training and validation resources
train_ds = S3Resource(s3_client=s3_client, bucket_name='vectice-examples', resource_path="Tutorial/ForecastTutorial/traindataset.csv")
test_ds = S3Resource(s3_client=s3_client, bucket_name='vectice-examples', resource_path="Tutorial/ForecastTutorial/testdataset.csv")
validate_ds = S3Resource(s3_client=s3_client, bucket_name='vectice-examples', resource_path="Tutorial/ForecastTutorial/validatedataset.csv")

dataset = Dataset.modeling(
    name="Modeling Dataset",
    training_resource=train_ds,
    testing_resource=test_ds,
    validation_resource=validate_ds,
)
phase_iter.step_partition_dataset = dataset
# Document the step and automatically attach the dataset
phase_iter.step_partition_dataset = "We split the cleaned dataset in a training, testing and validation datasets. 40% of the data is set aside for testing and our seed to generate repeatable datasets is 42"

Successfully registered Dataset(name='Modeling Dataset', id=24415, version='Version 1', type=MODELING).
New Dataset: 'Modeling Dataset' Version: 'Version 1' added to Step: Partition dataset
Link to Step: https://app.vectice.com/project/phase/iteration?w=2110&iterationId=6352

