# Before you start with this Data Preparation Notebook

This notebook is part of the Vectice tutorial project notebook series. It illustrates how to log the assets documented in the "Data Preparation" phase of the **"Tutorial: Forecast in store-unit sales** project you can find in your personal Vectice workspace.

### Pre-requisites:
Before using this notebook you will need:
* An account in Vectice
* An API token to connect to Vectice through the APIs
* The Phase Id of the project where you want to log your work

Refer to Vectice Tutorial Guide for more detailed instructions: https://docs.vectice.com/getting-started/tutorial


### Other Resources
*   Vectice Documentation: https://docs.vectice.com/ </br>
*   Vectice API documentation: https://api-docs.vectice.com/

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
import mlflow
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Install the latest Vectice Python client library

In [None]:
%pip install --q vectice -U

## Get started by connecting to Vectice

In [None]:
import vectice

vec = vectice.connect(api_token="my-api-token") #Paste your API token

## Specify which project phase you want to document
In Vectice UI, navigate to your personal workspace inside your default Tutorial project go to the Data Preparation phase and copy paste your Phase Id below.

In [None]:
phase = vec.phase("PHA-xxxx") #Put your own Data Preparation Phase ID

## Next we are going to create an iteration
An iteration allows you to organize your work in repeatable sequences of steps. You can have multiple iteration within a phase

In [None]:
prep_iteration = phase.create_iteration()

In [None]:
df_initial = pd.read_csv("https://raw.githubusercontent.com/vectice/GettingStarted/main/23.2/tutorial/SampleSuperstore.csv", converters = {'Postal Code': str})
df_initial.to_csv("SampleSuperstore.csv", index=False)

## Create MLflow Dataset
Requires the following 
1. An existing MLflow service
    - This can be local or hosted.

In [None]:
with mlflow.start_run():
    # Log the dataset to the MLflow Run. Specify the "training" context to indicate that the
    # dataset is used for model training
    mlflow.log_input(df_initial, context="preparation")

# Retrieve the run, including dataset information
run = mlflow.get_run(mlflow.last_active_run().info.run_id)
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")

# Load the dataset's source, which downloads the content from the source URL to the local
# filesystem
dataset_source = mlflow.data.get_source(dataset_info)
dataset_source.load()

## Log your origin dataset

In [None]:
from google.cloud import bigquery
# setup bigquery client to pass to Vectice
bq_client = bigquery.Client()

# create origin dataset resource 
origin_ds = vectice.BigQueryResource("tries-and-spikes.tutorial.productsales-origin", df_initial, bq_client)

In [None]:
# create origin dataset with resource 
origin_dataset = vectice.Dataset.origin(
    name="ProductSales Origin",
    resource=origin_ds, 
)

In [None]:
prep_iteration.step_select_data = origin_dataset

## Apply transformation to your origin dataset 

In [None]:
def wrangle(df):
    
    #Reducing Cardinality
    top_ten_cities=df["City"].value_counts().head(10).index
    df["City"]=df["City"].apply(lambda c: c if c in top_ten_cities else "others")
    top_ten_states=df["State"].value_counts().head(10).index
    df["State"]=df["State"].apply(lambda c: c if c in top_ten_states else "others")
    
    ## Dealing with Outliers
    q1,q2 =df["Profit"].quantile([0.1,0.9])
    df=df[df["Profit"].between(q1,q2)]
    
    
    return df

In [None]:
df_cleaned = wrangle(df_initial)
df_cleaned.describe()

In [None]:
#Checking for outliers
sns.distplot(df_cleaned["Profit"]);
plt.savefig("Profit.png")

In [None]:
#Checking for outliers
sns.distplot(df_cleaned["Quantity"])
plt.savefig("Quantity.png")

## Create MLflow Dataset 

In [None]:
with mlflow.start_run():
    # Log the dataset to the MLflow Run. Specify the "training" context to indicate that the
    # dataset is used for model training
    mlflow.log_input(df_cleaned, context="cleaned")

# Retrieve the run, including dataset information
run = mlflow.get_run(mlflow.last_active_run().info.run_id)
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")

# Load the dataset's source, which downloads the content from the source URL to the local
# filesystem
dataset_source = mlflow.data.get_source(dataset_info)
dataset_source.load()

## Log your clean Dataset, add graphs attachments

In [None]:
prepared_ds = vectice.BigQueryResource("tries-and-spikes.tutorial.productsales-cleaned", df_initial, bq_client)


prepared_ds = vectice.Dataset.clean(
    name="ProductSales Cleaned",
    resource=prepared_ds,
    derived_from=origin_dataset,                #Origin Dataset for documenting the lineage
    attachments=["Profit.png", "Quantity.png"]  #Graphs attachments
)

In [None]:
prep_iteration.step_clean_data = prepared_ds 

### Log your "ProductSales Cleaned" dataset in your step "Clean Data"

In [None]:
prep_iteration.complete()

## 🥇 Congrats! You learn how to succesfully use Vectice to auto-document the Data Preparation phase of the Tutorial Project.<br>
### Next we encourage you to explore other notebooks in the tutorial series. You can find those notebooks in Vectice public GitHub repository : https://github.com/vectice/GettingStarted/