# Before you start with this Data Preparation Notebook

This notebook is part of the Vectice tutorial project notebook series. It illustrates how to log the assets documented in the "Data Preparation" phase of the **"Tutorial: Forecast in store-unit sales** project you can find in your personal Vectice workspace.

### Pre-requisites:
Before using this notebook you will need:
* An account in Vectice
* An API key to connect to Vectice through the APIs
* The Phase Id of the project where you want to log your work

Refer to Vectice Tutorial Guide for more detailed instructions: https://docs.vectice.com/getting-started/tutorial


### Other Resources
*   Vectice Documentation: https://docs.vectice.com/ </br>
*   Vectice API documentation: https://api-docs.vectice.com/

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Install the latest Vectice Python client library

In [None]:
%pip install --q vectice -U

## Get started by connecting to Vectice

In [None]:
import vectice

connect = vectice.connect(api_token="your-api-key") #Paste your API key

## Specify which project phase you want to document
In Vectice app, navigate to your personal workspace inside your default Tutorial project go to the Data Preparation phase and copy paste your Phase Id below.

In [None]:
phase = connect.phase("PHA-xxxx") #Paste your own Data Preparation Phase ID

## Next we are going to create an iteration
An iteration allows you to organize your work in repeatable sequences. You can have multiple iteration within a phase. Iteration can be organized into sections.

In [None]:
iteration = phase.create_or_get_current_iteration()

In [None]:
df_initial = pd.read_csv("https://raw.githubusercontent.com/vectice/GettingStarted/main/23.3/tutorial/SampleSuperstore.csv", converters = {'Postal Code': str})
df_initial.to_csv("SampleSuperstore.csv", index=False)

## Log your origin dataset

In [None]:
origin_ds = vectice.FileResource(paths="SampleSuperstore.csv", dataframes=df_initial)


origin_dataset = vectice.Dataset.origin(
    name="ProductSales Origin",
    resource=origin_ds, 
)

In [None]:
iteration.log(origin_dataset, section = "select data")

## Apply transformation to your origin dataset 

In [None]:
def wrangle(df):    
    # Reducing Cardinality
    # Get the top 10 cities by value count 
    top_ten_cities=df["City"].value_counts().head(10).index
    # Filter the cities by top 10 cities
    df["City"]=df["City"].apply(lambda c: c if c in top_ten_cities else "others")
    # Get the top 10 states by value count
    top_ten_states=df["State"].value_counts().head(10).index
    # Filter the states by top 10 states
    df["State"]=df["State"].apply(lambda c: c if c in top_ten_states else "others")
    
    # Dealing with Outliers
    # Get the 10% and 90% quantiles for profit distribution
    q1,q2 =df["Profit"].quantile([0.1,0.9])
    # Filter the profit between the quantiles
    df=df[df["Profit"].between(q1,q2)]

    return df

In [None]:
df_cleaned = wrangle(df_initial)
df_cleaned.describe()

In [None]:
#Checking for outliers
sns.displot(df_cleaned["Profit"]);
plt.savefig("Profit.png")

In [None]:
#Checking for outliers
sns.displot(df_cleaned["Quantity"])
plt.savefig("Quantity.png")

In [None]:
df_cleaned.to_csv("ProductSales Cleaned.csv", index=False)

## Log your clean Dataset, add graphs attachments

In [None]:
prepared_ds = vectice.FileResource(paths="ProductSales Cleaned.csv", dataframes=df_cleaned)


prepared_ds = vectice.Dataset.clean(
    name="ProductSales Cleaned",
    resource=prepared_ds,
    derived_from=origin_dataset,                #Origin Dataset for documenting the lineage
    attachments=["Profit.png", "Quantity.png"]  #Graphs attachments
)

### Log your "ProductSales Cleaned" dataset in your iteration in the section "Clean data"

In [None]:
iteration.log(prepared_ds, section = "clean data")

In [None]:
iteration.complete()

## 🥇 Congrats! You learn how to succesfully use Vectice to auto-document the Data Preparation phase of the Tutorial Project.<br>
### Next we encourage you to explore other notebooks in the tutorial series. You can find those notebooks in Vectice Tutorial Guide: [Want to learn more about the other phases of the tutorial project?](https://docs.vectice.com/getting-started/tutorial#want-to-learn-more-about-the-other-phases-of-the-tutorial-project)