# Before your start with this Tutorial

**Tutorial Intention:** Providing an example of iteration and related step on a modeling phase for you to:

*   Experience the data science lifecycle using Vectice
*   See how simple it is to connect your notebook to Vectice
*   Learn how to structure and log your work using Vectice

**Resources needed:**
*   <b>Tutorial Project: Forecast in-store unit sales (23.2)</b> - You can find it as part of your personal workspace       

**Other resources:**
*   Vectice Webapp Documentation: https://docs.vectice.com/
*   Vectice API documentation: https://api-docs.vectice.com/sdk/index.html

# 1. Getting Started         

**First, we need to install and authenticate ourselves to the Vectice server. Before proceeding further:**
*   Visit the Vectice app (https://app.vectice.com/account/api-keys) to create and download an API token, name the file as "My Token"
*   Upload the file to Colab by clicking on the "folder" icon on the left-hand taskbar and selecting "Upload to Session Storage"

* If you then execute

In [None]:
%pip install --q vectice -U
import vectice as vct

vec = vct.connect(config="My Token.json")

#### You have successfully installed Vectice in your notebook and connected to your instance. 
#### Wasn't that easy?

## Optional libraries - depending on your project

In [None]:
%pip install --q squarify
%pip install --q plotly
%pip install --q seaborn
%pip install --q nbformat -U

## Import libraries

In [None]:
# importing libraries

import pandas as pd  # data science essentials
import matplotlib as mpl
mpl.rcParams['agg.path.chunksize'] = 10000
import matplotlib.pyplot as plt  # essential graphical output
import seaborn as sns  # enhanced graphical output
import numpy as np   # mathematical essentials
import squarify
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.express as px

## Download the datasets and config file used in this notebook

The dataset used in this project can be found here:<br>
* [items.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/items.csv)<br>
* [holidays_events.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/holidays_events.csv)<br>
* [stores.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/stores.csv)<br>
* [oil.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/oil.csv)<br>
* [transactions.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/transactions.csv)<br>
* [train_reduced.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/train_reduced.csv)

#### Execute the cell below to downlaod the files locally

In [None]:
# Download the files locally
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/items.csv -q --no-check-certificate
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/holidays_events.csv -q --no-check-certificate
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/stores.csv -q --no-check-certificate
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/oil.csv -q --no-check-certificate
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/transactions.csv -q --no-check-certificate
!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/train_reduced.csv -q --no-check-certificate

#### Great! Let's build dataframes from the file for later use

In [None]:
#read datasets
items = pd.read_csv("items.csv")
holiday_events = pd.read_csv("holidays_events.csv", parse_dates=['date'])
stores = pd.read_csv("stores.csv")
oil = pd.read_csv("oil.csv", parse_dates=['date'])
transactions = pd.read_csv("transactions.csv", parse_dates=['date'])
df = pd.read_csv("train_reduced.csv")

### Create Vectice dataset assets

#### First let's navigate your way to your personal workspace, get the tutorial project and start an iteration of the 'Data Understanding' phase. Go ahead and execute the cell below to navigate to your workspace. 

In [None]:
# Start an iteration of the phase
active_iter = vec.my_workspace.project("Tutorial Project: Forecast in store unit sales (23.2)").phase("Data Understanding").create_iteration() # You can also use your own ID's directly

#### Let's document the datasets we created for our project

In [None]:
# Provide context into the origin datasets by attaching them to the step
active_iter.step_collect_initial_data = vct.Dataset.origin(name="Items origin",resource=vct.FileResource(paths="items.csv"))
active_iter.step_collect_initial_data += vct.Dataset.origin(name="Holiday origin",resource=vct.FileResource(paths="holidays_events.csv"))
active_iter.step_collect_initial_data += vct.Dataset.origin(name="Stores origin",resource=vct.FileResource(paths="stores.csv"))
active_iter.step_collect_initial_data += vct.Dataset.origin(name="Oil origin",resource=vct.FileResource(paths="oil.csv"))
active_iter.step_collect_initial_data += vct.Dataset.origin(name="Transactions origin",resource=vct.FileResource(paths="transactions.csv"))

active_iter.step_collect_initial_data = "The datasets for the project have been identified"

#### Great, now we have our datasets and metadata documented in Vectice...pretty straight forward!

# Describe data

#### The following few cells are boiker plate code and not specific to Vectice
#### Obviously your strategy to describe your datasets might be more fleshed out than this

### Collect basic data properties

In [None]:
#provide info about the item dataset
items.info()

In [None]:
#provide info about the stores dataset
stores.info()

In [None]:
#provide info about the holiday_events dataset
holiday_events.info()

In [None]:
#perform date formating
holiday_events['date'] = pd.to_datetime(holiday_events['date'], format="%Y-%m-%d")

### Document the "Describe Data" step and close it

#### Let's push all that we have learned from our datasets in Vectice, keeping the context inline make it simple

In [None]:
# Document the "Describe Data" step and close it
# formatting the dimensions of the dataset (ROWS, COLUMNS)
msg = "\nSize of Original Dataset:\n"\
"Items dataset: Observations: " + str(items.shape[0]) + " - Features: " + str(items.shape[1])  + "\n" \
"Holiday dataset: Observations: " + str(holiday_events.shape[0])  + "- Features: " + str(holiday_events.shape[1])  + "\n" \
"Stores dataset: Observations: " + str(stores.shape[0])  + " - Features: " + str(stores.shape[1])  + "\n" \
"Oil: Observations: " + str(oil.shape[0])  + " - Features: " + str(oil.shape[1])  + "\n" \
"Transactions: Observations: " + str(transactions.shape[0])  + " - Features: " + str(transactions.shape[1])

# Document current step and get next one
active_iter.step_describe_data = "The data properties have been reviewed for the datasets identified\n" + msg

# Explore Data

### Visualizations

#### Stores Visualizations

In [None]:
#Treemap of store counts across different cities
fig = plt.figure(figsize=(25, 21))
marrimeko=stores.city.value_counts().to_frame().reset_index()
marrimeko.columns = ["city", "count"]
ax = fig.add_subplot(111, aspect="equal")
ax = squarify.plot(sizes=marrimeko['count'].values,label=marrimeko['city'].values,
              color=sns.color_palette('cubehelix_r', 28), alpha=1)
ax.set_xticks([])
ax.set_yticks([])
fig=plt.gcf()
fig.set_size_inches(40,25)
plt.title("Treemap of store counts across different cities", fontsize=18)
fig.savefig('Store1.png', dpi=300)
plt.show()

In [None]:
#Treemap of store counts across different States
fig = plt.figure(figsize=(25, 21))
marrimeko=stores.city.value_counts().to_frame().reset_index()
marrimeko.columns = ["state", "count"]
ax = fig.add_subplot(111, aspect="equal")
ax = squarify.plot(sizes=marrimeko['count'].values,label=marrimeko['state'].values,
              color=sns.color_palette('viridis_r', 28), alpha=1)
ax.set_xticks([])
ax.set_yticks([])
fig=plt.gcf()
fig.set_size_inches(40,25)
plt.title("Treemap of store counts across different States", fontsize=18)
fig.savefig('Store2.png', dpi=300)
plt.show()

##### Inspecting the allocation of clusters to store numbers - Visualizations

In [None]:
#Stacked Barplot of Store types and their cluster distribution
plt.style.use('dark_background')
type_cluster = stores.groupby(['type','cluster']).size()
type_cluster.unstack().plot(kind='bar',stacked=True, colormap= 'PuBu', figsize=(13,11),  grid=False)
plt.title('Stacked Barplot of Store types and their cluster distribution', fontsize=18)
plt.ylabel('Count of clusters in a particular store type', fontsize=16)
plt.xlabel('Store type', fontsize=16)
plt.savefig('Store4.png', dpi=300);
plt.show()

#### Holidays Visualization

In [None]:
#Stacked Barplot of locale name against event type
holiday_local_type = holiday_events.groupby(['locale_name', 'type']).size()
holiday_local_type.unstack().plot(kind='bar',stacked=True, colormap= 'magma_r', figsize=(12,10),  grid=False)
plt.title('Stacked Barplot of locale name against event type')
plt.ylabel('Count of entries')
plt.savefig('holiday.png')
plt.show()

#### Transactions Visualization

In [None]:
#Distribution of transactions per day from 2013 till 2017
plt.style.use('seaborn-white')
plt.figure(figsize=(13,11))
plt.plot(transactions.date.values, transactions.transactions.values, color='darkblue')
plt.ylim(-50, 10000)
plt.title("Distribution of transactions per day from 2013 till 2017")
plt.ylabel('transactions per day', fontsize= 16)
plt.xlabel('Date', fontsize= 16)
plt.savefig('transaction1317.png')
plt.show()

#### Items Visualizations

In [None]:

#Counts of items per family category
x, y = (list(x) for x in zip(*sorted(zip(items.family.value_counts().index, 
                                         items.family.value_counts().values), 
                                        reverse = False)))
trace2 = go.Bar(
    y=items.family.value_counts().values,
    x=items.family.value_counts().index,
    marker=dict(
        color=items.family.value_counts().values,
        colorscale = 'Portland',
        reversescale = False
    ),
    orientation='v',
)

layout = dict(
    title='Counts of items per family category',
     width = 800, height = 800,
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
#         domain=[0, 0.85],
    ))

fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)
py.iplot(fig1, filename='plots')
plt.savefig('Item1.png')

In [None]:
#Number of items attributed to a particular item class
x, y = (list(x) for x in zip(*sorted(zip(items['class'].value_counts().index, 
                                         items['class'].value_counts().values), 
                                        reverse = False)))
trace2 = go.Bar(
    x=items['class'].value_counts().index,
    y=items['class'].value_counts().values,
    marker=dict(
        color=items['class'].value_counts().values,
        colorscale = 'Portland',
        reversescale = True
    ),
    orientation='v',
)

layout = dict(
    title='Number of items attributed to a particular item class',
     width = 800, height = 1400,
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
#         domain=[0, 0.85],
    ))

fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)
py.iplot(fig1, filename='plots')
plt.savefig('Item2.png')

In [None]:
#Stacked Barplot of locale name against event type
plt.style.use('seaborn-white')
fam_perishable = items.groupby(['family', 'perishable']).size()
fam_perishable.unstack().plot(kind='bar',stacked=True, colormap= 'coolwarm', figsize=(12,10),  grid=False)
plt.title('Stacked Barplot of locale name against event type')
plt.ylabel('Count of entries')
plt.savefig('Item3.png')

### Document our findings in Vectice by attaching the visualizations we just created
#### and adding a comment for good measure

In [None]:

active_iter.step_explore_data = "Store1.png"

#Close step, mark it as completed in the webapp and publish message
active_iter.step_explore_data = "We created visualization for each of our datasets for value distribution, outliers and join candidates across all datasets.\nSome side reseach also gave sense to some data incoherence that could be observed, caused by black swan events, such as the earthquake of 2016 and the pandemic of 2020."

# Verify Data Quality

### Basic EDA

In [None]:

datasets = {"items": items, "holiday_events": holiday_events, "stores": stores, "oil": oil, "transaction": transactions}
for name, ds in datasets.items():
    print(f"Dataset: {name}")
    print(f"There are {len(ds)} rows in the dataset.")
    print("Isnull report:")
    print(ds.isnull().sum())
    print("Missing values report:")
    print(ds.isna())
    print("-----------------------")

### Document our findings in Vectice

In [None]:
#Close step, mark it as completed in the webapp and publish message
active_iter.step_verify_data_quality = "The information comprise in this dataset is accurate and comprehensive.\nAs the information aligns with other trusted resources, the dataset was considered as reliable and also relevant to the business problem we are trying to solve.\nHowever, this data can not be used for real time reporting as the data does not update itself.\nFurther data preparation is required."