#Before your start with this Tutorial

**Tutorial Intention:** Providing an example of iteration and related step on a modeling phase for you to:

*   Experience the data science lifecycle using Vectice
*   See how simple it is to connect your notebook to Vectice
*   Learn how to structure and log your work using Vectice

**Resources needed:**
*   Forecast Unit Sales Tutorial Project: You can find it as part of your personal workspace named after your name
*   Vectice Webapp Documentation: 
*   Vectice API documentation: 



#PIP Install Packages

In [None]:
!pip3 install --q vectice[github]

## Optional libraries - depending on your project

In [None]:
!pip3 install --q squarify
!pip3 install --q plotly

# Import libraries

In [None]:
# importing libraries
import pandas as pd  # data science essentials
import matplotlib as mpl
mpl.rcParams['agg.path.chunksize'] = 10000
import matplotlib.pyplot as plt  # essential graphical output
import seaborn as sns  # enhanced graphical output
import numpy as np   # mathematical essentials
import squarify
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.express as px
#import other libraries
import logging
logging.basicConfig(level=logging.INFO)

In [None]:
# import Vectice library
import vectice
from vectice import FileDataWrapper, DatasetSourceUsage 

## Download the datasets and config file used in this notebook (if on Colab or Jupyter)

## Reading the data

The dataset used in this project can be found here:<br>
* [items.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/items.csv)<br>
* [holidays_events.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/holidays_events.csv)<br>
* [stores.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/stores.csv)<br>
* [oil.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/oil.csv)<br>
* [transactions.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/transactions.csv)<br>
* [train_reduced.csv](https://vectice-examples.s3.us-west-1.amazonaws.com/Tutorial/ForecastTutorial/train_reduced.csv)

##  Vectice Config     
- To log your work to Vectice, you need to connect your notebook to your profile using your personal API token       
- Click on your profile at the top right corner of the Vectice application --> API Tokens --> Create API Token       
- Provide a name and description for the key. We recommend you name the API Token: "Tutorial_API_Token" to avoid having to make additional changes to the notebook.
- Save it in a location accessible by this code

### Update the workspace name below to match the workspace name your project is in

In [None]:
my_vectice = vectice.connect(config=r"Tutorial_API_token.json")
# print(my_vectice.workspaces) # print out a list of workspaces we have access to

my_workspace = my_vectice.workspace("Retail Ops") # replace workspace name
# print(my_workspace.projects) # print out a list of projects we have access to

my_project = my_workspace.project(".Forecast in-store unit sales")

In [None]:
# Get the phase for Data Understanding 
du_phase = my_project.phase("Data Understanding")

In [None]:
# Get the currently active iteration
du_iter = du_phase.iteration()

# Collect Initial Data

In [None]:
# Get the Collect Initial Data step
du_step = du_iter.step("Collect Initial Data")

In [None]:
#read datasets
items = pd.read_csv("items.csv")
holiday_events = pd.read_csv("holidays_events.csv", parse_dates=['date'])
stores = pd.read_csv("stores.csv")
oil = pd.read_csv("oil.csv", parse_dates=['date'])
transactions = pd.read_csv("transactions.csv", parse_dates=['date'])
df = pd.read_csv("train_reduced.csv")

In [None]:
#Wrap datasets to export metadata to Vectice 
items_file_wrapped = FileDataWrapper(path="items.csv", name="Items origin")
holiday_file_wrapped = FileDataWrapper(path="holidays_events.csv", name="Holiday origin")
stores_file_wrapped = FileDataWrapper(path="stores.csv", name="Stores origin")
oil_file_wrapped = FileDataWrapper(path="oil.csv", name="Oil origin")
transactions_file_wrapped = FileDataWrapper(path="transactions.csv", name="Transactions origin")
df_file_wrapped = FileDataWrapper(path="train_reduced.csv", name="Training origin")

In [None]:
#push dataset metadata to Vectice webapp for versioning purposes
my_project.origin_dataset = items_file_wrapped
my_project.origin_dataset = holiday_file_wrapped
my_project.origin_dataset = oil_file_wrapped
my_project.origin_dataset = transactions_file_wrapped
my_project.origin_dataset = df_file_wrapped

In [None]:
#Close step, mark it as completed in the webapp and publish message
du_step.close(message="We selected all the dataset available")

# Describe data

In [None]:
# Get the Describe Data step
du_step = du_iter.step("Describe Data")

In [None]:
# formatting and printing the dimensions of the dataset (ROWS, COLUMNS)
print(f"""
Size of Original Dataset
                  All | Items | holiday | stores |  oil  | transactions
Observations:  {df.shape[0]}|  {items.shape[0]} |   {holiday_events.shape[0]}   |   {stores.shape[0]}   |  {oil.shape[0]} | {transactions.shape[0]} 
Features:          {df.shape[1]}  |    {items.shape[1]}  |    {holiday_events.shape[1]}    |   {stores.shape[1]}    |    {oil.shape[1]}  | {transactions.shape[1]}  
""")



In [None]:
#provide info about the main dataset
df.info()

In [None]:
# to handle NaN’s
df.fillna(0) 
#pd.DatetimeIndex(df['date']).year # to get the year from the date.

In [None]:
#provide info about the item dataset
items.info()

In [None]:
#provide info about the stores dataset
stores.info()

In [None]:
#provide info about the holiday_events dataset
holiday_events.info()

In [None]:
#perform date formating
holiday_events['date'] = pd.to_datetime(holiday_events['date'], format="%Y-%m-%d")

In [None]:
#provide a preview of the main dataset
df.head()

In [None]:
#Close step, mark it as completed in the webapp and publish message
du_step.close(message="NaN have been replaced by 0 and date have been set to a similar format. However, additional cleaning need to be performed")

# Explore Data

In [None]:
# Get the Explore Data step
du_step = du_iter.step("Explore data")

In [None]:
#describe statistically the main dataset
df.describe()

### Stores Visualizations

In [None]:
#Treemap of store counts across different cities
fig = plt.figure(figsize=(25, 21))
marrimeko=stores.city.value_counts().to_frame()
ax = fig.add_subplot(111, aspect="equal")
ax = squarify.plot(sizes=marrimeko['city'].values,label=marrimeko.index,
              color=sns.color_palette('cubehelix_r', 28), alpha=1)
ax.set_xticks([])
ax.set_yticks([])
fig=plt.gcf()
fig.set_size_inches(40,25)
plt.title("Treemap of store counts across different cities", fontsize=18)
fig.savefig('Store1.png', dpi=300)
plt.show()

In [None]:
#Treemap of store counts across different States
fig = plt.figure(figsize=(25, 21))
marrimeko=stores.state.value_counts().to_frame()
ax = fig.add_subplot(111, aspect="equal")
ax = squarify.plot(sizes=marrimeko['state'].values,label=marrimeko.index,
              color=sns.color_palette('viridis_r', 28), alpha=1)
ax.set_xticks([])
ax.set_yticks([])
fig=plt.gcf()
fig.set_size_inches(40,25)
plt.title("Treemap of store counts across different States", fontsize=18)
fig.savefig('Store2.png', dpi=300)
plt.show()

#### Inspecting the allocation of clusters to store numbers - Visualizations

In [None]:
#Store numbers and the clusters they are assigned to
# Unhide to see the sorted zip order
neworder = [23, 24, 26, 36, 41, 15, 29, 31, 32, 34, 39, 
            53, 4, 37, 40, 43, 8, 10, 19, 20, 33, 38, 13, 
            21, 2, 6, 7, 3, 22, 25, 27, 28, 30, 35, 42, 44, 
            48, 51, 16, 0, 1, 5, 52, 45, 46, 47, 49, 9, 11, 12, 14, 18, 17, 50]

# Finally plot the seaborn heatmap
plt.style.use('dark_background')
plt.figure(figsize=(15,12))
store_pivot = stores.dropna().pivot("store_nbr","cluster", "store_nbr")
ax = sns.heatmap(store_pivot, cmap='jet', annot=True, linewidths=0, linecolor='white')
plt.title('Store numbers and the clusters they are assigned to')
plt.show()
plt.savefig('Store3.png', dpi=300)

In [None]:
#Stacked Barplot of Store types and their cluster distribution
plt.style.use('dark_background')
type_cluster = stores.groupby(['type','cluster']).size()
type_cluster.unstack().plot(kind='bar',stacked=True, colormap= 'PuBu', figsize=(13,11),  grid=False)
plt.title('Stacked Barplot of Store types and their cluster distribution', fontsize=18)
plt.ylabel('Count of clusters in a particular store type', fontsize=16)
plt.xlabel('Store type', fontsize=16)
plt.savefig('Store4.png', dpi=300);
plt.show()

### Holidays Visualization

In [None]:
#Stacked Barplot of locale name against event type
holiday_local_type = holiday_events.groupby(['locale_name', 'type']).size()
holiday_local_type.unstack().plot(kind='bar',stacked=True, colormap= 'magma_r', figsize=(12,10),  grid=False)
plt.title('Stacked Barplot of locale name against event type')
plt.ylabel('Count of entries')
plt.savefig('holiday.png')
plt.show()

### Transactions Visualization

In [None]:
#Distribution of transactions per day from 2013 till 2017
plt.style.use('seaborn-white')
plt.figure(figsize=(13,11))
plt.plot(transactions.date.values, transactions.transactions.values, color='darkblue')
plt.ylim(-50, 10000)
plt.title("Distribution of transactions per day from 2013 till 2017")
plt.ylabel('transactions per day', fontsize= 16)
plt.xlabel('Date', fontsize= 16)
plt.savefig('transaction1317.png')
plt.show()

In [None]:
#transactions per day
#plt.style.use('seaborn-deep')
#plt.figure(figsize=(13,11))
#plt.plot(df.date.values, df.unit_sales)
#plt.ylim(-50, 10000)
#plt.ylabel('transactions per day')
#plt.xlabel('Date')
#plt.savefig('Transactionspday.png')
#plt.show()


### Items Visualizations

In [None]:

#Counts of items per family category
x, y = (list(x) for x in zip(*sorted(zip(items.family.value_counts().index, 
                                         items.family.value_counts().values), 
                                        reverse = False)))
trace2 = go.Bar(
    y=items.family.value_counts().values,
    x=items.family.value_counts().index,
    marker=dict(
        color=items.family.value_counts().values,
        colorscale = 'Portland',
        reversescale = False
    ),
    orientation='v',
)

layout = dict(
    title='Counts of items per family category',
     width = 800, height = 800,
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
#         domain=[0, 0.85],
    ))

fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)
py.iplot(fig1, filename='plots')
plt.savefig('Item1.png')

In [None]:
#Number of items attributed to a particular item class
x, y = (list(x) for x in zip(*sorted(zip(items['class'].value_counts().index, 
                                         items['class'].value_counts().values), 
                                        reverse = False)))
trace2 = go.Bar(
    x=items['class'].value_counts().index,
    y=items['class'].value_counts().values,
    marker=dict(
        color=items['class'].value_counts().values,
        colorscale = 'Portland',
        reversescale = True
    ),
    orientation='v',
)

layout = dict(
    title='Number of items attributed to a particular item class',
     width = 800, height = 1400,
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
#         domain=[0, 0.85],
    ))

fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)
py.iplot(fig1, filename='plots')
plt.savefig('Item2.png')

In [None]:
#Stacked Barplot of locale name against event type
plt.style.use('seaborn-white')
fam_perishable = items.groupby(['family', 'perishable']).size()
fam_perishable.unstack().plot(kind='bar',stacked=True, colormap= 'coolwarm', figsize=(12,10),  grid=False)
plt.title('Stacked Barplot of locale name against event type')
plt.ylabel('Count of entries')
plt.savefig('Item3.png')

In [None]:
#Close step, mark it as completed in the webapp and publish message
du_step.close(message="I generated a total of 10 graphs. Some side reseach also gave sense to some data incoherence that could be observed. For example the 2016 hearthquake")

# Verify Data Quality

In [None]:
# Get the Verify Data Quality step
step = du_iter.step("Verify Data Quality")

In [None]:
#Close step, mark it as completed in the webapp and publish message
step.close(message="The information comprise in this dataset is accurate and comprehensive. As the information aligns with other trusted resources, the dataset was considered as reliable and also relevant to the business problem we are trying to solve. However, this data can not be used for real time reporting as the data does not update itself. ")