# Sales Forecasting with Rasgo

This notebook shows how to perform the data preparation and feature engineering for a sales forecasting model. Starting with [AdventureWorks](https://docs.microsoft.com/en-us/sql/samples/adventureworks-install-configure) data preloaded in Rasgo, the data will be explored, features created and modeling data extracted.

This analysis will be focused on the internet sales for this company.

## Packages

The documentation for each packaged used in this tutorial is linked below:
* [numpy](https://numpy.org/doc/stable/)
* [os](https://docs.python.org/3/library/os.html)
* [pandas](https://pandas.pydata.org/docs/)
* [pyrasgo](https://docs.rasgoml.com/rasgo-docs/)
* [scikit-learn](https://scikit-learn.org/stable/)
    * [sklearn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)
* [XGBoost](https://xgboost.readthedocs.io/en/latest/)

Uncomment the cell below if you're missing some of these packages on your local machine. 

In [1]:
#!pip install numpy pandas pyrasgo scikit-learn xgboost

### Ensure that pyrasgo is installed and up to date

In [2]:
!pip install pyrasgo --upgrade

!pip show pyrasgo

Collecting pyrasgo
  Downloading pyrasgo-0.4.14-py3-none-any.whl (105 kB)
     |████████████████████████████████| 105 kB 6.4 MB/s            
Installing collected packages: pyrasgo
  Attempting uninstall: pyrasgo
    Found existing installation: pyrasgo 0.4.12
    Uninstalling pyrasgo-0.4.12:
      Successfully uninstalled pyrasgo-0.4.12
Successfully installed pyrasgo-0.4.14
Name: pyrasgo
Version: 0.4.14
Summary: Alpha version of the Rasgo Python interface.
Home-page: https://www.rasgoml.com/
Author: Patrick Dougherty
Author-email: patrick@rasgoml.com
License: GNU Affero General Public License v3 or later (AGPLv3+)
Location: /Users/nick/Git/workspace/venv/lib/python3.9/site-packages
Requires: idna, more-itertools, pandas, pyarrow, pydantic, pyyaml, requests, snowflake-connector-python, snowflake-connector-python, tqdm
Required-by: 


In [3]:
import numpy as np
import os
import pandas as pd
import pyrasgo
from sklearn.metrics import mean_squared_error
import xgboost as xgb

## Create account on Rasgo

## Access Rasgo

### Create account

Next, click [here](https://app.rasgoml.com/account/register) to create an account on the Rasgo UI. Fill in the required information on the web page.

<p align="center">
  <img src="img/RasgoAccountRegistration.png" alt="Rasgo Account Registration" width="512">
</p>

You can close the browser tab as you will receive an email from rasgo to verify your email address. Click the **Verify Email** button to verify.

<p align="center">
  <img src="img/RasgoWelcome.png" alt="Verify Email" width="390">
</p>

This will open browser tab where you can log into the UI.

### Log into Rasgo UI

Enter your username and password and click **Login**.

<p align="center">
  <img src="img/RasgoLogin.png" alt="Login to Rasgo" width="528">
</p>

to be taken to the Rasgo App homepage.

### Copy your API Key

Click the **API KEY** button in the upper right of the screen

<img src="img/APIKEY.png" alt="Copy API Key" width="128">

to copy your API key to the clipboard.

## Work with PyRasgo

### Connect using your API key

In [4]:
API_KEY = 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiI3MjEiLCJpZCI6NzIxLCJvcmdJZCI6bnVsbCwidXNlcm5hbWUiOiJuaWNraGFydmV5b25saW5lQGdtYWlsLmNvbSIsImV4cCI6NDY5MjY5MjE5NDJ9.RItOPSuTEiRdGi136fBCszKSwTekXR8EdLafkhopriA'

rasgo = pyrasgo.connect(API_KEY)

### Get list of available datasets

Loop over all available datasets and print out the dataset ID and Name.

In [5]:
datasets = sorted(rasgo.get.datasets(), key=lambda x: x.id)
for ds in datasets:
    print(f"ID: {ds.id}\tDataset: {ds.name}")

ID: 52	Dataset: Adventureworks: Dim Account
ID: 53	Dataset: Adventureworks: Dim Currency
ID: 54	Dataset: Adventureworks: Dim Salesreason
ID: 55	Dataset: Adventureworks: Dim Customer
ID: 56	Dataset: Adventureworks: Dim Promotion
ID: 57	Dataset: Adventureworks: Dim Date
ID: 58	Dataset: Adventureworks: Dim Productcategory
ID: 59	Dataset: Adventureworks: Dim Departmentgroup
ID: 60	Dataset: Adventureworks: Newfact Currencyrate
ID: 61	Dataset: Adventureworks: Dim Geography
ID: 62	Dataset: Adventureworks: Dim Organization
ID: 63	Dataset: Adventureworks: Dim Reseller
ID: 64	Dataset: Adventureworks: Fact Additionalinternationalproductdescription
ID: 65	Dataset: Adventureworks: Fact Productinventory
ID: 66	Dataset: Adventureworks: Fact Callcenter
ID: 67	Dataset: Adventureworks: Fact Resellersales
ID: 68	Dataset: Adventureworks: Fact Internetsalesreason
ID: 69	Dataset: Adventureworks: Fact Salesquota
ID: 70	Dataset: Adventureworks: Prospectivebuyer
ID: 71	Dataset: Adventureworks: Dim Scenario
ID:

The two datasets of interest are 74 containing the sales information and 56 containing details about any promotions in place.

### Examine Internet Sales

In [6]:
internet_sales = rasgo.get.dataset(74)
internet_sales.preview()

Unnamed: 0,PRODUCTKEY,ORDERDATEKEY,DUEDATEKEY,SHIPDATEKEY,CUSTOMERKEY,PROMOTIONKEY,CURRENCYKEY,SALESTERRITORYKEY,SALESORDERNUMBER,SALESORDERLINENUMBER,...,PRODUCTSTANDARDCOST,TOTALPRODUCTCOST,SALESAMOUNT,TAXAMT,FREIGHT,CARRIERTRACKINGNUMBER,CUSTOMERPONUMBER,ORDERDATE,DUEDATE,SHIPDATE
0,310,20101229,20110110,20110105,21768,1,19,6,SO43697,1,...,2171.2942,2171.2942,3578.27,286.2616,89.4568,,,2010-12-29,2011-01-10,2011-01-05
1,346,20101229,20110110,20110105,28389,1,39,7,SO43698,1,...,1912.1544,1912.1544,3399.99,271.9992,84.9998,,,2010-12-29,2011-01-10,2011-01-05
2,346,20101229,20110110,20110105,25863,1,100,1,SO43699,1,...,1912.1544,1912.1544,3399.99,271.9992,84.9998,,,2010-12-29,2011-01-10,2011-01-05
3,336,20101229,20110110,20110105,14501,1,100,4,SO43700,1,...,413.1463,413.1463,699.0982,55.9279,17.4775,,,2010-12-29,2011-01-10,2011-01-05
4,346,20101229,20110110,20110105,11003,1,6,9,SO43701,1,...,1912.1544,1912.1544,3399.99,271.9992,84.9998,,,2010-12-29,2011-01-10,2011-01-05
5,311,20101230,20110111,20110106,27645,1,100,4,SO43702,1,...,2171.2942,2171.2942,3578.27,286.2616,89.4568,,,2010-12-30,2011-01-11,2011-01-06
6,310,20101230,20110111,20110106,16624,1,6,9,SO43703,1,...,2171.2942,2171.2942,3578.27,286.2616,89.4568,,,2010-12-30,2011-01-11,2011-01-06
7,351,20101230,20110111,20110106,11005,1,6,9,SO43704,1,...,1898.0944,1898.0944,3374.99,269.9992,84.3748,,,2010-12-30,2011-01-11,2011-01-06
8,344,20101230,20110111,20110106,11011,1,6,9,SO43705,1,...,1912.1544,1912.1544,3399.99,271.9992,84.9998,,,2010-12-30,2011-01-11,2011-01-06
9,312,20101231,20110112,20110107,27621,1,100,4,SO43706,1,...,2171.2942,2171.2942,3578.27,286.2616,89.4568,,,2010-12-31,2011-01-12,2011-01-07


This looks promising, but I'd like to see a single product sorted by date. This can be done through the use of the filter and order transforms. To use filter, the product we want to filter on is needed, as we don't know that yet, we will just order by *PRODUCTKEY* and *ORDERDATE*.

In [7]:
internet_sales.order(order_by={'PRODUCTKEY':'ASC','ORDERDATE':'ASC'}).preview()

Unnamed: 0,PRODUCTKEY,ORDERDATEKEY,DUEDATEKEY,SHIPDATEKEY,CUSTOMERKEY,PROMOTIONKEY,CURRENCYKEY,SALESTERRITORYKEY,SALESORDERNUMBER,SALESORDERLINENUMBER,...,PRODUCTSTANDARDCOST,TOTALPRODUCTCOST,SALESAMOUNT,TAXAMT,FREIGHT,CARRIERTRACKINGNUMBER,CUSTOMERPONUMBER,ORDERDATE,DUEDATE,SHIPDATE
0,214,20121228,20130109,20130104,12132,1,100,7,SO51181,4,...,13.0863,13.0863,34.99,2.7992,0.8748,,,2012-12-28,2013-01-09,2013-01-04
1,214,20121228,20130109,20130104,16313,1,100,8,SO51180,4,...,13.0863,13.0863,34.99,2.7992,0.8748,,,2012-12-28,2013-01-09,2013-01-04
2,214,20121229,20130110,20130105,12390,1,100,8,SO51191,4,...,13.0863,13.0863,34.99,2.7992,0.8748,,,2012-12-29,2013-01-10,2013-01-05
3,214,20121229,20130110,20130105,11241,1,100,7,SO51192,2,...,13.0863,13.0863,34.99,2.7992,0.8748,,,2012-12-29,2013-01-10,2013-01-05
4,214,20121230,20130111,20130106,11338,1,100,8,SO51207,4,...,13.0863,13.0863,34.99,2.7992,0.8748,,,2012-12-30,2013-01-11,2013-01-06
5,214,20121230,20130111,20130106,24604,1,6,9,SO51212,4,...,13.0863,13.0863,34.99,2.7992,0.8748,,,2012-12-30,2013-01-11,2013-01-06
6,214,20121231,20130112,20130107,11061,1,6,9,SO51237,4,...,13.0863,13.0863,34.99,2.7992,0.8748,,,2012-12-31,2013-01-12,2013-01-07
7,214,20121231,20130112,20130107,25625,1,100,8,SO51246,4,...,13.0863,13.0863,34.99,2.7992,0.8748,,,2012-12-31,2013-01-12,2013-01-07
8,214,20121231,20130112,20130107,11615,1,98,10,SO51232,2,...,13.0863,13.0863,34.99,2.7992,0.8748,,,2012-12-31,2013-01-12,2013-01-07
9,214,20121231,20130112,20130107,28204,1,6,9,SO51234,2,...,13.0863,13.0863,34.99,2.7992,0.8748,,,2012-12-31,2013-01-12,2013-01-07


This looks reasonable, use this for our modeling. For future reference, what columns exist in this table?

In [8]:
internet_sales.preview().columns.sort_values()

Index(['CARRIERTRACKINGNUMBER', 'CURRENCYKEY', 'CUSTOMERKEY',
       'CUSTOMERPONUMBER', 'DISCOUNTAMOUNT', 'DUEDATE', 'DUEDATEKEY',
       'EXTENDEDAMOUNT', 'FREIGHT', 'ORDERDATE', 'ORDERDATEKEY',
       'ORDERQUANTITY', 'PRODUCTKEY', 'PRODUCTSTANDARDCOST', 'PROMOTIONKEY',
       'REVISIONNUMBER', 'SALESAMOUNT', 'SALESORDERLINENUMBER',
       'SALESORDERNUMBER', 'SALESTERRITORYKEY', 'SHIPDATE', 'SHIPDATEKEY',
       'TAXAMT', 'TOTALPRODUCTCOST', 'UNITPRICE', 'UNITPRICEDISCOUNTPCT'],
      dtype='object')

*PROMOTIONKEY* is probably important for a sales forecast. Promotion information can be found in dataset 56.

In [9]:
promotion = rasgo.get.dataset(56)
promotion.preview()

Unnamed: 0,PROMOTIONKEY,PROMOTIONALTERNATEKEY,ENGLISHPROMOTIONNAME,SPANISHPROMOTIONNAME,FRENCHPROMOTIONNAME,DISCOUNTPCT,ENGLISHPROMOTIONTYPE,SPANISHPROMOTIONTYPE,FRENCHPROMOTIONTYPE,ENGLISHPROMOTIONCATEGORY,SPANISHPROMOTIONCATEGORY,FRENCHPROMOTIONCATEGORY,STARTDATE,ENDDATE,MINQTY,MAXQTY
0,1,1,No Discount,Sin descuento,Aucune remise,0.0,No Discount,Sin descuento,Aucune remise,No Discount,Sin descuento,Aucune remise,2010-11-29,2014-06-30,0,
1,2,2,Volume Discount 11 to 14,Descuento por volumen (entre 11 y 14),Remise sur quantité (de 11 à 14),0.02,Volume Discount,Descuento por volumen,Remise sur quantité,Reseller,Distribuidor,Revendeur,2010-12-29,2013-12-28,11,14.0
2,3,3,Volume Discount 15 to 24,Descuento por volumen (entre 15 y 24),Remise sur quantité (de 15 à 24),0.05,Volume Discount,Descuento por volumen,Remise sur quantité,Reseller,Distribuidor,Revendeur,2010-12-29,2013-12-28,15,24.0
3,4,4,Volume Discount 25 to 40,Descuento por volumen (entre 25 y 40),Remise sur quantité (de 25 à 40),0.1,Volume Discount,Descuento por volumen,Remise sur quantité,Reseller,Distribuidor,Revendeur,2010-12-29,2013-12-28,25,40.0
4,5,5,Volume Discount 41 to 60,Descuento por volumen (entre 41 y 60),Remise sur quantité (de 41 à 60),0.15,Volume Discount,Descuento por volumen,Remise sur quantité,Reseller,Distribuidor,Revendeur,2010-12-29,2013-12-28,41,60.0
5,6,6,Volume Discount over 60,Descuento por volumen (más de 60),Remise sur quantité (au-delà de 60),0.2,Volume Discount,Descuento por volumen,Remise sur quantité,Reseller,Distribuidor,Revendeur,2010-12-29,2013-12-28,61,
6,7,7,Mountain-100 Clearance Sale,"Liquidación de bicicleta de montaña, 100",Liquidation VTT 100,0.35,Discontinued Product,Descatalogado,Ce produit n'est plus commercialisé,Reseller,Distribuidor,Revendeur,2011-11-12,2011-12-28,0,
7,8,8,Sport Helmet Discount-2002,"Casco deportivo, descuento: 2002",Remise sur les casques sport - 2002,0.1,Seasonal Discount,Descuento de temporada,Remise saisonnière,Reseller,Distribuidor,Revendeur,2011-12-29,2012-01-28,0,
8,9,9,Road-650 Overstock,"Bicicleta de carretera: 650, oferta especial",Déstockage Vélo de route 650,0.3,Excess Inventory,Inventario excedente,Déstockage,Reseller,Distribuidor,Revendeur,2011-12-29,2012-02-28,0,
9,10,10,Mountain Tire Sale,Oferta de cubierta de montaña,Vente de pneus de VTT,0.5,Excess Inventory,Inventario excedente,Déstockage,Customer,Cliente,Client,2012-12-12,2013-02-26,0,


## Sales Data

Work with the sales and promotion data to create the base modeling time-series features for the sales forecasting model.

### Merge Promo data

First, we want to clean up the promotion data to only keep what needs to be added to the sales data. Drop all columns except *PROMOTIONKEY* and *DISCOUNTPCT* from  promotion using the `drop_columns` transformation.

In [10]:
reduced_promo = promotion.drop_columns(include_cols=['PROMOTIONKEY', 'DISCOUNTPCT'])
reduced_promo.order(order_by={'PROMOTIONKEY':'ASC'}).preview()

Unnamed: 0,PROMOTIONKEY,DISCOUNTPCT
0,1,0.0
1,2,0.02
2,3,0.05
3,4,0.1
4,5,0.15
5,6,0.2
6,7,0.35
7,8,0.1
8,9,0.3
9,10,0.5


Now merge this with the internet sales datausing the `join` transformation.

In [11]:
sales_promo = reduced_promo.join(join_table=internet_sales,
                                 join_type='RIGHT',
                                 join_columns={'PROMOTIONKEY':'PROMOTIONKEY'})
sales_promo.order(order_by={'PRODUCTKEY':'ASC', 'ORDERDATE':'ASC'}).preview()

Unnamed: 0,DISCOUNTPCT,PROMOTIONKEY,DUEDATEKEY,EXTENDEDAMOUNT,PRODUCTKEY,SALESORDERLINENUMBER,SALESORDERNUMBER,CUSTOMERPONUMBER,UNITPRICEDISCOUNTPCT,DISCOUNTAMOUNT,...,CARRIERTRACKINGNUMBER,ORDERDATE,SALESTERRITORYKEY,CUSTOMERKEY,DUEDATE,ORDERQUANTITY,PRODUCTSTANDARDCOST,CURRENCYKEY,UNITPRICE,SHIPDATEKEY
0,0.0,1,20130109,34.99,214,4,SO51181,,0.0,0.0,...,,2012-12-28,7,12132,2013-01-09,1,13.0863,100,34.99,20130104
1,0.0,1,20130109,34.99,214,4,SO51180,,0.0,0.0,...,,2012-12-28,8,16313,2013-01-09,1,13.0863,100,34.99,20130104
2,0.0,1,20130110,34.99,214,4,SO51191,,0.0,0.0,...,,2012-12-29,8,12390,2013-01-10,1,13.0863,100,34.99,20130105
3,0.0,1,20130110,34.99,214,2,SO51192,,0.0,0.0,...,,2012-12-29,7,11241,2013-01-10,1,13.0863,100,34.99,20130105
4,0.0,1,20130111,34.99,214,4,SO51207,,0.0,0.0,...,,2012-12-30,8,11338,2013-01-11,1,13.0863,100,34.99,20130106
5,0.0,1,20130111,34.99,214,4,SO51212,,0.0,0.0,...,,2012-12-30,9,24604,2013-01-11,1,13.0863,6,34.99,20130106
6,0.0,1,20130112,34.99,214,4,SO51237,,0.0,0.0,...,,2012-12-31,9,11061,2013-01-12,1,13.0863,6,34.99,20130107
7,0.0,1,20130112,34.99,214,4,SO51246,,0.0,0.0,...,,2012-12-31,8,25625,2013-01-12,1,13.0863,100,34.99,20130107
8,0.0,1,20130112,34.99,214,2,SO51232,,0.0,0.0,...,,2012-12-31,10,11615,2013-01-12,1,13.0863,98,34.99,20130107
9,0.0,1,20130112,34.99,214,2,SO51234,,0.0,0.0,...,,2012-12-31,9,28204,2013-01-12,1,13.0863,6,34.99,20130107


### Create Weekly Data

Now, we want to forecast these sales weekly, so we need to extract the week from the *ORDERDATE*. This can be done using the transform `datetrunc`.

In [12]:
salesds = sales_promo.datetrunc(dates={'ORDERDATE':'week'})
salesds.order(order_by={'PRODUCTKEY':'ASC', 'ORDERDATE':'ASC'}).preview()

Unnamed: 0,PROMOTIONKEY,DISCOUNTPCT,DUEDATEKEY,EXTENDEDAMOUNT,PRODUCTKEY,SALESORDERLINENUMBER,SALESORDERNUMBER,CUSTOMERPONUMBER,UNITPRICEDISCOUNTPCT,DISCOUNTAMOUNT,...,TAXAMT,ORDERDATEKEY,CUSTOMERKEY,DUEDATE,ORDERQUANTITY,PRODUCTSTANDARDCOST,CURRENCYKEY,UNITPRICE,SHIPDATEKEY,ORDERDATE_WEEK
0,1,0.0,20130109,34.99,214,4,SO51181,,0.0,0.0,...,2.7992,20121228,12132,2013-01-09,1,13.0863,100,34.99,20130104,2012-12-23
1,1,0.0,20130109,34.99,214,4,SO51180,,0.0,0.0,...,2.7992,20121228,16313,2013-01-09,1,13.0863,100,34.99,20130104,2012-12-23
2,1,0.0,20130110,34.99,214,4,SO51191,,0.0,0.0,...,2.7992,20121229,12390,2013-01-10,1,13.0863,100,34.99,20130105,2012-12-23
3,1,0.0,20130110,34.99,214,2,SO51192,,0.0,0.0,...,2.7992,20121229,11241,2013-01-10,1,13.0863,100,34.99,20130105,2012-12-23
4,1,0.0,20130111,34.99,214,4,SO51207,,0.0,0.0,...,2.7992,20121230,11338,2013-01-11,1,13.0863,100,34.99,20130106,2012-12-30
5,1,0.0,20130111,34.99,214,4,SO51212,,0.0,0.0,...,2.7992,20121230,24604,2013-01-11,1,13.0863,6,34.99,20130106,2012-12-30
6,1,0.0,20130112,34.99,214,4,SO51237,,0.0,0.0,...,2.7992,20121231,11061,2013-01-12,1,13.0863,6,34.99,20130107,2012-12-30
7,1,0.0,20130112,34.99,214,4,SO51246,,0.0,0.0,...,2.7992,20121231,25625,2013-01-12,1,13.0863,100,34.99,20130107,2012-12-30
8,1,0.0,20130112,34.99,214,2,SO51232,,0.0,0.0,...,2.7992,20121231,11615,2013-01-12,1,13.0863,98,34.99,20130107,2012-12-30
9,1,0.0,20130112,34.99,214,2,SO51234,,0.0,0.0,...,2.7992,20121231,28204,2013-01-12,1,13.0863,6,34.99,20130107,2012-12-30


The new week column is called *ORDERDATE_WEEK*. This is clunky, so let's rename it to *ORDERWEEK* using the `rename` transformation.

In [13]:
# newsalesds = salesds.rename(renames={'ORDERDATE_WEEK': 'ORDERWEEK'})
# newsalesds.order(order_by={'PRODUCTKEY':'ASC', 'ORDERDATE':'ASC'}).preview()

Alternatively, we can just chain these transformations together.

In [14]:
salesds = sales_promo.datetrunc(dates={'ORDERDATE': 'week'}).rename(
                                renames={'ORDERDATE_WEEK': 'ORDERWEEK'})
salesds.order(order_by={'PRODUCTKEY':'ASC', 'ORDERWEEK':'ASC'}).preview()

Unnamed: 0,ORDERWEEK,ORDERQUANTITY,SALESTERRITORYKEY,TOTALPRODUCTCOST,PRODUCTSTANDARDCOST,FREIGHT,PROMOTIONKEY,SHIPDATE,ORDERDATE,EXTENDEDAMOUNT,...,DUEDATE,REVISIONNUMBER,CARRIERTRACKINGNUMBER,CUSTOMERKEY,CUSTOMERPONUMBER,DISCOUNTAMOUNT,SALESORDERLINENUMBER,CURRENCYKEY,DISCOUNTPCT,TAXAMT
0,2012-12-23,1,7,13.0863,13.0863,0.8748,1,2013-01-04,2012-12-28,34.99,...,2013-01-09,1,,12132,,0.0,4,100,0.0,2.7992
1,2012-12-23,1,8,13.0863,13.0863,0.8748,1,2013-01-05,2012-12-29,34.99,...,2013-01-10,1,,12390,,0.0,4,100,0.0,2.7992
2,2012-12-23,1,7,13.0863,13.0863,0.8748,1,2013-01-05,2012-12-29,34.99,...,2013-01-10,1,,11241,,0.0,2,100,0.0,2.7992
3,2012-12-23,1,8,13.0863,13.0863,0.8748,1,2013-01-04,2012-12-28,34.99,...,2013-01-09,1,,16313,,0.0,4,100,0.0,2.7992
4,2012-12-30,1,8,13.0863,13.0863,0.8748,1,2013-01-07,2012-12-31,34.99,...,2013-01-12,1,,25625,,0.0,4,100,0.0,2.7992
5,2012-12-30,1,8,13.0863,13.0863,0.8748,1,2013-01-06,2012-12-30,34.99,...,2013-01-11,1,,11338,,0.0,4,100,0.0,2.7992
6,2012-12-30,1,9,13.0863,13.0863,0.8748,1,2013-01-06,2012-12-30,34.99,...,2013-01-11,1,,24604,,0.0,4,6,0.0,2.7992
7,2012-12-30,1,9,13.0863,13.0863,0.8748,1,2013-01-07,2012-12-31,34.99,...,2013-01-12,1,,28204,,0.0,2,6,0.0,2.7992
8,2012-12-30,1,10,13.0863,13.0863,0.8748,1,2013-01-07,2012-12-31,34.99,...,2013-01-12,1,,11615,,0.0,2,98,0.0,2.7992
9,2012-12-30,1,9,13.0863,13.0863,0.8748,1,2013-01-07,2012-12-31,34.99,...,2013-01-12,1,,11061,,0.0,4,6,0.0,2.7992


Now we can aggregate this to the product-week level and create aggregations of the *'DISCOUNTAMOUNT'*, *'DISCOUNTPCT'*, *'ORDERQUANTITY'*, *'PRODUCTSTANDARDCOST'*, *'SALESAMOUNT'*, *'TAXAMT'*, *'TOTALPRODUCTCOST'*, *'UNITPRICE'*, *'UNITPRICEDISCOUNTPCT'* using the `aggregate` transform.

In [15]:
salesds_agg = sales_promo.datetrunc(dates={'ORDERDATE': 'week'}).rename(
                                renames={'ORDERDATE_WEEK': 'ORDERWEEK'}).aggregate(
                                group_by=['PRODUCTKEY', 'ORDERWEEK'],
                                aggregations={'DISCOUNTAMOUNT': ['MIN','MAX','AVG','SUM'], 
                                              'DISCOUNTPCT': ['MIN','MAX', 'AVG', 'SUM'],
                                              'ORDERQUANTITY': ['SUM'],
                                              'PRODUCTSTANDARDCOST': ['AVG', 'SUM'],
                                              'SALESAMOUNT': ['SUM'], 
                                              'TAXAMT': ['SUM'],
                                              'TOTALPRODUCTCOST': ['AVG', 'SUM'],
                                              'UNITPRICE': ['AVG', 'SUM'],
                                              'UNITPRICEDISCOUNTPCT': ['MIN', 'MAX', 'AVG', 'SUM']})
salesds_agg.describe().to_df()

Unnamed: 0,FEATURE,DTYPE,COUNT,NULL_COUNT,UNIQUE_COUNT,MOST_FREQUENT,MEAN,STD_DEV,MIN,_25_PERCENTILE,_50_PERCENTILE,_75_PERCENTILE,MAX
0,DISCOUNTPCT_AVG,FLOAT,6935,0,98,0,0.001397,0.009489,0,0.0,0.0,0.0,0.2
1,TAXAMT_SUM,NUMBER,6935,0,772,62.6392,338.672564,502.119733,0.3992,41.9832,118.776,408.2376,3721.4008
2,TOTALPRODUCTCOST_SUM,NUMBER,6935,0,809,486.7066,2491.390566,3691.452088,1.8663,269.28,826.2926,3247.53,28226.8246
3,DISCOUNTAMOUNT_AVG,FLOAT,6935,0,1,0,0.0,0.0,0,0.0,0.0,0.0,0
4,UNITPRICEDISCOUNTPCT_MAX,FLOAT,6935,0,1,0,0.0,0.0,0,0.0,0.0,0.0,0
5,DISCOUNTAMOUNT_MAX,FLOAT,6935,0,1,0,0.0,0.0,0,0.0,0.0,0.0,0
6,DISCOUNTAMOUNT_MIN,FLOAT,6935,0,1,0,0.0,0.0,0,0.0,0.0,0.0,0
7,PRODUCTSTANDARDCOST_AVG,NUMBER,6935,0,45,1481.9379000000,661.195888,651.4128,0.8565000000,26.1763,461.4448,1251.9813,2171.2942000000
8,UNITPRICEDISCOUNTPCT_AVG,FLOAT,6935,0,1,0,0.0,0.0,0,0.0,0.0,0.0,0
9,UNITPRICE_AVG,NUMBER,6935,0,42,2443.3500000000,1110.230001,1092.563872,2.2900000000,53.99,769.49,2181.5625,3578.2700000000


This gives us statistics for each product over a given week.

### Time-series feature engineering

For sales forcasting, in addition to the lagged variables, we need to know what the sales were in prior weeks. The transform `lag` can create these variables for us. In this case we will lag the following variables *'DISCOUNTAMOUNT_AVG'*, *'DISCOUNTPCT_AVG'*, *'ORDERQUANTITY_SUM'*, *'PRODUCTSTANDARDCOST_AVG'*, *'SALESAMOUNT_SUM'*, *'TAXAMT_SUM'*, *'TOTALPRODUCTCOST_SUM'*,*'UNITPRICEDISCOUNTPCT_AVG'*, *'UNITPRICE_AVG'*, *'UNITPRICE_SUM'*
over *1*, *2*, *3*, and *12* weeks.

In [16]:
salesds = salesds_agg.lag(columns=['DISCOUNTAMOUNT_AVG', 'DISCOUNTPCT_AVG', 'ORDERQUANTITY_SUM', 
                                         'PRODUCTSTANDARDCOST_AVG', 'SALESAMOUNT_SUM', 'TAXAMT_SUM', 
                                         'TOTALPRODUCTCOST_SUM','UNITPRICEDISCOUNTPCT_AVG', 
                                         'UNITPRICE_AVG', 'UNITPRICE_SUM'],
                                amounts=[1, 2, 3, 12],
                                order_by=['PRODUCTKEY', 'ORDERWEEK'],
                                partition=['PRODUCTKEY'])
   
salesds.order(order_by={'PRODUCTKEY':'ASC', 'ORDERWEEK':'ASC'}).preview()

Unnamed: 0,PRODUCTKEY,ORDERWEEK,DISCOUNTAMOUNT_MIN,DISCOUNTAMOUNT_MAX,DISCOUNTAMOUNT_AVG,DISCOUNTAMOUNT_SUM,DISCOUNTPCT_MIN,DISCOUNTPCT_MAX,DISCOUNTPCT_AVG,DISCOUNTPCT_SUM,...,LAG_UNITPRICEDISCOUNTPCT_AVG_3,LAG_UNITPRICEDISCOUNTPCT_AVG_12,LAG_UNITPRICE_AVG_1,LAG_UNITPRICE_AVG_2,LAG_UNITPRICE_AVG_3,LAG_UNITPRICE_AVG_12,LAG_UNITPRICE_SUM_1,LAG_UNITPRICE_SUM_2,LAG_UNITPRICE_SUM_3,LAG_UNITPRICE_SUM_12
0,214,2012-12-23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
1,214,2012-12-30,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,34.99,,,,139.96,,,
2,214,2013-01-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,34.99,34.99,,,349.9,139.96,,
3,214,2013-01-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,34.99,34.99,34.99,,419.88,349.9,139.96,
4,214,2013-01-20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,34.99,34.99,34.99,,209.94,419.88,349.9,
5,214,2013-01-27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,34.99,34.99,34.99,,384.89,209.94,419.88,
6,214,2013-02-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,34.99,34.99,34.99,,1084.69,384.89,209.94,
7,214,2013-02-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,34.99,34.99,34.99,,1539.56,1084.69,384.89,
8,214,2013-02-17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,34.99,34.99,34.99,,1189.66,1539.56,1084.69,
9,214,2013-02-24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,34.99,34.99,34.99,,1154.67,1189.66,1539.56,


In addition to lag variables, the moving average of the quantites can be useful. In this case, we'll calculate the moving average over *4* observations of *ORDERQUANTITY_SUM* and *TOTALPRODUCTCOST_SUM* using the transform `moving_avg`.

In [17]:
salesds = salesds.moving_avg(input_columns=['ORDERQUANTITY_SUM', 'SALESAMOUNT_SUM'],
                             window_sizes=[4],
                             order_by=['PRODUCTKEY', 'ORDERWEEK'],
                             partition=['PRODUCTKEY'])
    
salesds.order(order_by={'PRODUCTKEY':'ASC', 'ORDERWEEK':'ASC'}).preview()

Unnamed: 0,PRODUCTKEY,ORDERWEEK,DISCOUNTAMOUNT_MIN,DISCOUNTAMOUNT_MAX,DISCOUNTAMOUNT_AVG,DISCOUNTAMOUNT_SUM,DISCOUNTPCT_MIN,DISCOUNTPCT_MAX,DISCOUNTPCT_AVG,DISCOUNTPCT_SUM,...,LAG_UNITPRICE_AVG_1,LAG_UNITPRICE_AVG_2,LAG_UNITPRICE_AVG_3,LAG_UNITPRICE_AVG_12,LAG_UNITPRICE_SUM_1,LAG_UNITPRICE_SUM_2,LAG_UNITPRICE_SUM_3,LAG_UNITPRICE_SUM_12,MEAN_ORDERQUANTITY_SUM_4,MEAN_SALESAMOUNT_SUM_4
0,214,2012-12-23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,4.0,139.96
1,214,2012-12-30,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,34.99,,,,139.96,,,,7.0,244.93
2,214,2013-01-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,34.99,34.99,,,349.9,139.96,,,8.666,303.2466666
3,214,2013-01-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,34.99,34.99,34.99,,419.88,349.9,139.96,,8.0,279.92
4,214,2013-01-20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,34.99,34.99,34.99,,209.94,419.88,349.9,,9.75,341.1525
5,214,2013-01-27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,34.99,34.99,34.99,,384.89,209.94,419.88,,15.0,524.85
6,214,2013-02-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,34.99,34.99,34.99,,1084.69,384.89,209.94,,23.0,804.77
7,214,2013-02-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,34.99,34.99,34.99,,1539.56,1084.69,384.89,,30.0,1049.7
8,214,2013-02-17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,34.99,34.99,34.99,,1189.66,1539.56,1084.69,,35.5,1242.145
9,214,2013-02-24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,34.99,34.99,34.99,,1154.67,1189.66,1539.56,,37.0,1294.63


#### Save result

At this point, the data has been aggregated to weekly data and multiple transformations have been applied. This could be a good starting point for additional analysis and useful for visualization. For this reason, we will publish it to Rasgo to make it available for others to use. This can be done with the `rasgo.publish.dataset` function

In [18]:
weeklysales = rasgo.publish.dataset(dataset=salesds,
                                    name="WKSP: AdventureWorks: weekly sales",
                                    description="Internet Sales data converted to weekly sales")
weeklysales

Dataset(id=2420, name=WKSP: AdventureWorks: weekly sales, status=published, description=Internet Sales data converted to weekly sales)

We can examine this dataset on Rasgo by clicking the link below

In [19]:
print(f"https://app.rasgoml.com/datasets/{weeklysales.id}")

https://app.rasgoml.com/datasets/2420


Using this dataset, we can continue data preparation.

### Create Modeling Data

We can now begin final preparation for modeling with this dataset. To do this, we need to do three things. First, the target (next weeks sales needs to be created). Second, the categorical variables should be one-hot encoded. Finally, missing values should be imputed for the numeric columns.

#### Target Creation

Use the `lag` transform with a negative lag value to get next weeks sales as the target. While doing this, rename the value to make it clear that it is the target.

In [20]:
modelingds = weeklysales.lag(columns=['SALESAMOUNT_SUM'],
                             amounts=[-1],
                             order_by=['PRODUCTKEY', 'ORDERWEEK'],
                             partition =['PRODUCTKEY']).rename(
                             renames={'LAG_SALESAMOUNT_SUM__1': 'TARGET_SALESAMOUNT'})
modelingds.preview().columns.sort_values()

Index(['DISCOUNTAMOUNT_AVG', 'DISCOUNTAMOUNT_MAX', 'DISCOUNTAMOUNT_MIN',
       'DISCOUNTAMOUNT_SUM', 'DISCOUNTPCT_AVG', 'DISCOUNTPCT_MAX',
       'DISCOUNTPCT_MIN', 'DISCOUNTPCT_SUM', 'LAG_DISCOUNTAMOUNT_AVG_1',
       'LAG_DISCOUNTAMOUNT_AVG_12', 'LAG_DISCOUNTAMOUNT_AVG_2',
       'LAG_DISCOUNTAMOUNT_AVG_3', 'LAG_DISCOUNTPCT_AVG_1',
       'LAG_DISCOUNTPCT_AVG_12', 'LAG_DISCOUNTPCT_AVG_2',
       'LAG_DISCOUNTPCT_AVG_3', 'LAG_ORDERQUANTITY_SUM_1',
       'LAG_ORDERQUANTITY_SUM_12', 'LAG_ORDERQUANTITY_SUM_2',
       'LAG_ORDERQUANTITY_SUM_3', 'LAG_PRODUCTSTANDARDCOST_AVG_1',
       'LAG_PRODUCTSTANDARDCOST_AVG_12', 'LAG_PRODUCTSTANDARDCOST_AVG_2',
       'LAG_PRODUCTSTANDARDCOST_AVG_3', 'LAG_SALESAMOUNT_SUM_1',
       'LAG_SALESAMOUNT_SUM_12', 'LAG_SALESAMOUNT_SUM_2',
       'LAG_SALESAMOUNT_SUM_3', 'LAG_TAXAMT_SUM_1', 'LAG_TAXAMT_SUM_12',
       'LAG_TAXAMT_SUM_2', 'LAG_TAXAMT_SUM_3', 'LAG_TOTALPRODUCTCOST_SUM_1',
       'LAG_TOTALPRODUCTCOST_SUM_12', 'LAG_TOTALPRODUCTCOST_SUM_2',
  

#### Categorical encoding

The only column that needs to be encoded *PRODUCTKEY*. We can use the `one_hot_encode`, `target_encode` or the `label_encode` transforms. 

For features with a large number of categorties and when using tree-based modeling algorithms, the `label_encode` transform is a useful technique to encode categorical variables. The `target_encode` transform replaces the category by the mean target value of that category. Target encoding is a very powerful techinque to encode these high-cardinality categorical variables efficiently and help improve model performance. In this case, we will target encode the product_key

In [21]:
modelingds_encoded = modelingds.target_encode(column='PRODUCTKEY',
                                      target='TARGET_SALESAMOUNT')

modelingds_encoded.order(order_by={'PRODUCTKEY':'ASC', 'ORDERWEEK':'ASC'}).preview()

Unnamed: 0,TARGET_SALESAMOUNT,LAG_TAXAMT_SUM_1,LAG_TOTALPRODUCTCOST_SUM_2,LAG_TOTALPRODUCTCOST_SUM_3,LAG_UNITPRICE_AVG_2,TAXAMT_SUM,LAG_DISCOUNTPCT_AVG_3,LAG_ORDERQUANTITY_SUM_1,UNITPRICE_SUM,DISCOUNTAMOUNT_AVG,...,LAG_DISCOUNTAMOUNT_AVG_2,LAG_PRODUCTSTANDARDCOST_AVG_12,LAG_PRODUCTSTANDARDCOST_AVG_3,ORDERQUANTITY_SUM,LAG_ORDERQUANTITY_SUM_3,LAG_UNITPRICEDISCOUNTPCT_AVG_2,LAG_UNITPRICE_AVG_1,LAG_UNITPRICE_SUM_12,LAG_UNITPRICE_SUM_3,PRODUCTKEY_TARGET_ENCODED
0,349.9,,,,,11.1968,,,139.96,0.0,...,,,,4,,,,,,1366.452
1,419.88,11.1968,,,,27.992,,4.0,349.9,0.0,...,,,,10,,,34.99,,,1366.452
2,209.94,27.992,52.3452,,34.99,33.5904,,10.0,419.88,0.0,...,0.0,,,12,,0.0,34.99,,,1366.452
3,384.89,33.5904,130.863,52.3452,34.99,16.7952,0.0,12.0,209.94,0.0,...,0.0,,13.0863,6,4.0,0.0,34.99,,139.96,1366.452
4,1084.69,16.7952,157.0356,130.863,34.99,30.7912,0.0,6.0,384.89,0.0,...,0.0,,13.0863,11,10.0,0.0,34.99,,349.9,1366.452
5,1539.56,30.7912,78.5178,157.0356,34.99,86.7752,0.0,11.0,1084.69,0.0,...,0.0,,13.0863,31,12.0,0.0,34.99,,419.88,1366.452
6,1189.66,86.7752,143.9493,78.5178,34.99,123.1648,0.0,31.0,1539.56,0.0,...,0.0,,13.0863,44,6.0,0.0,34.99,,209.94,1366.452
7,1154.67,123.1648,405.6753,143.9493,34.99,95.1728,0.0,44.0,1189.66,0.0,...,0.0,,13.0863,34,11.0,0.0,34.99,,384.89,1366.452
8,1294.63,95.1728,575.7972,405.6753,34.99,92.3736,0.0,34.0,1154.67,0.0,...,0.0,,13.0863,33,31.0,0.0,34.99,,1084.69,1366.452
9,1014.71,92.3736,444.9342,575.7972,34.99,103.5704,0.0,33.0,1294.63,0.0,...,0.0,,13.0863,37,44.0,0.0,34.99,,1539.56,1366.452


#### Imputation

As a final step before modeling, all numeric columns should have missing values replaced by a number. This can be done by the `impute` transformation. If a linear or logistic regression, SVM or neural network algorithm was going to be applied, we may want to impute the mean or median. This could be done by passing 'mean' or 'median' in through the imputations dictionary.

As the modeling algoritm applied here is tree-based, we can simply impute and extreme value. All of the features created are non-negative or close to zero, so we will impute a very large negative number, *-999,999*.

In [22]:
imputation_dict = {'DISCOUNTAMOUNT_AVG': -999999,
                   'DISCOUNTAMOUNT_MAX': -999999,
                   'DISCOUNTAMOUNT_MIN': -999999,
                   'DISCOUNTAMOUNT_SUM': -999999,
                   'DISCOUNTPCT_AVG': -999999,
                   'DISCOUNTPCT_MAX': -999999,
                   'DISCOUNTPCT_MIN': -999999,
                   'DISCOUNTPCT_SUM': -999999,
                   'LAG_DISCOUNTAMOUNT_AVG_1': -999999,
                   'LAG_DISCOUNTAMOUNT_AVG_12': -999999,
                   'LAG_DISCOUNTAMOUNT_AVG_2': -999999,
                   'LAG_DISCOUNTAMOUNT_AVG_3': -999999,
                   'LAG_DISCOUNTPCT_AVG_1': -999999,
                   'LAG_DISCOUNTPCT_AVG_12': -999999,
                   'LAG_DISCOUNTPCT_AVG_2': -999999,
                   'LAG_DISCOUNTPCT_AVG_3': -999999,
                   'LAG_ORDERQUANTITY_SUM_1': -999999,
                   'LAG_ORDERQUANTITY_SUM_12': -999999,
                   'LAG_ORDERQUANTITY_SUM_2': -999999,
                   'LAG_ORDERQUANTITY_SUM_3': -999999,
                   'LAG_PRODUCTSTANDARDCOST_AVG_1': -999999,
                   'LAG_PRODUCTSTANDARDCOST_AVG_12': -999999,
                   'LAG_PRODUCTSTANDARDCOST_AVG_2': -999999,
                   'LAG_PRODUCTSTANDARDCOST_AVG_3': -999999,
                   'LAG_SALESAMOUNT_SUM_1': -999999,
                   'LAG_SALESAMOUNT_SUM_12': -999999,
                   'LAG_SALESAMOUNT_SUM_2': -999999,
                   'LAG_SALESAMOUNT_SUM_3': -999999,
                   'LAG_TAXAMT_SUM_1': -999999,
                   'LAG_TAXAMT_SUM_12': -999999,
                   'LAG_TAXAMT_SUM_2': -999999,
                   'LAG_TAXAMT_SUM_3': -999999,
                   'LAG_TOTALPRODUCTCOST_SUM_1': -999999,
                   'LAG_TOTALPRODUCTCOST_SUM_12': -999999,
                   'LAG_TOTALPRODUCTCOST_SUM_2': -999999,
                   'LAG_TOTALPRODUCTCOST_SUM_3': -999999,
                   'LAG_UNITPRICEDISCOUNTPCT_AVG_1': -999999,
                   'LAG_UNITPRICEDISCOUNTPCT_AVG_12': -999999,
                   'LAG_UNITPRICEDISCOUNTPCT_AVG_2': -999999,
                   'LAG_UNITPRICEDISCOUNTPCT_AVG_3': -999999,
                   'LAG_UNITPRICE_AVG_1': -999999,
                   'LAG_UNITPRICE_AVG_12': -999999,
                   'LAG_UNITPRICE_AVG_2': -999999,
                   'LAG_UNITPRICE_AVG_3': -999999,
                   'LAG_UNITPRICE_SUM_1': -999999,
                   'LAG_UNITPRICE_SUM_12': -999999,
                   'LAG_UNITPRICE_SUM_2': -999999,
                   'LAG_UNITPRICE_SUM_3': -999999,
                   'MEAN_ORDERQUANTITY_SUM_4': -999999,
                   'MEAN_SALESAMOUNT_SUM_4': -999999,
                   'ORDERQUANTITY_SUM': -999999,
                   'PRODUCTSTANDARDCOST_AVG': -999999,
                   'PRODUCTSTANDARDCOST_SUM': -999999,
                   'SALESAMOUNT_SUM': -999999,
                   'TAXAMT_SUM': -999999,
                   'TOTALPRODUCTCOST_AVG': -999999,
                   'TOTALPRODUCTCOST_SUM': -999999,
                   'UNITPRICEDISCOUNTPCT_AVG': -999999,
                   'UNITPRICEDISCOUNTPCT_MAX': -999999,
                   'UNITPRICEDISCOUNTPCT_MIN': -999999,
                   'UNITPRICEDISCOUNTPCT_SUM': -999999,
                   'UNITPRICE_AVG': -999999,
                   'UNITPRICE_SUM': -999999}

In [23]:
modelingds_imputed = modelingds_encoded.impute(imputations=imputation_dict)

modelingds_imputed.order(order_by={'PRODUCTKEY':'ASC','ORDERWEEK':'ASC'}).preview()

Unnamed: 0,LAG_DISCOUNTAMOUNT_AVG_2,LAG_ORDERQUANTITY_SUM_12,PRODUCTKEY_TARGET_ENCODED,PRODUCTSTANDARDCOST_SUM,LAG_PRODUCTSTANDARDCOST_AVG_3,LAG_TAXAMT_SUM_3,LAG_TOTALPRODUCTCOST_SUM_1,UNITPRICEDISCOUNTPCT_AVG,UNITPRICEDISCOUNTPCT_MIN,UNITPRICEDISCOUNTPCT_SUM,...,LAG_SALESAMOUNT_SUM_3,PRODUCTKEY,LAG_ORDERQUANTITY_SUM_3,LAG_TAXAMT_SUM_1,LAG_TOTALPRODUCTCOST_SUM_3,LAG_UNITPRICE_AVG_1,LAG_UNITPRICE_SUM_12,MEAN_ORDERQUANTITY_SUM_4,ORDERQUANTITY_SUM,LAG_UNITPRICEDISCOUNTPCT_AVG_12
0,-999999.0,-999999,1366.452,52.3452,-999999.0,-999999.0,-999999.0,0.0,0.0,0.0,...,-999999.0,214,-999999,-999999.0,-999999.0,-999999.0,-999999.0,4.0,4,-999999.0
1,-999999.0,-999999,1366.452,130.863,-999999.0,-999999.0,52.3452,0.0,0.0,0.0,...,-999999.0,214,-999999,11.1968,-999999.0,34.99,-999999.0,7.0,10,-999999.0
2,0.0,-999999,1366.452,157.0356,-999999.0,-999999.0,130.863,0.0,0.0,0.0,...,-999999.0,214,-999999,27.992,-999999.0,34.99,-999999.0,8.666,12,-999999.0
3,0.0,-999999,1366.452,78.5178,13.0863,11.1968,157.0356,0.0,0.0,0.0,...,139.96,214,4,33.5904,52.3452,34.99,-999999.0,8.0,6,-999999.0
4,0.0,-999999,1366.452,143.9493,13.0863,27.992,78.5178,0.0,0.0,0.0,...,349.9,214,10,16.7952,130.863,34.99,-999999.0,9.75,11,-999999.0
5,0.0,-999999,1366.452,405.6753,13.0863,33.5904,143.9493,0.0,0.0,0.0,...,419.88,214,12,30.7912,157.0356,34.99,-999999.0,15.0,31,-999999.0
6,0.0,-999999,1366.452,575.7972,13.0863,16.7952,405.6753,0.0,0.0,0.0,...,209.94,214,6,86.7752,78.5178,34.99,-999999.0,23.0,44,-999999.0
7,0.0,-999999,1366.452,444.9342,13.0863,30.7912,575.7972,0.0,0.0,0.0,...,384.89,214,11,123.1648,143.9493,34.99,-999999.0,30.0,34,-999999.0
8,0.0,-999999,1366.452,431.8479,13.0863,86.7752,444.9342,0.0,0.0,0.0,...,1084.69,214,31,95.1728,405.6753,34.99,-999999.0,35.5,33,-999999.0
9,0.0,-999999,1366.452,484.1931,13.0863,123.1648,431.8479,0.0,0.0,0.0,...,1539.56,214,44,92.3736,575.7972,34.99,-999999.0,37.0,37,-999999.0


#### Train-test split

As this is a time-series problem, a random train-test split won't work as there will be leakage from observations near the end of the time frame in the training set to observations earlier than this in the test set. The way to avoid this problem is to perform the split based on the date. The transformation `train_test_split` can do this by passing the date columns through the parameter **order_by**.

In [24]:
modelingds_split = modelingds_imputed.train_test_split(order_by=['ORDERWEEK'], 
                                         train_percent=0.8)
    
modelingds_split.order(order_by={'PRODUCTKEY':'ASC', 'ORDERWEEK':'ASC'}).preview()

Unnamed: 0,DISCOUNTAMOUNT_AVG,DISCOUNTPCT_MAX,LAG_DISCOUNTPCT_AVG_3,LAG_ORDERQUANTITY_SUM_1,LAG_UNITPRICEDISCOUNTPCT_AVG_1,LAG_TOTALPRODUCTCOST_SUM_3,LAG_UNITPRICE_AVG_2,ORDERWEEK,LAG_UNITPRICE_AVG_1,LAG_DISCOUNTAMOUNT_AVG_1,...,MEAN_SALESAMOUNT_SUM_4,LAG_DISCOUNTAMOUNT_AVG_12,LAG_SALESAMOUNT_SUM_2,UNITPRICEDISCOUNTPCT_AVG,UNITPRICEDISCOUNTPCT_MAX,LAG_PRODUCTSTANDARDCOST_AVG_12,UNITPRICEDISCOUNTPCT_SUM,SALESAMOUNT_SUM,LAG_PRODUCTSTANDARDCOST_AVG_1,TT_SPLIT
0,0.0,0.0,-999999.0,-999999,-999999.0,-999999.0,-999999.0,2012-12-23,-999999.0,-999999.0,...,139.96,-999999.0,-999999.0,0.0,0.0,-999999.0,0.0,139.96,-999999.0,TRAIN
1,0.0,0.0,-999999.0,4,0.0,-999999.0,-999999.0,2012-12-30,34.99,0.0,...,244.93,-999999.0,-999999.0,0.0,0.0,-999999.0,0.0,349.9,13.0863,TRAIN
2,0.0,0.0,-999999.0,10,0.0,-999999.0,34.99,2013-01-06,34.99,0.0,...,303.2466666,-999999.0,139.96,0.0,0.0,-999999.0,0.0,419.88,13.0863,TRAIN
3,0.0,0.0,0.0,12,0.0,52.3452,34.99,2013-01-13,34.99,0.0,...,279.92,-999999.0,349.9,0.0,0.0,-999999.0,0.0,209.94,13.0863,TRAIN
4,0.0,0.0,0.0,6,0.0,130.863,34.99,2013-01-20,34.99,0.0,...,341.1525,-999999.0,419.88,0.0,0.0,-999999.0,0.0,384.89,13.0863,TRAIN
5,0.0,0.0,0.0,11,0.0,157.0356,34.99,2013-01-27,34.99,0.0,...,524.85,-999999.0,209.94,0.0,0.0,-999999.0,0.0,1084.69,13.0863,TRAIN
6,0.0,0.0,0.0,31,0.0,78.5178,34.99,2013-02-03,34.99,0.0,...,804.77,-999999.0,384.89,0.0,0.0,-999999.0,0.0,1539.56,13.0863,TRAIN
7,0.0,0.0,0.0,44,0.0,143.9493,34.99,2013-02-10,34.99,0.0,...,1049.7,-999999.0,1084.69,0.0,0.0,-999999.0,0.0,1189.66,13.0863,TRAIN
8,0.0,0.0,0.0,34,0.0,405.6753,34.99,2013-02-17,34.99,0.0,...,1242.145,-999999.0,1539.56,0.0,0.0,-999999.0,0.0,1154.67,13.0863,TRAIN
9,0.0,0.0,0.0,33,0.0,575.7972,34.99,2013-02-24,34.99,0.0,...,1294.63,-999999.0,1189.66,0.0,0.0,-999999.0,0.0,1294.63,13.0863,TRAIN


#### Save Modeling Dataset

We can now save this modeling dataset so we can return to it in the future.

In [25]:
modeling = rasgo.publish.dataset(dataset=modelingds_split,
                                 name="WKSP: AdventureWorks: Sales Forecast Modeling",
                                 description="Modeling dataset for Internet Sales Forecasting")
modeling

Dataset(id=2421, name=WKSP: AdventureWorks: Sales Forecast Modeling, status=published, description=Modeling dataset for Internet Sales Forecasting)

We can examine this dataset on Rasgo by clicking the link below

In [26]:
print(f"https://app.rasgoml.com/datasets/{modeling.id}")

https://app.rasgoml.com/datasets/2421


Capture this dataset ID for use in prediction.

In [27]:
ds_id = modeling.id

### Modeling

We are now ready to build the model. First, get the modeling data from Rasgo using `to_df`.

In [28]:
df = modeling.to_df().reset_index(drop=True)

Check for numeric datatypes and convert the numeric ones to floats.

In [29]:
for c in df.select_dtypes(exclude=[np.number]).columns:
    if c not in ['ORDERWEEK', 'TT_SPLIT']:
        df[c] = pd.to_numeric(df[c])

Eliminate the last week of data as there is no target.

In [30]:
df = df[~df.TARGET_SALESAMOUNT.isna()]

#### Train the model

First, split the data using the TT_SPLIT column.

In [31]:
df_train = df[df['TT_SPLIT'] == 'TRAIN'].drop(columns=['TT_SPLIT', 'ORDERWEEK'])
df_test = df[df['TT_SPLIT'] == 'TEST'].drop(columns=['TT_SPLIT', 'ORDERWEEK'])

In [32]:
y_train = df_train['TARGET_SALESAMOUNT']
X_train = df_train.drop(columns=['TARGET_SALESAMOUNT'])
y_test = df_test['TARGET_SALESAMOUNT']
X_test = df_test.drop(columns=['TARGET_SALESAMOUNT'])

#### Fit the model

For illustration purposes, we are just fitting the model with a single set of parameters. In general, you should optimize the hyperparameters before building the final model. That process is beyond the scope of this document.

In [33]:
model = xgb.XGBRegressor(n_estimators= 100,
                         max_depth=5,
                         eta=0.01,
                         random_state=1066,
                         subsample=0.7,
                         colsample_bytree=0.7)

model.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.7, enable_categorical=False,
             eta=0.01, gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.00999999978,
             max_delta_step=0, max_depth=5, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=12,
             num_parallel_tree=1, predictor='auto', random_state=1066,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.7,
             tree_method='exact', validate_parameters=1, verbosity=None)

#### Check the performance

In [34]:
model.predict(X_test)

array([ 288.2816 , 5252.6167 ,  886.3017 , ...,  201.25864,  128.2966 ,
        113.91244], dtype=float32)

In [35]:
rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
rmse

4793.670942182546

### Predict on new data

Since our feature engineering was saved in Rasgo, as new data enters the system, it will automatically be prepared for modeling. We can just pull the data in question and make a prediction on it.

In this case, if we are making these predictions each week, we can just pull the most recent week. In this particular data, that is '*2014-01-19*'.

#### Pull the data

Use to_df to grab the data from this date. We have several columns not needed in the model, so we will drop those as well.

In [36]:
predictdf = rasgo.get.dataset(ds_id).to_df(filters=["ORDERWEEK = '2014-01-19'"])
for c in predictdf.select_dtypes(exclude=[np.number]).columns:
    if c not in ['ORDERWEEK', 'TT_SPLIT']:
        predictdf[c] = pd.to_numeric(predictdf[c])
predictdf.head()

Unnamed: 0,LAG_PRODUCTSTANDARDCOST_AVG_1,LAG_PRODUCTSTANDARDCOST_AVG_2,LAG_UNITPRICE_AVG_3,TOTALPRODUCTCOST_AVG,LAG_TOTALPRODUCTCOST_SUM_1,LAG_DISCOUNTAMOUNT_AVG_12,LAG_DISCOUNTPCT_AVG_12,LAG_ORDERQUANTITY_SUM_12,LAG_ORDERQUANTITY_SUM_2,LAG_UNITPRICE_SUM_3,...,LAG_DISCOUNTPCT_AVG_2,LAG_PRODUCTSTANDARDCOST_AVG_3,LAG_SALESAMOUNT_SUM_12,LAG_TAXAMT_SUM_1,LAG_UNITPRICE_SUM_2,LAG_SALESAMOUNT_SUM_3,MEAN_SALESAMOUNT_SUM_4,ORDERQUANTITY_SUM,SALESAMOUNT_SUM,TT_SPLIT
0,41.5723,41.5723,53.99,41.5723,83.1446,0.0,0.0,6,3,161.97,...,0.0,41.5723,323.94,8.6384,161.97,161.97,134.975,2,107.98,TEST
1,1.8663,1.8663,4.99,1.8663,82.1172,0.0,0.0,98,29,214.57,...,0.0,1.8663,489.02,17.5648,144.71,214.57,193.3625,39,194.61,TEST
2,13.09,13.09,35.0,13.09,196.35,0.0,0.0,25,17,700.0,...,0.0,13.09,875.0,42.0,595.0,700.0,533.75,9,315.0,TEST
3,1.8663,1.8663,4.99,1.8663,41.0586,0.0,0.0,33,13,114.77,...,0.0,1.8663,164.67,8.7824,64.87,114.77,107.285,28,139.72,TEST
4,38.4923,38.4923,49.99,38.4923,230.9538,0.0,0.0,13,6,399.92,...,0.0,38.4923,649.87,23.9952,299.94,399.92,312.4375,5,249.95,TEST


Now use the model to get the sales forecast. We will create a dataframe to hold the predictions then drop the columns not needed by the model before making the prediction.

In [37]:
salesforecastdf = predictdf[['PRODUCTKEY', 'ORDERWEEK']].copy()
salesforecastdf['forecast'] = model.predict(predictdf.drop(columns=['TT_SPLIT', 'ORDERWEEK', 'TARGET_SALESAMOUNT']))
salesforecastdf

Unnamed: 0,PRODUCTKEY,ORDERWEEK,forecast
0,488,2014-01-19,139.818634
1,477,2014-01-19,138.948486
2,537,2014-01-19,421.024445
3,530,2014-01-19,113.912437
4,228,2014-01-19,242.160858
5,467,2014-01-19,119.229134
6,479,2014-01-19,111.68586
7,473,2014-01-19,118.734268
8,538,2014-01-19,204.513458
9,234,2014-01-19,172.3703
