# Sales Forecasting with Rasgo

This notebook shows how to perform the data preparation and feature engineering for a sales forecasting model. Starting with [AdventureWorks](https://docs.microsoft.com/en-us/sql/samples/adventureworks-install-configure) data preloaded in Rasgo, the data will be explored, features created and modeling data extracted.

This analysis will be focused on the internet sales for this company.

## Packages

The documentation for each packaged used in this tutorial is linked below:
* [numpy](https://numpy.org/doc/stable/)
* [os](https://docs.python.org/3/library/os.html)
* [pandas](https://pandas.pydata.org/docs/)
* [pyrasgo](https://docs.rasgoml.com/rasgo-docs/)
* [scikit-learn](https://scikit-learn.org/stable/)
    * [sklearn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)
* [XGBoost](https://xgboost.readthedocs.io/en/latest/)

In [None]:
import numpy as np
import os
import pandas as pd
import pyrasgo
from sklearn.metrics import mean_squared_error
import xgboost as xgb

## Create account on Rasgo

## Access Rasgo

### Create account

Next, click [here](https://app.rasgoml.com/account/register) to create an account on the Rasgo UI. Fill in the required information on the web page.

<p align="center">
  <img src="img/RasgoAccountRegistration.png" alt="Rasgo Account Registration" width="512">
</p>

You can close the browser tab as you will receive an email from rasgo to verify your email address. Click the **Verify Email** button to verify.

<p align="center">
  <img src="img/RasgoWelcome.png" alt="Verify Email" width="390">
</p>

This will open browser tab where you can log into the UI.

### Log into Rasgo UI

Enter your username and password and click **Login**.

<p align="center">
  <img src="img/RasgoLogin.png" alt="Login to Rasgo" width="528">
</p>

to be taken to the Rasgo App homepage.

### Copy your API Key

Click the **API KEY** button in the upper right of the screen

<img src="img/APIKEY.png" alt="Copy API Key" width="128">

to copy your API key to the clipboard.

### Save API Key as an environment variable

Save the API Key as an environment variable called **RASGO_API_KEY**. This can be done on:
* [Linux](https://unix.stackexchange.com/questions/21598/how-do-i-set-a-user-environment-variable-permanently-not-session)
* [Mac](https://apple.stackexchange.com/questions/395457/how-to-set-environment-variable-permanently-on-macos-catalina)
* [Windows](https://stackoverflow.com/questions/17312348/how-do-i-set-windows-environment-variables-permanently)

## Work with PyRasgo

### Load the API Key from the environment variable

In [None]:
API_KEY = os.getenv('RASGO_API_KEY')

### Connect to Rasgo

In [None]:
rasgo = pyrasgo.connect(API_KEY)

### Get list of available datasets

Loop over all available datasets and print out the dataset ID and Name.

In [None]:
datasets = sorted(rasgo.get.datasets(), key=lambda x: x.id)
for ds in datasets:
    print(f"ID: {ds.id}\tDataset: {ds.name}")

Instead of searching through this list, let's look for datasets that have sales and internet in their name.

In [None]:
for ds in datasets:
    if 'sale' in ds.name.casefold() and 'internet' in ds.name.casefold():
        print(f"ID: {ds.id}\tDataset: {ds.name}")

Dataset 74 refer to internet sales. Let's check dataset 74

### Examine Internet Sales

In [None]:
internet_sales = rasgo.get.dataset(74)
internet_sales.preview()

This looks promising, but I'd like to see a single product sorted by date. This can be done through the use of the filter and order transforms. To use filter, the product we want to filter on is needed, as we don't know that yet, we will just order by *PRODUCTKEY* and *ORDERDATE*.

In [None]:
internet_sales.order(col_list=['PRODUCTKEY', 'ORDERDATE'], order_method="ASC").preview()

This looks reasonable, use this for our modeling. For future reference, what columns exist in this table?

In [None]:
internet_sales.preview().columns.sort_values()

Interesting fields that may link back to other tables: *CURRENCYKEY*, *CUSTOMERKEY*, *PRODUCTKEY*, *PROMOTIONKEY*, *SALESTERRITORYKEY*.

Not all of these are relevant, but *PRODUCTKEY*, *PROMOTIONKEY* are probably important for a sales forecast. To find which datasets we can find these in, pull the list of datasets and look for adventureworks

### Examine Product and Promotion Data

In [None]:
for ds in datasets:
    if 'adventureworks' in ds.name.casefold():
        print(f"ID: {ds.id}\tDataset: {ds.name}")

Dataset 56 looks like it will contain information on the promotion and 75 on the product. Take a look at the first.

In [None]:
promotion = rasgo.get.dataset(56)
promotion.preview()

In [None]:
promotion.preview().columns.sort_values()

And the product dataset.

In [None]:
product = rasgo.get.dataset(75)
product.preview()

In [None]:
product.preview().columns.sort_values()

## Sales Data

Work with the sales and promotion data to create the base modeling time-series features for the sales forecasting model.

### Merge Promo data

First, we want to clean up the promotion data to only keep what needs to be added to the sales data. Drop all columns except *PROMOTIONKEY* and *DISCOUNTPCT* from  promotion using the `drop_columns` transformation.

In [None]:
reduced_promo = promotion.drop_columns(include_cols=['PROMOTIONKEY', 'DISCOUNTPCT'])
reduced_promo.order(col_list=['PROMOTIONKEY'], order_method="ASC").preview()

Now merge this with the internet sales datausing the `join` transformation.

In [None]:
sales_promo = reduced_promo.join(join_table=internet_sales,
                                 join_type='RIGHT',
                                 join_columns={'PROMOTIONKEY':'PROMOTIONKEY'})
sales_promo.order(col_list=['PRODUCTKEY', 'ORDERDATE'], order_method="ASC").preview()

### Create Weekly Data

Now, we want to forecast these sales weekly, so we need to extract the week from the *ORDERDATE*. This can be done using the transform `datetrunc`.

In [None]:
salesds = sales_promo.datetrunc(dates={'ORDERDATE': 'week'})
salesds.order(col_list=['PRODUCTKEY', 'ORDERDATE'], order_method="ASC").preview()

The new week column is called *ORDERDATE_WEEK*. This is clunky, so let's rename it to *ORDERWEEK* using the `rename` transformation.

In [None]:
newsalesds = salesds.rename(renames={'ORDERDATE_WEEK': 'ORDERWEEK'})
newsalesds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

Alternatively, we can just chain these transformations together.

In [None]:
salesds = sales_promo.datetrunc(dates={'ORDERDATE': 'week'}).rename(
                                renames={'ORDERDATE_WEEK': 'ORDERWEEK'})
salesds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

Now we can aggregate this to the product-week level and create aggregations of the *'DISCOUNTAMOUNT'*, *'DISCOUNTPCT'*, *'ORDERQUANTITY'*, *'PRODUCTSTANDARDCOST'*, *'SALESAMOUNT'*, *'TAXAMT'*, *'TOTALPRODUCTCOST'*, *'UNITPRICE'*, *'UNITPRICEDISCOUNTPCT'* using the `aggregate` transform.

In [None]:
salesds = sales_promo.datetrunc(dates={'ORDERDATE': 'week'}).rename(
                                renames={'ORDERDATE_WEEK': 'ORDERWEEK'}).aggregate(
                                group_by=['PRODUCTKEY', 'ORDERWEEK'],
                                aggregations={'DISCOUNTAMOUNT': ['MIN', 'MAX', 'AVG', 'SUM'], 
                                              'DISCOUNTPCT': ['MIN', 'MAX', 'AVG', 'SUM'],
                                              'ORDERQUANTITY': ['SUM'],
                                              'PRODUCTSTANDARDCOST': ['AVG', 'SUM'],
                                              'SALESAMOUNT': ['SUM'], 
                                              'TAXAMT': ['SUM'],
                                              'TOTALPRODUCTCOST': ['AVG', 'SUM'],
                                              'UNITPRICE': ['AVG', 'SUM'],
                                              'UNITPRICEDISCOUNTPCT': ['MIN', 'MAX', 'AVG', 'SUM']})
salesds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

This gives us statistics for each product over a given week.

### Time-series feature engineering

For sales forcasting, in addition to the lagged variables, we need to know what the sales were in prior weeks. The transform `lag` can create these variables for us. In this case we will lag the following variables *'DISCOUNTAMOUNT_AVG'*, *'DISCOUNTPCT_AVG'*, *'ORDERQUANTITY_SUM'*, *'PRODUCTSTANDARDCOST_AVG'*, *'SALESAMOUNT_SUM'*, *'TAXAMT_SUM'*, *'TOTALPRODUCTCOST_SUM'*,*'UNITPRICEDISCOUNTPCT_AVG'*, *'UNITPRICE_AVG'*, *'UNITPRICE_SUM'*
over *1*, *2*, *3*, and *12* weeks.

In [None]:
salesds = sales_promo.datetrunc(dates={'ORDERDATE': 'week'}).rename(
                                renames={'ORDERDATE_WEEK': 'ORDERWEEK'}).aggregate(
                                group_by=['PRODUCTKEY', 'ORDERWEEK'],
                                aggregations={'DISCOUNTAMOUNT': ['MIN', 'MAX', 'AVG', 'SUM'], 
                                              'DISCOUNTPCT': ['MIN', 'MAX', 'AVG', 'SUM'],
                                              'ORDERQUANTITY': ['SUM'],
                                              'PRODUCTSTANDARDCOST': ['AVG', 'SUM'],
                                              'SALESAMOUNT': ['SUM'], 
                                              'TAXAMT': ['SUM'],
                                              'TOTALPRODUCTCOST': ['AVG', 'SUM'],
                                              'UNITPRICE': ['AVG', 'SUM'],
                                              'UNITPRICEDISCOUNTPCT': ['MIN', 'MAX', 'AVG', 'SUM']}).lag(
                                columns=['DISCOUNTAMOUNT_AVG', 'DISCOUNTPCT_AVG', 'ORDERQUANTITY_SUM', 
                                         'PRODUCTSTANDARDCOST_AVG', 'SALESAMOUNT_SUM', 'TAXAMT_SUM', 
                                         'TOTALPRODUCTCOST_SUM','UNITPRICEDISCOUNTPCT_AVG', 
                                         'UNITPRICE_AVG', 'UNITPRICE_SUM'],
                                amounts=[1, 2, 3, 12],
                                order_by=['PRODUCTKEY', 'ORDERWEEK'],
                                partition=['PRODUCTKEY'])
   
salesds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

In addition to lag variables, the moving average of the quantites can be useful. In this case, we'll calculate the moving average over *4* observations of *ORDERQUANTITY_SUM* and *TOTALPRODUCTCOST_SUM* using the transform `moving_avg`.

In [None]:
salesds = sales_promo.datetrunc(dates={'ORDERDATE': 'week'}).rename(
                                renames={'ORDERDATE_WEEK': 'ORDERWEEK'}).aggregate(
                                group_by=['PRODUCTKEY', 'ORDERWEEK'],
                                aggregations={'DISCOUNTAMOUNT': ['MIN', 'MAX', 'AVG', 'SUM'], 
                                              'DISCOUNTPCT': ['MIN', 'MAX', 'AVG', 'SUM'],
                                              'ORDERQUANTITY': ['SUM'],
                                              'PRODUCTSTANDARDCOST': ['AVG', 'SUM'],
                                              'SALESAMOUNT': ['SUM'], 
                                              'TAXAMT': ['SUM'],
                                              'TOTALPRODUCTCOST': ['AVG', 'SUM'],
                                              'UNITPRICE': ['AVG', 'SUM'],
                                              'UNITPRICEDISCOUNTPCT': ['MIN', 'MAX', 'AVG', 'SUM']}).lag(
                                columns=['DISCOUNTAMOUNT_AVG', 'DISCOUNTPCT_AVG', 'ORDERQUANTITY_SUM', 
                                         'PRODUCTSTANDARDCOST_AVG', 'SALESAMOUNT_SUM', 'TAXAMT_SUM', 
                                         'TOTALPRODUCTCOST_SUM','UNITPRICEDISCOUNTPCT_AVG', 
                                         'UNITPRICE_AVG', 'UNITPRICE_SUM'],
                                amounts=[1, 2, 3, 12],
                                order_by=['PRODUCTKEY', 'ORDERWEEK'],
                                partition=['PRODUCTKEY']).moving_avg(
                                input_columns=['ORDERQUANTITY_SUM', 'SALESAMOUNT_SUM'],
                                window_sizes=[4],
                                order_by=['PRODUCTKEY', 'ORDERWEEK'],
                                partition=['PRODUCTKEY'])
    
salesds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

#### Save result

At this point, the data has been aggregated to weekly data and multiple transformations have been applied. This could be a good starting point for additional analysis and useful for visualization. For this reason, we will publish it to Rasgo to make it available for others to use. This can be done with the `rasgo.publish.dataset` function

In [None]:
weeklysales = rasgo.publish.dataset(dataset=salesds,
                                    name="WKSP FULL: AdventureWorks: weekly sales",
                                    description="Internet Sales data converted to weekly sales")
weeklysales

We can examine this dataset on Rasgo by clicking the link below

In [None]:
print(f"https://app.rasgoml.com/datasets/{weeklysales.id}")

Using this dataset, we can continue data preparation.

#### Capture trends

Lag variables are necessary for time-series models, but often calculating trend variables provides additional value. These can be simple differences or ratios, more complicated ratios such as the difference between two lags divided by the time between the observations (velocity) or a weighted mocing average (often providing more weight to the most recent observations. All of these can be calculated using the `math` transformation. In this case, we will calculate only

* *ORDERQUANTITY_SUM - LAG_ORDERQUANTITY_SUM_3*
* *ORDERQUANTITY_SUM / LAG_ORDERQUANTITY_SUM_3*
* *(SALESAMOUNT_SUM - LAG_SALESAMOUNT_SUM_3) / 4*
* *SALESAMOUNT_SUM / MEAN_SALESAMOUNT_SUM_4*
* *(4*SALESAMOUNT_SUM + 3*LAG_SALESAMOUNT_SUM_1 + 2*LAG_SALESAMOUNT_SUM_1 + LAG_SALESAMOUNT_SUM_3)/10*

In [None]:
salesds = weeklysales.math(math_ops=['ORDERQUANTITY_SUM - LAG_ORDERQUANTITY_SUM_3',
                                     'ORDERQUANTITY_SUM / NULLIF(LAG_ORDERQUANTITY_SUM_12, 0)',
                                     '(SALESAMOUNT_SUM - LAG_SALESAMOUNT_SUM_3) / 4',
                                     'SALESAMOUNT_SUM / NULLIF(MEAN_SALESAMOUNT_SUM_4, 0)',
                                     '(4*SALESAMOUNT_SUM + 3*LAG_SALESAMOUNT_SUM_1 + 2*LAG_SALESAMOUNT_SUM_1 + LAG_SALESAMOUNT_SUM_3)/10'])
salesds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

Unfortunately, by default, the math transform creates the column name by simplifying the math operation carried out. This gives us the names:
* *ORDERQUANTITY_SUM___LAG_ORDERQUANTITY_SUM_3*
* *ORDERQUANTITY_SUM__NULLIFLAG_ORDERQUANTITY_SUM_12_0*
* *SALESAMOUNT_SUM___LAG_SALESAMOUNT_SUM_3__4*
* *SALESAMOUNT_SUM__NULLIFMEAN_SALESAMOUNT_SUM_4_0*
* *_4SALESAMOUNT_SUM__3LAG_SALESAMOUNT_SUM_1__2LAG_SALESAMOUNT_SUM_1__LAG_SALESAMOUNT_SUM_310*

These do not really represent the concepts well, so we will rename them using the `rename` transform to:
* *ORDERQUANTITY_SUM_DELTA_4*
* *ORDERQUANTITY_SUM_RATIO_12*
* *SALESAMOUNT_SUM_VELOCITY_4*
* *SALESAMOUNT_RATIO_MA_4*
* *SALESAMOUNT_SUM_WMA_4*

In [None]:
salesds = weeklysales.math(math_ops=['ORDERQUANTITY_SUM - LAG_ORDERQUANTITY_SUM_3',
                                     'ORDERQUANTITY_SUM / NULLIF(LAG_ORDERQUANTITY_SUM_12, 0)',
                                     '(SALESAMOUNT_SUM - LAG_SALESAMOUNT_SUM_3) / 4',
                                     'SALESAMOUNT_SUM / NULLIF(MEAN_SALESAMOUNT_SUM_4, 0)',
                                     '(4*SALESAMOUNT_SUM + 3*LAG_SALESAMOUNT_SUM_1 + 2*LAG_SALESAMOUNT_SUM_1 + LAG_SALESAMOUNT_SUM_3)/10']).rename(
                           renames={'ORDERQUANTITY_SUM___LAG_ORDERQUANTITY_SUM_3': 'ORDERQUANTITY_SUM_DELTA_4',
                                    'ORDERQUANTITY_SUM__NULLIFLAG_ORDERQUANTITY_SUM_12_0': 'ORDERQUANTITY_SUM_RATIO_12',
                                    'SALESAMOUNT_SUM___LAG_SALESAMOUNT_SUM_3__4': 'SALESAMOUNT_SUM_VELOCITY_4',
                                    'SALESAMOUNT_SUM__NULLIFMEAN_SALESAMOUNT_SUM_4_0': 'SALESAMOUNT_RATIO_MA_4',
                                    '_4SALESAMOUNT_SUM__3LAG_SALESAMOUNT_SUM_1__2LAG_SALESAMOUNT_SUM_1__LAG_SALESAMOUNT_SUM_310': 'SALESAMOUNT_SUM_WMA_4'})

salesds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

### Publish to Rasgo

At this point we've created all of the features from the internet sales data. We're not quite ready to model with it (we still need to merge in the product data and perform a last bit of feature engineering), but we'd like to make this work available to others and saved for future analysis. This means we will publish it to Rasgo.

In [None]:
finishedsales = rasgo.publish.dataset(dataset=salesds,
                                      name="WKSP FULL: AdventureWorks: sales forecasting",
                                      description="Internet Sales data set up for sales forecasting")
finishedsales

We can examine this dataset on Rasgo by clicking the link below

In [None]:
print(f"https://app.rasgoml.com/datasets/{finishedsales.id}")

## Product Data

Let's turn our attention to the product data. First, let's take a quick look again to remind ourselves what is here.

In [None]:
product.preview()

We see a lot of missing data, but looking closer *FINISHEDGOODSFLAG* is one, let's filter on this to see just finished goods. We can use the transform `filter` to filter the data.

In [None]:
finishedproducts = product.filter(filter_statements=["FINISHEDGOODSFLAG = 1"])
finishedproducts.preview()

That looks better and we can use this data.

### Explore Product Subcategory

Promotion looks like it can just be added and only *DISCOUNTPCT* is needed. Product has a subcategory key and there is a relevant dataset 78. Explore that

In [None]:
productsubcategory = rasgo.get.dataset(78)
productsubcategory.preview()

In [None]:
productsubcategory.preview().columns.sort_values()

### Join subcategory to product

We can use the `join` transformation to join the subcategory name to the product information.

In [None]:
finishedproducts2 = finishedproducts.join(join_table=productsubcategory,
                                          join_type='LEFT',
                                          join_columns={'PRODUCTSUBCATEGORYKEY':'PRODUCTSUBCATEGORYKEY'})
finishedproducts2.preview()

There are a lot of columns we don't really need, let's keep *PRODUCTKEY*, *CLASS*, *COLOR*, *DEALERPRICE*, *ENGLISHDESCRIPTION*, *ENGLISHPRODUCTNAME*, *ENGLISHPRODUCTSUBCATEGORYNAME*, and *STANDARDCOST*.

#### Drop unneeded columns

The transformation `drop_columns` can take either an **include_cols** or **exclude_cols** argument. As we know which columns we want to keep, **include_cols** will be easier.

We could run the transformation on the result of the last set, but these transformations can be chained together as follows.

In [None]:
finishedproducts2 = finishedproducts.join(join_table=productsubcategory,
                                          join_type='LEFT',
                                          join_columns={'PRODUCTSUBCATEGORYKEY':'PRODUCTSUBCATEGORYKEY'}).drop_columns(
                                          include_cols=['PRODUCTKEY', 'CLASS', 'COLOR', 'DEALERPRICE', 'ENGLISHDESCRIPTION', 
                                                        'ENGLISHPRODUCTNAME', 'ENGLISHPRODUCTSUBCATEGORYNAME', 
                                                        'STANDARDCOST'])
finishedproducts2.preview()

This looks like a useful table, we can publish it to allow us to reuse it in future analysis. 

In [None]:
finishedprod = rasgo.publish.dataset(dataset=finishedproducts2,
                                     name="WKSP FULL: AdventureWorks: product details",
                                     description="English language detail for finished products from the product and productsubcategory tables")
finishedprod

We can examine this dataset on Rasgo by clicking the link below

In [None]:
print(f"https://app.rasgoml.com/datasets/{finishedprod.id}")

### Create Modeling Data

We can now join the product data to the sales data we have been working with

In [None]:
startingds = finishedsales.join(join_table=finishedprod,
                                join_type='LEFT',
                                join_columns={'PRODUCTKEY': 'PRODUCTKEY'})
startingds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

To prepare this for modeling, we need to do three things. First, the target (next weeks sales needs to be created). Second, the categorical variables should be one-hot encoded. Finally, missing values should be imputed for the numeric columns.

#### Target Creation

Use the `lag` transform with a negative lag value to get next weeks sales as the target. While doing this, rename the value to make it clear that it is the target.

In [None]:
modelingds = startingds.lag(columns=['SALESAMOUNT_SUM'],
                            amounts=[-1],
                            order_by=['PRODUCTKEY', 'ORDERWEEK'],
                            partition=['PRODUCTKEY']).rename(
                            renames={'LAG_SALESAMOUNT_SUM__1': 'TARGET_SALESAMOUNT'})
modelingds.preview()

#### Categorical encoding

The columns that need to be encoded are: *CLASS*, *COLOR*, *ENGELISHPRODUCTNAME*, and *ENGLISHPRODUCTSUBCATEGORYNAME*. We will use the `one_hot_encode` transorm to encode *CLASS* and *COLOR*.

Since *ENGLISHPRODUCTNAME* and *ENGLISHPRODUCTSUBCATEGORYNAME* contain a large number of categorties and we intend to use tree-based modeling algorithms, we will encode *ENGLISHPRODUCTSUBCATEGORYNAME* with the `label_encode` transform. We will encode *ENGLISHPRODUCTNAME* with `target_encode` that will replace it by the mean target value of that category. Target encoding is a very powerful techinque to encode these high-cardinality categorical variables efficiently and help improve model performance.

In [None]:
modelingds = modelingds.one_hot_encode(column='CLASS').one_hot_encode(
                                       column='COLOR').target_encode(
                                       column='ENGLISHPRODUCTNAME',
                                       target='TARGET_SALESAMOUNT').label_encode(
                                       column='ENGLISHPRODUCTSUBCATEGORYNAME')

modelingds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

#### Imputation

As a final step before modeling, all numeric columns should have missing values replaced by a number. This can be done by the `impute` transformation. If a linear or logistic regression, SVM or neural network algorithm was going to be applied, we may want to impute the mean or median. This could be done by passing 'mean' or 'median' in through the imputations dictionary.

As the modeling algoritm applied here is tree-based, we can simply impute and extreme value. All of the features created are non-negative or close to zero, so we will impute a very large negative number, *-999,999*.

In [None]:
imputation_dict = {'DEALERPRICE': -999999,
                   'DISCOUNTAMOUNT_AVG': -999999,
                   'DISCOUNTAMOUNT_MAX': -999999,
                   'DISCOUNTAMOUNT_MIN': -999999,
                   'DISCOUNTAMOUNT_SUM': -999999,
                   'DISCOUNTPCT_AVG': -999999,
                   'DISCOUNTPCT_MAX': -999999,
                   'DISCOUNTPCT_MIN': -999999,
                   'DISCOUNTPCT_SUM': -999999,
                   'LAG_DISCOUNTAMOUNT_AVG_1': -999999,
                   'LAG_DISCOUNTAMOUNT_AVG_12': -999999,
                   'LAG_DISCOUNTAMOUNT_AVG_2': -999999,
                   'LAG_DISCOUNTAMOUNT_AVG_3': -999999,
                   'LAG_DISCOUNTPCT_AVG_1': -999999,
                   'LAG_DISCOUNTPCT_AVG_12': -999999,
                   'LAG_DISCOUNTPCT_AVG_2': -999999,
                   'LAG_DISCOUNTPCT_AVG_3': -999999,
                   'LAG_ORDERQUANTITY_SUM_1': -999999,
                   'LAG_ORDERQUANTITY_SUM_12': -999999,
                   'LAG_ORDERQUANTITY_SUM_2': -999999,
                   'LAG_ORDERQUANTITY_SUM_3': -999999,
                   'LAG_PRODUCTSTANDARDCOST_AVG_1': -999999,
                   'LAG_PRODUCTSTANDARDCOST_AVG_12': -999999,
                   'LAG_PRODUCTSTANDARDCOST_AVG_2': -999999,
                   'LAG_PRODUCTSTANDARDCOST_AVG_3': -999999,
                   'LAG_SALESAMOUNT_SUM_1': -999999,
                   'LAG_SALESAMOUNT_SUM_12': -999999,
                   'LAG_SALESAMOUNT_SUM_2': -999999,
                   'LAG_SALESAMOUNT_SUM_3': -999999,
                   'LAG_TAXAMT_SUM_1': -999999,
                   'LAG_TAXAMT_SUM_12': -999999,
                   'LAG_TAXAMT_SUM_2': -999999,
                   'LAG_TAXAMT_SUM_3': -999999,
                   'LAG_TOTALPRODUCTCOST_SUM_1': -999999,
                   'LAG_TOTALPRODUCTCOST_SUM_12': -999999,
                   'LAG_TOTALPRODUCTCOST_SUM_2': -999999,
                   'LAG_TOTALPRODUCTCOST_SUM_3': -999999,
                   'LAG_UNITPRICEDISCOUNTPCT_AVG_1': -999999,
                   'LAG_UNITPRICEDISCOUNTPCT_AVG_12': -999999,
                   'LAG_UNITPRICEDISCOUNTPCT_AVG_2': -999999,
                   'LAG_UNITPRICEDISCOUNTPCT_AVG_3': -999999,
                   'LAG_UNITPRICE_AVG_1': -999999,
                   'LAG_UNITPRICE_AVG_12': -999999,
                   'LAG_UNITPRICE_AVG_2': -999999,
                   'LAG_UNITPRICE_AVG_3': -999999,
                   'LAG_UNITPRICE_SUM_1': -999999,
                   'LAG_UNITPRICE_SUM_12': -999999,
                   'LAG_UNITPRICE_SUM_2': -999999,
                   'LAG_UNITPRICE_SUM_3': -999999,
                   'MEAN_ORDERQUANTITY_SUM_4': -999999,
                   'MEAN_SALESAMOUNT_SUM_4': -999999,
                   'ORDERQUANTITY_SUM': -999999,
                   'ORDERQUANTITY_SUM_DELTA_4': -999999,
                   'ORDERQUANTITY_SUM_RATIO_12': -999999,
                   'PRODUCTSTANDARDCOST_AVG': -999999,
                   'PRODUCTSTANDARDCOST_SUM': -999999,
                   'SALESAMOUNT_RATIO_MA_4': -999999,
                   'SALESAMOUNT_SUM': -999999,
                   'SALESAMOUNT_SUM_VELOCITY_4': -999999,
                   'SALESAMOUNT_SUM_WMA_4': -999999,
                   'STANDARDCOST': -999999,
                   'TAXAMT_SUM': -999999,
                   'TOTALPRODUCTCOST_AVG': -999999,
                   'TOTALPRODUCTCOST_SUM': -999999,
                   'UNITPRICEDISCOUNTPCT_AVG': -999999,
                   'UNITPRICEDISCOUNTPCT_MAX': -999999,
                   'UNITPRICEDISCOUNTPCT_MIN': -999999,
                   'UNITPRICEDISCOUNTPCT_SUM': -999999,
                   'UNITPRICE_AVG': -999999,
                   'UNITPRICE_SUM': -999999}

In [None]:
modelingds = modelingds.impute(imputations=imputation_dict)

modelingds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

#### Train-test split

As this is a time-series problem, a random train-test split won't work as there will be leakage from observations near the end of the time frame in the training set to observations earlier than this in the test set. The way to avoid this problem is to perform the split based on the date. The transformation `train_test_split` can do this by passing the date columns through the parameter **order_by**.

In [None]:
modelingds = modelingds.train_test_split(order_by=['ORDERWEEK'],
                                         train_percent=0.8)
    
modelingds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

#### Delete unneeded columns

We now have a number of columns not needed for modeling (such as the raw categorical columns), we can delete the following from the dataset.
* *CLASS*
* *COLOR*
* *ENGLISHDESCRIPTION*
* *ENGLISHPRODUCTNAME*
* *ENGLISHPRODUCTSUBCATEGORYNAME*


In [None]:
modelingds = modelingds.drop_columns(exclude_cols=['CLASS', 'COLOR', 'ENGLISHDESCRIPTION', 
                                                   'ENGLISHPRODUCTNAME', 'ENGLISHPRODUCTSUBCATEGORYNAME'])
    
modelingds.order(col_list=['PRODUCTKEY', 'ORDERWEEK'], order_method="ASC").preview()

#### Save Modeling Dataset

We can now save this modeling dataset so we can return to it in the future.

In [None]:
modeling = rasgo.publish.dataset(dataset=modelingds,
                                 name="WKSP FULL: AdventureWorks: Sales Forecast Modeling",
                                 description="Modeling dataset for Internet Sales Forecasting")
modeling

We can examine this dataset on Rasgo by clicking the link below

In [None]:
print(f"https://app.rasgoml.com/datasets/{modeling.id}")

Capture this dataset ID for use in prediction.

In [27]:
ds_id = modeling.id

### Modeling

We are now ready to build the model. First, get the modeling data from Rasgo using `to_df`.

In [None]:
df = modeling.to_df().reset_index(drop=True)

Check for numeric datatypes and convert the numeric ones to floats.

In [None]:
for c in df.select_dtypes(exclude=[np.number]).columns:
    if c not in ['ORDERWEEK', 'TT_SPLIT']:
        df[c] = pd.to_numeric(df[c])

Eliminate the last week of data as there is no target.

In [None]:
df = df[~df.TARGET_SALESAMOUNT.isna()]

#### Train the model

First, split the data using the TT_SPLIT column.

In [None]:
df_train = df[df['TT_SPLIT'] == 'TRAIN'].drop(columns=['TT_SPLIT', 'ORDERWEEK'])
df_test = df[df['TT_SPLIT'] == 'TEST'].drop(columns=['TT_SPLIT', 'ORDERWEEK'])

In [None]:
y_train = df_train['TARGET_SALESAMOUNT']
X_train = df_train.drop(columns=['TARGET_SALESAMOUNT'])
y_test = df_test['TARGET_SALESAMOUNT']
X_test = df_test.drop(columns=['TARGET_SALESAMOUNT'])

#### Fit the model

For illustration purposes, we are just fitting the model with a single set of parameters. In general, you should optimize the hyperparameters before building the final model. That process is beyond the scope of this document.

In [None]:
model = xgb.XGBRegressor(n_estimators=100,
                         max_depth=5,
                         eta=0.01,
                         random_state=1066,
                         subsample=0.7,
                         colsample_bytree=0.7)

model.fit(X_train, y_train)

#### Check the performance

In [None]:
model.predict(X_test)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
rmse

### Predict on new data

Since our feature engineering was saved in Rasgo, as new data enters the system, it will automatically be prepared for modeling. We can just pull the data in question and make a prediction on it.

In this case, if we are making these predictions each week, we can just pull the most recent week. In this particular data, that is '*2014-01-19*'.

#### Pull the data

Use to_df to grab the data from this date. We have several columns not needed in the model, so we will drop those as well.

In [None]:
#predictdf = rasgo.get.dataset(1940).to_df(filters={"ORDERWEEK":"2014-01-19"}).drop(columns=['TT_SPLIT', 'ORDERWEEK', 'TARGET_SALES'])
predictdf = rasgo.get.dataset(ds_id).to_df(filters=["ORDERWEEK = '2014-01-19'"])
for c in predictdf.select_dtypes(exclude=[np.number]).columns:
    if c not in ['ORDERWEEK', 'TT_SPLIT']:
        predictdf[c] = pd.to_numeric(predictdf[c])
predictdf.head()

Now use the model to get the sales forecast. We will create a dataframe to hold the predictions then drop the columns not needed by the model before making the prediction.

In [None]:
salesforecastdf = predictdf[['PRODUCTKEY', 'ORDERWEEK']].copy()
###salesforecastdf['forecast'] = model.predict(predictdf.drop(columns=['TT_SPLIT', 'ORDERWEEK', 'TARGET_SALES']))
salesforecastdf['forecast'] = model.predict(predictdf.drop(columns=['TT_SPLIT', 'ORDERWEEK', 'TARGET_SALESAMOUNT']))
salesforecastdf