#### Engineering

I'll start by installing the Creme library:

In [1]:
#pip install git+https://github.com/creme-ml/creme --upgrade

I'm importing the packages that I'm going to need:

In [2]:
import copy
import datetime
import random
import tqdm

In [3]:
from creme import compose
from creme import feature_extraction
from creme import metrics
from creme import optim
from creme import preprocessing
from creme import stats
from creme import stream

In [4]:
from creme import neighbors
from creme import tree
from creme import linear_model

I use this first function to parse the date and extract the number of the day.

In [5]:
def extract_date(x):
    """Extract features from the date."""
    import datetime
    if not isinstance(x['date'], datetime.datetime):
        x['date'] = datetime.datetime.strptime(x['date'], '%Y-%m-%d')
    x['wday'] = x['date'].weekday()
    return x

``get_metadata`` allows you to extract the identifier of the product and the store.

In [6]:
def get_metadata(x):
    key = x['id'].split('_')
    x['store_id'] = f'{key[3]}_{key[4]}'
    x['item_id'] = f'{key[0]}_{key[1]}_{key[2]}'
    return x

Below I define the feature extraction pipeline. I use the module ``feature_extraction.TargetAgg`` to calculate the features on the target variable of the stream.

In [7]:
extract_features = compose.TransformerUnion(
    compose.Select('wday'),
    
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Mean()),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Var()),
    
    feature_extraction.TargetAgg(by=['store_id', 'wday'], how=stats.Mean()),
    feature_extraction.TargetAgg(by=['store_id', 'wday'], how=stats.Var()),
    
    feature_extraction.TargetAgg(by=['store_id'], how=stats.RollingMean(1)),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.RollingMean(3)),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.RollingMean(7)),
    
    feature_extraction.TargetAgg(by=['store_id'], how=stats.RollingMean(30)),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.RollingMean(15)),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.RollingMean(20)),
    
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Shift(30) | stats.RollingMean(1)),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Shift(29) | stats.RollingMean(1)),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Shift(28) | stats.RollingMean(1)),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Shift(27) | stats.RollingMean(1)),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Shift(26) | stats.RollingMean(1)),
    
    feature_extraction.TargetAgg(by=['wday', 'store_id'], how=stats.RollingMean(1)),
    feature_extraction.TargetAgg(by=['wday', 'store_id'], how=stats.RollingMean(3)),
    feature_extraction.TargetAgg(by=['wday', 'store_id'], how=stats.RollingMean(7)),
    
    feature_extraction.TargetAgg(by=['wday', 'store_id'], how=stats.RollingMean(30)),
    feature_extraction.TargetAgg(by=['wday', 'store_id'], how=stats.RollingMean(15)),
    feature_extraction.TargetAgg(by=['wday', 'store_id'], how=stats.RollingMean(20)),
)

Below, I define the global pipeline I want to deploy in production. The pipeline is composed of:

- Extraction of the product identifier.

- Extraction of the day number of the date $\in$ {1, 2, ..7}. 

- Computation of the features.

- Standard scaler that centers and reduces the value of features.

- Model declaration ``neighbors.KNeighborsRegressor``.

In [8]:
knn = (
    compose.FuncTransformer(get_metadata) |
    compose.FuncTransformer(extract_date) |
    extract_features |
    neighbors.KNeighborsRegressor(window_size=300, n_neighbors=30, p=2)
)

lm = (
    compose.FuncTransformer(get_metadata) |
    compose.FuncTransformer(extract_date) |
    extract_features |
    linear_model.LinearRegression(optimizer=optim.SGD(0.00005), clip_gradient=1, intercept_lr=0.001)
)

I have choosen to create one model per product. The piece of code below creates a copy of the pipeline for all products and store them in a dictionary.

In [9]:
list_model = []

X_y = stream.iter_csv('./data/sample_submission.csv', target_name='F8')

for x, y in tqdm.tqdm(X_y, position=0):
    
    item_id = '_'.join(x['id'].split('_')[:3])

    if item_id not in list_model:

        list_model.append(item_id)
        
dict_knn = {item_id: copy.deepcopy(knn) for item_id in tqdm.tqdm(list_model, position=0)}
dict_lm  = {item_id: copy.deepcopy(lm) for item_id in tqdm.tqdm(list_model, position=0)}

60980it [00:01, 33220.40it/s]
100%|██████████| 3049/3049 [00:06<00:00, 460.29it/s]
100%|██████████| 3049/3049 [00:05<00:00, 541.71it/s]


I do a warm-up of all the models from a subset of the training set. To do this pre-training, I selected the last two months of the training set and saved it in csv format.I use Creme's ``stream.iter_csv`` module to iterate on the training dataset. The pipeline below consumes very little RAM memory because we load the data into the memory one after the other. I train my models to predict sales 7 days in advance.

In [10]:
import collections 

random.seed(42)

params = dict(target_name='y', converters={'y': int, 'id': str}, parse_dates= {'date': '%Y-%m-%d'})

X_y = stream.iter_csv('./data/train_preprocessed.csv', **params)

bar = tqdm.tqdm(X_y, position = 0)

metric_knn = collections.defaultdict(lambda: metrics.MAE())
metric_lm  = collections.defaultdict(lambda: metrics.MAE())

for i, (x, y) in enumerate(bar):
    
    item_id  = '_'.join(x['id'].split('_')[:3])

    y_pred_knn = dict_knn[f'{item_id}'].predict_one(x)
    
    y_pred_lm  = dict_lm[f'{item_id}'].predict_one(x)

    metric_knn[f'{item_id}'].update(y, y_pred_knn)
    
    metric_lm[f'{item_id}'].update(y, y_pred_lm)

    dict_knn[f'{item_id}'].fit_one(x=x, y=y)
    
    # Train linear model for 10 epochs on each training example
    for _ in range(10):
        
        dict_lm[f'{item_id}'].fit_one(x=x, y=y)

1829400it [14:54:19, 34.09it/s] 


Save scores of models for each product after training the models:

In [11]:
import json

scores_knn = {id: _.get() for id, _ in metric_knn.items()}
scores_lm  = {id: _.get() for id, _ in metric_lm.items()}

with open('scores_knn.json', 'w') as file:
    
    json.dump(scores_knn, file)

with open('scores_lm.json', 'w') as file:
    
    json.dump(scores_lm, file)

In [17]:
scores = {}

for item_id in tqdm.tqdm(scores_knn.keys()):
    
    score_knn = scores_knn[item_id]
    
    score_lm  = scores_lm[item_id]
    
    if score_knn < score_lm:
        
        scores[item_id] = score_knn
        
    else:
        
        scores[item_id] = score_lm

100%|██████████| 3049/3049 [00:00<00:00, 165462.52it/s]


In [None]:
import numpy as np
sum({value for key, value in scores_knn.items()}) / 3049

Save models:

In [12]:
import dill

with open('dict_knn.dill', 'wb') as file:
    
    dill.dump(dict_knn, file)
    
with open('dict_lm.dill', 'wb') as file:
    
    dill.dump(dict_lm, file)

Load scores and models

In [13]:
import json

with open('scores_knn.json', 'rb') as file:
    
    scores_knn = json.load(file)

with open('scores_lm.json', 'rb') as file:
    
    scores_lm = json.load(file)

In [14]:
import dill

with open('dict_knn.dill', 'rb') as file:
    
    dict_knn = dill.load(file)
    
with open('dict_lm.dill', 'rb') as file:
    
    dict_lm = dill.load(file)

For each product, we chosse the best model between KNNRegressor and linear model depending on the validation score:

In [15]:
dict_model = {}

for item_id in tqdm.tqdm(scores_knn.keys()):
    
    score_knn = scores_knn[item_id]
    
    score_lm  = scores_lm[item_id]
    
    if score_knn < score_lm:
        
        dict_model[item_id] = dict_knn[item_id]
        
    else:
        
        dict_model[item_id] = dict_lm[item_id]
        
# Save selected models:
with open('dict_model.dill', 'wb') as file:
    
    dill.dump(dict_model, file)

100%|██████████| 3049/3049 [00:00<00:00, 8539.32it/s]


#### Deployment of the model:

**Now that all the models are pre-trained, I will be able to deploy the pipelines behind an API in a production environment. I will use the [Chantilly](https://github.com/creme-ml/chantilly) library to do so.**

**[Chantilly](https://github.com/creme-ml/chantilly) is a project that aims to ease train Creme models when they are deployed. Chantilly is a minimalist API based on the Flask framework.** Chantilly allows to make predictions, train models and measure model performance in real time. It gives access to a dashboard.

Chantilly is a library currently under development. For various reasons, I choose to extract the files from Chantilly that I'm interested in to realize this project.

I choose to deploy my API on Heroku. To do so I followed the [tutorial](https://stackabuse.com/deploying-a-flask-application-to-heroku/). I choose Heroku because they allow me to run my API with a very modest configuration at a low cost. (This modest configuration increases the response time of my API when there are several users). With a budget bigger than mine, my API could handle a large volume of requests simultaneously.

The main difficulty I encountered when deploying on Heroku was creating the ``Profile`` file. The ``Procfile`` is used to initialize the API when it is deployed on Heroku.

Here is its contents:

```web: gunicorn -w 4 "app:create_app()"```

You will be able to find the whole architecture of my API [here](https://github.com/raphaelsty/M5-Forecasting-Accuracy).

After deploying my Chantilly API on Heroku, I add the regression flavor. Chantilly uses this flavor to select the appropriate metrics (MAE, MSE and SMAPE).

In [None]:
url = 'https://kaggle-creme-ml.herokuapp.com'
#url = 'http://0.0.0.0:5000'

In [None]:
import requests
requests.post(f'{url}/api/init', json= {'flavor': 'regression'})

After initializing the flavor of my API, I upload all the models I've pre-trained. Each model has a name. This name is the name of the product. I have used dill to serialize the model before uploading it to my API.

In [180]:
for model_name, model in tqdm.tqdm(dic_models.items(), position=0):
    r = requests.post(f'{url}/api/model/{model_name}', data=dill.dumps(model))
    break

  0%|          | 0/3049 [00:00<?, ?it/s]


In [190]:
model_name

'HOBBIES_1_001'

All the models are now deployed in production and available to make predictions. The models can also be updated on a daily basis. That's it.

![](static/online_learning.png)

**As you may have noticed, the philosophy of online learning allows to reduce the complexity of the deployment of a machine learning algorithm in production. Moreover, to update the model, we only have to make calls to the API. We don't need to re-train the model from scratch.**


#### Make a prediction by calling the API:

In [191]:
r = requests.post(f'{url}/api/predict', json={
    'id': 1,
    'model': 'HOBBIES_1_001',
    'features': {'date': '2016-05-23', 'id': 'HOBBIES_1_001_CA_1'}
})

You can execute these requests. I don't recommend that you request my models to blend my predictions to yours. My models are not intended to be competitive. 

If you want my opinion on the competition, I think there's a lot of noise in the data. I haven't observed any particular trends when it comes to day-to-day predictions for all products.


#### Update models with new data:

In [192]:
r = requests.post(f'{url}/api/learn', json={
    'id': 1,
    'model': 'HOBBIES_1_001',
    'ground_truth': 1,
})

In [193]:
r

<Response [400]>

In [44]:
def funct(x):
    yield from x

In [45]:
def gen_treshold(k):

    gen1 = funct([1, 2, 3, 4])
    gen2 = funct([5, 6, 7, 8])

    for bar1, bar2 in zip(gen1, gen2):
        if bar1 > k:
            yield bar1, bar2

In [46]:
for bar1, bar2 in gen_treshold(2):
    print(bar1, bar2)

3 7
4 8


In [50]:
def lookahead(iterable):
    """Pass through all values from the given iterable, augmented by the
    information if there are more values to come after the current one
    (True), or if it is the last value (False).
    """
    # Get an iterator and pull the first value.
    it = iter(iterable)
    last = next(it)
    # Run the iterator to exhaustion (starting from the second value).
    for val in it:
        # Report the *previous* value (more to come).
        yield last, True
        last = val
    # Report the last value.
    yield last, False

In [51]:
for i in lookahead([1, 2, 3]):
    print(i)

(1, True)
(2, True)
(3, False)
