#### Engineering

I'll start by installing the Creme library:

In [1]:
# !pip uninstall creme -y
# !pip install git+https://github.com/creme-ml/creme --upgrade

In [2]:
#!pip install creme

I'm importing the packages that I'm going to need:

In [3]:
import copy
import datetime
import random
import tqdm

In [4]:
from creme import compose
from creme import feature_extraction
from creme import metrics
from creme import neighbors
from creme import preprocessing
from creme import stats
from creme import stream

I use this first function to parse the date and extract the number of the day.

In [5]:
def extract_date(x):
    """Extract features from the date."""
    import datetime
    if not isinstance(x['date'], datetime.datetime):
        x['date'] = datetime.datetime.strptime(x['date'], '%Y-%m-%d')
    x['wday'] = x['date'].weekday()
    return x

``get_metadata`` allows you to extract the identifier of the product and the store.

In [6]:
def get_metadata(x):
    key = x['id'].split('_')
    x['store_id'] = f'{key[3]}_{key[4]}'
    x['item_id'] = f'{key[0]}_{key[1]}_{key[2]}'
    return x

Below I define the feature extraction pipeline. I use the module ``feature_extraction.TargetAgg`` to calculate the features on the target variable of the stream.

In [7]:
extract_features = compose.TransformerUnion(
    compose.Select('wday'),
    
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Shift(7) | stats.Mean()),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Shift(7) | stats.Var()),
    
    feature_extraction.TargetAgg(by=['wday'], how=stats.Shift(7) | stats.RollingMean(1)),
    feature_extraction.TargetAgg(by=['wday'], how=stats.Shift(7) | stats.RollingMean(3)),
    feature_extraction.TargetAgg(by=['wday'], how=stats.Shift(7) | stats.RollingMean(7)),
    
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Shift(7) | stats.RollingMean(1)),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Shift(7) | stats.RollingMean(3)),
    feature_extraction.TargetAgg(by=['store_id'], how=stats.Shift(7) | stats.RollingMean(7)),
    
    feature_extraction.TargetAgg(by=['wday', 'store_id'], how=stats.Shift(7) | stats.RollingMean(1)),
    feature_extraction.TargetAgg(by=['wday', 'store_id'], how=stats.Shift(7) | stats.RollingMean(3)),
    feature_extraction.TargetAgg(by=['wday', 'store_id'], how=stats.Shift(7) | stats.RollingMean(7)),
)

Below, I define the global pipeline I want to deploy in production. The pipeline is composed of:

- Extraction of the product identifier.

- Extraction of the day number of the date $\in$ {1, 2, ..7}. 

- Computation of the features.

- Standard scaler that centers and reduces the value of features.

- Model declaration ``neighbors.KNeighborsRegressor``.

In [8]:
model = (
    compose.FuncTransformer(get_metadata) |
    compose.FuncTransformer(extract_date) |
    extract_features |
    neighbors.KNeighborsRegressor(window_size=30, n_neighbors=15)
)

I have choosen to create one model per product. The piece of code below creates a copy of the pipeline for all products and store them in a dictionary.

In [9]:
list_model = []

X_y = stream.iter_csv('./data/sample_submission.csv', target_name='F8')

for x, y in tqdm.tqdm(X_y, position=0):
    
    item_id = '_'.join(x['id'].split('_')[:3])
    
    if item_id not in list_model:
    
        list_model.append(item_id)
        
dic_models = {item_id: copy.deepcopy(model) for item_id in tqdm.tqdm(list_model, position=0)}

60980it [00:02, 26285.58it/s]
100%|██████████| 3049/3049 [00:05<00:00, 531.69it/s]


In [10]:
len(dic_models)

3049

I do a warm-up of all the models from a subset of the training set. To do this pre-training, I selected the last two months of the training set and saved it in csv format.I use Creme's ``stream.iter_csv`` module to iterate on the training dataset. The pipeline below consumes very little RAM memory because we load the data into the memory one after the other. I train my models to predict sales 7 days in advance.

In [None]:
random.seed(42)

params = dict(target_name='y', converters={'y': int, 'id': str}, parse_dates= {'date': '%Y-%m-%d'})

X_y = stream.simulate_qa(
    X_y    = stream.iter_csv('./data/train_preprocessed.csv', **params), 
    moment = 'date', 
    delay  = datetime.timedelta(days=7)
)

bar = tqdm.tqdm(X_y, position = 0)

metric = metrics.Rolling(metrics.MAE(), 600000)

y_pred = {}

for i, x, y in bar:
    
    item_id  = '_'.join(x['id'].split('_')[:3])
    
    if y != None:

        dic_models[f'{item_id}'].fit_one(x=x, y=y)
        
        # Update the metric:
        metric = metric.update(y, y_pred[i])
           
        if i % 1000 == 0:
            # Update tqdm progress bar.
            bar.set_description(f'MAE: {metric.get():4f}')

    else:

        y_pred[i] = dic_models[f'{item_id}'].predict_one(x)

MAE: 1.274888: : 1092817it [09:38, 1774.50it/s]

In [None]:
# KNN
# MAE: 1.101585: : 3658800 [26:13, 2324.70it/s]

# LR
#MAE: 1.307144: : 3658800it [19:04, 3197.18it/s]

Save models after training with dill:

In [None]:
import dill

In [None]:
with open('dic_model_knn_lag.dill', 'wb') as file:
    dill.dump(dic_models, file)

#### Deployment of the model:

**Now that all the models are pre-trained, I will be able to deploy the pipelines behind an API in a production environment. I will use the [Chantilly](https://github.com/creme-ml/chantilly) library to do so.**

**[Chantilly](https://github.com/creme-ml/chantilly) is a project that aims to ease train Creme models when they are deployed. Chantilly is a minimalist API based on the Flask framework.** Chantilly allows to make predictions, train models and measure model performance in real time. It gives access to a dashboard.

Chantilly is a library currently under development. For various reasons, I choose to extract the files from Chantilly that I'm interested in to realize this project.

I choose to deploy my API on Heroku. To do so I followed the [tutorial](https://stackabuse.com/deploying-a-flask-application-to-heroku/). I choose Heroku because they allow me to run my API with a very modest configuration at a low cost. (This modest configuration increases the response time of my API when there are several users). With a budget bigger than mine, my API could handle a large volume of requests simultaneously.

The main difficulty I encountered when deploying on Heroku was creating the ``Profile`` file. The ``Procfile`` is used to initialize the API when it is deployed on Heroku.

Here is its contents:

```web: gunicorn -w 4 "app:create_app()"```

You will be able to find the whole architecture of my API [here](https://github.com/raphaelsty/M5-Forecasting-Accuracy).

After deploying my Chantilly API on Heroku, I add the regression flavor. Chantilly uses this flavor to select the appropriate metrics (MAE, MSE and SMAPE).

In [12]:
url = 'https://kaggle-creme-ml.herokuapp.com'

In [13]:
import requests
requests.post(f'{url}/api/init', json= {'flavor': 'regression'})

<Response [201]>

After initializing the flavor of my API, I upload all the models I've pre-trained. Each model has a name. This name is the name of the product. I have used dill to serialize the model before uploading it to my API.

In [14]:
with open('dic_models.dill', 'rb') as file:
    dic_models = dill.load(file)

In [15]:
for model_name, model in tqdm.tqdm(dic_models.items(), position=0):
    r = requests.post(f'{url}/api/model/{model_name}', data=dill.dumps(model))

100%|██████████| 3049/3049 [23:06<00:00,  2.20it/s]


All the models are now deployed in production and available to make predictions. The models can also be updated on a daily basis. That's it.

![](static/online_learning.png)

**As you may have noticed, the philosophy of online learning allows to reduce the complexity of the deployment of a machine learning algorithm in production. Moreover, to update the model, we only have to make calls to the API. We don't need to re-train the model from scratch.**


#### Make a prediction by calling the API:

In [26]:
r = requests.post(f'{url}/api/predict', json={
    'id': 1,
    'model': 'HOBBIES_1_001',
    'features': {'date': '2016-05-23', 'id': 'HOBBIES_1_001_CA_1'}
})

In [27]:
print(r.json())

{'model': 'HOBBIES_1_001', 'prediction': 0.8351625255839868}


You can execute these requests. I don't recommend that you request my models to blend my predictions to yours. My models are not intended to be competitive. 

If you want my opinion on the competition, I think there's a lot of noise in the data. I haven't observed any particular trends when it comes to day-to-day predictions for all products.


#### Update models with new data:

In [18]:
r = requests.post(f'{url}/api/learn', json={
    'id': 1,
    'model': 'HOBBIES_1_001',
    'ground_truth': 1,
})