# Gradient Boosting Notebook

Fork from a version using CatBoost. 

We switch to LightGBM because CatBoost is very RAM-intensive. 

Used Kaggle sources:

- [M5 First Public Notebook Under 0.50](https://www.kaggle.com/kneroma/m5-first-public-notebook-under-0-50#Changes)
- [m5_catboost (inspired on previous notebook)](https://www.kaggle.com/vgarshin/m5-catboost)
    * This notebook is *heavily* inspired by this notebook. 
    * Where sections are almost literally copied, I indicated this in the respective cells as well

## Roadmap / TODO

- Remove low-importance categorical features to save RAM if necessary
- Increase number of training days as much as RAM allows (already done, mostly)
- Further remove RAM requirements by experimenting with 'lower' dtypes
- Make sure to [avoid memory spikes](https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/149754)
- Final prediction as weighted mean of three "alpha based" predictions. See original notebook.
- Use lightgbm.cv() instead! 
- Other notebooks do actually randomly select validation days. Our lower score could be because we do not train on the last day.
    * Given that the time-dependencies is captured in the lag features, shuffling should actuall not hurt (this does not hold for other methods)

# Submission log

For reference, the best score with the CatBoost notebook this is a fork from was [0.67] on CPU, with feature engineering (lag + rolling mean + dates); 1000 iterations; depth 8; max_bin 128, max_ctr_complexity 1

Increase MODEL_VERSION below

Most recent at the bottom:

- v0: 0.57666
- v1: 0.58
    * Added month and year date feature
    * Iterations 1000 > 1200
- v2: CRASH
    * Remove month and year again
    * Iterations still 1200
    * Year more training days
- v3: 0.55185
    * BEGIN_DATE '2014-8-01' 
- v4: 0.76206
    * Add a zero threshold of 0.01 for very low predict sales
    * Setting predictions to 0 already during recursive predictions may too strongly influence predictions for the coming days. I could instead only apply the threshold after making all recursive predictions.
- v5: 0.75
    * Try a lower threshold of 0.001 
    * What is strange is that I can't spot anything being set to 0, so this threshold is clearly too low.
    
v4 and v5 accidentally used only a few days for training... I changed this for speed at one point and forgot to set it back. Back to the drawing table.

- v6: 0.57
    * Fix training data issue of v4 and v5
    * Added feature importance
    * Zero threshold at 0.01, but now only apply after all predictions are done
    * So does not help, unless the optimized prediction loop has some detrimental effect. But it shouldn't.
- v7: 0.6
    * Remove rolling_window features on id, I don't think it makes sense
    * Clean up load_data, remove unused label_encoding
    * There was double code for dropping everything before BEGIN_DATE that nevertheless increased memory usage. Fixed.
    * Fix: all event_names were NaN. I forgot to set them to int16 in load_data(). Fixed.
    * Now we can safely drop NaNs! 
    * As a result of dropping NaNs, I'll experiment with adding more data. Try with BEGIN_DATE = '2014-01-01'
- v7: 0.58478
    * BEGIN_DATE '2013-08-01' (day 915)
- v8: CRASH
    * Comment out threshold code; clearly does not help.
    * BEGIN_DATE: '2012-01-01' (day 337)
- v9: 0.58478 =v7!
    * Back to '2013-08-01' (day 915)
    * Difference with v7 is that in v7 I forgot to comment out the threshold code
- v10: 0.58478 (dus optimalizatie werkt prima!)
    * I can't quite get back to the original 0.55 of v3, despite adding more data. What's going on? 
    * Could be caused by: a) dropping NaN (but we added more data instead!) b) mistake in optimized prediction loop c) due to the change in event_type handling, d) removing wnd_feats (unlikely)
    * a), b) seem very unlikely to me. D could be, but notebook <0.50 also didn't use that feature. It could be c, if before events were not used in the prediction and now actually slightly hurt the prediction.
    * I first want to exclude option c, just to be sure. 
- v11: 
    * Use optimized version
- v??: 0.54597
    * Submission with truly best per category models
    * Weighting 1/2 (individual) 1/2 (category)
- v?? + 1:
    * Weighting 2/3 (individual) 1/3 (category)
- v?? + 2:
    * Weighting 1/3 (individual) 2/3 (category)

# Preprocessing

## Imports

In [None]:
MODEL_VERSION = 'v10'

import os
import gc #garbage collection
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting
from datetime import datetime, timedelta, date # handling dates
from tqdm.notebook import tqdm # progress bars

# LightGBM
import lightgbm as lgb

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Global variables

TODO: END_DAY is een string, maar dat slaat nergens op. Converteer naar int en refactor.

Daarom gaat vermoedelijk deze lijn in `load_data` fout: `sales_train_validation = sales_train_validation[(sales_train_validation['day'] >= BEGIN_DAY)]`

In [None]:
# Set this to true if you want only one iteration to run, for testing purposes.
TEST_RUN = False

# Do not truncate view when max_cols is exceeded
# ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html
pd.set_option('display.max_columns', 50) 

# Path to Data Folder
KAGGLE_DATA_FOLDER = '/kaggle/input/m5-forecasting-accuracy'

# Paths to Models per Category
PATH_TO_CAT0="/kaggle/input/models-per-cat-with-sale/model_cat0_v10.lgb"
PATH_TO_CAT1="/kaggle/input/models-per-cat-with-sale/model_cat1_v10.lgb"
PATH_TO_CAT2="/kaggle/input/models-per-cat-with-sale/model_cat2_v10.lgb"

# Path to Model over All Categories
PATH_TO_ALL_CATS = "/kaggle/input/lgbmindividualbestsubmission/model_v13_param_tuning.lgb"
BACKWARD_LAG = 60
END_DAY = 1913
# Use this if you do not want to load in all data
# Full data starts at 2011-01-29
#BEGIN_DATE = '2015-02-11' 
#BEGIN_DATE = '2014-8-01'
BEGIN_DATE = '2013-08-01'
#BEGIN_DATE = '2012-01-01'
BEGIN_DAY = str((datetime.strptime(BEGIN_DATE, '%Y-%m-%d') - datetime.strptime('2011-01-29', '%Y-%m-%d')).days)
TRAIN_SPLIT = '2016-03-27'
EVAL_SPLIT = '2016-04-24' # In this phase of the competition, this is the end date
print(datetime.strptime(EVAL_SPLIT, '%Y-%m-%d'))
TASK_TYPE='CPU'



PATH_MODELS = [PATH_TO_ALL_CATS, PATH_TO_CAT0, PATH_TO_CAT1, PATH_TO_CAT2]

## Loading data and preprocessing

### Data types 

TODO:

- Willen we 'd' en bijv. event_name_1 objecten houden? Misschien kan die naar int16. 
    * Nu zijn event type en name bijv. "unknown". Al helemaal type zou met een integer volstaan.
    * Update: LightGBM weigert 'object' datatype, dus de conversie naar int of float is noodzakelijk

Zie bijv. [hier](https://www.kaggle.com/kneroma/m5-first-public-notebook-under-0-50#Changes)

```
cal[col] = cal[col].cat.codes.astype("int16")
cal[col] -= cal[col].min()
```

- De snaps zijn nu int16, maar zijn float32(!) in bovenstaande notebook. Ik zie echter niet in waarom ze float32 zouden moeten zijn, aangezien ze maar twee waardes hebben geloof ik.

In [None]:
# N.B. LightGBM specifically requires the 'category' dtype
# E.g. see https://stackoverflow.com/questions/56070396/why-does-categorical-feature-of-lightgbm-not-work
CALENDAR_DTYPES = {
    'date':             'str',
    'wm_yr_wk':         'int16', 
    'weekday':          'category',
    'wday':             'int16', 
    'month':            'int16', 
    'year':             'int16', 
    'd':                'object',
    'event_name_1':     'category',
    'event_type_1':     'category',
    'event_name_2':     'category',
    'event_type_2':     'category',
    'snap_CA':          'int16', 
    'snap_TX':          'int16', 
    'snap_WI':          'int16'
}
PARSE_DATES = ['date']
SALES_PRICES_DTYPES = {
    'store_id':    'category', 
    'item_id':     'category', 
    'wm_yr_wk':    'int16',  
    'sell_price':  'float32'#,
    #'sales': 'float32'
}

### Loading with preprocessing

All steps are now integrated into load_data; I did them sequentially before but that resulted in too much overhead.

#### Convert sales dataframe from wide to long format

Whereas in the "wide" dataframe one row contains columns with the corresponding sales/demand per day (1913 days at the moment), the new "long" dataframe has a new entry for each day. 

The resulting dataframe (assuming you use all days) will therefore have 1913-1 less "day" columns (1919-1912+1 = 8 columns), and 30490x1913=58.327.370 rows.

This is achieved with [pandas.melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) by specifying identifier variables and measured variables ('day' in this case), as well as the name of the output value.

Convert sales_train_validation such that it becomes a function of day with output of sales/demand.
Unpivots everything not set as id_var, so by default value_vars are all day entries.

#### Merging dataframes

[pandas merge doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)

- "left": left outer join that uses keys from left dataframe
- left dataframe is "sales_train_validation", in which we have just defined the "day" column header
- join "day" (e.g. d_1) on "d" from the calendar dataframe (also of form d_1)

In [None]:
def load_data(train=True):
    """
    Load data
    """
    
    ##### SALES_TRAIN_VALIDATION
    
    print("Loading train and validation data")   
    # Dtype magic from https://www.kaggle.com/kneroma/m5-first-public-notebook-under-0-50#Changes
    # Required to make LightGBM deal with categorical values
    numcols = [f"d_{day}" for day in range(int(BEGIN_DAY),END_DAY+1)]
    catcols = ['id', 'item_id', 'dept_id','store_id', 'cat_id', 'state_id']
    dtype = {numcol:"float32" for numcol in numcols} 
    dtype.update({col: "category" for col in catcols if col != "id"})
    
    sales_train_validation = pd.read_csv(os.path.join(KAGGLE_DATA_FOLDER, 'sales_train_validation.csv'),
                                                     usecols=catcols+numcols, dtype=dtype)
    for col in catcols:
        if col != "id":
            sales_train_validation[col] = sales_train_validation[col].cat.codes.astype("int16")
            sales_train_validation[col] -= sales_train_validation[col].min()
    
    if not train:
        # Add columns for future 28 days, 1914-1941
        for day in range(END_DAY+1, END_DAY+28+1):
            sales_train_validation[f"d_{day}"] = 0  # TODO this was np.nan before

        # Then only keep data from the last BACKWARD_LAG days        
        # If we remove the 'd_' prefix, we can compare day numbers
        value_vars = [column for column in sales_train_validation.columns 
                              if (column.startswith('d_') and int(column.replace('d_', ''))>= END_DAY - BACKWARD_LAG)]
    else:
        # Immediately throw away all days before BEGIN_DAY
        # Doing this so early is important because pd.melt increases memory significantly
        value_vars = [col for col in sales_train_validation.columns 
                      if (col.startswith('d_') and (int(col.replace('d_', '')) >= int(BEGIN_DAY)))]
    
    print("Shape:", sales_train_validation.shape )
    print("Memory usage (Mb) before melting:", sales_train_validation.memory_usage().sum() / 1024**2)
    
    sales_train_validation = pd.melt(
        sales_train_validation, 
        id_vars = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'],
        value_vars=value_vars,
        var_name = 'day',
        value_name = 'sales')
    print("Completed melting, new shape:", sales_train_validation.shape )
    print("Colums after melting:", sales_train_validation.columns)
    print("Memory usage (Mb) after melting:", sales_train_validation.memory_usage().sum() / 1024**2)
    
    columns = ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']
    
    ####### CALENDAR
    
    print("Loading calendar")
    # Parse dates parses the dates as datetime objects! Pandas provides some nice functions on datetime objects.
    calendar = pd.read_csv(os.path.join(KAGGLE_DATA_FOLDER, 'calendar.csv'), dtype=CALENDAR_DTYPES, parse_dates=['date'])
    print("Calendar columns: ", calendar.columns)
    print("Memory usage (Mb) calendar: ", calendar.memory_usage().sum() / 1024**2)
    calendar.rename(columns={'d':'day'}, inplace=True)

    for col, col_dtype in CALENDAR_DTYPES.items():
        if col_dtype == "category":
            calendar[col] = calendar[col].cat.codes.astype("int16")
            calendar[col] -= calendar[col].min()
            
    # Merge sales_train_validation and calendar
    sales_train_validation = sales_train_validation.merge(calendar, on="day", copy=False)
    del calendar; gc.collect()
    print("Merged calendar (in place)")
    print("Memory usage (Mb) after merging calendar:", sales_train_validation.memory_usage().sum() / 1024**2)
    print("Colums after merge:", sales_train_validation.columns)
    
    ####### SELL PRICES
    
    sell_prices = pd.read_csv(os.path.join(KAGGLE_DATA_FOLDER, 'sell_prices.csv'), dtype=SALES_PRICES_DTYPES)
    
    # From https://www.kaggle.com/kneroma/m5-first-public-notebook-under-0-50#Changes
    # TODO investigate normalization step
    for col, col_dtype in SALES_PRICES_DTYPES.items():
        if col_dtype == "category":
            sell_prices[col] = sell_prices[col].cat.codes.astype("int16")
            sell_prices[col] -= sell_prices[col].min()

    print("Memory usage (Mb) sell prices:", sell_prices.memory_usage().sum() / 1024**2)
    
    columns = ['item_id', 'store_id', 'sell_price']
    for feature in columns:
        if feature == 'sell_price':
            sell_prices[feature].fillna(0, inplace=True)
    
    # Merge in sell prices
    sales_train_validation = sales_train_validation.merge(sell_prices, on=["store_id","item_id","wm_yr_wk"], copy=False)
    del sell_prices; gc.collect()
    print("Merged sales prices (in place)")
    print("Memory usage (Mb) after merging sales:", sales_train_validation.memory_usage().sum() / 1024**2)
    print("Colums after merge:", sales_train_validation.columns)
     
    #submission = pd.read_csv(os.path.join(KAGGLE_DATA_FOLDER, 'sample_submission.csv'))  
    #return reduce_mem_usage(calendar), reduce_mem_usage(sell_prices), reduce_mem_usage(sales_train_validation)
    return sales_train_validation

## Feature engineering

Note that the make_features converts 'day' from an object to an integer, so be aware of this side-effect. I should probably move this to the load_data function.

See [this tutorial of autoregression](https://machinelearningmastery.com/autoregression-models-time-series-forecasting-python/)

Inspiration: from [here](https://www.kaggle.com/vgarshin/m5-catboost), and from [here](https://www.kaggle.com/kneroma/m5-first-public-notebook-under-0-50#Changes). 

The first notebook creates the following features:

- lag_7           float32
- lag_28          float32
- rmean_7_7       float32
- rmean_28_7      float32
- rmean_7_28      float32
- rmean_28_28     float32
- week            int16
- quarter         int16
- mday            int16


### Q&A

Q by Blazej: Why are you calculating rolling means of lags instead of rolling means of the actual values?

A by Vlad-Marius Griguta: 

> Good question. The reason for using lagged values of the target variable is to reduce the effect of self-propagating errors through multiple predictions of the same model.
The objective is to predict 28 days in advance in each series. Therefore, to predict the 1st day in the series you can use the whole series of sales (up to lag1). However, to predict the 8th day you only have actual data for up to lag8 and to predict the whole series you have actuals up to lag28. What people have done at the beginning of the competition was to only use features computed from up to lag28 and apply regression (e.g. lightGBM). This is the safest option, as it does not require the use of 'predictions on predictions'. At the same time, it restrains the capacity of the model to learn features closer to the predicted values. I.e., it underperforms at predicting the 1st day, which could use much more of the latest values in the series than lag28. What this notebook is doing is to find a balance between 'predicting on predictions' and using the latest available information. Using features based on a lag that has some seasonal significance (lag7) seems to give positive results, while the fact that only two features (lag7 and rmean7_7) self-propagate errors keep the over-fitting problem under control.

- Adding month and date seems to lower performance

In [None]:
# This function is copied from m5_catboost 
# minor change: I renamed 'd' to 'day'
# minor change: I pass dates as strings, not datetime, so I convert them
# The date_features contain pandas functions defined on datetimeIndex,
# e.g. https://www.geeksforgeeks.org/python-pandas-datetimeindex-weekofyear/
def make_lag_features(strain):    
    """
    N.B. If you adjust this function, make sure to make make_features_for_day() below
    """
    
    # 1. Lagged sales
    print('in dataframe:', strain.shape)
    print("headers:", strain.columns)
    lags = [7, 28]
    lag_cols = ['lag_{}'.format(lag) for lag in lags ]
    for lag, lag_col in zip(lags, lag_cols):
        strain[lag_col] = strain[['id', 'sales']].groupby('id')['sales'].shift(lag)
    print('lag sales done')
    
    # 2. Rolling means
    windows= [7, 28]
    for window in windows:
        for lag, lag_col in zip(lags, lag_cols):
            window_col = f'rmean_{lag_col}_{window}'
            strain[window_col] = strain[['id', lag_col]].groupby('id')[lag_col].transform(
                lambda x: x.rolling(window).mean()
            )
        print(f'Rolling mean sales for done for window {window}')
   
    # ATTENTION
    # Shit creates some NaNs because lags for the initial days cannot be computed
    # I currently just fill them
    #return strain.fillna(0)

    
def make_date_features(dt):
    # 3. New date features (values are corresponding pandas functions)
    date_features = {
        'week': 'weekofyear',
        'quarter': 'quarter',
        'mday': 'day',
        "wday": "weekday"
    }
    
    # Additional potential date features
    # "wday": "weekday",
    # "month": "month",",
    # "year": "year",
    # "ime": "is_month_end",
    # "ims": "is_month_start",
    #id", "date", "sales", "day", "wm_yr_wk", "weekday
    
    for date_feat_name, date_feat_func in date_features.items():
        if not date_feat_name in dt.columns:
            dt[date_feat_name] = getattr(dt['date'].dt, date_feat_func).astype('int16')      
    print('date features done')
    dt['day'] = dt['day'].apply(lambda x: int(x.replace('d_', '')))  
    print('out dataframe:', dt.shape)

# Prediction

### Load models

In [None]:
model_all = lgb.Booster(model_file = PATH_MODELS[0])
model_cat0 = lgb.Booster(model_file = PATH_MODELS[1])
model_cat1 = lgb.Booster(model_file = PATH_MODELS[2])
model_cat2 = lgb.Booster(model_file = PATH_MODELS[3])
model_per = [model_cat0, model_cat1, model_cat2]

### Load data and add future days

In [None]:
#%%time
df = load_data(train=False)

df["sale"] = ((df['sell_price'] * 100 % 10) < 6).astype('int8')
make_lag_features(df)
make_date_features(df)

In [None]:
df.info()

In [None]:
drop = ["id", "date", "sales", "day", "wm_yr_wk", "weekday"]#, 'lag_28', 'lag_7', 'lag_7_id_rmean_7', 'lag_7_item_id_rmean_28', 'event_name_2', 'event_type_1', 'lag_7_item_id_rmean_7', 'lag_28_id_rmean_28', 'lag_28_id_rmean_7', 'lag_28_item_id_rmean_28', 'event_name_1', 'event_type_2', 'lag_7_id_rmean_28', 'lag_28_item_id_rmean_7']
train_columns = df.columns[~df.columns.isin(drop)]
print(train_columns)

In [None]:
df.info()

### Prediction loop

Apply the "recursive features" approach here:

- Predict the next day based on last BACKWARD_LAG days
- Perform feature engineering on those days (same as during training)
- Repeat, but now include the day for which we just predicted demand

TODO: [use some weighing scheme](https://www.kaggle.com/kneroma/m5-first-public-notebook-under-0-50#Changes)

Optimization: only compute the required features for a prediction day for an immense speedup.

In [None]:
# TODO als ik deze functie gebruik gaat iets grandioos fout
# Returnt een numpy ndarray!

# Code to just compute features only for the single prediction day
# Adapted with several changes from https://www.kaggle.com/poedator/m5-under-0-50-optimized#Prediction-stage 
def lag_features_for_day(dt, day):
    print(type(dt))
    lags = [7, 28]
    lag_cols = [f"lag_{lag}" for lag in lags]
    for lag, lag_col in zip(lags, lag_cols):
        dt.loc[dt['date'] == str(day), lag_col] = \
            dt.loc[dt['date'] == str(day-timedelta(days=lag)), 'sales'].values
    
    windows = [7, 28]
    for window in windows:
        for lag, lag_col in zip(lags, lag_cols):
            df_window = dt[(dt['date'] <= str(day-timedelta(days=lag))) & (dt['date'] > str(day-timedelta(days=lag+window)))]
            df_window_grouped = df_window.groupby("id").agg({'sales':'mean'}).reindex(dt.loc[dt['date']==str(day),'id'])
            dt.loc[dt['date'] == str(day),f'rmean_{lag_col}_{window}'] = df_window_grouped.sales.values   
    print("Lag features done")
    print(type(dt))


In [None]:
%%time

END_DATE = EVAL_SPLIT
ZERO_THRESHOLD = 0.01
PREDICT_DAYS = 28

# TODO check of fillen met 0 hierboven goed gaat; i.e. of fillna nu nog nodig is
#df['sales']=df['sales'].fillna(0)

# Predict from 2016-04-25 on
for f_day in tqdm(range(1,PREDICT_DAYS+1)):
    pred_date = (datetime.strptime(END_DATE, '%Y-%m-%d') + timedelta(days=f_day)).date()
    print(f"Forecasting day {END_DAY+f_day}, date: {str(pred_date)}")
    pred_begin_date = pred_date - timedelta(days=BACKWARD_LAG+1)
    print(pred_begin_date)
    # Select last BACKWARD_LAG days to use for predicting
    
    prediction_data = df[(df['date'] >= str(pred_begin_date)) & (df['date'] <= str(pred_date))].copy()
    
    # Repeat feature engineering
    # Following line does feature engineering on the whole bunch, but in fact we only need features for pred_date
    #make_lag_features(prediction_data)
    
    lag_features_for_day(prediction_data, pred_date)
    
    # Only use the columns you trained on before
    prediction_data = prediction_data.loc[prediction_data['date'] == str(pred_date), train_columns]

    prediction_all_cats = model_all.predict(prediction_data)   
    print("Single model done")

    # Construct a dataframe to save the per category predictions, such that they have the same shape as prediction_all_cats
    prediction_per_cat = pd.DataFrame(data=None, index=prediction_data.index, columns=["sales"])
    #print(prediction_per_cat)
    for category in range(3):
        prediction_data_cat = prediction_data.loc[prediction_data["cat_id"] == category, train_columns].drop("cat_id", axis = 1)
        
        prediction = model_per[category].predict(prediction_data_cat)
        
        prediction_per_cat.loc[prediction_per_cat.index.isin(prediction_data_cat.index), "sales"] = prediction
        
    prediction_per_cat = prediction_per_cat['sales'].to_numpy(dtype='float64')
    
    
    # Get weighted prediction
    
    prediction = np.average([prediction_all_cats, prediction_per_cat], axis = 0, weights=[1./3, 2./3])
    
    print("Prediction", prediction.size, prediction)
    df.loc[df['date'] == str(pred_date), 'sales'] = prediction
    
# If predictions are very close to zero, predict a gap day 
#df.loc[df['sales'] < ZERO_THRESHOLD, 'sales'] = 0  

In [None]:
del prediction_data
gc.collect()

# Submission

Now let's turn the prediction into a submission file. 
We'll wrangle the long dataframe with the predictions into the correct format.

cf. https://medium.com/@durgaswaroop/reshaping-pandas-dataframes-melt-and-unmelt-9f57518c7738 

- Currently we only predict 30490 rows corresponding to the validation set. Later in the competition, another 30490 rows corresponding to the evaluation set will be added. 
- For now, to get the correct submission format, we simply copy the predictions of the first 30490 rows. 


In [None]:
# We are only interested in the predicted days
# We need the id for the row index, the day to calculate F_{x}, and the sales for the prediction values
submission_val = df.loc[df['date'] > END_DATE, ['id', 'day', 'sales']].copy()

# Memory clean-up
#del df; gc.collect()

# Do not make negative predictions
submission_val.loc[submission_val['sales'] < 0, 'sales'] = 0

# Sort on id 
submission_val.sort_values('id', inplace=True)

#submission_val['day'] = submission_val['day'].apply(lambda x: 'F{}'.format(int(x.replace('d_', '')) - END_DAY))
# Use code below if you have used make_features on df, because it replaces day with an int already.
submission_val['day'] = submission_val['day'].apply(lambda x: 'F{}'.format(x - END_DAY))
print(submission_val.columns)

Now we have a single 'day' column. Instead, we want to have a separate column for each day.
The reverse of the melt operation is `pivot` .
An extra 'sales' descriptor is introduced that we remove again.
'id' will serve as the index, but we want to reintroduce it as a column for submission with `reset_index()`


In [None]:
# This is required to force the correct ordering after reshaping
f_cols = ['F{}'.format(x) for x in range(1, 28 + 1)]

submission_val = submission_val.pivot(index='id', columns='day')['sales'][f_cols].reset_index(level='id')
print(submission_val.columns)

In [None]:
# Temporary solution, copy the 28 validation days as the 28 evaluation days
submission_eval = submission_val.copy()

submission_eval['id'] = submission_eval['id'].str.replace('validation', 'evaluation')
submission = pd.concat([submission_val, submission_eval], axis=0, sort=False)
#spred_subm.reset_index(drop=True, inplace=True)
print(submission.columns)
print(submission.head(1))

In [None]:
submission.to_csv('submission.csv', index=False)
print('Submission shape', submission.shape)