
# -1. Still a draft
-  require periodic code cleanup/rewrite/streamlining to prevent crumbling to "technical debt"

- __note: this is currently the most updated version__

- need to join "train & test" to have a number of features that match,
    - indicate via column: "is_test" to still be able to seggregate back.

# 0. General Workflow <a class="anchor" id="Ch0"></a>:
1. [__Problem def__](#Ch1): 
    - project objective: prediction contraints, selection of relevant models
    - additional objective: explain how/why those results/methods as much as possible 
        - (to convince other DS-experts, as well as experts in the field of the project's topic)
<br>
    - [1.1](#Ch1.1) goal
    - [1.2](#Ch1.2) global strategy
    - [1.3](#Ch1.3) milestones
    
2. [__Data acquisition__](#Ch2):
    - [2.1](#Ch2.a) location
    - [2.2](#Ch2.b) retrival
    - [2.3](#Ch2.c) variables definitions
    - [2.4](#Ch2.d) early __feature proposals__ from (human) prior understanding
<br>   
3. [__Data pre-processing__](#Ch3):

    - [3.1 train & test](#Ch3.1)  :: [(a) priority](#Ch3.1a) [(b) processing](#Ch3.1b)
    - [3.2 stores & events](#Ch3.2) :: [(a) stores](#Ch3.2) [(b) events](#Ch3.2b) [(c) sp_events](#Ch3.2c)
    - [3.3 paydays](#Ch3.3)
    - [3.4 oil](#Ch3.4)
    - __[3.5 processed storage](#Ch3.5)__

4. [__Model ensembles__](#Ch4):
     
    - [(a)](#Ch4.a) model ensembles construction
    - [4.1 construction](#Ch4.1)
    - [4.2 eval](#Ch4.2)
    - __[4.3 one model pipe](#Ch4.3)__
    - [4.3 global](#Ch4.4)
    
    - [(b)](#Ch4.b) for all models: gain function maximization + estimated error/deviation of the "gain"
    - [(c)](#Ch4.c) for all models: feedback/interpretations for each model (if possible)
    - [(d)](#Ch4.d) unique (top 3?) model selection
<br>
6. [__Prediction and scenarios__](#Ch6):
    - [(a)](#Ch6.a) multiple scenarios: predictions from selected model(s)
    - [(b)](#Ch6.b) key points/insights, applicable to each scenario
<br>
7. [__Bonus: Iterations for in the future__](#Ch7):
    - compare how model(s) are holding up with real data, posterior to project prediction
    - sources of unexplained variance ? suggest new sources of features for future tasks/models
    - repeat (1.) with more of (both) the same kind data and new kind of data
    
    
[to end](#End)

# 1. Problem def <a class="anchor" id="Ch1"></a>: 

[goto (0.) General Workflow](#Ch0)

## 1.1 Goal of the project [(to kaggle)](https://www.kaggle.com/c/store-sales-time-series-forecasting/overview): <a class="anchor" id="Ch1.1"></a>

1. Primary/Major objective: Predict future sales $Y[s,p](t_{+})$ for each $(s,p) \in \{\text{store}, \text{product-class}\}$, based on previous sales $Y[s,p](t_{-})$ of those same $(s,p)$
    - hence all "time series" machine learing [methods](https://en.wikipedia.org/wiki/Time_series) are (potentially) applicable
    - from the context (= sales, economics) mostly methods from [here](https://link.springer.com/book/10.1007/978-1-4419-0320-4) or equivalent --i.e. stationary processes-- will be tested
<br><br>
2. Secondary objective: What ML-algo provides the 'best' prediction ? 
    - for the [given metric](https://www.kaggle.com/c/store-sales-time-series-forecasting/overview/evaluation): "Root Mean Squared Logarithmic Error"
        - $M(\hat{y})=\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$
    <br><br>
    - and its (estimated) associated deviations on this metric (time-series cross validation)
        
    - (bonus) theoretical (?) deviations on this metric ([error propagation](https://en.wikipedia.org/wiki/Propagation_of_uncertainty)):
        - $\sigma^2(M(\hat{y})) \approx \frac{4}{n^2} \left[ \sum_{i=1}^n \frac{(\Delta \log)^2_i}{(1+y_i)^2} \sigma^2(\hat{y}_i) + 2 \sum_{i=1}^n \sum_{j=i+1}^n \frac{(\Delta \log)_i (\Delta \log)_j}{(1+y_i)(1+y_j)} \sigma(\hat{y}_j)\sigma(\hat{y}_i)\rho_{ij}\right]/M(\hat{y})^2$
<br><br> 
3. Secondary objective: What are the different insights/interpretation to be gathered from those ML-algo ?
<br><br>
4. Minor/major objective: Additional questions to investigate:   
      (Major --> probably what would interest the big boss and the marketing departement)  
      (Minor --> here to train myself with techniques of ML and work of any DS-expert, covering most aspect of any forecast)
    - Can from this data alone, infer the consequences of opening a new store ? <br><br>
    - What are the most stable products sales-wise?  I.e. products which sales does not vary through seasons/years and are therefore references/markers of (secure) future sales. 
    - Can those product be related to the "Maslow's pyramid of needs"-concepts ? <br><br>
    - Conversely,what are the most volatile poducts ?
    - What are the characteristics of those volatiles product (store-related,time-related, location-related, no relation whatsoever,...) ? 


## 1.2 Global Strategy: induction: <a class="anchor" id="Ch1.2"></a>
1. Divide and Conquer: segregate the full data $D$ into $D[s,p]$<br>
2. Simple at first: start with (almost) arbitrary pair $(s^* , p^*)$ and proceed with/ try (full) ML-methods on that pair<br>
3. From one to many: then constuct the desired pipeline applicable for all pairs (= 2 for loops)<br>
4. cleanup code
5. (Bonus?) Interactions: correlations/mutual info between pairs $(s_i,p_j)$ and $(s_k,p_l)$
 

## 1.3  Milestones: <a class="anchor" id="Ch1.3"></a>
___Done___<br>
A. Start small/compact: aim for minimum intervention to get first (global, but opaque) results fast:

    1. the only goal is the primary objective: merely to provide a submission file
    2. only main train/test as source
    3. less exploratory pre-processing:
                - day/week/month as feature
                - CalendarFourier
    4. models considered: 
                - Linear Regression/ridge
                - Random Forests
                - xgboost
                - HistGradientBoost
    5. business as usual scenario
               
<br>___Currently___<br>

B. Dissect afterwards: explore more complex aspects (more interventions, more time consuming):

    2. include complementary files 
    3. complementary preprocessing:
                - trend segregation
                - event-response features
                


note: general workflow will be applied multiple times
- for the initial/fast/opaque
- for the specific (fixed) pair $(s^* , p^*)$
- for each pair $(s,p)$ ,a systematized (pipeline) version of the specific pair $(s^* , p^*)$
- (bonus) for higher order interactions (network)
    - for same store $(s^*)$ and different products $\{(p)\}$ : hence pairs $\{(s^*,p)\}$
    - for same product $(p^*)$ and different stores $\{(s)\}$ : hence pairs $\{(s,p^*)\}$
    - more ?

# 2. Data acquisition <a class="anchor" id="Ch2"></a>: 

[goto (0.) General Workflow](#Ch0)

## (2.1) & (2.2) Location <a class="anchor" id="Ch2.a"></a> and retrieval <a class="anchor" id="Ch2.b"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#import matplotlib as plt
import matplotlib.pyplot as plt
import seaborn as sns

# additional dependencies
#from pandas.tseries.offsets import MonthEnd
import re

from pathlib import Path

#work_path = Path('../input/store-sales-time-series-forecasting')

# specific path to locate different data files
comp_dir = Path('../input/store-sales-time-series-forecasting')


### configs:
DO_REGENERATE_DATA=True

work_path = Path('../working')

submission = pd.DataFrame(columns=["id","sales"])
submission.to_csv('submission.csv',mode='a',  index=False, header=False)

In [None]:
if DO_REGENERATE_DATA:


    # major features
    train = pd.read_csv(
        comp_dir / 'train.csv',
        usecols=['id','store_nbr', 'family', 'date', 'sales', 'onpromotion'],
        dtype={
            'id':'uint32',
            'store_nbr': 'category',
            'family': 'category',
            'sales': 'float32',
            'onpromotion': 'uint32',
        },
        parse_dates=['date'],
        infer_datetime_format=True
    )
    #train['date'] = train.date.dt.to_period('D')
    train = train.set_index(['store_nbr', 'family', 'date']).sort_index()

    
if DO_REGENERATE_DATA:


    #test data
    test = pd.read_csv(
        comp_dir / 'test.csv',
        dtype={
            'store_nbr': 'category',
            'family': 'category',
            'onpromotion': 'uint32',
        },
        parse_dates=['date'],
        infer_datetime_format=True,
    )
    #test['date'] = test.date.dt.to_period('D')
    test = test.set_index(['store_nbr', 'family', 'date']).sort_index()
    test.head()

    # oil prices 
    oil = pd.read_csv(
        comp_dir / 'oil.csv',
        parse_dates=['date'],
        infer_datetime_format=True,)
    #oil['date'] = oil.date.dt.to_period('D')
    # stores categories/types & locations
    stores = pd.read_csv(comp_dir / 'stores.csv')

    # holidays & events
    events = pd.read_csv(comp_dir / 'holidays_events.csv',
        parse_dates=['date'],
        infer_datetime_format=True,)
    #events['date'] = events.date.dt.to_period('D')
    ## to do :
    ### construct a panda data frame which incorporates the suggestion of the (additional) notes : 
    ### -> see "(2.c) Variable definitions" markdown/link


In [None]:
submission = pd.DataFrame(columns=["id","sales"])
submission.to_csv('submission.csv',mode='w',  index=False, header=False)

## 2.3 Variables definitions <a class="anchor" id="Ch2.c"></a>

[meanings of data](https://www.kaggle.com/c/store-sales-time-series-forecasting/data?select=train.csv)

## 2.4 Features proposals <a class="anchor" id="Ch2.d"></a>

[goto (0.) General Workflow](#Ch0)

(see current objective in [milestones](#Ch1.3) )

___If we focus ourselves on a specific pair $(s^*,p^*)$:___<br>

0. ___his means that we hypothesise that___
    - the direct influence between stores $\{(s)\}$is of second or third order (less impactfull than rest of features/interactions) 
    - the correlation in sales/promotions between different product-types $\{(p)\}$ are also neglected
    - which one of the above interactions (network) is the most impactfull (correlated) one will be investigated later (bonus)
<br><br>
1. ___'stores'-data:___
    we will (probably) only need the pair ('city','state') location in 'stores'-data, because:
   - 'city' : there are city dependent holidays/events possible
   - 'state' : idem
   - 'type' : (expected/hypothesised to be only) relevant when comparing different pairs $\{(s,p)\}$, 
        hence this feature will clutter the ML-algo's processing and memory until we analyse/investigate interactions of all pairs $\{(s,p)\}$ (=another pipeline)
   - 'cluster': idem <br><br>

2. ___'events'-data:___
   - is an __interactions feature__ between 'date' and ('city' and/or 'state') and therefore 'store_nbr', certainly worth keeping/including this aspect <br><br>
   - rows with values ( val= 'Transferred' in col = 'is_transferred') will be dropped (=a dupplicate, however with caveats that are expected/hypothesised to be of negligible importance in magnitude and frequency) <br><br>
   - investigate carefully if the effects of holiday/events have a similar effect year on year (= feature importance with meaningful interpretation) <br><br>
   - some events have (probably/hypothetically) more and different impacts on sales (store-wise,and product-wise) than others: 
       - we will avoid the (total) anonymization of the holidays/events
       - we will construct a dictionary of (yearly) repeated events/holidays, non repeating events/holidays will be anonymized (= more volatile/ less reliable) <br><br>
   - given that we will investigate "stationary process" models, there could be "anticipation" and/or "after(burn)" prior and/or posterior respectively to specific events/holidays: those can be modelled in a time lag manner:
       - anticipation: $\sim$ min amount of days/weeks before the next event/holiday happens
       - afterburn : $\sim$ min amount of days/weeks after the previous event/holiday happened <br><br>
   - because in holiday week we are more flexible to go shopping than in a normal week or on events (?):
       - segregate the previously mentioned "time lag"-like features for events-only and holiday-only<br><br>
3. ___'oil'-data:___ 
    - seem to be generally applicable, however stores with gasoline/petrol pumps/distribution may be affected to a greather degree :
        - however only way to evaluate such effect is to compare the sales of different stores being part of the same 'cluster'&'type' and those which differ in 'cluster'&'type'
    - sales may also be affected by the high difference in pricing ($\sim$ time lag), but over which time span ?
        - multiple (finite-)difference price of oil  over different time span as feature
        
4. ___'train'-data:___
    - lagged sales (=sales from before) as features: motivated if stationary process
    - onpromotion: no changes 
        - lagging: promotion does not linger
        - random-persieved: those promotions are (likely) not announced & anticipated prior to the day of promotion, hence --from buyer-perspective-- at random ?
    - expanded dates into day/week/month + their respective trigonometric version.
            

# 3. Data pre-processing <a class="anchor" id="Ch3"></a> <a class="anchor" id="Ch3.a"></a> <a class="anchor" id="Ch3.b"></a>: 

[goto (0.) General Workflow](#Ch0)

## 3.0) Setup of general functions

In [None]:
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess


plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)


In [None]:
def make_plot3(df1,x_axis,y_axis,
               df2=pd.DataFrame(),
               df3 =pd.DataFrame(),  title=False, line=1):
   
    plt.style.use("seaborn-whitegrid")
    plt.rc(
    "figure",
    autolayout=True,
    figsize=(11, 4),
    titlesize=18,
    titleweight='bold',
    )
    plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
    )
    %config InlineBackend.figure_format = 'retina'

    fig, ax = plt.subplots()
    df1.plot(**plot_params)
    if (not df2.empty):
        df2.plot(ax=ax, linewidth=line)
    if (not df3.empty):
        df3.plot(ax=ax, linewidth=line, color='red')
    
    ax.set_ylabel(y_axis)
    ax.set_xlabel(x_axis)
    if title:
        plt.title(title)
    plt.show()
    pass
    
def make_plot3_rol(df1,x_axis,y_axis,
               df2=pd.DataFrame(),
               df3 =pd.DataFrame(),  title=False, line=1):
    return make_plot3(
        df1.rolling(
    window=21,
    center=True,
    min_periods=11,
).mean(),
        x_axis,y_axis,
        df2.rolling(
    window=21,
    center=True,
    min_periods=11,
).mean(),
        df3.rolling(
    window=21,
    center=True,
    min_periods=11,
).mean() ,title=False, line=1)

In [None]:
def check_integrity(data):
### ---------------------------------
    ##-# general data check
    missing = data.isna().sum()
    
    print(f'total (initial) missing values {missing.sum()}')
    print('*'*40)
    
    ##-# check dtypes
    types_df = pd.DataFrame(data.dtypes, columns=['type'])
    u_types=types_df['type'].unique()
    print(f'unique_types in data = {u_types}')
    print('*'*40)
    
    ##-# setup to check columns
    cols_u = []
    for ku in np.arange(len(u_types)):
        tupl = types_df.loc[types_df['type']==u_types[ku]].index.to_numpy()
        cols_u.append(tupl)
    print(f'cols_u = {cols_u}')
    print('*'*40)
### ---------------------------------    
    print('*'*40)
    ##-# check index: check index if missing wrt range
    if isinstance(data.index, pd.MultiIndex):
        print("MultiIndex data is currently not supported")
    elif isinstance(data.index, pd.DatetimeIndex):  
        itype= "DatetimeIndex"
        print( f"index is: {itype}")
        
        idx_start=data.index[0]
        idx_end = data.index[-1]

        idx_full_range=pd.date_range(start=idx_start,end=idx_end)
    elif isinstance(data.index, pd.RangeIndex):
        itype= "RangeIndex"
        print( f"index is:{itype}")
        idx_start=data.index[0]
        idx_end = data.index[-1]
        
        idx_full_range=np.arange(start=idx_start,stop=idx_end+1,dtype=type(idx_start))
    else:
        itype= type(data.index[0])
        print( f"index is:{itype}")
        print("passing index range integrity check")
        idx_full_range = data.index
    ### ---    
    _n_idx = -len(data.index)+len(idx_full_range)
    if (_n_idx!=0):
        print(f"index not at full range! increasing index range by (n={_n_idx})")
        
        DF = pd.DataFrame(index=idx_full_range)
        DF = DF.join(data)
    else:
        print(f"index seems to be at full range")
        DF = data.copy()
    print('*'*40)    
### ---------------------------------    

    ##-# check NaN, per cols/col_type
    for ku in np.arange(len(u_types)):
        print('*'*40)
        print(f'checking type = {u_types[ku]}')
        print('-'*30)
        print('-'*30)
        for kcu in np.arange(len(cols_u[ku])):
            missing_cu = DF[cols_u[ku][kcu]].copy().isna().sum()
            if (missing_cu != 0):
                print(f'missing (n={missing_cu}) values in (col={cols_u[ku][kcu]})')
            else:
                print(f'all ok in (col={cols_u[ku][kcu]})')
    
    ##--# check repeated/uniques values in obj_cols
            if (u_types[ku]=="object"):
                nu = DF[cols_u[ku][kcu]].nunique()
                #amount_uniq = len(pd.unique(DF[cols_u[ku][kcu]]))
                print( f'non-unique elements (n={nu})')
                #print( f'unique elements (n={amount_uniq})')
                print('-'*30)
    
    #check inconsistent obj_cols entries: need to be tailored to data: not here 
    return types_df, u_types,cols_u, DF

In [None]:
def make_lag(DF, column,lag_lst=[1,],interpol=True):
    #DF = df.copy()
    for lag in lag_lst:
        if interpol:
            DF[column+"_"+str(lag)] = DF[column].shift(lag).interpolate(limit_direction='both')
        else:
            DF[column+"_"+str(lag)] = DF[column].shift(lag)
    pass

## 3.1) Train and Test <a class="anchor" id="Ch3.1"></a>
  [goto (0.) General Workflow](#Ch0)  

In [None]:
print('Missing values in train:', train.isna().sum().sum())
print('Missing values in test:', test.isna().sum().sum())

train.head()


In [None]:
lst_s = train.index.get_level_values(0).unique().to_numpy()
lst_p = train.index.get_level_values(1).unique().to_numpy()
lst_t = train.index.get_level_values(2).unique().to_numpy()

print(lst_s, len(lst_s))
print("*"*40)
print(lst_p, len(lst_p))
print("*"*40)
print(lst_t, len(lst_t))

### 3.1a) prioritize (s,p) pairs to be generated <a class="anchor" id="Ch3.1"></a>
[goto (0.) General Workflow](#Ch0)  

- relevant for [3.5 generate](#Ch3.5)

In [None]:
train_s_pt = train.groupby(by='store_nbr').sum()
ref_s = train_s_pt.index
imp_s = train_s_pt.sort_values(by="sales",ascending=False).index

In [None]:
print(imp_s[0],ref_s[1])


In [None]:
order_s= []
for s_sort in imp_s:
    for ks in np.arange(len(ref_s)):
        if ref_s[ks] == s_sort:
            order_s.append(ks)
            break

In [None]:
ref_s[order_s[0]]

In [None]:
train_p_st = train.groupby(by='family').sum()
ref_p = train_p_st.index
imp_p = train_p_st.sort_values(by="sales",ascending=False).index
order_p= []

for p_sort in imp_p:
    for kp in np.arange(len(ref_p)):
        if ref_p[kp] == p_sort:
            order_p.append(kp)
            break
ref_p[order_p[0]]

### 3.1b) (s,p) grabbing and processing <a class="anchor" id="Ch3.1b"></a>
[goto (0.) General Workflow](#Ch0)  

In [None]:
def to_log_sales(y):
    z = y.copy()
    z["sales"] = z["sales"].apply(np.log1p)
    return z

def from_log_sales(y):
    z = y.copy()
    z["sales"] = z["sales"].apply(np.expm1)
    return z
    

In [None]:
def select_sp(DF,s,p):
    return DF.xs(s,level="store_nbr").xs(p,level='family')

def join_sp_train_test(train_sp,test_sp):
    tr = train_sp.copy()
    tr["is_test"]=np.full(len(tr),False)
    te = test_sp.copy()
    te["is_test"]=np.full(len(te),True)
    te["sales"] = np.full(len(te),np.nan) 
    sp = pd.concat([tr,te])
    return sp 

def DataProcess_to_Xy(sp,is_test=False):
    
    #X_pre = sp.copy().dropna()   
    X_pre = sp.copy()
    if not is_test:
        y_plt = X_pre.pop("sales")
        y = pd.DataFrame(y_plt,columns=["sales"]).join(X_pre[["id"]])
        y = to_log_sales(y)
    ### ---
    else:
        y = X_pre[["id"]]
        
    fourier = CalendarFourier(freq='W', order=4)
    fourier_2 = CalendarFourier(freq='M', order=4)
    dp = DeterministicProcess(index=y.index,
                              constant=False,
                              order=1,
                              seasonal=False,
                              additional_terms=[fourier,fourier_2],
                              drop=True)

    X = dp.in_sample()
    X = X.join(X_pre,how="outer")
    
    # explicit "day","week" and "month"
    X["day"] = X.index.dayofweek
    X["week"] = X.index.isocalendar().week
    X["month"]= X.index.month
    
    #y, X = y.align(X, join='inner')
    
    
    
    return X,y

In [None]:
tdf, u_t , u_cols, DF= check_integrity(select_sp(DF=train,s= lst_s[0],p =lst_p[7]))

In [None]:
def sp_pipe(DF, is_test=False):
    sp = DF.copy()
    sp["id"] = sp["id"].fillna(-1)
    sp["onpromotion"] = sp["onpromotion"].fillna(0)
    
    ### keeping track of interpolated instances (=less reliable value)
    if not is_test:
        #sp["sales_na"] = sp["sales"].isna() 
        sp["sales"]= sp["sales"].interpolate(limit_direction='both')
    
    return sp

In [None]:
ks0 = 2
kp0 = 13
sp0 = select_sp(DF=train,s=lst_s[ks0],p=lst_p[kp0])
sp0 = sp_pipe(DF=sp0)
sp0.head()

In [None]:
X_pre = sp0.copy()
y_plt = X_pre.pop("sales")
y = pd.DataFrame(y_plt,columns=["sales"]).join(X_pre[["id"]])
fourier = CalendarFourier(freq='W', order=4)
fourier_2 = CalendarFourier(freq='M', order=4)
dp = DeterministicProcess(index=y.index,
                              constant=False,
                              order=1,
                              seasonal=False,
                              additional_terms=[fourier,fourier_2],
                              drop=True)
X = dp.in_sample()
X = X.join(X_pre,how="outer")
X["day"] = X.index.dayofweek
X["week"] = X.index.isocalendar().week
X["month"]= X.index.month
    
X.head()

In [None]:
print(len(y),len(X))

## 3.2 "stores" and "events" data:<a class="anchor" id="Ch3.2"></a>
[goto (0.) General Workflow](#Ch0)
### 3.2a "stores"-data:
    this first, because need a dictionary build from it for "events"-data

In [None]:
stores.tail(3)

In [None]:
tdf, u_t , u_cols, DF= check_integrity(stores)

In [None]:
if DO_REGENERATE_DATA:
    N_stores= stores.shape
    print(N_stores)
    store_ids = stores["store_nbr"].to_numpy()
    stores.loc[stores["store_nbr"] == 1].head()

if DO_REGENERATE_DATA:
    locale = []
    for k in store_ids:
        locale.append(stores[["state","city"]].loc[stores["store_nbr"]==k].to_numpy()[0])

    loc_dict = {str(store_ids[ks]):locale[ks] for ks in np.arange(len(store_ids))}

### 3.2b) "events"-data <a class="anchor" id="Ch3.2b"></a>

In [None]:
events.head(3)

In [None]:
tdf, u_t , u_cols, DF= check_integrity(events)

In [None]:
def make_parsed_descrip_1(x):
    x = tuple(re.split("[0-9' '\-\+]+",x))
    return x

def make_simple_descrip(x):
    if "Recupero" in x:
        return "Recupero"
    
    elif "Traslado" in x:
        z = ""
        for s in x:
            if s!="Traslado":
                z+=s+" "
        return z.rstrip()
    
    elif "Puente" in x:
        z = ""
        for s in x:
            if s!="Puente":
                z+=s+" "
        return z.rstrip()
    
    elif ("Terremoto" in x)and("Manabi"in x):
        return "Terremoto"+" "+"Manabi"
    
    elif ("Mundial" in x)and("futbol"in x):
        return "Mundial de futbol"
    
    else:
        y=""
        for s in x:
            y+= s+" "
        return y.rstrip()
    pass

def is_core_event(x):
    
    y = tuple(re.split("[a-z' '\-\+]+",x))
    for s in y:
        if s.isdigit():
            return False
        else:
            continue
    return True

In [None]:
from sklearn.preprocessing import OneHotEncoder
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

def events_sp_pipe(ks): 
    # events relevant for (s,p)
    E_s = events.loc[events["locale"]=="National"].copy()
    E_s = pd.concat([E_s,events.loc[events["locale"]=="Local"].loc[events["locale_name"]==loc_dict[lst_s[ks]][0]]])
    E_s = pd.concat([E_s,events.loc[events["locale"]=="Local"].loc[events["locale_name"]==loc_dict[lst_s[ks]][1]]])
    
    col = "description"
    
    E_s["parsed_"+col] = E_s[col].apply(make_parsed_descrip_1)
    E_s["simple_"+col] = E_s["parsed_"+col].apply(make_simple_descrip)
    E_s = E_s.drop("parsed_"+col, axis=1)
    
    E_s["core_event"] = E_s["description"].apply(is_core_event)
    E_s["core_event"].loc[E_s["type"] == "Bridge"] = False
    E_s["core_event"].loc[E_s["type"] == "Transfer"] = True
    E_s["core_event"].loc[E_s["type"] == "Work Day"] = False
    E_s["core_event"].loc[E_s["transferred"] == True] = False
    

    E_s = E_s.drop("type",axis=1)
    E_s = E_s.loc[E_s["transferred"]==False].copy()
    E_s = E_s.drop("transferred",axis=1)
    E_s = E_s.drop("locale",axis=1)
    E_s = E_s.drop("locale_name",axis=1)
    E_s = E_s.drop(col,axis=1)
    
    #OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(E_s["simple_"+col]))
    #OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

    
    return E_s
    

In [None]:
E_s0 = events_sp_pipe(ks=ks0)
E_s0.head()

### 3.2c matching events onto "sp"-data <a class="anchor" id="Ch3.2c"></a>
[goto (0.) General Workflow](#Ch0)

[goto 4.3 global](#Ch4.3)

In [None]:
def sp_pipe_2(DF,E_s, is_test=False):
    EE = E_s.copy().set_index("date")
    sp = DF.copy()
    #print(f'init (n_sp={len(sp)})')
    
    #sp = pd.concat([sp,EE],axis=1)
    #sp = sp.join(EE)
    cols = ["core_event","simple_description"]
    match_on_dates_2(DF_target=sp,DF_source=EE,cols=cols)
    
    sp["core_event"] = sp["core_event"].fillna(False)
    
    lst_ev = EE["simple_description"].unique()
    
    ### OH encoding
    if True:
        
        sp["sd"] = sp["simple_description"]
        sp = sp.drop("simple_description",axis=1)
        sp = pd.get_dummies(sp, columns = ["sd"])
        
        if not is_test:
            sp = sp.loc[sp["sales"]>=0]
    
    return sp, lst_ev

In [None]:
E_s0["simple_description"].unique()

In [None]:
EE0= E_s0.set_index("date")
EE0

In [None]:
def match_on_dates_2(DF_target,DF_source,cols):
    
    for col in cols:
        DF_target[col] = np.full(len(DF_target),np.nan)
        
    for idx in DF_target.index:
        try:
            for col in cols:
                DF_target.loc[idx,col] = DF_source.loc[idx,col]
        except:
            continue
    
    pass

In [None]:
sp0_1 , evs0 = sp_pipe_2(DF=sp0,E_s=E_s0, is_test=False)

In [None]:
evs0

In [None]:
def next_date_next_val_2(DF,col_name,col_val,current_idx):
    dates_idxs = DF.loc[current_idx:].loc[DF[col_name]==col_val].index.to_numpy()
    try:
        next_date_idx = dates_idxs[0]
        next_date = DF.loc[next_date_idx]
    except:
        next_date_idx = np.nan
        next_date = np.nan
    return next_date_idx,next_date

def prev_date_prev_val_2(DF,col_name,col_val,current_idx):
    dates_idxs = DF.loc[:current_idx].loc[DF[col_name]==col_val].index.to_numpy()
    try:
        prev_date_idx = dates_idxs[-1]
        prev_date = DF.loc[prev_date_idx]
    except:
        prev_date_idx = np.nan
        prev_date = np.nan
    return prev_date_idx,prev_date


In [None]:
def N_bef(DF, col_name, col_val, target_col, drop=False):
    DF[target_col] = np.full(len(DF),np.nan)
    for i in DF.index.to_numpy():
        y,z = next_date_next_val_2(DF,
                         col_name,
                         col_val,
                         current_idx=i)
        if not pd.isna(y):
            DF[target_col].loc[i:i] = pd.to_timedelta(y-i).days
    if drop:
        DF = DF.drop(col_name,axis=1)
        
    pass

def N_aft(DF, col_name, col_val, target_col,drop=False):
    DF[target_col] = np.full(len(DF),np.nan)
    for i in DF.index.to_numpy():
        y,z = prev_date_prev_val_2(DF,
                         col_name,
                         col_val,
                         current_idx=i)
        if not pd.isna(y):
            DF[target_col].loc[i:i] = pd.to_timedelta(i-y).days
    if drop:
        DF = DF.drop(col_name,axis=1)
    pass


In [None]:
t0 = sp0_1.index[1]
a,b = next_date_next_val_2(DF = sp0_1,
                             col_name="core_event",
                             col_val=True,
                             current_idx=t0)
print(a)
print('*'*30)
b

print((t0-a).days)
print('*'*30)
print(a-t0)
print(type(a-t0))
print('*'*30)
(a-t0).days

In [None]:
N_bef(DF=sp0_1,
       col_name="core_event",
       col_val=True,
       target_col="N_bef_core_event")

In [None]:
sp0_1

__note__: "sp_pipe_4" requires lots of computation/time $O(N \cdot n)$ 
- with $N = N(\text{index_dates}), n = N(\text{lst_evs})$ 

In [None]:
def sp_pipe_3(DF, lst_evs):
    sp = DF.copy()
    
    ### core_events
    if False:
        N_bef(DF=sp,
                  col_name="core_event",
                  col_val=True,
                  target_col="N_bef_core_event")

        N_bef(DF=sp,
                  col_name="core_event",
                  col_val=True,
                  target_col="N_aft_core_event")
    ### ---
    ### lst of events
    cols=[]
    for ev in lst_evs:
        cols.append("sd_"+str(ev))
    
    for kc in np.arange(len(cols)):
        N_bef(DF=sp,
              col_name=cols[kc],
              col_val=True,
              target_col="N_bef_"+str(lst_evs[kc]),
              drop=True)
        
        N_aft(DF=sp,
              col_name=cols[kc],
              col_val=True,
              target_col="N_aft_"+str(lst_evs[kc]),
              drop=True)

    return sp

In [None]:
def h_f1(x):
    if pd.isna(x):
        return 0
    else:
        return 1/(1+abs(x)/7) 

    
def h_f2(x):
    if pd.isna(x):
        return 0
    else:
        return 1/(1+(abs(x)/7)**2) 

def h_f3(x):
    if pd.isna(x):
        return 0
    else:
        return 1/(1+(abs(x)/7)**3) 
    
def make_x_bef_aft(DF,cols,funct=h_f2):
    for col in cols:
        DF["x"+col] = DF[col].apply(funct)
    
    pass

In [None]:
def sp_pipe_4(DF, lst_evs, funct=h_f1):
    sp = DF.copy()
    
    cols_sd = []
    for kc in np.arange(len(lst_evs)):
        cols_sd.append("sd_"+str(lst_evs[kc]))
    
    sp = sp.copy().drop(cols_sd,axis=1)
    
    bef_cols= []
    for col in sp.columns:
        s = re.search('^N_bef', col)
        if not pd.isna(s):
            bef_cols= np.append(bef_cols,col)
    aft_cols =[]
    for col in sp.columns:
        s = re.search('^N_aft', col)
        if not pd.isna(s):
            aft_cols= np.append(aft_cols,col)

    #print(aft_cols)
    bef_aft_cols = np.append(bef_cols,aft_cols)
    
    make_x_bef_aft(sp,bef_aft_cols,funct=funct)
    
    sp = sp.copy().drop(bef_aft_cols,axis=1)
    
    return sp


## 3.3 ignoring "oil" and "paydays"-data for now <a class="anchor" id="Ch3.3"></a>
<a class="anchor" id="Ch3.4"></a>
[goto (0.) General Workflow](#Ch0)

In [None]:
tdf, u_t , u_cols, DF= check_integrity(oil.set_index("date",drop=False))

## 3.5 global generation and storage of processed <a class="anchor" id="Ch3.5"></a>

[goto (.0) General Workflow](#Ch0)
- because it takes a long time to process data for each pair (s,p),
- generate and store the processed data:
- compute models on the already processed data

In [None]:
def generate_sp_train(ks,kp, funct=h_f1,pipe3=False,pipe2=False, to_csv=True):
    sp_fname = f'sp_ks{ks}_kp{kp}.csv'
    
    s = lst_s[ks]
    p = lst_p[kp]

    E_s = events_sp_pipe(ks)

    sp_tr= select_sp(DF= train,s=s,p=p)
    sp_tr= sp_pipe(DF=sp_tr,is_test=False)
    if pipe2:
        sp_tr, lst_evs = sp_pipe_2(DF=sp_tr,E_s=E_s)
        if pipe3:
            sp_tr = sp_pipe_3(DF=sp_tr, lst_evs=lst_evs)
            sp_tr = sp_pipe_4(DF=sp_tr, lst_evs=lst_evs, funct=funct)

    if to_csv:
        sp_tr.to_csv(sp_fname)
        print ("processing of 'train'-data done, stored away")
        return sp_tr
    else:
        return sp_tr

def generate_sp_test(ks,kp, funct=h_f1,pipe3=False,pipe2=False,to_csv=True):
    sp_fname = f'spt_ks{ks}_kp{kp}.csv'
    
    s = lst_s[ks]
    p = lst_p[kp]

    E_s = events_sp_pipe(ks)

    sp_tr= select_sp(DF=test,s=s,p=p)
    sp_tr= sp_pipe(DF=sp_tr,is_test=True)
    if pipe2:
        sp_tr, lst_evs = sp_pipe_2(DF=sp_tr,E_s=E_s,is_test=True)
        if pipe3:
            sp_tr = sp_pipe_3(DF=sp_tr, lst_evs=lst_evs)
            sp_tr = sp_pipe_4(DF=sp_tr, lst_evs=lst_evs, funct=funct)
    
    if to_csv:
        sp_tr.to_csv(sp_fname)
        print ("processing of 'test'-data done, stored away")
        return sp_tr
    else:
        return sp_tr
    

__ks = order_s\[0\]__

In [None]:
if True:
    for ks in order_s[0:1]:
        for kp in order_p[0:1]:
            generate_sp_train(ks=ks,kp=kp,pipe3=False)
            generate_sp_test(ks=ks,kp=kp,pipe3=False)

# 4 Models <a class="anchor" id="Ch4"></a>
[goto (0.) General Workflow](#Ch0)
## 4.0 Setup
reminder of  models to work with: 

            - Linear Regression/ridge
            - Random Forests
            - xgboost
            - HistGradientBoost
            
note: evaluate function [from sklearn examples:](https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html?highlight=time)

In [None]:
from sklearn.preprocessing import RobustScaler

from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import HistGradientBoostingRegressor


from sklearn.pipeline import make_pipeline

from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_log_error


ts_cv = TimeSeriesSplit(
    n_splits=7,
    gap=0,
    max_train_size=1000,
    test_size=100,
)

#all_splits = list(ts_cv.split(X, y))

def evaluate(model,X,y,cv):
    cv_results = cross_validate(
        model,
        X,
        y,
        cv=cv,
        scoring=["neg_mean_absolute_error", "neg_root_mean_squared_error","neg_mean_squared_log_error"],
    )
    mae = -cv_results["test_neg_mean_absolute_error"]
    rmse = -cv_results["test_neg_root_mean_squared_error"]
    msle= -cv_results["test_neg_mean_squared_log_error"]
    print(
        f"Mean Absolute Error:     {mae.mean():.3f} +/- {mae.std():.3f}\n"
        f"Root Mean Squared Error: {rmse.mean():.3f} +/- {rmse.std():.3f}\n"
        f"Mean Squared Log Error: {msle.mean():.3f} +/- {msle.std():.3f}"
    )


In [None]:
def rmsle(y_true,y_pred):
    return np.sqrt((y_true - y_pred).apply(np.log1p).apply(np.square).mean())


## 4.1) models contruction: <a class="anchor" id="Ch4.1"></a>
[goto (0.) General Workflow](#Ch0)

### 4.1a simple models

In [None]:
ridge = make_pipeline(RobustScaler(),
                      RidgeCV())

RF = make_pipeline(RobustScaler(),
                   RandomForestRegressor(max_depth=5))

XGB = make_pipeline(RobustScaler(),
                    XGBRegressor(n_estimators=100, learning_rate=0.05, n_jobs=4))
HGB = make_pipeline(HistGradientBoostingRegressor())

### 4.1b Compound Models <a class="anchor" id="Ch4.1b"></a>

[goto (0.) General Workflow](#Ch0)

- primarily to decompose into "trend" and "detrended"

In [None]:
def default_selector(X):
    cols=X.columns
    cols1 = cols[0:1]
    cols2 = cols[1:]
    return [cols1,cols2,[]]

class CompoundModel:
    def __init__(self, model_1, model_2, model_3=None,sel_funct=default_selector):
        self.model_1 = model_1
        self.model_2 = model_2
        self.y_columns = None
        self.model_3 = model_3
        self.sel_funct = sel_funct

def CompoundModelFit(self, X, y):
    
    mtx_cols = self.sel_funct(X)
    X_1 = X[mtx_cols[0]]
    X_2 = X[mtx_cols[1]]
    X_3 = X[mtx_cols[2]]
    #print(X_1.shape,X_2.shape,X_3.shape)
    self.model_1.fit(X_1,y)

    y_fit = pd.DataFrame(
        self.model_1.predict(X_1), 
        index=X_1.index, columns=y.columns,
    )

    y_resid = y - y_fit
    y_resid = y_resid.stack().squeeze() # wide to long
    if not (X_2.empty):
        self.model_2.fit(X_2, y_resid)
        
        if not (X_3.empty):
            y_resid2 = y_fit-y_resid
            y_resid2 = y_resid2.stack().squeeze() # wide to long
            self.model_3.fit(X_3, y_resid2)

        else:
            y_resid2 = None
            #print("Note: in 'CompoundModelFit':: X_3 = NaN")
    else:
        print("Warning: in 'CompoundModelFit' :: X_2 = NaN, \n check sel_funct \n or check cols in X")
    
    # Save column names for predict method
    self.y_columns = y.columns
    
def CompoundModelPredict(self, X):
    mtx_cols = self.sel_funct(X)
    X_1 = X[mtx_cols[0]]
    X_2 = X[mtx_cols[1]]
    X_3 = X[mtx_cols[2]]
    #print(X_1.shape,X_2.shape,X_3.shape)
    y_pred = pd.DataFrame(
        self.model_1.predict(X_1),
        index=X_1.index, columns=self.y_columns,)
    y_pred = y_pred.stack().squeeze()  # wide to long
    
    if not (X_2.empty):
        y_pred += self.model_2.predict(X_2)
        #y_pred = y_pred.stack().squeeze()
    elif (X_2.empty):
        print('Warning: in "CompoundModelPredict" :: X_2 = NaN, \n check sel_funct \n or check cols in X')
    
    elif not (X_3.empty):
        y_pred  += self.model_3.predict(X_3)
    #print("Note: in 'CompoundModelPedict':: X_3 = NaN")
    return y_pred.unstack()  # long to wide

### ---

CompoundModel.fit = CompoundModelFit
CompoundModel.predict = CompoundModelPredict

def X_selector2(X):
    cols=list(X.columns.to_numpy())
    K = "trend"
    cols1 = list(["trend",])
    while(K in cols) :
        cols.remove(K)
    return list([cols1,cols,[]])

## 4.2 Evaluation of predictions <a class="anchor" id="Ch4.2"></a>
[goto (0.) General Workflow](#Ch0)

In [None]:
s = lst_s[4]
p = lst_p[7]

sp_tr= select_sp(DF= train,s=s,p=p)
X_1,y_1 = DataProcess_to_Xy(sp_tr, is_test=False)

sp_te = select_sp(DF=test,s=s,p=p)
X_1_te ,y_1_te = DataProcess_to_Xy(sp_te, is_test=True)



In [None]:
print("Ridge")
evaluate(model=ridge,X=X_1,y=y_1["sales"],cv=ts_cv)
print('*'*40)
print("RF")
evaluate(model=RF,X=X_1,y=y_1["sales"],cv=ts_cv)
print('*'*40)
print("XGB")
evaluate(model=XGB,X=X_1,y=y_1["sales"],cv=ts_cv)
print('*'*40)
print("HGB")
evaluate(model=HGB,X=X_1,y=y_1["sales"],cv=ts_cv)
print('*'*40)

In [None]:
def Make_preds_Y(model,X_tr,y_tr,X_te, plotting=False, verbose=False):
    model.fit(X_tr,y_tr[["sales"]])

    yp_tr = pd.DataFrame(model.predict(X_tr), index=X_tr.index,columns=["sales"])
    yp_te = pd.DataFrame(model.predict(X_te), index=X_te.index,columns=["sales"])

    if plotting:
        make_plot3(df1=y_tr["sales"],df2=yp_tr["sales"],df3=yp_te["sales"],x_axis="date",y_axis="sales",title=False)
        make_plot3_rol(df1=y_tr["sales"],df2=yp_tr["sales"],df3=yp_te["sales"],x_axis="date",y_axis="sales",title=False)

    if verbose:
        print(f"training fit rmsle= {rmsle(y_tr['sales'],yp_tr['sales'])} \n")
        #print( f"testing fit rmsle= {rmsle(y_te['sales'],yp_0_te)}")
    
    Yp_te = X_te[["id"]].join(pd.DataFrame(yp_te,index=yp_te.index,columns=["sales"]))
    Yp_tr = X_tr[["id"]].join(pd.DataFrame(yp_tr,index=yp_tr.index,columns=["sales"]))
    return Yp_te,Yp_tr

In [None]:
def Compare_True(Yp_te,Yp_tr,Y_true, plotting=False):
    Y_fit = pd.concat([Yp_tr,Yp_te])
    r = rmsle(Y_fit['sales'],Y_true['sales'])
    print(f"fit rmsle= {r}")
    if plotting:
        make_plot3(df1=Y_true['sales'],df2=Yp_tr['sales'],df3=Yp_te["sales"],x_axis="date",y_axis="sales",title=False)
        make_plot3_rol(df1=Y_true['sales'],df2=Yp_tr['sales'],df3=Yp_te["sales"],x_axis="date",y_axis="sales",title=False)
    pass

## 4.3 one model pipeline : <a class="anchor" id="Ch4.3"></a>

[goto (0.) General Workflow](#Ch0)

In [None]:
def modelize_from_csv(ks,kp, model=XGB, plotting=False,
             on_train_subset=False, from_work=True):
    
    sp_fname=f'sp_ks{ks}_kp{kp}.csv'
    spt_fname=f'spt_ks{ks}_kp{kp}.csv'
    ###-------------------------------------
    
    if from_work:
        sp_tr1 = pd.read_csv(work_path/sp_fname,
                         parse_dates=['date'],
                         infer_datetime_format=True)
    else:
        sp_tr1 = pd.read_csv(comp_dir/sp_fname,
                         parse_dates=['date'],
                         infer_datetime_format=True)
    
    sp_tr1 = sp_tr1.set_index("date")
    
    X_1,y_1 = DataProcess_to_Xy(sp_tr1, is_test=False)
    
    if not (on_train_subset is False):
        
        X_tr, X_te, y_tr, y_te = train_test_split(X_1, y_1, test_size=on_train_subset,
                                                      random_state=1, shuffle=False)
        Yp_sp_te,Yp_sp_tr = Make_preds_Y(model,
                                X_tr=X_tr,
                                y_tr=y_tr,
                                X_te=X_te, 
                                plotting=False)
        Compare_True(Yp_sp_te , Yp_sp_tr,y_1, plotting=plotting)
        return Yp_sp_te,Yp_sp_tr
    
    else:

        sp_te1 = pd.read_csv(work_path/spt_fname,
                     parse_dates=['date'],
                     infer_datetime_format=True)
    
        sp_te1 = sp_te1.set_index("date")
        
        X_1_te ,y_1_te = DataProcess_to_Xy(sp_te1, is_test=True)
        Yp_sp_te,Yp_sp_tr = Make_preds_Y(model,
                                X_tr=X_1,
                                y_tr=y_1,
                                X_te=X_1_te, 
                                plotting=plotting)
        
        
        return Yp_sp_te,Yp_sp_tr
    
    pass 

In [None]:
def modelize_sp(sp_tr,sp_te=pd.DataFrame(), model=XGB, plotting=False,
             on_train_subset=False):
    
    sp_tr1= sp_tr.copy()
    
    #sp_tr1 = sp_tr1.set_index("date")
    
    X_1,y_1 = DataProcess_to_Xy(sp_tr1, is_test=False)
    
    if not (on_train_subset is False):
        
        X_tr, X_te, y_tr, y_te = train_test_split(X_1, y_1, test_size=on_train_subset,
                                                      random_state=1, shuffle=False)
        Yp_sp_te,Yp_sp_tr = Make_preds_Y(model,
                                X_tr=X_tr,
                                y_tr=y_tr,
                                X_te=X_te, 
                                plotting=False)
        Compare_True(Yp_sp_te , Yp_sp_tr,y_1, plotting=plotting)
        return Yp_sp_te,Yp_sp_tr
    
    elif not(sp_te.empty):
        
        sp_te1 = sp_te.copy()
    
        #sp_te1 = sp_te1.set_index("date")
        
        X_1_te ,y_1_te = DataProcess_to_Xy(sp_te1, is_test=True)
        Yp_sp_te,Yp_sp_tr = Make_preds_Y(model,
                                X_tr=X_1,
                                y_tr=y_1,
                                X_te=X_1_te, 
                                plotting=plotting)
        
        
        return Yp_sp_te,Yp_sp_tr
    
    pass 

In [None]:
__ = modelize_from_csv(ks=order_s[0],kp=order_p[0], model=XGB, plotting=True, on_train_subset=0.15,)

## 4.4 Compound model analysis <a class="anchor" id="Ch4.4"></a>

In [None]:
### Debugging; please ignore
### debug both "modelize" and "generate"
if False:
    ks0 = order_s[0]
    kp0 = order_p[0]
    
    sp_fname = f'sp_ks{ks0}_kp{kp0}.csv'
    DO_GEN = False
    
    ### ----------------------------------------
    if DO_GEN:
        s = lst_s[ks0]
        p = lst_p[kp0]

        E_s = events_sp_pipe(ks0)

        sp_tr= select_sp(DF= train,s=s,p=p)
        sp_tr= sp_pipe(DF=sp_tr,is_test=False)

        sp_tr, lst_evs = sp_pipe_2(DF=sp_tr,E_s=E_s)

        sp_tr = sp_pipe_3(DF=sp_tr, lst_evs=lst_evs)

        sp_tr = sp_pipe_4(DF=sp_tr, lst_evs=lst_evs, funct=h_f1)


        sp_tr.to_csv(sp_fname)
    ### ----------------------------------------
    
    sp_tr1 = pd.read_csv(work_path/sp_fname,
                         parse_dates=['date'],
                         infer_datetime_format=True)

    sp_tr1 = sp_tr1.set_index("date")
    X_1,y_1 = DataProcess_to_Xy(sp_tr1, is_test=False)

    all_splits = list(ts_cv.split(X_1, y_1))
    k = 4
    X_tr = X_1.iloc[all_splits[k][0]]
    X_te = X_1.iloc[all_splits[k][1]]

    y_tr = y_1.iloc[all_splits[k][0]]
    y_te = y_1.iloc[all_splits[k][1]]

    model_0 = HGB
    Yp_te,Yp_tr=Make_preds_Y(model= model_0,
                 X_tr=X_tr,
                 y_tr=y_tr,
                 X_te=X_te, 
                 plotting=False,verbose=True)
    Compare_True(Yp_te , Yp_tr,y_1,plotting=True)


In [None]:
CustomModel = CompoundModel(ridge,HGB,sel_funct=X_selector2)

__ = modelize_from_csv(ks=order_s[0],kp=order_p[0], model=CustomModel, plotting=True, on_train_subset=0.15)

In [None]:
def CompoundPred_1(self, X):
    mtx_cols = self.sel_funct(X)
    X_1 = X[mtx_cols[0]]
    X_2 = X[mtx_cols[1]]
    X_3 = X[mtx_cols[2]]
    #print(X_1.shape,X_2.shape,X_3.shape)
    y_pred = pd.DataFrame(
        self.model_1.predict(X_1),
        index=X_1.index, columns=self.y_columns,
    )
    y_pred = y_pred#.stack().squeeze()  # wide to long
    
    
    return y_pred

CompoundModel.pred_1 = CompoundPred_1
    
def CompoundPred_2(self, X):
    mtx_cols = self.sel_funct(X)
    X_1 = X[mtx_cols[0]]
    X_2 = X[mtx_cols[1]]
    X_3 = X[mtx_cols[2]]
    #print(X_1.shape,X_2.shape,X_3.shape)
    y_pred = pd.DataFrame(
        self.model_2.predict(X_2),
        index=X_2.index, columns=self.y_columns,
    )
    y_pred = y_pred#.stack().squeeze()  # wide to long

    return y_pred

CompoundModel.pred_2 = CompoundPred_2

def show_decomposition(self,X,Y_true, on_train_subset=0.15):
    plotting=True
    
    X_tr, X_te, y_tr, y_te = train_test_split(X, Y_true, test_size=on_train_subset,
                                                      random_state=1, shuffle=False)
### global compound performance
    yp_trG = self.predict(X_tr)
    yp_teG = self.predict(X_te)
    Y_fitG = pd.concat([yp_trG,yp_teG])
    print(f"fit rmsle global predict = {rmsle(Y_fitG['sales'],Y_true['sales'])}")
    
### trend graph:
    yp_tr = self.pred_1(X_tr)
    yp_te = self.pred_1(X_te)
    #print(yp_tr.head())
    
    Y_fit = pd.concat([yp_tr,yp_te])
    print(f"fit rmsle pred_1 = {rmsle(Y_fit['sales'],Y_true['sales'])}")
    #print(f"fit rmsle= {rmsle(Y_fit[['sales']],Y_true[['sales']])}")
    if plotting:
        make_plot3(df1=Y_true['sales'],df2=yp_tr['sales'],df3=yp_te["sales"],
                   x_axis="date",y_axis="sales",title=False, line=3)
        
        make_plot3_rol(df1=Y_true['sales'],df2=yp_tr['sales'],df3=yp_te["sales"],
                       x_axis="date",y_axis="sales",title=False, line=5)
### off-trend graph

    Y_true_2 = Y_true-self.pred_1(X)

    yp_tr2 = self.pred_2(X_tr)
    yp_te2 = self.pred_2(X_te)
    
    Y_fit2 = pd.concat([yp_tr2,yp_te2])
    print(f"fit rmsle pred_2 = {rmsle(Y_fit['sales'],Y_true['sales'])}")
    if plotting:
        make_plot3(df1=Y_true_2['sales'],df2=yp_tr2['sales'],df3=yp_te2["sales"],
                   x_axis="date",y_axis="detrended sales",title=False, line=1)
        
        make_plot3_rol(df1=Y_true_2['sales'],df2=yp_tr2['sales'],df3=yp_te2["sales"],
                       x_axis="date",y_axis="detrended sales",title=False, line=1)
    
    pass

CompoundModel.show_decomp = show_decomposition

In [None]:
from_work =True
if True:
    sp_fname=f'sp_ks{ks}_kp{kp}.csv'
    if from_work:
        sp_tr1 = pd.read_csv(work_path/sp_fname,
                         parse_dates=['date'],
                         infer_datetime_format=True)
    else:
        sp_tr1 = pd.read_csv(comp_dir/sp_fname,
                         parse_dates=['date'],
                         infer_datetime_format=True)
        
    sp_tr1 = sp_tr1.set_index("date")
    
    X_1,y_1 = DataProcess_to_Xy(sp_tr1, is_test=False)

In [None]:
def Show_decomp(model,ks,kp,on_train_subset=0.15,from_work=True):
    sp_fname=f'sp_ks{ks}_kp{kp}.csv'
    if from_work:
        sp_tr1 = pd.read_csv(work_path/sp_fname,
                         parse_dates=['date'],
                         infer_datetime_format=True)
    else:
        sp_tr1 = pd.read_csv(comp_dir/sp_fname,
                         parse_dates=['date'],
                         infer_datetime_format=True)
        
    sp_tr1 = sp_tr1.set_index("date")
    
    X_1,y_1 = DataProcess_to_Xy(sp_tr1, is_test=False)
    
    return CustomModel.show_decomp(X=X_1,Y_true=y_1,on_train_subset=on_train_subset)

In [None]:
Show_decomp(model=CustomModel,ks=order_s[0],kp=order_p[0],on_train_subset=0.15,from_work=True)

# 5 Global submission <a class="anchor" id="Ch5"></a>
[goto (0.) General Workflow](#Ch0)

In [None]:
def submit_from_csv(ks=ks,kp=kp,model=CustomModel):
    
    Yp_sp_te,_ = modelize_from_csv(ks=ks,kp=kp, model=model, plotting=False, on_train_subset=False, from_work=True)
    Yp_sp_te = from_log_sales(Yp_sp_te)
    Yp_sp_te.to_csv('submission.csv', mode='a', index=False, header=False)
    pass

In [None]:
def submit_sp(sp_tr,sp_te,model=XGB,):
    if (sp_tr.empty) or (sp_te.empty):
        print(f'WARNING: sp_tr/te is empty')
    else:
        Yp_sp_te,_ = modelize_sp(model=model,sp_tr=sp_tr,sp_te=sp_te, plotting=False, on_train_subset=False)
        Yp_sp_te.to_csv('submission.csv', mode='a', index=False, header=False)
    pass

In [None]:
to_csv=False

if True:
    for ks in order_s:
        for kp in order_p:
            
            sp_tr = generate_sp_train(ks=ks,kp=kp,pipe3=False,pipe2=False,to_csv=to_csv)
            sp_te = generate_sp_test(ks=ks,kp=kp,pipe3=False,pipe2=False,to_csv=to_csv)
            
            if to_csv:
                submit_from_csv(ks=ks,kp=kp,model=CustomModel)
            else:
                submit_sp(model=CustomModel,sp_tr=sp_tr,sp_te=sp_te)

In [None]:
### NOTE: DEPRECATED

def complete_pipe(ks,kp,model=ridge,plotting=False, on_train_subset=False):
    if False:
        s = lst_s[ks]
        p = lst_p[kp]

        E_s = events_sp_pipe(ks=ks)

        sp_tr= select_sp(DF= train,s=s,p=p)
        sp_tr = sp_pipe(sp_tr)
        sp_tr = sp_pipe_2(sp_tr,E_s)
        X_1,y_1 = DataProcess_to_Xy(sp_tr, is_test=False)

        sp_te = select_sp(DF=test,s=s,p=p)
        sp_te = sp_pipe(sp_te,is_test=True)
        #sp_te = sp_pipe_2(sp_te,E_s)
        X_1_te ,y_1_te = DataProcess_to_Xy(sp_te, is_test=True)

        if on_train_subset is False:
            Yp_sp_te,Yp_sp_tr = Make_preds_Y(model,
                                    X_tr=X_1,
                                    y_tr=y_1,
                                    X_te=X_1_te, 
                                    plotting=plotting)
        else:
            X_tr, X_te, y_tr, y_te = train_test_split(X_1, y_1, test_size=on_train_subset,
                                                          random_state=1, shuffle=False)
            Yp_sp_te,Yp_sp_tr = Make_preds_Y(model,
                                    X_tr=X_tr,
                                    y_tr=y_tr,
                                    X_te=X_te, 
                                    plotting=False)
            Compare_True(Yp_sp_te , Yp_sp_tr,y_1, plotting=plotting)
        return Yp_sp_te,Yp_sp_tr
    return print("function deprecated")

In [None]:
__=complete_pipe(ks=order_s[0],kp=order_p[0],model=XGB,plotting=True,on_train_subset=0.1)

In [None]:
#submission.to_csv('submission.csv',mode='w',  index=False, header=False)

model_loop = XGB
if False:
    for ks in np.arange(len(lst_s[0:2])):
        for kp in np.arange(len(lst_p[0:2])):
            Yp_sp_te,_ = complete_pipe(ks=ks,kp=kp,model=model_loop,plotting=False)
            Yp_sp_te.to_csv('submission.csv', mode='a', index=False, header=False)
    

# Fin <a class="anchor" id="End"></a>

[to (0.) General Workflow](#Ch0)