En este notebook se recogen varias pruebas solo con los contadores cuyas series temporales son completas o casi completas.

**PRUEBAS HECHAS**
1. Predicción haciendo la media entre XGBoost Regressor y Gradient Boosting Regressor (BASELINE ENTREGA LOCAL)
2. El 1 pero hacemos un modelo agregado por semanas para las predicciones agregadas semanales (EMPEORA)
3. Meter variable lag_7 y lag_14 (MEJORA)
4. Hacer selección de variables atendiendo a las features_importances_ de XGBoost (MEJORA sospechosamente MUCHO)

**PRUEBAS POR HACER**
- Usar de train menos meses
- Quitar predicciones negativas (ponerlas a 0 y a funcionar)
- Usar catboost en vez de xgboost
- Probar a hacer un modelo agregado semanal pero usando ARIMA
- Probar modelo diario con ARIMA (solo en contadores buenos)
- Quitar outlayers (esto es más en preprocesado)
- Hacer logaritmos a (casi) todo
- LSTM (a saber qué se puede gestionar con esto) (puede ser buena idea no usar toda la serie temporal)

### Imports, utils and train/test creation

In [1]:
import pandas as pd
import numpy as np
import datetime
from tqdm import tqdm

import xgboost as xgb
from sklearn.ensemble import GradientBoostingRegressor
import warnings

  from pandas import MultiIndex, Int64Index


In [2]:
warnings.simplefilter('ignore')

In [3]:
'''
given a start date in datetime format "start_date" and an "end_date" returns a list of strings with the dates from
"start_date" to "end_date".

Example:

start_date = datetime.date(2019, 9 , 30)
end_date = datetime.date(2019, 10, 7)
get_date_range(start_date, end_date)
'''
def get_date_range(start_date, end_date):
    number_of_days = (end_date-start_date).days
    return [(start_date + datetime.timedelta(days = day)).isoformat() for day in range(number_of_days+1)]

'''
This function expects two dataframes with the same format: for the first seven columns, each column corresponds to a date 
and each row corresponds to a counter index. In position i,j there should be DELTA of counter i in date j. 
For the last two columns of the dataframes they should not reffer to a daily prediction but to the aggregated prediction 
of week_1 and week_2. Given these two dataframes (one for theprediction and one for the real values), 
the function returns de error according to the competition rules.

Examples:

import pandas as pd
import copy

test = pd.read_pickle('../data/test.pkl')

compute_error(test, test)

test_v3 = copy.copy(test)
test_v3.iloc[:,0] = test_v3.iloc[:,1]
compute_error(test_v3, test)

'''
def compute_error(pred, real):
    daily_rmses = []
    for i in range(7):
        daily_rmses.append((((real.iloc[:,i] - pred.iloc[:,i])**2/len(real.iloc[:,i])).sum())**(1/2))
    rmse_1 = sum(daily_rmses)/7
    
    first_week_pred_sum = pred.iloc[:,7].sum()
    second_week_pred_sum = pred.iloc[:,8].sum()
    first_week_real_sum = real.iloc[:,7].sum()
    second_week_real_sum = real.iloc[:,8].sum()
    
    first_week_rmse = (((first_week_real_sum - first_week_pred_sum)**2)/len(real.iloc[:,7]))**(1/2)
    second_week_rmse = (((second_week_real_sum - second_week_pred_sum)**2)/len(real.iloc[:,8]))**(1/2)
    rmse_2 = (first_week_rmse + second_week_rmse)/2
    
    return (rmse_1 + rmse_2)/2

In [4]:
path = '../data/df6.pkl'

df = pd.read_pickle(path)
start_date = datetime.date(2019, 2 , 1)
end_date = datetime.date(2020, 1, 17)
train = df[df['DATE'].isin(get_date_range(start_date, end_date))]
train = train[train['IS_GOOD']==1]
train.drop(['YEAR_DAY','WEEKDAY','IS_GOOD','DATE'], axis=1, inplace=True)
train['SUN'] = train['SUN'].fillna(train['SUN'].mean())
train['PRECIPITATIONS'] = train['PRECIPITATIONS'].fillna(train['PRECIPITATIONS'].mean())

start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 31)
test = df[df['DATE'].isin(get_date_range(start_date, end_date))]
test = test[test['IS_GOOD']==1]
test.drop(['YEAR_DAY','WEEKDAY','IS_GOOD'], axis=1, inplace=True)
test['SUN'] = test['SUN'].fillna(test['SUN'].mean())
test['PRECIPITATIONS'] = test['PRECIPITATIONS'].fillna(test['PRECIPITATIONS'].mean())

print('Train:', train.shape, 'Test:', test.shape)

X_train = train.drop(['DELTA'], axis=1)
y_train = train['DELTA']

X_test = test.drop(['DELTA', 'DATE'], axis=1)

Train: (931203, 14) Test: (37142, 15)


### XGBR and GBR for all counters

- The final prediction is the mean between XGBR and GBR
- No lags are used
- The week prediction is done just by adding the daily predictions 

In [5]:
path = '../data/df6.pkl'

df = pd.read_pickle(path)
start_date = datetime.date(2019, 2 , 1)
end_date = datetime.date(2020, 1, 17)
train = df[df['DATE'].isin(get_date_range(start_date, end_date))]
train.drop(['YEAR_DAY','WEEKDAY','IS_GOOD','DATE'], axis=1, inplace=True)
train['SUN'] = train['SUN'].fillna(train['SUN'].mean())
train['PRECIPITATIONS'] = train['PRECIPITATIONS'].fillna(train['PRECIPITATIONS'].mean())

start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 31)
test = df[df['DATE'].isin(get_date_range(start_date, end_date))]
test.drop(['YEAR_DAY','WEEKDAY','IS_GOOD'], axis=1, inplace=True)
test['SUN'] = test['SUN'].fillna(test['SUN'].mean())
test['PRECIPITATIONS'] = test['PRECIPITATIONS'].fillna(test['PRECIPITATIONS'].mean())

print('Train:', train.shape, 'Test:', test.shape)

X_train = train.drop(['DELTA'], axis=1)
y_train = train['DELTA']

X_test = test.drop(['DELTA', 'DATE'], axis=1)

Train: (964197, 14) Test: (38458, 15)


In [6]:
model1 = xgb.XGBRegressor(
    n_estimators=1000,
    reg_lambda=1,
    gamma=0,
    max_depth=8
)

model2 = GradientBoostingRegressor()

print('Fitting XGB...')
model1.fit(X_train, y_train)
print('Fitting GB...')
model2.fit(X_train, y_train)
print('End fitting.')

Fitting XGB...


KeyboardInterrupt: 

In [None]:
y_pred1 = model1.predict(X_test)
y_pred2 = model2.predict(X_test)

results_df = pd.DataFrame.from_dict({'ID':test['ID'].values, 
                                     'DATE':test['DATE'].values,
                                     'y_pred1':y_pred1,
                                     'y_pred2':y_pred2})
results_df = results_df.sort_values(['ID','DATE'])
results_df['FINAL'] = results_df[['y_pred1','y_pred2']].mean(axis=1)

In [None]:
start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 31)
fechas_test = get_date_range(start_date, end_date)

ID = []
Dia_1 = []
Dia_2 = []
Dia_3 = []
Dia_4 = []
Dia_5 = []
Dia_6 = []
Dia_7 = []
for i, fecha in enumerate(fechas_test[0:7]):
    aux = results_df[results_df['DATE']==fecha]
    ID = list(aux['ID'].values)
    if i==0:
        Dia_1 += list(aux['FINAL'].values)
    if i==1:
        Dia_2 += list(aux['FINAL'].values)
    if i==2:
        Dia_3 += list(aux['FINAL'].values)
    if i==3:
        Dia_4 += list(aux['FINAL'].values)
    if i==4:
        Dia_5 += list(aux['FINAL'].values)
    if i==5:
        Dia_6 += list(aux['FINAL'].values)
    if i==6:
        Dia_7 += list(aux['FINAL'].values)
print(len(ID),len(Dia_1))
final_df = pd.DataFrame.from_dict({'ID':ID,
                                   'Dia_1':Dia_1,
                                  'Dia_2':Dia_2,
                                  'Dia_3':Dia_3,
                                  'Dia_4':Dia_4,
                                  'Dia_5':Dia_5,
                                  'Dia_6':Dia_6,
                                  'Dia_7':Dia_7,})

ID = []
Dia_8 = []
Dia_9 = []
Dia_10 = []
Dia_11 = []
Dia_12 = []
Dia_13 = []
Dia_14 = []
for i, fecha in enumerate(fechas_test[7:14]):
    aux = results_df[results_df['DATE']==fecha]
    ID = list(aux['ID'].values)
    if i==0:
        Dia_8 += list(aux['FINAL'].values)
    if i==1:
        Dia_9 += list(aux['FINAL'].values)
    if i==2:
        Dia_10 += list(aux['FINAL'].values)
    if i==3:
        Dia_11 += list(aux['FINAL'].values)
    if i==4:
        Dia_12 += list(aux['FINAL'].values)
    if i==5:
        Dia_13 += list(aux['FINAL'].values)
    if i==6:
        Dia_14 += list(aux['FINAL'].values)
print(len(ID),len(Dia_11))
final_df2 = pd.DataFrame.from_dict({'ID':ID,
                                   'Dia_8':Dia_8,
                                  'Dia_9':Dia_9,
                                  'Dia_10':Dia_10,
                                  'Dia_11':Dia_11,
                                  'Dia_12':Dia_12,
                                  'Dia_13':Dia_13,
                                  'Dia_14':Dia_14,})

final_df['Semana_1'] = final_df[['Dia_1','Dia_2','Dia_3','Dia_4','Dia_5','Dia_6','Dia_7']].sum(axis=1)
final_df['Semana_2'] = final_df2[['Dia_8','Dia_9','Dia_10','Dia_11','Dia_12','Dia_13','Dia_14']].sum(axis=1)

final_df2 = final_df.drop('ID', axis=1)

In [None]:
test = pd.read_pickle('../data/test.pkl')
error = compute_error(final_df2, test)
print('Mean between XGBR and GBR:', round(error,2))

### Train and test sets for weekly predictions instead of adding daily predictions

In [None]:
path = '../data/df6.pkl'

df = pd.read_pickle(path)
start_date = datetime.date(2019, 2 , 1)
end_date = datetime.date(2020, 1, 17)
week_train = df[df['DATE'].isin(get_date_range(start_date, end_date))]
week_train['YEAR_WEEK'] = (week_train['YEAR_DAY']-1)//7
week_train = week_train[week_train['YEAR_WEEK']!=-1]
week_train.drop(['YEAR_DAY','IS_WEEKEND','WEEKDAY','sin_WEEKDAY','cos_WEEKDAY',
                 'sin_year_day','cos_year_day','IS_GOOD','DATE'], axis=1, inplace = True)

week_train = week_train.groupby(['YEAR_WEEK','ID']).agg({ 
                                     'DELTA':sum,
                                     'PRECIPITATIONS':np.mean,
                                     'MIN_TEMP':min,
                                     'MEAN_TEMP':np.mean,
                                     'MAX_TEMP':max,
                                     'SUN':np.mean,
                                     'MEAN_CONSUMPTION':np.mean,
                                     'VARIANCE_CONSUMPTION':np.mean}).reset_index()

weeks_in_a_year = 50
week_train['sin_YEAR_WEEK'] = np.sin(2*np.pi*week_train['YEAR_WEEK']/weeks_in_a_year)
week_train['cos_YEAR_WEEK'] = np.cos(2*np.pi*week_train['YEAR_WEEK']/weeks_in_a_year) 
week_train.head()

In [None]:
X_train = week_train.drop(['DELTA','YEAR_WEEK'], axis=1)
y_train = week_train['DELTA']

model1 = xgb.XGBRegressor(
    n_estimators=1000,
    reg_lambda=1,
    gamma=0,
    max_depth=8
)
model2 = GradientBoostingRegressor()

print('Fitting XGB...')
model1.fit(X_train, y_train)
print('Fitting GB...')
model2.fit(X_train, y_train)
print('End fitting.')

In [None]:
path = '../data/df6.pkl'

df = pd.read_pickle(path)
start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 31)
week_test = df[df['DATE'].isin(get_date_range(start_date, end_date))]
week_test['YEAR_WEEK'] = (week_test['YEAR_DAY']-1)//7
week_test = week_test[week_test['YEAR_WEEK']!=-1]
week_test.drop(['YEAR_DAY','IS_WEEKEND','WEEKDAY','sin_WEEKDAY','cos_WEEKDAY',
                 'sin_year_day','cos_year_day','IS_GOOD','DATE'], axis=1, inplace = True)

week_test = week_test.groupby(['YEAR_WEEK','ID']).agg({ 
                                     'DELTA':sum,
                                     'PRECIPITATIONS':np.mean,
                                     'MIN_TEMP':min,
                                     'MEAN_TEMP':np.mean,
                                     'MAX_TEMP':max,
                                     'SUN':np.mean,
                                     'MEAN_CONSUMPTION':np.mean,
                                     'VARIANCE_CONSUMPTION':np.mean}).reset_index()

weeks_in_a_year = 50
week_test['sin_YEAR_WEEK'] = np.sin(2*np.pi*week_test['YEAR_WEEK']/weeks_in_a_year)
week_test['cos_YEAR_WEEK'] = np.cos(2*np.pi*week_test['YEAR_WEEK']/weeks_in_a_year) 

X_test = week_test.drop(['DELTA','YEAR_WEEK'], axis=1)
y_test = week_test['DELTA']

In [None]:
y_pred1 = model1.predict(X_test)
y_pred2 = model2.predict(X_test)

In [None]:
week_test['prediction1'] = y_pred1
week_test['prediction2'] = y_pred2
week_test['prediction'] = (week_test['prediction1'] + week_test['prediction2'])/2

In [None]:
final_df2['Semana_1'] = week_test[week_test['YEAR_WEEK']==50]['prediction']
final_df2['Semana_2'] = week_test[week_test['YEAR_WEEK']==51]['prediction']

test = pd.read_pickle('../data/test.pkl')
error = compute_error(final_df2, test)
print('Mean between XGBR and GBR:', round(error,2))

In [None]:
week_test

### Incluyendo lag_7

In [None]:
path = '../data/df6.pkl'

df = pd.read_pickle(path)
df['LAG_7'] = df['DELTA'].shift(7, fill_value=0)
df['LAG_14'] = df['DELTA'].shift(14, fill_value=0)

start_date = datetime.date(2019, 2 , 1)
end_date = datetime.date(2020, 1, 17)
train = df[df['DATE'].isin(get_date_range(start_date, end_date))]
train = train[train['IS_GOOD']==1]
train.drop(['YEAR_DAY','WEEKDAY','IS_GOOD','DATE'], axis=1, inplace=True)
train['SUN'] = train['SUN'].fillna(train['SUN'].mean())
train['PRECIPITATIONS'] = train['PRECIPITATIONS'].fillna(train['PRECIPITATIONS'].mean())

start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 24)
test = df[df['DATE'].isin(get_date_range(start_date, end_date))]
test = test[test['IS_GOOD']==1]
test.drop(['YEAR_DAY','WEEKDAY','IS_GOOD'], axis=1, inplace=True)
test['SUN'] = test['SUN'].fillna(test['SUN'].mean())
test['PRECIPITATIONS'] = test['PRECIPITATIONS'].fillna(test['PRECIPITATIONS'].mean())

start_date = datetime.date(2020, 1 , 25)
end_date = datetime.date(2020, 1, 31)
test_2 = df[df['DATE'].isin(get_date_range(start_date, end_date))]
test_2 = test_2[test_2['IS_GOOD']==1]
test_2.drop(['YEAR_DAY','WEEKDAY','IS_GOOD'], axis=1, inplace=True)
test_2['SUN'] = test_2['SUN'].fillna(test_2['SUN'].mean())
test_2['PRECIPITATIONS'] = test_2['PRECIPITATIONS'].fillna(test['PRECIPITATIONS'].mean())

print('Train:', train.shape, 'Test:', test.shape, 'Test 2:', test_2.shape)

X_train = train.drop(['DELTA'], axis=1)
y_train = train['DELTA']

X_test = test.drop(['DELTA', 'DATE'], axis=1)
X_test_2 = test_2.drop(['DELTA', 'DATE'], axis=1)

In [None]:
model1 = xgb.XGBRegressor(
    n_estimators=1000,
    reg_lambda=1,
    gamma=0,
    max_depth=8
)

model2 = GradientBoostingRegressor()

print('Fitting XGB...')
model1.fit(X_train, y_train)
print('Fitting GB...')
model2.fit(X_train, y_train)
print('End fitting.')

In [None]:
y_pred1 = model1.predict(X_test)
y_pred2 = model2.predict(X_test)

results_df = pd.DataFrame.from_dict({'ID':test['ID'].values, 
                                     'DATE':test['DATE'].values,
                                     'y_pred1':y_pred1,
                                     'y_pred2':y_pred2})
results_df = results_df.sort_values(['ID','DATE'])
results_df['FINAL'] = results_df[['y_pred1','y_pred2']].mean(axis=1)

X_test_2['LAG_7'] = results_df['FINAL'].values

y_pred1_2 = model1.predict(X_test_2)
y_pred2_2 = model2.predict(X_test_2)

results_df_2 = pd.DataFrame.from_dict({'ID':test_2['ID'].values, 
                                     'DATE':test_2['DATE'].values,
                                     'y_pred1':y_pred1,
                                     'y_pred2':y_pred2})
results_df_2 = results_df_2.sort_values(['ID','DATE'])
results_df_2['FINAL'] = results_df_2[['y_pred1','y_pred2']].mean(axis=1)

In [None]:
results_df = pd.concat([results_df, results_df_2])

In [None]:
start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 31)
fechas_test = get_date_range(start_date, end_date)

ID = []
Dia_1 = []
Dia_2 = []
Dia_3 = []
Dia_4 = []
Dia_5 = []
Dia_6 = []
Dia_7 = []
for i, fecha in enumerate(fechas_test[0:7]):
    aux = results_df[results_df['DATE']==fecha]
    ID = list(aux['ID'].values)
    if i==0:
        Dia_1 += list(aux['FINAL'].values)
    if i==1:
        Dia_2 += list(aux['FINAL'].values)
    if i==2:
        Dia_3 += list(aux['FINAL'].values)
    if i==3:
        Dia_4 += list(aux['FINAL'].values)
    if i==4:
        Dia_5 += list(aux['FINAL'].values)
    if i==5:
        Dia_6 += list(aux['FINAL'].values)
    if i==6:
        Dia_7 += list(aux['FINAL'].values)
print(len(ID),len(Dia_1))
final_df = pd.DataFrame.from_dict({'ID':ID,
                                   'Dia_1':Dia_1,
                                  'Dia_2':Dia_2,
                                  'Dia_3':Dia_3,
                                  'Dia_4':Dia_4,
                                  'Dia_5':Dia_5,
                                  'Dia_6':Dia_6,
                                  'Dia_7':Dia_7,})

ID = []
Dia_8 = []
Dia_9 = []
Dia_10 = []
Dia_11 = []
Dia_12 = []
Dia_13 = []
Dia_14 = []
for i, fecha in enumerate(fechas_test[7:14]):
    aux = results_df[results_df['DATE']==fecha]
    ID = list(aux['ID'].values)
    if i==0:
        Dia_8 += list(aux['FINAL'].values)
    if i==1:
        Dia_9 += list(aux['FINAL'].values)
    if i==2:
        Dia_10 += list(aux['FINAL'].values)
    if i==3:
        Dia_11 += list(aux['FINAL'].values)
    if i==4:
        Dia_12 += list(aux['FINAL'].values)
    if i==5:
        Dia_13 += list(aux['FINAL'].values)
    if i==6:
        Dia_14 += list(aux['FINAL'].values)
print(len(ID),len(Dia_11))
final_df2 = pd.DataFrame.from_dict({'ID':ID,
                                   'Dia_8':Dia_8,
                                  'Dia_9':Dia_9,
                                  'Dia_10':Dia_10,
                                  'Dia_11':Dia_11,
                                  'Dia_12':Dia_12,
                                  'Dia_13':Dia_13,
                                  'Dia_14':Dia_14,})

final_df['Semana_1'] = final_df[['Dia_1','Dia_2','Dia_3','Dia_4','Dia_5','Dia_6','Dia_7']].sum(axis=1)
final_df['Semana_2'] = final_df2[['Dia_8','Dia_9','Dia_10','Dia_11','Dia_12','Dia_13','Dia_14']].sum(axis=1)

final_df2 = final_df.drop('ID', axis=1)

In [None]:
test = pd.read_pickle('../data/test.pkl')
error = compute_error(final_df2, test)
print('Mean between XGBR and GBR:', round(error,2))

### Haciendo selección de variables 

In [None]:
for i in zip(X_train.columns, model1.feature_importances_):
    print(i)

In [None]:
best_vars = ['sin_WEEKDAY', 'MIN_TEMP', 'LAG_7']

In [None]:

path = '../data/df6.pkl'

df = pd.read_pickle(path)
df['LAG_7'] = df['DELTA'].shift(7, fill_value=0)
df['LAG_14'] = df['DELTA'].shift(14, fill_value=0)

start_date = datetime.date(2019, 2 , 1)
end_date = datetime.date(2020, 1, 17)
train = df[df['DATE'].isin(get_date_range(start_date, end_date))]
train = train[train['IS_GOOD']==1]
train.drop(['YEAR_DAY','WEEKDAY','IS_GOOD','DATE'], axis=1, inplace=True)
train['SUN'] = train['SUN'].fillna(train['SUN'].mean())
train['PRECIPITATIONS'] = train['PRECIPITATIONS'].fillna(train['PRECIPITATIONS'].mean())

start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 24)
test = df[df['DATE'].isin(get_date_range(start_date, end_date))]
test = test[test['IS_GOOD']==1]
test.drop(['YEAR_DAY','WEEKDAY','IS_GOOD'], axis=1, inplace=True)
test['SUN'] = test['SUN'].fillna(test['SUN'].mean())
test['PRECIPITATIONS'] = test['PRECIPITATIONS'].fillna(test['PRECIPITATIONS'].mean())

start_date = datetime.date(2020, 1 , 25)
end_date = datetime.date(2020, 1, 31)
test_2 = df[df['DATE'].isin(get_date_range(start_date, end_date))]
test_2 = test_2[test_2['IS_GOOD']==1]
test_2.drop(['YEAR_DAY','WEEKDAY','IS_GOOD'], axis=1, inplace=True)
test_2['SUN'] = test_2['SUN'].fillna(test_2['SUN'].mean())
test_2['PRECIPITATIONS'] = test_2['PRECIPITATIONS'].fillna(test['PRECIPITATIONS'].mean())

print('Train:', train.shape, 'Test:', test.shape, 'Test 2:', test_2.shape)

X_train = train.drop(['DELTA'], axis=1)
y_train = train['DELTA']

X_test = test.drop(['DELTA', 'DATE'], axis=1)
X_test_2 = test_2.drop(['DELTA', 'DATE'], axis=1)

In [None]:
model1 = xgb.XGBRegressor(
    n_estimators=1000,
    reg_lambda=1,
    gamma=0,
    max_depth=8
)

model2 = GradientBoostingRegressor()

print('Fitting XGB...')
model1.fit(X_train[best_vars], y_train)
print('Fitting GB...')
model2.fit(X_train[best_vars], y_train)
print('End fitting.')

In [None]:
y_pred1 = model1.predict(X_test[best_vars])
y_pred2 = model2.predict(X_test[best_vars])

results_df = pd.DataFrame.from_dict({'ID':test['ID'].values, 
                                     'DATE':test['DATE'].values,
                                     'y_pred1':y_pred1,
                                     'y_pred2':y_pred2})
results_df = results_df.sort_values(['ID','DATE'])
results_df['FINAL'] = results_df[['y_pred1','y_pred2']].mean(axis=1)

X_test_2['LAG_7'] = results_df['FINAL'].values

y_pred1_2 = model1.predict(X_test_2[best_vars])
y_pred2_2 = model2.predict(X_test_2[best_vars])

results_df_2 = pd.DataFrame.from_dict({'ID':test_2['ID'].values, 
                                     'DATE':test_2['DATE'].values,
                                     'y_pred1':y_pred1,
                                     'y_pred2':y_pred2})
results_df_2 = results_df_2.sort_values(['ID','DATE'])
results_df_2['FINAL'] = results_df_2[['y_pred1','y_pred2']].mean(axis=1)

In [None]:
results_df = pd.concat([results_df, results_df_2])

In [None]:
#ESTO HAY QUE HACERLO MÁS LIMPIO 100%
start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 31)
fechas_test = get_date_range(start_date, end_date)

final_df = pd.DataFrame(columns=['ID','Dia_1','Dia_2','Dia_3','Dia_4','Dia_5','Dia_6','Dia_7'], index=range(2653))

final_df['ID'] = results_df[results_df['DATE']==fechas_test[0]]['ID'].values
final_df['Dia_1'] = results_df[results_df['DATE']==fechas_test[0]]['FINAL'].values
final_df['Dia_2'] = results_df[results_df['DATE']==fechas_test[1]]['FINAL'].values
final_df['Dia_3'] = results_df[results_df['DATE']==fechas_test[2]]['FINAL'].values
final_df['Dia_4'] = results_df[results_df['DATE']==fechas_test[3]]['FINAL'].values
final_df['Dia_5'] = results_df[results_df['DATE']==fechas_test[4]]['FINAL'].values
final_df['Dia_6'] = results_df[results_df['DATE']==fechas_test[5]]['FINAL'].values
final_df['Dia_7'] = results_df[results_df['DATE']==fechas_test[6]]['FINAL'].values
final_df['Semana_1'] = final_df[['Dia_1','Dia_2','Dia_3','Dia_4','Dia_5','Dia_6','Dia_7']].sum(axis=1)

ID = []
Dia_8 = []
Dia_9 = []
Dia_10 = []
Dia_11 = []
Dia_12 = []
Dia_13 = []
Dia_14 = []
for i, fecha in enumerate(fechas_test[7:14]):
    aux = results_df[results_df['DATE']==fecha]
    ID = list(aux['ID'].values)
    if i==0:
        Dia_8 += list(aux['FINAL'].values)
    if i==1:
        Dia_9 += list(aux['FINAL'].values)
    if i==2:
        Dia_10 += list(aux['FINAL'].values)
    if i==3:
        Dia_11 += list(aux['FINAL'].values)
    if i==4:
        Dia_12 += list(aux['FINAL'].values)
    if i==5:
        Dia_13 += list(aux['FINAL'].values)
    if i==6:
        Dia_14 += list(aux['FINAL'].values)
print(len(ID),len(Dia_11))
final_df2 = pd.DataFrame.from_dict({'ID':ID,
                                   'Dia_8':Dia_8,
                                  'Dia_9':Dia_9,
                                  'Dia_10':Dia_10,
                                  'Dia_11':Dia_11,
                                  'Dia_12':Dia_12,
                                  'Dia_13':Dia_13,
                                  'Dia_14':Dia_14,})

final_df['Semana_1'] = final_df[['Dia_1','Dia_2','Dia_3','Dia_4','Dia_5','Dia_6','Dia_7']].sum(axis=1)
final_df['Semana_2'] = final_df2[['Dia_8','Dia_9','Dia_10','Dia_11','Dia_12','Dia_13','Dia_14']].sum(axis=1)

final_df2 = final_df.drop('ID', axis=1)

In [None]:
test = pd.read_pickle('../data/test.pkl')
error = compute_error(final_df2, test)
print('Mean between XGBR and GBR:', round(error,2))

In [None]:
results_df.head()

In [None]:
start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 31)
fechas_test = get_date_range(start_date, end_date)

results_df[results_df['DATE']==fechas_test[0]]

In [None]:
start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 31)
fechas_test = get_date_range(start_date, end_date)

final_df = pd.DataFrame(columns=['ID','Dia_1','Dia_2','Dia_3','Dia_4','Dia_5','Dia_6','Dia_7'], index=range(2653))

final_df['ID'] = results_df[results_df['DATE']==fechas_test[0]]['ID'].values
final_df['Dia_1'] = results_df[results_df['DATE']==fechas_test[0]]['FINAL'].values
final_df['Dia_2'] = results_df[results_df['DATE']==fechas_test[1]]['FINAL'].values
final_df['Dia_3'] = results_df[results_df['DATE']==fechas_test[2]]['FINAL'].values
final_df['Dia_4'] = results_df[results_df['DATE']==fechas_test[3]]['FINAL'].values
final_df['Dia_5'] = results_df[results_df['DATE']==fechas_test[4]]['FINAL'].values
final_df['Dia_6'] = results_df[results_df['DATE']==fechas_test[5]]['FINAL'].values
final_df['Dia_7'] = results_df[results_df['DATE']==fechas_test[6]]['FINAL'].values
final_df['Semana_1'] = final_df[['Dia_1','Dia_2','Dia_3','Dia_4','Dia_5','Dia_6','Dia_7']].sum(axis=1)