En este notebook se construye un conjunto de test (básicamente este conjunto será la última semana de enero) y se proporcionan algunas utilidades:

- Una función que dada una predicción la pase al formato apropiado para la entrega
- Una función que dada una predicción devuelva la métrica sobre el conjunto empleado como test (que en este caso será la última semana de Enero)

In [40]:
import pandas as pd
import copy

In [32]:
df = pd.read_pickle('../data/data_v1.pkl') 
df.head()

Unnamed: 0,ID,SAMPLETIME,DELTA,READING,DATE,TIME
0,0,2019-06-13 08:34:09,17.0,369320.0,2019-06-13,08:34:09
1,0,2019-06-13 17:34:10,2.0,369403.0,2019-06-13,17:34:10
2,0,2019-06-13 18:34:10,0.0,369403.0,2019-06-13,18:34:10
3,0,2019-06-13 04:34:10,1.0,369284.0,2019-06-13,04:34:10
4,0,2019-06-13 14:34:10,28.0,369356.0,2019-06-13,14:34:10


In [33]:
df = df.drop(['SAMPLETIME','TIME','READING'],axis=1)
df = df.groupby(['ID','DATE']).sum().reset_index()
df.head()

Unnamed: 0,ID,DATE,DELTA
0,0,2019-02-01,243.0
1,0,2019-02-02,236.0
2,0,2019-02-03,335.0
3,0,2019-02-04,252.0
4,0,2019-02-05,220.0


In [3]:
import datetime
from tqdm import tqdm

'''
given a start date in datetime format "start_date" and an "end_date" returns a list of strings with the dates from
"start_date" to "end_date".

Example:

start_date = datetime.date(2019, 9 , 30)
end_date = datetime.date(2019, 10, 7)
get_date_range(start_date, end_date)
'''
def get_date_range(start_date, end_date):
    number_of_days = (end_date-start_date).days
    return [(start_date + datetime.timedelta(days = day)).isoformat() for day in range(number_of_days+1)]

'''
given a df in which there is one or zero entries per ID and DATE, a start_date and an end date, returns a dictionary whose
keys are ID and the days between start_date and end_date and whose values are a list of IDs for 'ID' key and a list of
the DELTA values for each ID and DATE for the keys corresponding to the days between start_date and end_date. If for some 
ID and DATE there is no such DELTA, then we take DELTA=0. The execution takes so long due to the df filtering. 

Example:

start_date = datetime.date(2019, 9 , 30)
end_date = datetime.date(2019, 10, 7)
get_set_by_days(df, start_date, end_date)
'''
def get_set_by_days(df, start_date, end_date):
    list_of_days = get_date_range(start_date, end_date)
    test_dict = {'ID':[]}
    for i in tqdm(df['ID'].unique()):
        test_dict['ID'].append(i)
        for day in list_of_days:
            filtered_df = df[(df['ID']==i) & (df['DATE']==day)]
            #comprobar que no haya mas de una entrada por ID y DATE
            if len(filtered_df)>1:
                raise Exception("More than one delta value per ID and DATE")
            #si hay una entrada, añadimos el valor de delta
            elif len(filtered_df)==1:
                delta = filtered_df.iloc[0]['DELTA']
                if day not in list(test_dict.keys()):
                    test_dict[day] = [delta]
                else:
                    test_dict[day].append(delta)
            #si no hay entrada añadimos 0 
            else:
                if day not in test_dict.keys():
                    test_dict[day] = [None]
                else:
                    test_dict[day].append(None)
    return test_dict

In [84]:
start_date = datetime.date(2019, 2 , 1)
end_date = datetime.date(2020, 1, 31)
training_dict = get_set_by_days(df, start_date, end_date)

100%|███████████████████████████████████████████████████████████████████████████| 2747/2747 [16:11:46<00:00, 21.23s/it]


In [85]:
all_formated = pd.DataFrame.from_dict(training_dict)

In [86]:
all_formated.to_pickle("../data/counters_in_rows.pkl")

## Test fabrication and error computing

In [52]:
df = pd.read_pickle('../data/df6.pkl') 
df=df[['ID','DELTA','DATE']]
start_date = datetime.date(2020,  1, 18)
end_date = datetime.date(2020, 1, 31)
training_dict = get_set_by_days(df, start_date, end_date)
test = pd.DataFrame.from_dict(training_dict)
test.set_index('ID', inplace = True)

start_date = datetime.date(2020, 1 , 18)
end_date = datetime.date(2020, 1, 24)
first_week = get_date_range(start_date, end_date)

start_date = datetime.date(2020, 1 , 25)
end_date = datetime.date(2020, 1, 31)
second_week = get_date_range(start_date, end_date)

test['first_week'] = test[first_week].sum(axis=1)
test['second_week'] = test[second_week].sum(axis=1) 

test.drop(second_week, axis=1, inplace=True)
print(test.shape)
print(test.head())
test.to_pickle('../data/test.pkl')

100%|██████████████████████████████████████████████████████████████████████████████| 2747/2747 [04:08<00:00, 11.06it/s]

(2747, 9)
    2020-01-18  2020-01-19  2020-01-20  2020-01-21  2020-01-22  2020-01-23  \
ID                                                                           
0        421.0       273.0       306.0       292.0       460.0       331.0   
1          0.0       216.0        14.0         3.0         0.0         3.0   
2         28.0        33.0        48.0        35.0        33.0        20.0   
3        485.0       394.0       237.0       297.0       312.0       321.0   
4        365.0       387.0       370.0       293.0       287.0       361.0   

    2020-01-24  first_week  second_week  
ID                                       
0       368.00     2451.00      2222.00  
1         5.67      241.67        54.33  
2        37.00      234.00       272.00  
3       439.00     2485.00      2792.00  
4       203.00     2266.00      2216.00  





In [47]:
'''
This function expects two dataframes with the same format: for the first seven columns, each column corresponds to a date 
and each row corresponds to a counter index. In position i,j there should be DELTA of counter i in date j. 
For the last two columns of the dataframes they should not reffer to a daily prediction but to the aggregated prediction 
of week_1 and week_2. Given these two dataframes (one for theprediction and one for the real values), 
the function returns de error according to the competition rules.

Examples:

import pandas as pd
import copy

test = pd.read_pickle('../data/test.pkl')

compute_error(test, test)

test_v3 = copy.copy(test)
test_v3.iloc[:,0] = test_v3.iloc[:,1]
compute_error(test_v3, test)

'''
def compute_error(pred, real):
    daily_rmses = []
    for i in range(7):
        daily_rmses.append((((real.iloc[:,i] - pred.iloc[:,i])**2/len(real.iloc[:,i])).sum())**(1/2))
    rmse_1 = sum(daily_rmses)/7
    
    first_week_pred_sum = pred.iloc[:,7].sum()
    second_week_pred_sum = pred.iloc[:,8].sum()
    first_week_real_sum = real.iloc[:,7].sum()
    second_week_real_sum = real.iloc[:,8].sum()
    
    first_week_rmse = (((first_week_real_sum - first_week_pred_sum)**2)/len(real.iloc[:,7]))**(1/2)
    second_week_rmse = (((second_week_real_sum - second_week_pred_sum)**2)/len(real.iloc[:,8]))**(1/2)
    rmse_2 = (first_week_rmse + second_week_rmse)/2
    
    return (rmse_1 + rmse_2)/2

In [None]:
'''
This method expects one dataframe with a certain format: for the first seven columns, each column corresponds to a date 
and each row corresponds to a counter index. In position i,j there should be DELTA of counter i in date j. 
For the last two columns of the dataframes they should not reffer to a daily prediction but to the aggregated prediction 
of week_1 and week_2. It saves this dataframe into a txt file in the format according to the competition rules.

DISCLAIMER:
Although the dataframe is saved in csv format, it must be used .txt extension.
'''
def to_txt(pred, path = './data/predicition.txt'):
    pred.to_csv(path, , sep = '|', header=False, index=False)