En este notebook se construye un conjunto de test (básicamente este conjunto será la última semana de enero) y se proporcionan algunas utilidades:

- Una función que dada una predicción la pase al formato apropiado para la entrega
- Una función que dada una predicción devuelva la métrica sobre el conjunto empleado como test (que en este caso será la última semana de Enero)

In [32]:
import pandas as pd

df = pd.read_pickle('../data/data_v1.pkl') 
df.head()

Unnamed: 0,ID,SAMPLETIME,DELTA,READING,DATE,TIME
0,0,2019-06-13 08:34:09,17.0,369320.0,2019-06-13,08:34:09
1,0,2019-06-13 17:34:10,2.0,369403.0,2019-06-13,17:34:10
2,0,2019-06-13 18:34:10,0.0,369403.0,2019-06-13,18:34:10
3,0,2019-06-13 04:34:10,1.0,369284.0,2019-06-13,04:34:10
4,0,2019-06-13 14:34:10,28.0,369356.0,2019-06-13,14:34:10


In [33]:
df = df.drop(['SAMPLETIME','TIME','READING'],axis=1)
df = df.groupby(['ID','DATE']).sum().reset_index()
df.head()

Unnamed: 0,ID,DATE,DELTA
0,0,2019-02-01,243.0
1,0,2019-02-02,236.0
2,0,2019-02-03,335.0
3,0,2019-02-04,252.0
4,0,2019-02-05,220.0


In [34]:
df[df['ID']==0].tail(7)

Unnamed: 0,ID,DATE,DELTA
358,0,2020-01-25,390.0
359,0,2020-01-26,304.0
360,0,2020-01-27,213.0
361,0,2020-01-28,232.0
362,0,2020-01-29,403.0
363,0,2020-01-30,425.0
364,0,2020-01-31,255.0


In [28]:
penultimos_7_dias = df[df['ID']==0].tail(7)
penultimos_7_dias[ultimos_7_dias['DATE']=="2020-01-25"].iloc[0]['DELTA']

390.0

In [80]:
import datetime
from tqdm import tqdm

'''
given a start date in datetime format "start_date" and an "end_date" returns a list of strings with the dates from
"start_date" to "end_date".

Example:

start_date = datetime.date(2019, 9 , 30)
end_date = datetime.date(2019, 10, 7)
get_date_range(start_date, end_date)
'''
def get_date_range(start_date, end_date):
    number_of_days = (end_date-start_date).days
    return [(start_date + datetime.timedelta(days = day)).isoformat() for day in range(number_of_days+1)]

'''
given a df in which there is one or zero entries per ID and DATE, a start_date and an end date, returns a dictionary whose
keys are ID and the days between start_date and end_date and whose values are a list of IDs for 'ID' key and a list of
the DELTA values for each ID and DATE for the keys corresponding to the days between start_date and end_date. If for some 
ID and DATE there is no such DELTA, then we take DELTA=0. The execution takes so long due to the df filtering. 

Example:

start_date = datetime.date(2019, 9 , 30)
end_date = datetime.date(2019, 10, 7)
get_set_by_days(df, start_date, end_date)
'''
def get_set_by_days(df, start_date, end_date):
    list_of_days = get_date_range(start_date, end_date)
    test_dict = {'ID':[]}
    for i in tqdm(df['ID'].unique()):
        test_dict['ID'].append(i)
        for day in list_of_days:
            filtered_df = df[(df['ID']==i) & (df['DATE']==day)]
            #comprobar que no haya mas de una entrada por ID y DATE
            if len(filtered_df)>1:
                raise Exception("More than one delta value per ID and DATE")
            #si hay una entrada, añadimos el valor de delta
            elif len(filtered_df)==1:
                delta = filtered_df.iloc[0]['DELTA']
                if day not in list(test_dict.keys()):
                    test_dict[day] = [delta]
                else:
                    test_dict[day].append(delta)
            #si no hay entrada añadimos 0 
            else:
                if day not in test_dict.keys():
                    test_dict[day] = [None]
                else:
                    test_dict[day].append(None)
    return test_dict

In [82]:
start_date = datetime.date(2019, 2 , 1)
end_date = datetime.date(2020, 1, 31)
training_dict = get_set_by_days(df, start_date, end_date)

  0%|▏                                                                             | 7/2747 [02:09<14:07:42, 18.56s/it]


KeyboardInterrupt: 

In [78]:
two_last_january_weeks_df = pd.DataFrame.from_dict(training_dict)

In [79]:
two_last_january_weeks_df.to_pickle("../data/two_last_january_weeks.pkl")