tsfresh is a Python package that computes more than 750 time series features from each original feature. It also provides a feature selection algorithm that identifies the most predictive features. The challenge with tsfresh is that to run it on a dataframe with over 1 million rows becomes very difficult on most machines. Therefore, in this notebook it will be run for each month and then combined into one overall dataframe with the most relevant features. 

In [1]:
import numpy as np 
import pandas as pd 

import datetime

from tsfresh import (extract_features, extract_relevant_features, select_features)
from tsfresh.feature_extraction import settings
from tsfresh.utilities.dataframe_functions import impute

In [2]:
train = pd.read_csv('train.csv',index_col='id')
test = pd.read_csv('test.csv',index_col='id')

In [3]:
train

Unnamed: 0_level_0,date,store_nbr,family,sales,onpromotion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2013-01-01,1,AUTOMOTIVE,0.000,0
1,2013-01-01,1,BABY CARE,0.000,0
2,2013-01-01,1,BEAUTY,0.000,0
3,2013-01-01,1,BEVERAGES,0.000,0
4,2013-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...
3000883,2017-08-15,9,POULTRY,438.133,0
3000884,2017-08-15,9,PREPARED FOODS,154.553,1
3000885,2017-08-15,9,PRODUCE,2419.729,148
3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


Since tsfresh needs to use the target data to predict the most relevant features, and it will be necessary to have these new features in both the train and test dataframes, it will be necessary to create a preliminary prediction of the sales target data in the test dataframe. 

To do that a simple model that uses the values of each family of product sold, for each store number, averaged over a 'look-back' window of the same length (16 days) just prior to the predicted date range will be used for the predictions. This notebook by Carl McBride Ellis was used as a reference for this process: https://www.kaggle.com/code/carlmcbrideellis/store-sales-using-the-average-of-the-last-16-days

In [4]:
train_16_days = train.query("date >= '2017-07-31' ")
def exp_mean_ln(df):
    return np.expm1(np.mean(np.log1p(df['sales'])))
train_average = train_16_days.groupby(['store_nbr', 'family']).apply(exp_mean_ln).to_dict()

In [5]:
test['sales'] = test.set_index(['store_nbr', 'family']).index.map(train_average.get)

In [6]:
train_test = pd.concat([train, test], ignore_index=True)

To change date datatype to datetime

In [7]:
train_test.date = pd.to_datetime(train_test.date)

In [8]:
oil = pd.read_csv('oil.csv', parse_dates = ['date'], infer_datetime_format = True, index_col = 'date').to_period('D')

The first day is missing oil data, so that is filled with the same value as the day after.

There are also a number of days, primarily on weekends, where there is no oil data. Those dates are identified and then filled with the linear method. This method essentially draws a line between the two observed points and then filling the missing values so that they lie on this line.

In [9]:
oil = oil.interpolate(method='linear')
oil.iloc[0] = oil.iloc[1]

#some days are skipped. Filling up the gap.

start_date = train_test.date.min() 
# from beggining of the train date and the end of test date
number_of_days = 1704 #1703
date_list = [(start_date + datetime.timedelta(days = day)).isoformat() for day in range(number_of_days)]

date = (pd.Series(date_list)).to_frame()
date.columns = ['date']
date.date = pd.to_datetime(date.date)
date['date_str'] = date.date.astype(str)
oil['date_str'] = oil.index.astype(str)

oil = pd.merge(date,oil,how='left',on='date_str')
oil = oil.set_index('date').dcoilwtico.interpolate(method='linear').to_frame()
oil['date_str'] = oil.index.astype(str)

In [10]:
oil

Unnamed: 0_level_0,dcoilwtico,date_str
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-01,93.140000,2013-01-01
2013-01-02,93.140000,2013-01-02
2013-01-03,92.970000,2013-01-03
2013-01-04,93.120000,2013-01-04
2013-01-05,93.146667,2013-01-05
...,...,...
2017-08-27,46.816667,2017-08-27
2017-08-28,46.400000,2017-08-28
2017-08-29,46.460000,2017-08-29
2017-08-30,45.960000,2017-08-30


In [11]:
train_test['date_str'] = train_test.date.astype(str)
train_test = pd.merge(train_test,oil,how='left',on='date_str')

A month_year column is added as the basis for batching the tsfresh algorithm by each month. 

In [12]:
train_test['month_year'] = train_test.date.dt.to_period('M')

In [13]:
train_test

Unnamed: 0,date,store_nbr,family,sales,onpromotion,date_str,dcoilwtico,month_year
0,2013-01-01,1,AUTOMOTIVE,0.000000,0,2013-01-01,93.14,2013-01
1,2013-01-01,1,BABY CARE,0.000000,0,2013-01-01,93.14,2013-01
2,2013-01-01,1,BEAUTY,0.000000,0,2013-01-01,93.14,2013-01
3,2013-01-01,1,BEVERAGES,0.000000,0,2013-01-01,93.14,2013-01
4,2013-01-01,1,BOOKS,0.000000,0,2013-01-01,93.14,2013-01
...,...,...,...,...,...,...,...,...
3029395,2017-08-31,9,POULTRY,436.545569,1,2017-08-31,47.26,2017-08
3029396,2017-08-31,9,PREPARED FOODS,108.170800,0,2017-08-31,47.26,2017-08
3029397,2017-08-31,9,PRODUCE,1607.282218,1,2017-08-31,47.26,2017-08
3029398,2017-08-31,9,SCHOOL AND OFFICE SUPPLIES,141.254425,9,2017-08-31,47.26,2017-08


In experimenting with the dates in earlier Kaggle submissions it was found that the best results came from only including the data from 2016 forward.

In [14]:
train_test = train_test.query("date >= '2016-01-01' ")

To set up the periods list to be iterated over. 

In [15]:
periods = train_test['month_year'].unique()
print(periods)

<PeriodArray>
['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06', '2016-07',
 '2016-08', '2016-09', '2016-10', '2016-11', '2016-12', '2017-01', '2017-02',
 '2017-03', '2017-04', '2017-05', '2017-06', '2017-07', '2017-08']
Length: 20, dtype: period[M]


To run tsfresh by monthly batches and then roll up the final rows for all the features into the relevant_features dataframe. The two original non time based features of onpromotion and dcoilwtico are fed into the tsfresh algorithm. 

The tsfresh function extract_relevant_features creates the expanded list of tsfresh features for each original feature and then selects the most relevant features to be included in the dataframe. This combines tsfresh's extract_features and select_features functions. 

In [16]:
count = 0
for p in periods:
    period_dataset = train_test.loc[train_test['month_year'] == p]
    X = period_dataset.drop('sales', axis=1)
    X = X.rename_axis('id').reset_index()
    X = X[['id', 'date', 'onpromotion', 'dcoilwtico']]
    y = period_dataset['sales']
    
    new_features = extract_relevant_features(X, y, column_id='id', column_sort='date')
    
    if count == 0:
        relevant_features = new_features
    else:
        relevant_features = pd.concat([relevant_features, new_features])
            
    count +=1

Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 20/20 [20:27<00:00, 61.40s/it]
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 20/20 [17:32<00:00, 52.64s/it]
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 20/20 [17:07<00:00, 51.35s/it]
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 20/20 [20:52<00:00, 62.64s/it]
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 20/20 [18:58<00:00, 56.94s/it]
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 20/20 [17:52<00:00, 53.64s/it]
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 20/20 [18:26<00:00, 55.31s/it]
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 20/20 [18:31<00:00, 55.58s/it]
Feature Extraction: 100%|███████████████

There were 50 new features created from this process. 

In [17]:
relevant_features

Unnamed: 0,onpromotion__sum_values,onpromotion__value_count__value_0,onpromotion__value_count__value_1,onpromotion__count_below__t_0,onpromotion__range_count__max_1__min_-1,"onpromotion__fft_coefficient__attr_""abs""__coeff_0","onpromotion__fft_coefficient__attr_""real""__coeff_0","onpromotion__cwt_coefficients__coeff_0__w_20__widths_(2, 5, 10, 20)","onpromotion__cwt_coefficients__coeff_0__w_10__widths_(2, 5, 10, 20)","onpromotion__cwt_coefficients__coeff_0__w_5__widths_(2, 5, 10, 20)",...,dcoilwtico__quantile__q_0.1,dcoilwtico__minimum,dcoilwtico__absolute_maximum,dcoilwtico__maximum,dcoilwtico__root_mean_square,dcoilwtico__mean,dcoilwtico__median,dcoilwtico__sum_values,dcoilwtico__quantile__q_0.6,dcoilwtico__benford_correlation
1945944,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.000000,0.000000,0.000000,...,36.97,36.97,36.97,36.97,36.97,36.97,36.97,36.97,36.97,0.062915
1945945,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.000000,0.000000,0.000000,...,36.97,36.97,36.97,36.97,36.97,36.97,36.97,36.97,36.97,0.062915
1945946,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.000000,0.000000,0.000000,...,36.97,36.97,36.97,36.97,36.97,36.97,36.97,36.97,36.97,0.062915
1945947,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.000000,0.000000,0.000000,...,36.97,36.97,36.97,36.97,36.97,36.97,36.97,36.97,36.97,0.062915
1945948,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.000000,0.000000,0.000000,...,36.97,36.97,36.97,36.97,36.97,36.97,36.97,36.97,36.97,0.062915
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3029395,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.193940,0.274272,0.387880,...,,,,,,,,,,
3029396,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.000000,0.000000,0.000000,...,,,,,,,,,,
3029397,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.193940,0.274272,0.387880,...,,,,,,,,,,
3029398,9.0,0.0,0.0,0.0,0.0,9.0,9.0,1.745458,2.468450,3.490916,...,,,,,,,,,,


In [19]:
print(list(relevant_features))

['onpromotion__sum_values', 'onpromotion__value_count__value_0', 'onpromotion__value_count__value_1', 'onpromotion__count_below__t_0', 'onpromotion__range_count__max_1__min_-1', 'onpromotion__fft_coefficient__attr_"abs"__coeff_0', 'onpromotion__fft_coefficient__attr_"real"__coeff_0', 'onpromotion__cwt_coefficients__coeff_0__w_20__widths_(2, 5, 10, 20)', 'onpromotion__cwt_coefficients__coeff_0__w_10__widths_(2, 5, 10, 20)', 'onpromotion__cwt_coefficients__coeff_0__w_5__widths_(2, 5, 10, 20)', 'onpromotion__quantile__q_0.9', 'onpromotion__quantile__q_0.8', 'onpromotion__quantile__q_0.7', 'onpromotion__cwt_coefficients__coeff_0__w_2__widths_(2, 5, 10, 20)', 'onpromotion__quantile__q_0.4', 'onpromotion__abs_energy', 'onpromotion__median', 'onpromotion__quantile__q_0.6', 'onpromotion__root_mean_square', 'onpromotion__maximum', 'onpromotion__absolute_maximum', 'onpromotion__mean', 'onpromotion__benford_correlation', 'onpromotion__quantile__q_0.1', 'onpromotion__quantile__q_0.2', 'onpromotion

Since this process took an extensive amount of time to run, the output is saved to a csv file that can then be used in a more complete prediction and submission notebook without having to run the tsfresh algorithm again. 

In [25]:
relevant_features.to_csv('tsfresh_features.csv')