# Feature Engineering and Time Series Feature Generation

### 08th Februry 2022 created by Yan Ge


In this Notebook, we will focus on:1) Time-series feature generation; 2) Feature generation with machine learning: case study of financial marketing with recommender systems

Sources of this tutorial: 1). https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/; 2) https://github.com/Apress/hands-on-time-series-analylsis-python

# Time Series Feature Generation

In this part, we will use a daily stock price of Apple.Inc (one year: from 28/Dec/2020 to 27/Dec/2021), which is derived from Yahoo Finance. First, the time series is loaded as a Pandas Series. We then create a new Pandas DataFrame for the transformed dataset. Next, each column is added one at a time where month and day information is extracted from the time-stamp information for each observation in the series. Below is the Python code to do this.

In [64]:
# create date time features of a dataset
from pandas import read_csv
from pandas import DataFrame
series = read_csv('data/AAPL.csv', header=0, index_col=0, parse_dates=True, squeeze=True)
dataframe = DataFrame()
dataframe['month'] = [series.index[i].month for i in range(len(series))]
dataframe['day'] = [series.index[i].day for i in range(len(series))]
dataframe['price'] = [series[i] for i in range(len(series))]
print(dataframe.head(15))

    month  day       price
0      12   28  133.990005
1      12   29  138.050003
2      12   30  135.580002
3      12   31  134.080002
4       4    1  133.520004
5       5    1  128.889999
6       6    1  127.720001
7       7    1  128.360001
8       8    1  132.429993
9      11    1  129.190002
10     12    1  128.500000
11      1   13  128.759995
12      1   14  130.800003
13      1   15  128.779999
14      1   19  127.779999


### Lag feature

Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems. The simplest approach is to predict the value at the next time (t+1) given the value at the previous time (t-1). The supervised learning problem with shifted values looks as follows:

Value(t-1), Value(t+1)

Value(t-1), Value(t+1)

Value(t-1), Value(t+1)

The Pandas library provides the shift() function to help create these shifted or lag features from a time series dataset. Note that: in this tutorial, shifting the dataset by 1 creates the t-1 column, adding a NaN (unknown) value for the first row. The time series dataset without a shift represents the t+1

Below is an example of creating a lag feature for our daily stock price dataset. The values are extracted from the loaded series and a shifted and unshifted list of these values is created. Each column is also named in the DataFrame for clarity.

In [1]:
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
series = read_csv('data/AAPL.csv', header=0, index_col=0)
temps = DataFrame(series.values)

In [None]:
# shift operation with 1 lag
lag_shift = 

In [None]:
dataframe = concat([lag_shift, temps], axis=1)
dataframe.columns = ['t-1', 't+1']
print(dataframe.head(5))

You can see that we would have to discard the first row to use the dataset to train a supervised learning model, as it does not contain enough data to work with.

The addition of lag features is called the sliding window method, in this case with a window width of 1. It is as though we are sliding our focus along the time series for each observation with an interest in only what is within the window width.

We can expand the window width and include more lagged features. For example, below is the above case modified to include the last 3 observed values to predict the value at the next time step.

In [None]:
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
series = read_csv('data/AAPL.csv', header=0, index_col=0)
temps = DataFrame(series.values)

In [None]:
# shift operation with 1 lag, 2 lag and 3 lag
lag_shift_one = 
lag_shift_two = 
lag_shift_three = 

In [None]:
dataframe = concat([lag_shift_three, lag_shift_two, lag_shift_one, temps], axis=1)
dataframe.columns = ['t-3', 't-2', 't-1', 't+1']
print(dataframe.head(5))

Again, you can see that we must discard the first few rows that do not have enough data to train a supervised model. A difficulty with the sliding window approach is how large to make the window for your problem. Perhaps a good starting point is to perform a sensitivity analysis and try a suite of different window widths to in turn create a suite of different “views” of your dataset and see which results in better performing models. 

### Rolling Window Statistics

A step beyond adding raw lagged values is to add a summary of the values at previous time steps. We can calculate summary statistics across the values in the sliding window and include these as features in our dataset. Perhaps the most useful is the mean of the previous few values, also called the rolling mean. For example, we can calculate the mean of the previous two values and use that to predict the next value. 

The first thing we need to do is shifted. Then the rolling dataset can be created and the mean values calculated on each window of two values.

In [None]:
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
series = read_csv('data/AAPL.csv', header=0, index_col=0)
temps = DataFrame(series.values)
shifted = temps.shift(1)

In [2]:
# shift rolling with window size 1


# mean value in the window


In [None]:
dataframe = concat([means, temps], axis=1)
dataframe.columns = ['mean(t-2,t-1)', 't+1']
print(dataframe.head(5))

Below is another example that shows a window width of 3 and a dataset comprised of more summary statistics, specifically the minimum, mean, and maximum value in the window. You can see in the code that we are explicitly specifying the sliding window width as a named variable. This lets us use it both in calculating the correct shift of the series and in specifying the width of the window to the rolling() function. In this case, the window width of 3 means we must shift the series forward by 2 time steps. This makes the first two rows NaN. Next, we need to calculate the window statistics with 3 values per window. It takes 3 rows before we even have enough data from the series in the window to start calculating statistics.

In [46]:
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
series = read_csv('data/AAPL.csv', header=0, index_col=0)
temps = DataFrame(series.values)
width = 3
shifted = temps.shift(width - 1)
window = shifted.rolling(window=width)
dataframe = concat([window.min(), window.mean(), window.max(), temps], axis=1)
dataframe.columns = ['min', 'mean', 'max', 't+1']
print(dataframe.head(5))

          min        mean         max         t+1
0         NaN         NaN         NaN  133.990005
1         NaN         NaN         NaN  138.050003
2         NaN         NaN         NaN  135.580002
3         NaN         NaN         NaN  134.080002
4  133.990005  135.873337  138.050003  133.520004


### Expanding Window Statistics

Another type of window that may be useful includes all previous data in the series. This is called an expanding window and can help with keeping track of the bounds of observable data. Like the rolling() function on DataFrame, Pandas provides an expanding() function that collects sets of all prior values for each time step.

Below is an example of calculating the minimum, mean, and maximum values of the expanding window on the daily stock price dataset. Running the example prints the first 5 rows of the dataset.

In [None]:
# create expanding window features
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
series = read_csv('data/AAPL.csv', header=0, index_col=0)
temps = DataFrame(series.values)
window = temps.expanding()

In [None]:
# Calculate the min, mean and max value in the window
win_min = 
win_mean = 
win_max = 

In [None]:
dataframe = concat([win_min, win_mean,win_max, temps.shift(-1)], axis=1)
dataframe.columns = ['min', 'mean', 'max', 't+1']
print(dataframe.head(5))

 # Feature generation with machine learning: case study of financial marketing with recommender systems

In this part, we use a real-world dataset from Amazon.com.Inc. In total, this dataset includes 478,235 users and 266,414 items. The data is from http://jmcauley.ucsd.edu/data/amazon/.

In [4]:
import numpy as np
import pandas as pd

In [5]:
# In this example, we only use 12,000 user-item pairs. But you can break this limitation by using the whole dataset.
df=pd.read_csv('data/ratings_Digital_Music.csv',header=None, nrows=12000)
df.head() #column order: user, items, ratings, timestamp

Unnamed: 0,0,1,2,3
0,A2EFCYXHNK06IS,5555991584,5.0,978480000
1,A1WR23ER5HMAA9,5555991584,5.0,953424000
2,A2IR4Q0GPAFJKW,5555991584,4.0,1393545600
3,A2V0KUVAB9HSYO,5555991584,4.0,966124800
4,A1J0GL9HCA7ELW,5555991584,5.0,1007683200


In [6]:
df.shape

(12000, 4)

In [60]:
n_users = df[0].unique().shape[0]
n_items = df[1].unique().shape[0]
n_rating = df[2].unique().shape[0]

print ('%i unique users' %n_users)
print ('%i unique items' %n_items)
print ('%i unique ratings' %n_rating)

9843 unique users
561 unique items
5 unique ratings


In [61]:
# generate user-item matrix
ratings=df.pivot(index=0, columns=1, values=2)

In [62]:
# fill NaN values with 0


In [7]:
# make a statistical analysis about the percetage of exsiting user-items ratings. From this statistics, we can see that 
# the vast majority of ratings are missing, which is our motivation to develop a recommender system to predict such missing values


In [27]:
# This creats a validation dataset by selecting rows (user) that have 35 or more ratings, then randomly select 15 of those ratings
#for validation set, but set those values to 0 in the training set.

def train_test_split(ratings):
    
    validation = np.zeros(ratings.shape)
    train = ratings.copy() #don't do train=ratings, otherwise, ratings becomes empty
    
    for user in np.arange(ratings.shape[0]):
        if len(ratings[user,:].nonzero()[0])>=35:# 35 seems to be best, it depends on sparsity of your user-item matrix
            val_ratings = np.random.choice(ratings[user, :].nonzero()[0], 
                                        size=15, #tweak this, 15 seems to be optimal
                                        replace=False)
            train[user, val_ratings] = 0
            validation[user, val_ratings] = ratings[user, val_ratings]
    print(validation.shape)
    return train, validation

In [8]:
# split this dataset into train and test sets


In [31]:
#P is latent user feature matrix
#Q is latent item feature matrix
# make a rating prediction given P and Q
def prediction(P,Q):
    

In [32]:
lmbda = 0.4 # Regularization parameter
k = 3 #tweak this parameter
m, n = train.shape  # Number of users and items
n_epochs = 30  # Number of epochs
alpha=0.01  # Learning rate

P = 3 * np.random.rand(k,m) # Latent user feature matrix
Q = 3 * np.random.rand(k,n) # Latent movie feature matrix

In [34]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [35]:
train_errors = []
val_errors = []
# Gradient descent for optimisation
# Only consider items with ratings 
users,items = train.nonzero()      
for epoch in range(n_epochs):
    for u, i in zip(users,items):
        e = train[u, i] - prediction(P[:,u],Q[:,i])  # Calculate error for gradient update
        P[:,u] += alpha * ( e * Q[:,i] - lmbda * P[:,u]) # Update latent user feature matrix
        Q[:,i] += alpha * ( e * P[:,u] - lmbda * Q[:,i])  # Update latent item feature matrix
    
    train_rmse = rmse(prediction(P,Q),train)
    val_rmse = rmse(prediction(P,Q),val) 
    train_errors.append(train_rmse)
    val_errors.append(val_rmse)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# visualise training process 

### Take a look at prediction vs. actual ratings

In [37]:
SGD_prediction=prediction(P,Q)

In [38]:
estimation= SGD_prediction[val.nonzero()]
ground_truth = val[val.nonzero()]
results=pd.DataFrame({'prediction':estimation, 'actual rating':ground_truth})

In [39]:
results.head()

Unnamed: 0,prediction,actual rating
0,3.969998,5.0
1,3.944109,5.0
2,4.108701,2.0
3,3.499998,2.0
4,3.885569,5.0
