## Feature Engineering and Target preparation

The aim is to use the different data sources we have and create meaningful and informative features. 

Reminder: The aim of the project is to develop a model to predict the probability that a machine will fail in the next 48 hours due to a certain component.

In [1]:
import seaborn as sns
import numpy as np
import pandas as pd
from utils import load_data

# ignore user warnings
import warnings

warnings.filterwarnings("ignore")

sns.set_style("darkgrid")
sns.set_palette("Set2")
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

### Data Sources

This dataset comprises information from 100 machines throughout the year 2015, providing an extensive hourly time series. The dataset comprises of five data sources. 

1.	Telemetry Data (PdM_telemetry.csv): For the year 2015, the dataset provides hourly average measurements of voltage, rotation, pressure, and vibration collected from 100 machines.

2.	Error (PdM_errors.csv): These errors occur while the machines are in operation but do not result in machine shutdown, and thus, they are not categorized as failures. The error date and times are rounded to the nearest hour to align with the telemetry data collection frequency

3.	Maintenance (PdM_maint.csv): If a machine component is replaced, it is recorded in this table. Component replacements occur in two situations: 1) During regular scheduled visits, technicians perform proactive maintenance by replacing the component. 2) In case of a component breakdown, technicians conduct unscheduled maintenance to replace the faulty component, which is considered a failure and captured in the Failures data. The maintenance data includes records from both 2014 and 2015 and is rounded to the nearest hour to align with the hourly telemetry data collection rate.

4.	Failures (PdM_failures.csv): Each entry in this dataset signifies the replacement of a failed component. 

5.	Machines (PdM_Machines.csv): Contains the model type and age of the Machines.


### Load Data

In [2]:
# Load data
df_telemetry, df_errors, df_maintenance, df_failures, df_machines = load_data()

### Lag Features from Telemetry Data
Lagging features are the go to method of feature engineering in the context of time series data. The lag features will be created by computing the rolling aggregate measures such as mean, standard deviation, minimum, maximum, etc...

To represent the short term history of the telemetry over the lag window. In the following, rolling mean and standard deviation of the telemetry data over the last 4 hour lag window is calculated for every 4 hours. In addition, a 48h lag is also calculated to capture a long term effects.

Although here, we will set the short and long term window sizes to 4 and 48, we can createf eatures for any arbitrary window size. 

The code is configurable. 

In [3]:
# Window sizes for computing the lagged features
window_long = 48
window_short = 4

In [4]:
def gen_rolling_features(df, window_size):
    """generate rolling mean and standard deviation features for telemetry data

    Parameters
    ----------
    df : DataFrame
        the input dataframe
    window_size : int
        number of hours for rolling window

    Returns
    -------
    DataFrame
        the dataframe with rolling mean and standard deviation features
    """
    # copy the dataframe
    df = df.copy()
    # temp collects the columns to be added back to the dataframe
    temp = []
    for col in ["volt", "rotate", "pressure", "vibration"]:
        # calculate mean
        mean_col = col + "_mean_" + str(window_size) + "h"
        df[mean_col] = df[col].rolling(window=window_size).mean()
        # calculate standard deviation
        std_col = col + "_std_" + str(window_size) + "h"
        df[std_col] = df[col].rolling(window=window_size).std()
        temp.extend([mean_col, std_col])

    # add the machine ID and datetime columns back in
    temp.extend(["machineID", "datetime"])

    # drop all rows with null values generated by rolling mean/std
    telemetry_feat = df[temp].dropna()
    return telemetry_feat

In [5]:
# generate rolling features for the past 4 hour window
rolling_features_short = gen_rolling_features(df_telemetry, window_short)
# generate rolling features for the past 48 hour window
rolling_features_long = gen_rolling_features(df_telemetry, window_long)
# combine 4h and 48h features and keep the column names consistent. No missing values should be present
rolling_telemetry_features = rolling_features_short.merge(
    rolling_features_long, on=["machineID", "datetime"], how="left"
)
# drop missing values
rolling_telemetry_features = rolling_telemetry_features.dropna()
rolling_telemetry_features.head()

Unnamed: 0,volt_mean_4h,volt_std_4h,rotate_mean_4h,rotate_std_4h,pressure_mean_4h,pressure_std_4h,vibration_mean_4h,vibration_std_4h,machineID,datetime,volt_mean_48h,volt_std_48h,rotate_mean_48h,rotate_std_48h,pressure_mean_48h,pressure_std_48h,vibration_mean_48h,vibration_std_48h
44,180.694192,9.610191,485.046875,46.151381,95.428381,4.437964,37.832936,6.458281,1,2015-01-03 05:00:00,170.045337,13.132056,449.71137,44.714296,98.792391,10.826645,39.428496,5.526109
45,180.666104,9.621991,469.576558,39.616761,94.209279,2.410554,42.215156,9.533812,1,2015-01-03 06:00:00,170.074009,13.147313,450.574966,44.498525,98.450176,10.623186,39.59328,5.809134
46,180.472154,9.754595,463.636015,32.65289,97.226801,5.232113,45.090099,8.898822,1,2015-01-03 07:00:00,170.330969,13.124213,452.319701,44.224339,98.637102,10.648371,39.686235,5.906454
47,172.821801,5.835065,501.168525,47.092576,96.874856,5.396217,45.555083,9.089829,1,2015-01-03 08:00:00,170.192459,13.152165,453.186836,46.095287,99.038156,10.096549,39.9955,5.99985
48,172.153274,5.786929,492.435436,52.918165,95.204932,7.893687,49.398795,2.469875,1,2015-01-03 09:00:00,170.379989,13.103691,455.440349,43.311082,98.538116,10.174171,40.132507,6.100367


In [6]:
rolling_telemetry_features.describe()

Unnamed: 0,volt_mean_4h,volt_std_4h,rotate_mean_4h,rotate_std_4h,pressure_mean_4h,pressure_std_4h,vibration_mean_4h,vibration_std_4h,machineID,datetime,volt_mean_48h,volt_std_48h,rotate_mean_48h,rotate_std_48h,pressure_mean_48h,pressure_std_48h,vibration_mean_48h,vibration_std_48h
count,876053.0,876053.0,876053.0,876053.0,876053.0,876053.0,876053.0,876053.0,876053.0,876053,876053.0,876053.0,876053.0,876053.0,876053.0,876053.0,876053.0,876053.0
mean,170.777802,13.833164,446.605007,46.241373,100.858781,9.244105,40.385049,4.618743,50.502656,2015-07-02 18:14:01.506621184,170.777782,15.07957,446.604958,50.612783,100.858732,10.244763,40.385031,5.083156
min,135.979457,0.110325,198.770337,0.215923,76.437827,0.093697,28.339762,0.074046,1.0,2015-01-01 06:00:00,160.56986,8.393878,276.466068,28.360658,92.905441,5.392215,36.760162,2.883778
25%,165.186221,9.5391,430.400935,31.936551,96.740946,6.371674,38.395826,3.184593,26.0,2015-04-02 12:00:00,168.672296,13.93327,443.654838,46.610087,99.075815,9.300044,39.55551,4.651415
50%,170.384577,13.330501,448.584123,44.595454,100.204714,8.901361,40.133234,4.450365,51.0,2015-07-02 18:00:00,170.196233,15.00567,449.232026,50.171934,100.097147,10.020587,40.067961,5.0115
75%,175.768606,17.576965,465.973672,58.69101,103.821037,11.734377,41.932853,5.86609,76.0,2015-10-02 00:00:00,171.843116,16.134619,454.378088,53.995052,101.192744,10.80478,40.624986,5.402276
max,234.113819,54.584229,573.307528,164.917765,164.875324,32.747118,67.312085,16.488949,100.0,2016-01-01 06:00:00,215.764528,29.78065,483.074597,107.700452,150.060243,28.519275,60.668118,11.555126
std,8.449634,5.84498,29.764544,19.492048,6.815953,3.920148,3.159422,1.952493,28.864584,,3.887628,1.680559,15.191276,6.01908,4.102678,1.589481,1.759595,0.673468


### Lag Features from Errors Data

Here, we can also do the same and create the lag features. But the errors are categorical. Hence, we will aggregate the number of a given type of error. 


In [7]:
df_errors

Unnamed: 0,datetime,machineID,errorID
0,2015-01-03 07:00:00,1,error1
1,2015-01-03 20:00:00,1,error3
2,2015-01-04 06:00:00,1,error5
3,2015-01-10 15:00:00,1,error4
4,2015-01-22 10:00:00,1,error4
...,...,...,...
3914,2015-11-21 08:00:00,100,error2
3915,2015-12-04 02:00:00,100,error1
3916,2015-12-08 06:00:00,100,error2
3917,2015-12-08 06:00:00,100,error3


Errors which happened at the same time for the same machine

In [8]:
error_count_groued = df_errors.groupby(["datetime", "machineID"]).count()
error_count_groued = error_count_groued.reset_index()
error_count_groued = error_count_groued.rename(columns={"errorID": "error_count"})
# sort in descending order
error_count_groued = error_count_groued.sort_values("error_count", ascending=False)
error_count_groued.head()

Unnamed: 0,datetime,machineID,error_count
3204,2015-11-18 06:00:00,94,3
273,2015-01-27 06:00:00,63,3
3030,2015-10-31 06:00:00,15,3
65,2015-01-06 06:00:00,12,3
2130,2015-08-05 06:00:00,21,3


In [9]:
# get the records for machine id 94 at time 2015-11-18 06:00:00
df_errors[
    (df_errors["machineID"] == 94) & (df_errors["datetime"] == "2015-11-18 06:00:00")
]

Unnamed: 0,datetime,machineID,errorID
3656,2015-11-18 06:00:00,94,error2
3657,2015-11-18 06:00:00,94,error3
3658,2015-11-18 06:00:00,94,error5


In [10]:
# create a column for each error type
error_count = pd.get_dummies(df_errors.set_index("datetime")).reset_index()
error_count.columns = [
    "datetime",
    "machineID",
    "error1",
    "error2",
    "error3",
    "error4",
    "error5",
]
# combine errors for a given machine in a given hour
error_count = error_count.groupby(["machineID", "datetime"]).sum().reset_index()
error_count.head(13)

Unnamed: 0,machineID,datetime,error1,error2,error3,error4,error5
0,1,2015-01-03 07:00:00,1,0,0,0,0
1,1,2015-01-03 20:00:00,0,0,1,0,0
2,1,2015-01-04 06:00:00,0,0,0,0,1
3,1,2015-01-10 15:00:00,0,0,0,1,0
4,1,2015-01-22 10:00:00,0,0,0,1,0
5,1,2015-01-25 15:00:00,0,0,0,1,0
6,1,2015-01-27 04:00:00,1,0,0,0,0
7,1,2015-03-03 22:00:00,0,1,0,0,0
8,1,2015-03-05 06:00:00,1,0,0,0,0
9,1,2015-03-20 18:00:00,1,0,0,0,0


For each telemetry datetime compute the number of errors

In [11]:
error_count = (
    df_telemetry[["datetime", "machineID"]]
    .merge(error_count, on=["machineID", "datetime"], how="left")
    .fillna(0.0)
)
error_count.head()

Unnamed: 0,datetime,machineID,error1,error2,error3,error4,error5
0,2015-01-01 06:00:00,1,0.0,0.0,0.0,0.0,0.0
1,2015-01-01 07:00:00,1,0.0,0.0,0.0,0.0,0.0
2,2015-01-01 08:00:00,1,0.0,0.0,0.0,0.0,0.0
3,2015-01-01 09:00:00,1,0.0,0.0,0.0,0.0,0.0
4,2015-01-01 10:00:00,1,0.0,0.0,0.0,0.0,0.0


Finally, we can compute the total number of errors of each type over the last 48 hours.

In [12]:
fields = [f"error{i}" for i in range(1, 6)]
for col in fields:
    error_count[col + "_sum_48h"] = error_count[col].rolling(window=window_long).sum()
error_count = error_count.dropna()
# drop the original error count columns
error_count = error_count.drop(fields, axis=1)
error_count.head()

Unnamed: 0,datetime,machineID,error1_sum_48h,error2_sum_48h,error3_sum_48h,error4_sum_48h,error5_sum_48h
47,2015-01-03 05:00:00,1,0.0,0.0,0.0,0.0,0.0
48,2015-01-03 06:00:00,1,0.0,0.0,0.0,0.0,0.0
49,2015-01-03 07:00:00,1,1.0,0.0,0.0,0.0,0.0
50,2015-01-03 08:00:00,1,1.0,0.0,0.0,0.0,0.0
51,2015-01-03 09:00:00,1,1.0,0.0,0.0,0.0,0.0


In [13]:
# error_count 10 records for machine 1 where error1_sum_48h > 1 and error2_sum_48h > 0
error_count[
    (error_count["machineID"] == 1)
    & (error_count["error2_sum_48h"] > 0)
    & (error_count["error3_sum_48h"] > 0)
].head(10)

Unnamed: 0,datetime,machineID,error1_sum_48h,error2_sum_48h,error3_sum_48h,error4_sum_48h,error5_sum_48h
2592,2015-04-19 06:00:00,1,0.0,1.0,1.0,0.0,0.0
2593,2015-04-19 07:00:00,1,0.0,1.0,1.0,0.0,0.0
2594,2015-04-19 08:00:00,1,0.0,1.0,1.0,0.0,0.0
2595,2015-04-19 09:00:00,1,0.0,1.0,1.0,0.0,0.0
2596,2015-04-19 10:00:00,1,0.0,1.0,1.0,0.0,0.0
2597,2015-04-19 11:00:00,1,0.0,1.0,1.0,0.0,0.0
2598,2015-04-19 12:00:00,1,0.0,1.0,1.0,0.0,0.0
2599,2015-04-19 13:00:00,1,0.0,1.0,1.0,0.0,0.0
2600,2015-04-19 14:00:00,1,0.0,1.0,1.0,0.0,0.0
2601,2015-04-19 15:00:00,1,0.0,1.0,1.0,0.0,0.0


### Features from the maintenance data

#### Days Since Last Replacement from Maintenance

The duration since the last replacement of a component is a crucial factor that is expected to correlate strongly with component failures. As components are utilized over time, their performance tends to degrade, increasing the likelihood of failures.


In [14]:
df_maintenance.head()

Unnamed: 0,datetime,machineID,comp
0,2014-06-01 06:00:00,1,comp2
1,2014-07-16 06:00:00,1,comp4
2,2014-07-31 06:00:00,1,comp3
3,2014-12-13 06:00:00,1,comp1
4,2015-01-05 06:00:00,1,comp4


In [15]:
# create a column for each error type
comp_rep = pd.get_dummies(df_maintenance.set_index("datetime")).reset_index()
comp_rep.columns = ["datetime", "machineID", "comp1", "comp2", "comp3", "comp4"]

# combine repairs for a given machine in a given hour. If more than one component type was replaced at the same time.
comp_rep = comp_rep.groupby(["machineID", "datetime"]).sum().reset_index()

# add timepoints where no components were replaced
comp_rep = (
    df_telemetry[["datetime", "machineID"]]
    .merge(comp_rep, on=["datetime", "machineID"], how="outer")
    .fillna(0)
    .sort_values(by=["machineID", "datetime"])
)

components = ["comp1", "comp2", "comp3", "comp4"]
for comp in components:
    # convert indicator to most recent date of component change
    comp_rep.loc[comp_rep[comp] < 1, comp] = None
    comp_rep.loc[-comp_rep[comp].isnull(), comp] = comp_rep.loc[
        -comp_rep[comp].isnull(), "datetime"
    ]

    # forward-fill the most-recent date of component change
    comp_rep[comp] = comp_rep[comp].fillna(method="ffill")

# remove dates in 2014 (may have NaN or future component change dates)
comp_rep = comp_rep.loc[comp_rep["datetime"] > pd.to_datetime("2015-01-01")]

# replace dates of most recent component change with days since most recent component change
for comp in components:
    comp_rep[comp] = (comp_rep["datetime"] - comp_rep[comp]) / np.timedelta64(1, "D")

# comp_rep.describe()

comp_rep.head()

Unnamed: 0,datetime,machineID,comp1,comp2,comp3,comp4
0,2015-01-01 06:00:00,1,19.0,214.0,154.0,169.0
1,2015-01-01 07:00:00,1,19.041667,214.041667,154.041667,169.041667
2,2015-01-01 08:00:00,1,19.083333,214.083333,154.083333,169.083333
3,2015-01-01 09:00:00,1,19.125,214.125,154.125,169.125
4,2015-01-01 10:00:00,1,19.166667,214.166667,154.166667,169.166667


### Machine Features

The model and age features can be directly used

In [16]:
df_machines.head()

Unnamed: 0,machineID,model,age
0,1,model3,18
1,2,model4,7
2,3,model3,8
3,4,model3,7
4,5,model3,2


### Bringing it all together
We merge all the feature data sets we created earlier to get the final feature matrix.

In [17]:
rolling_telemetry_features.head()

Unnamed: 0,volt_mean_4h,volt_std_4h,rotate_mean_4h,rotate_std_4h,pressure_mean_4h,pressure_std_4h,vibration_mean_4h,vibration_std_4h,machineID,datetime,volt_mean_48h,volt_std_48h,rotate_mean_48h,rotate_std_48h,pressure_mean_48h,pressure_std_48h,vibration_mean_48h,vibration_std_48h
44,180.694192,9.610191,485.046875,46.151381,95.428381,4.437964,37.832936,6.458281,1,2015-01-03 05:00:00,170.045337,13.132056,449.71137,44.714296,98.792391,10.826645,39.428496,5.526109
45,180.666104,9.621991,469.576558,39.616761,94.209279,2.410554,42.215156,9.533812,1,2015-01-03 06:00:00,170.074009,13.147313,450.574966,44.498525,98.450176,10.623186,39.59328,5.809134
46,180.472154,9.754595,463.636015,32.65289,97.226801,5.232113,45.090099,8.898822,1,2015-01-03 07:00:00,170.330969,13.124213,452.319701,44.224339,98.637102,10.648371,39.686235,5.906454
47,172.821801,5.835065,501.168525,47.092576,96.874856,5.396217,45.555083,9.089829,1,2015-01-03 08:00:00,170.192459,13.152165,453.186836,46.095287,99.038156,10.096549,39.9955,5.99985
48,172.153274,5.786929,492.435436,52.918165,95.204932,7.893687,49.398795,2.469875,1,2015-01-03 09:00:00,170.379989,13.103691,455.440349,43.311082,98.538116,10.174171,40.132507,6.100367


We merge all the tables. To keep the telemetry timestamp, we do a left join. 

In [18]:
final_feat = rolling_telemetry_features.merge(
    error_count, on=["datetime", "machineID"], how="left"
)
final_feat = final_feat.merge(comp_rep, on=["datetime", "machineID"], how="left")
final_feat = final_feat.merge(df_machines, on=["machineID"], how="left")
# make machinID and datetime the first two columns
final_feat = final_feat[
    ["machineID", "datetime"]
    + [col for col in final_feat.columns if col not in ["machineID", "datetime"]]
]

final_feat.head()

Unnamed: 0,machineID,datetime,volt_mean_4h,volt_std_4h,rotate_mean_4h,rotate_std_4h,pressure_mean_4h,pressure_std_4h,vibration_mean_4h,vibration_std_4h,...,error2_sum_48h,error3_sum_48h,error4_sum_48h,error5_sum_48h,comp1,comp2,comp3,comp4,model,age
0,1,2015-01-03 05:00:00,180.694192,9.610191,485.046875,46.151381,95.428381,4.437964,37.832936,6.458281,...,0.0,0.0,0.0,0.0,20.958333,215.958333,155.958333,170.958333,model3,18
1,1,2015-01-03 06:00:00,180.666104,9.621991,469.576558,39.616761,94.209279,2.410554,42.215156,9.533812,...,0.0,0.0,0.0,0.0,21.0,216.0,156.0,171.0,model3,18
2,1,2015-01-03 07:00:00,180.472154,9.754595,463.636015,32.65289,97.226801,5.232113,45.090099,8.898822,...,0.0,0.0,0.0,0.0,21.041667,216.041667,156.041667,171.041667,model3,18
3,1,2015-01-03 08:00:00,172.821801,5.835065,501.168525,47.092576,96.874856,5.396217,45.555083,9.089829,...,0.0,0.0,0.0,0.0,21.083333,216.083333,156.083333,171.083333,model3,18
4,1,2015-01-03 09:00:00,172.153274,5.786929,492.435436,52.918165,95.204932,7.893687,49.398795,2.469875,...,0.0,0.0,0.0,0.0,21.125,216.125,156.125,171.125,model3,18


In [19]:
final_feat.columns

Index(['machineID', 'datetime', 'volt_mean_4h', 'volt_std_4h',
       'rotate_mean_4h', 'rotate_std_4h', 'pressure_mean_4h',
       'pressure_std_4h', 'vibration_mean_4h', 'vibration_std_4h',
       'volt_mean_48h', 'volt_std_48h', 'rotate_mean_48h', 'rotate_std_48h',
       'pressure_mean_48h', 'pressure_std_48h', 'vibration_mean_48h',
       'vibration_std_48h', 'error1_sum_48h', 'error2_sum_48h',
       'error3_sum_48h', 'error4_sum_48h', 'error5_sum_48h', 'comp1', 'comp2',
       'comp3', 'comp4', 'model', 'age'],
      dtype='object')

In [20]:
final_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 876053 entries, 0 to 876052
Data columns (total 29 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   machineID           876053 non-null  int64         
 1   datetime            876053 non-null  datetime64[ns]
 2   volt_mean_4h        876053 non-null  float64       
 3   volt_std_4h         876053 non-null  float64       
 4   rotate_mean_4h      876053 non-null  float64       
 5   rotate_std_4h       876053 non-null  float64       
 6   pressure_mean_4h    876053 non-null  float64       
 7   pressure_std_4h     876053 non-null  float64       
 8   vibration_mean_4h   876053 non-null  float64       
 9   vibration_std_4h    876053 non-null  float64       
 10  volt_mean_48h       876053 non-null  float64       
 11  volt_std_48h        876053 non-null  float64       
 12  rotate_mean_48h     876053 non-null  float64       
 13  rotate_std_48h      876053 no

# Target Creation

Since we have the failures data in a separate table, we need to prepare our target variable based on the dates and the time window we selected. 

The objective is to estimate the probability of a machine experiencing a failure within the specified time frame, specifically due to one of four component failures. In order to establish the target variable, a categorical failure feature is generated to serve as the label. Records that fall within a 48-hour window before a failure of component 1 are labeled as "comp1," while those within the same time frame before component 2, 3, or 4 failures are labeled as "comp2," "comp3," or "comp4" respectively. All records that are not within the 48-hour window of any component failure are labeled as "Normal." 

In [21]:
df_failures.head()

Unnamed: 0,datetime,machineID,failure
0,2015-01-05 06:00:00,1,comp4
1,2015-03-06 06:00:00,1,comp1
2,2015-04-20 06:00:00,1,comp2
3,2015-06-19 06:00:00,1,comp4
4,2015-09-02 06:00:00,1,comp4


In [22]:
time_interval = 48  # hours
train_data = final_feat.merge(df_failures, on=["datetime", "machineID"], how="left")
# Sort the DataFrame by machineID and datetime to ensure proper grouping and ordering
train_data = train_data.sort_values(by=["machineID", "datetime"])

# Group the DataFrame by machineID
grouped = train_data.groupby("machineID")
# change failure to categorical feature
train_data["failure"] = train_data["failure"].astype("category")
# Apply backfill within each machineID group
train_data["failure"] = grouped["failure"].fillna(method="bfill", limit=time_interval)

# fill any remaining NaN values with 'none'. failure is a categorical feature so fillna won't work.
train_data["failure"] = train_data["failure"].cat.add_categories("Normal")
train_data["failure"] = train_data["failure"].fillna("Normal")

train_data.head()

Unnamed: 0,machineID,datetime,volt_mean_4h,volt_std_4h,rotate_mean_4h,rotate_std_4h,pressure_mean_4h,pressure_std_4h,vibration_mean_4h,vibration_std_4h,...,error3_sum_48h,error4_sum_48h,error5_sum_48h,comp1,comp2,comp3,comp4,model,age,failure
0,1,2015-01-03 05:00:00,180.694192,9.610191,485.046875,46.151381,95.428381,4.437964,37.832936,6.458281,...,0.0,0.0,0.0,20.958333,215.958333,155.958333,170.958333,model3,18,Normal
1,1,2015-01-03 06:00:00,180.666104,9.621991,469.576558,39.616761,94.209279,2.410554,42.215156,9.533812,...,0.0,0.0,0.0,21.0,216.0,156.0,171.0,model3,18,comp4
2,1,2015-01-03 07:00:00,180.472154,9.754595,463.636015,32.65289,97.226801,5.232113,45.090099,8.898822,...,0.0,0.0,0.0,21.041667,216.041667,156.041667,171.041667,model3,18,comp4
3,1,2015-01-03 08:00:00,172.821801,5.835065,501.168525,47.092576,96.874856,5.396217,45.555083,9.089829,...,0.0,0.0,0.0,21.083333,216.083333,156.083333,171.083333,model3,18,comp4
4,1,2015-01-03 09:00:00,172.153274,5.786929,492.435436,52.918165,95.204932,7.893687,49.398795,2.469875,...,0.0,0.0,0.0,21.125,216.125,156.125,171.125,model3,18,comp4


This is it. We have now a training data. Lets save it as a parquet file. 

### Save Data

In [23]:
# save the labeled features as parquet file for later use.
train_data.to_parquet("data/training_data.parquet", index=False)