# Regression

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Import data from csv file `./data/PM_train.csv`

In [None]:
df = pd.read_csv('./data/PM_train.csv')

In [None]:
df.info()

## Feature engineering

Based on the input data description we have walked through in a previous section, an intuitive predictive maintenance question to ask is "Given these aircraft engine operation and failure events history, can we predict when an in-service engine will fail?"

We re-formulate this question into: How many more cycles an in-service engine will last before it fails?

Calculate the maximum cycle count for each engine id

In [None]:
df.groupby(['engine_id'])['cycle'].max()

Create new column based on above calculated maximum cycle count

In [None]:
df['RUL'] = df.groupby(['engine_id'])['cycle'].transform(np.max)
df.head()

Subtract the current cycle for each row

In [None]:
df['RUL'] = df.groupby(['engine_id'])['cycle'].transform(np.max) - df['cycle']
df.head()

In [None]:
df

Generate a sample feature based on a rolling mean over `s2`

In [None]:
df['a2'] = df['s2'].rolling(5, min_periods=1).mean()
df.head()

Build this rolling mean feature as well as a standard deviation feature for all sensors

In [None]:
for i in range(1,22):
    df['a'+str(i)] = df.groupby('engine_id')['s'+str(i)].rolling(5, min_periods=1).mean().reset_index(drop=True)
    df['sd'+str(i)] = df.groupby('engine_id')['s'+str(i)].rolling(5, min_periods=1).std().reset_index(drop=True)

In [None]:
df.shape

In [None]:
df.head()

Clean missing data

In [None]:
df.dropna(inplace=True)

Normalize all values to an interval between 0 and 1

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
df.loc[:, df.columns != 'RUL'] = scaler.fit_transform(df.loc[:, df.columns != 'RUL'])

In [None]:
df.head()

Seperate DataFrame into one containing all features and another containing the target variable

In [None]:
df_X = df.drop(['engine_id', 'RUL'], axis=1)
df_X.info()

In [None]:
df_y = df['RUL']
df_y

Sperate train and test data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.2)

### 4-step modelling pattern

**Step 1.** Initiate a linear regression model

In [None]:
from sklearn.linear_model import LinearRegression

**Step 2.** Make an instance of the Model

In [None]:
model = LinearRegression()

**Step 3.** Training the model on the data, storing the information learned from the data.

In [None]:
model.fit(X_train, y_train)

**Step 4.** Using the trained model to predict the results for the test set

In [None]:
y_pred = model.predict(X_test)

Compare predicted with real results

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

Let's try with a Decicision Tree Regression model

In [None]:
from sklearn import tree
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mean_absolute_error(y_test, y_pred)

Try again with a Gradient Boosting Regression model

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(n_estimators=50)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mean_absolute_error(y_test, y_pred)