# CI Portfolio Project 5 - Filter Maintenance Predictor 2022
## **Feature Engineering Notebook**

## Objectives

**1. Cleaning**

Performed within the [data cleaning notebook](https://github.com/roeszler/filter-maintenance-predictor/blob/main/jupyter_notebooks/02_DataCleaning.ipynb).

**2. Data Transformation**
* Processing the data for the modelling stage.
* Transform data into a format that is useful for the algorithm learn the relationship among the variables.
* Evaluate the use of the following approaches to engineer the variables:
    * ordinal categorical encoding
    * numerical transformation
    * smart correlated selection
    
**3. Feature Extraction**
* Evenly distribute dust type

**4. Feature Selection**

**5. Feature Iteration**



### Inputs

1. Cleaned Test Dataset : `outputs/datasets/collection/dfCleanTrain.csv`

2. Cleaned Train Dataset : `outputs/datasets/collection/dfCleanTrain.csv`

### Outputs

* Generate engineered Train and Test sets, both saved under `outputs/datasets/transformed`

### Conclusions

  * Best approach to engineer variables based on...
  * Transformations that we will consider in our pipeline are...

---

# Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("Current directory set to new location")

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Cleaned Data

In [None]:
import pandas as pd
df_total = pd.read_csv(f'outputs/datasets/cleaned/dfCleanTotal.csv')

---

# Data Transformation

## Ordinal Categorical encoding

#### Convert `Data_No` to a categorical variable

In [None]:
# data_no_total = df_total['Data_No'].map(str)
# df_total['Data_No'] = data_no_total
# df_total.info()

## Numerical transformations

This process can consider transformers like:
* Smooth with SMA
* Logarithmic in base e
* Logarithmic in base 10
* Reciprocal
* Power
* BoxCox
* Yeo Johnson

Note: Part of the challenge with this data is dealing with a continuous dataset that is comprised of data 'bins'. Each `Data_No` data bin represents an individual test cycle. 

When engineering **descriptive statistics** into the dataset, that typically used a sample of the previous values to indicate a mean or standard deviation, **we need to treat each bin individually** and not progress the first row of each bin incorrectly with the last rows of the previous bin.

To solve this, we use a sequential loop that proceeds through each bin and inserts a progressive descriptive statistic calculation like change in differential pressure (`change_DP`) and appends it to the previous bin. This eventually creates a version of the dataset that includes the new calculation:

### Change in Differential Pressure
Include change in Differential Pressure calculation

**Note**: We replace first instance of `change_DP` with first value of `differential_pressure` in the new bin. 
* This signifies that the fist observation starts from a **zero** DP value.

In [None]:
df_change_dp = pd.DataFrame()

list_data_nos = list(df_total['Data_No'].unique())
for n in list_data_nos:
    if (df_total.Data_No != df_total.Data_No.shift(1)).any().any():
        df_bin = df_total[df_total['Data_No'] == n]

        change_dp_calc = df_bin['Differential_pressure'].diff().fillna(df_bin['Differential_pressure'])
        df_bin.insert(loc=7, column='change_DP', value=change_dp_calc)

        df_change_dp = pd.concat([df_change_dp, df_bin], ignore_index = True)
df_total = df_change_dp
df_total.loc[446:451]

#### Outliers in differential pressure observations

In each bin we notice that the size and direction of the `change_DP` measure occasionally produces a zero or negative value. This highlights a fluctuation in the **differential pressures**. These may be considered outliers as the pressure gradient across the filter needs time to stabilize. We have considered three main methods to deal with these observations:
* Log transformation
* Winsorize method
* Dropping the outliers

These will be handled in the [feature engineering](https://github.com/roeszler/filter-maintenance-predictor/blob/main/jupyter_notebooks/03_FeatureEngineering.ipynb) notebook

**Random sample** (Bin 96) to plot and inspect change in Differential Pressure distributions

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(20,5))

bin_96 = df_total[df_total['Data_No'] == 96]
sns.boxplot(x = bin_96['change_DP'], ax=ax1)
sns.histplot(x = bin_96['change_DP'], ax=ax2)
sns.lineplot(x=bin_96['Time'], y=bin_96['Differential_pressure'], ax=ax3)
plt.show()

---

## Smoothing of **Differential Pressure**
*Dealing with outliers in differential pressure observations*

In each bin note `differential_pressure` and `change_DP` observations. Occasionally the measures fluctuate outside of the general trend of the data (outliers). This can be seen below in the **second to last** observation of Data_No bin 98 (index no. 75826) where the recorded value is outside the general trend of the data. 

In [None]:
df_total[df_total['Data_No'] == 98].tail()

To quantify such measures, have considered a tolerance of ±200% change in value, however this can be altered depending on what we see when fitting the models.

To **smooth** the variability of this measure we can apply a **mean** (or average) value to the data in various ways. 
This attempt to soften the severity of changes seen and reduce the instances of values that are higher / lower than the general trend. It will effectively reduce variability in the **differential pressure** measure, making it easier to model.

We have considered the following methods to deal with outliers:
* Smooth with Simple Mean Average (SMA)
* Exponentially Weighted Mean (EWM)
* Log transformation
* Dropping the outliers
* Winsorize method

For each calculation that uses previous observations to produce a value, we can use the same process we applied to manage the unique data bins as we did in calculating **change in differential pressure** above.

#### Smoothing of Differential Pressure with a **Simple Moving Average** (SMA)
A simple moving average (SMA) is an arithmetic **moving average** calculated by adding recent values, then dividing that by the number of observations in the calculation. In this case we used the same process to manage the data bins to include SMA of the last four measures, that moves as the observations progress.

* We can see the first four observations of the SMA are **NaN** indicated. These could be imputed with arbitrary values, mean values, closest k sample values and/or a MICE (Multiple Imputation by Chained Equations) value that fits a linear regression with the present values.
    * On review, the MICE method would be the preferred method to impute the missing SMA values, however considering the progressive nature of the `differential_pressure` variable, a preferable alternate to SMA would be Exponentially Weighted Mean (EWM).

#### Smoothing of Differential Pressure with an **Exponentially Weighted Mean** (EWM)
Is a measure of the moving average that considers older observations to have given lower weightings in the calculation. The weights fall exponentially as the data point gets older – hence the name exponentially weighted.

* The EWM is calculated with a 4 point EWM (`span=4`) measure, that considers the previous 4 observations to calculate the weighted mean.


In [None]:
df_means = pd.DataFrame()

list_data_nos = list(df_total['Data_No'].unique())
for n in list_data_nos:
    if (df_total.Data_No != df_total.Data_No.shift(1)).any().any():
        df_bin = df_total[df_total['Data_No'] == n]

        ewm_calc = df_bin['Differential_pressure'].ewm(span=4, adjust=False).mean()
        df_bin.insert(loc=2, column='4point_EWM', value=ewm_calc)

        sma_calc = df_bin['Differential_pressure'].rolling(4).mean()
        df_bin.insert(loc=2, column='4point_SMA', value=sma_calc)

        change_ewm_calc = df_bin['4point_EWM'].diff().fillna(df_bin['4point_EWM'])
        df_bin.insert(loc=10, column='change_EWM', value=change_ewm_calc)

        df_means = pd.concat([df_means, df_bin], ignore_index = True)
df_total = df_means
df_total.loc[446:451]

## **Logarithmic Transformation** of Differential Pressure

As seen at data collection, the original continuous `differential_pressure` data is right or **positively skewed** at 1.81 and does not follow the shape of a normally distributed bell curve.
* Included a plot of the **smoothed** variable to check that it is representative of the **differential_pressure** source, which it is.

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

fig, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(20,6))

sns.histplot(x = df_total['Differential_pressure'], ax=ax1)
sns.histplot(x = df_total['4point_SMA'], ax=ax2)
sns.histplot(x = df_total['4point_EWM'], ax=ax3)

ax1.title.set_text('Differential Pressure Histogram\n')
ax2.title.set_text('Simple Moving Average Histogram\n')
ax3.title.set_text('Exponential Weighted Mean Histogram\n')
plt.show()

A **log transformation** of this data will represent the values within a normal distribution, **as much as possible**, allowing a more valid statistical analysis from this data.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(20,6))

df_total_log_dp = df_total['Differential_pressure']
log_dp = np.log(df_total_log_dp)
sns.histplot(x = log_dp, ax=ax1)

df_total_log_sma = df_total['4point_SMA']
log_sma = np.log(df_total_log_sma)
sns.histplot(x = log_sma, ax=ax2)

df_total_log_ewm = df_total['4point_EWM']
log_ewm = np.log(df_total_log_ewm)
sns.histplot(x = log_ewm, ax=ax3)

# plt.title('Log Histogram')
ax1.title.set_text('Log_DP Histogram\n')
ax2.title.set_text('Log_SMA Histogram\n')
ax3.title.set_text('Log_EWM Histogram\n')
plt.show()

**Observations** 
* The shape of the numerical differential_pressure data is improved by a natural logarithmic transformation
    * Untreated values of 78.23 mean, std.deviation at 107.34 and median at 35.17, indicates the positively skewed nature of the original data
    * Transformed the data is much improved with mean 3.18 much closer to the median 3.89

The transformed data is still affected by the negative skew in the data:
* Pressure measures show unusual behavior at the start of most tests, producing an unusual entry tail of values at zero or below.
    * These can be managed by 
        * adding a constant to stop the value becoming negative, or 
        * indicate the negative as a missing number or 
        * dropping the entire observation
    * Considering these measures indicate a zone at the beginning of the test procedure to 'equalize' the difference in pressure between the areas before and after the filter, we can treat them as outliers and confidently drop the entire row this observation sits in.

**Further EWM transformation**

Remove rows with negative numbers in `4pointWEM` column

In [None]:
# df_total.insert(loc=4, column='log_EWM', value=log_ewm)
# data = df_total.loc[:, df_total.columns == 'log_EWM']
# df_total = df_total[data.select_dtypes(include=[np.number]).ge(-0).all(1)]
# # df_total

In [None]:
# Remove Negative values
df_total.insert(loc=4, column='log_EWM', value=log_ewm)
data = df_total.loc[:, df_total.columns == 'log_EWM']
df_total = df_total[data.select_dtypes(include=[np.number]).ge(-0).all(1)]

# Visualize the data
old_shape = pd.read_csv(f'outputs/datasets/cleaned/dfCleanTotal.csv')
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(20,5))

sns.histplot(x = log_dp, ax=ax1)
sns.histplot(x = df_total['log_EWM'], ax=ax2)

plt.title('log_EWM Histogram')
ax1.title.set_text('Transformed Differential Pressure Histogram\n')
ax2.title.set_text('Transformed Differential Pressure Exponentially Weighted Mean Histogram (Values > 0)\n')
plt.show()
print("Old Shape: ", old_shape.shape)
print("New Shape: ", df_total.shape)

This transformed data is far more useable to train the model. 
* Less variability 
* Negative values removed
* Minimal loss of data

---

### Further Dropping of Outliers
Dropping all negative numbers to reduce the noise of the data

#### Evaluation of detecting outliers using **Inter Quartile Range** (IQR).

In [None]:
df_total_IQR = df_total.copy()

# IQR
Q1 = np.percentile(df_total_IQR, 25, method='midpoint')
Q3 = np.percentile(df_total_IQR, 75,method='midpoint')
IQR = Q3 - Q1
print("Old Shape: ", df_total_IQR.shape)
 
# Upper bound
upper = np.where(df_total_IQR >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df_total_IQR <= (Q1-1.5*IQR))
 
''' Removing the Outliers '''
df_total_IQR.drop(upper[0], inplace = True)
df_total_IQR.drop(lower[0], inplace = True)
 
print("New Shape: ", df_total_IQR.shape)

# df_total_log_IQR = df_total_IQR['Differential_pressure']
# df_total_log_IQR = df_total_IQR['4point_SMA']
# df_total_log_IQR = df_total_IQR['4point_EWM']
df_total_log_IQR = df_total_IQR['log_EWM']
log_dp_IQR = np.log(df_total_log_IQR)
# df_total_IQR['log_DP'] = log_dp
sns.histplot(x = df_total_log_IQR)
plt.title('IQR Histogram of log EWM')
plt.show()

IQR transformation is not particularly effective on this data with negative values removed. We will not apply this method and proceed with other techniques.

#### **The Winsorize Method**
Winsorization is the process of replacing the extreme values of statistical data in order to limit the effect of the outliers on the calculations or the results obtained by using that data. To apply this to a our exponentially logged differential pressure `log_EWM` variable, where outliers are present only at one end of the data:
* The lower 10% values of the data will have their values set equal to the value of the data point at the 10th percentile.

In [None]:
from scipy.stats.mstats import winsorize

WinsorizedArrayMean = np.mean(df_total['log_EWM'])
winz_EWM = winsorize(df_total['log_EWM'],(0.1,0.1))
df_total.insert(loc=5, column='winz_EWM', value=winz_EWM)
WinsorizedArray = df_total['winz_EWM']
plt.boxplot(WinsorizedArray)
plt.title('Winsorized array')
plt.show()
WinsorizedArrayNewMean = np.mean(WinsorizedArray)
print('Old Mean: ', WinsorizedArrayMean)
print('New Mean: ', WinsorizedArrayNewMean)

In [None]:
df_total_log_winz = df_total['winz_EWM']
log_winz = np.log(df_total_log_winz)
sns.histplot(x = log_winz)

Winzorization smooths the data too much and does not add much value to training our models so we will not apply this method either. 

In [None]:
del df_total['winz_EWM']
del df_total['4point_SMA']
df_total

---

# Feature Extraction

## Evenly distribute dataset by `Dust` type

Both the **train** and **test** sets supplied have data distributed unevenly between 50 test bins. To account for this we wish to assess the measures of central tendency for each Dust class, with the aim of reducing the data size to a more evenly proportioned one between classes.

#### **Train** Dataset

**Considerations**

* The proportion of data that **has reached filter failure** is represented by how close `filter_balance` is to zero or less. 
    * Data with filter_balance values approaching zero may be worth keeping and will make part of our heuristic decision process.
* Notwithstanding that the **mean** is the most frequently used measure of central tendency because it uses all values in the data set to give you an average
    * For data from skewed distributions (like `differential_pressure`), the **median** is better than the mean because it isn’t influenced by extremely large values.

In the following calculation, we see a summary of the top ten `Data_No` bins where `differential_pressure` observations that have made it to the **600 Pa** (the point of filter failure).

Divide back into **Train** and **Test** sets

In [None]:
last_row = df_total[df_total.Data_No != df_total.Data_No.shift(-1)]
# last_row_descending = last_row.sort_values(by='Dust', ascending=True)
last_row_descending = last_row.sort_values(by='Differential_pressure', ascending=False)
last_row_descending.head(n=10)

---

In [None]:
last_row_train = df_train[df_train.Data_No != df_train.Data_No.shift(-1)]
# last_row_descending = last_row_train.sort_values(by='Dust', ascending=True)
last_row_descending = last_row_train.sort_values(by='Differential_pressure', ascending=False)
last_row_descending.head(n=10)

Note the diagram below showing proportions of `Dust` variable in the **df_train** dataset.
* It shows a disproportionate mix between classes. This will be the first dataset we tidy up.

In [None]:
%matplotlib inline

category_totals = df_train.groupby('Dust')['Differential_pressure'].count().sort_values()
category_totals.plot(kind="barh", title='Proportion of Dust Classes in df_train\n', xlabel='\nObservations', ylabel='Dust Class')
category_totals

#### Our next aim is to 
* Fill these bins with data that best represents a central tendency.
* Make the size of each bin around ±**5000** observations (similar to the A4 Coarse Dust class bin) 

#### Procedure
* Include a comparison to how far each `differential_pressure` measure **deviates** or how far it is from the **.median()** value of the bin.
* Ordered by `filter_balance` showing sets with data closest to 600 Pa `differential_pressure`.
* Include comparison to median
* Add a cumulative measure of Data_No's to use as a ranking
* Create a dataframe of the A3 Medium Dust : **1.025**

Add a calculation of Standard Deviation to **df_train** test set

In [None]:
del df_train['winz_Mean']
del df_train['log_DP']
std_group = df_train.groupby('Data_No').std()
std_group.index.name = None
std_group.loc[:,'Data_No'] = std_group.index
map_std = df_train['Data_No'].map(std_group.set_index('Data_No')['Differential_pressure'])
df_train.loc[:,'std_DP'] = map_std
# df_test.loc[363:368]
df_train.loc[444:453]

Add a calculation of Coefficient of Variation (variance) to **df_train** test set

In [None]:
import numpy as np
cv = lambda data: np.std(data, ddof=1) / np.mean(data, axis=0) * 100 
var_group = df_train.groupby('Data_No').apply(cv)
var_group.index.name = None
var_group.loc[:,'Data_No'] = var_group.index
map_var = df_train['Data_No'].map(var_group.set_index('Data_No')['Differential_pressure'])
df_train.loc[:,'cv_DP'] = map_var
# df_test.loc[363:368]
df_train.loc[444:453]

The coefficient of variation is an indication of how far the standard deviation is away from the mean. As we can see it does not add value to our understanding of the data, primarily due the the skewed nature of the `differential_pressure` continuous variable. 
* This re-enforces the understanding that descriptive statistics using the mean may not be preferred measure of central tendency.

**Remove Coefficient of Variation and Add Median to df_train**
* Median is the preferred measure of central tendency to observe in a skewed dataset such as this as it is not as affected by larger values.

In [None]:
del df_train['cv_DP']
median_group = df_train.groupby('Data_No').median()
median_group.index.name = None
median_group.loc[:,'Data_No'] = median_group.index
map_median = df_train['Data_No'].map(median_group.set_index('Data_No')['Differential_pressure'])
df_train.loc[:,'median_DP'] = map_median
df_train.loc[444:453]

#### Now we can evaluate the dataframe with just **A3 Dust** in it, ordered by `filter_balance` as a measure of time to filter failure

Map the size of each bin and include a cumulative sum of each **bin size** to help see which data bin that reaches **4500** or more total values.

In [None]:
bin_sum = df_train.groupby('Data_No')['Data_No'].count().reset_index(name='bin_Tot')
map_bin = df_train['Data_No'].map(bin_sum.set_index('Data_No')['bin_Tot'])
df_train.loc[:, 'bin_Size'] = map_bin
# df_train.loc[38817:38827]
dust_A3 = df_train[df_train['Dust'] == 1.025]
filter_A3 = dust_A3[dust_A3.Data_No != dust_A3.Data_No.shift(-1)]
df_train_A3 = filter_A3.sort_values(by='Filter_Balance', ascending=True)
df_train_A3['c_Sum'] = df_train_A3['bin_Size'].cumsum()
df_train_A3.head(13)

We can see that in the current dataframe containing only A3 Medium Dust observations, that is ordered by those tests with closest to a completed test to failure:
* **The top 9 data bins (seen at bin 7) would extract a A3 Medium dust training dataset with 5,109 observations**
* We will now perform a further PDA to evaluate the suitability of these further

#### Rank by Standard Deviations, ordered by `std_DP`
The standard deviation is used to measure the spread of values in a sample.

In [None]:
# dust_A3 = df_train[df_train['Dust'] == 1.025]
# filter_A3 = dust_A3[dust_A3.Data_No != dust_A3.Data_No.shift(-1)]
df_train_A3_std = filter_A3.sort_values(by='std_DP', ascending=True)
df_train_A3_std['c_Sum'] = df_train_A3_std['bin_Size'].cumsum()
df_train_A3_std.head(15)

#### Rank by central tenancy of the Median value, ordered by `median_DP`
* the value of the number in the middle of the dataset

In [None]:
# dust_A3 = df_train[df_train['Dust'] == 1.025]
# filter_A3 = dust_A3[dust_A3.Data_No != dust_A3.Data_No.shift(-1)]
df_train_A3_median = filter_A3.sort_values(by='median_DP', ascending=True)
df_train_A3_median['c_Sum'] = df_train_A3_median['bin_Size'].cumsum()
df_train_A3_median.head(14)

#### Considerations
* T...

---

### Extract these bins from the df_train dataset

Make a separate frame indicating the bin numbers we wish to extract

In [None]:
bin_no = df_train_A3['Data_No'].head(9)
bin_no.to_frame()

Use these references to create a dataframe `df_train_cleaned_A3` that is ready for inclusion in our final dataframe `df_train_clean`.
* Note we disregard the cumulative sum measure as it doesn't add value to further calculations

In [None]:
df_train_copy = df_train
df_train_cleaned_A3 = df_train_copy[df_train_copy['Data_No'].isin(bin_no)]
df_train_cleaned_A3

#### A Quick Review: 

In [None]:
print('Shape started with: ', dust_A3.shape)
print('Shape we have now: ', df_train_cleaned_A3.shape)

In [None]:
df_train_cleaned_A3

Add these to a new dataset to compare

In [None]:
dust_A2 = df_train[df_train['Dust'] == 0.900]
dust_A3 = df_train_cleaned_A3
dust_A4 = df_train[df_train['Dust'] == 1.200]

df_train_compare = pd.concat([dust_A2, dust_A3, dust_A4], ignore_index = True)
df_train_compare

In [None]:
%matplotlib inline

category_totals = df_train_compare.groupby('Dust')['Differential_pressure'].count().sort_values()
category_totals.plot(kind="barh", title='Proportion of Dust Classes in df_train_compare\n', xlabel='\nObservations', ylabel='Dust Class')
category_totals

---

### Repeat procedure with remaining `Dust` classes and **Test** dataset

Extract, Clear and Replace...

In [None]:
# bin_sum = df_train.groupby('Data_No')['Data_No'].count().reset_index(name='bin_Tot')
# map_bin = df_train['Data_No'].map(bin_sum.set_index('Data_No')['bin_Tot'])
# df_train.loc[:, 'bin_Size'] = map_bin

# df_train.loc[38817:38827]

dust_A2 = df_train[df_train['Dust'] == 0.900]
filter_A2 = dust_A2[dust_A2.Data_No != dust_A2.Data_No.shift(-1)]
df_train_A2 = filter_A2.sort_values(by='Filter_Balance', ascending=True)
df_train_A2['c_Sum'] = df_train_A2['bin_Size'].cumsum()
df_train_A2.head(13)

In [None]:
bin_no = df_train_A2['Data_No'].head(9)
bin_no.to_frame()

In [None]:
df_train_copy = df_train
df_train_cleaned_A2 = df_train_copy[df_train_copy['Data_No'].isin(bin_no)]
df_train_cleaned_A2

In [None]:
dust_A2 = df_train_cleaned_A2
dust_A3 = df_train_cleaned_A3
dust_A4 = df_train[df_train['Dust'] == 1.200]

df_train_compare = pd.concat([dust_A2, dust_A3, dust_A4], ignore_index = True)
df_train_compare

In [None]:
%matplotlib inline

category_totals = df_train_compare.groupby('Dust')['Differential_pressure'].count().sort_values()
category_totals.plot(kind="barh", title='Proportion of Dust Classes in df_train_compare\n', xlabel='\nObservations', ylabel='Dust Class')
category_totals

### How much data do we need?
* At a bin sample of 9 for both A2 and A3 dust, we currently have a little under 15,000 observations to train the ML Models
* Will review this depending on the performance of the models
* Overfitting with too much data vs underfitting with too little is a balance and the qualitative nature of the heuristics surrounding the predictions made
* Nonlinear Algorithms (like clustering) may need more data

Short answer is, **we have plenty of extra data at this point that may be less suited to training the model**. 

Notwithstanding, it is still live data and may there may be some value toward increasing the power of our ML Models. 
We do have the option to alter the selected number of dust class bins and possibly augmenting the smaller A4 dust dataset with the **Synthetic Minority Oversampling Technique (SMOTE)** or SMOTE NC for categorical data should we need.

In [None]:
df_train_cleaned = df_train_compare
df_train_cleaned

---

# Feature Selection

---

# Feature Iteration

---

## Divide Datasets

Divide **df_total** into original **df_test** & **df_train** proportions as supplied

In [None]:
data_no_total = df_total['Data_No'].map(int).round(decimals=0)
df_total['Data_No'] = data_no_total
n = df_total['Data_No'][0:len(df_total)]
df_train = df_total[n < 51].reset_index(drop=True, names='index')
df_test = df_total[n > 50].reset_index(drop=True, names='index')
del df_train['RUL']
df_test

In [None]:
df_train

## Save Datasets 
Save the files to /transformed folder

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/transformed')
except Exception as e:
  print(e)

df_train.to_csv(f'outputs/datasets/transformed/dfTransformedTrain.csv',index=False)
df_test.to_csv(f'outputs/datasets/transformed/dfTransformedTest.csv',index=False)
df_total.to_csv(f'outputs/datasets/transformed/dfTransformedTotal.csv',index=False)

---

# Conclusions and Next steps

#### Conclusions: 
* 

#### Next Steps:
* Regression Model
* Classification Model
* Cluster Model
* Correlation Study