# Engine Predictive Maintenance Analytics

This notebook loads the NASA C MAPSS FD001 dataset (Kaggle version), 
performs preprocessing, builds baseline and improved predictive models, 
and exports predictions and metrics for the Streamlit dashboard app.

### Objectives

* The primary objective of the notebook is to predict the Remaining Useful Life (RUL) of a Turbo Fan Engine using Data Analytics and Machine Learning.
* The secondary objectives are to use the dataset to answer a number of businness questions and hypotheses by exploring three data analytics categories: descriptive analytics, diagnostic analytics and predictive analytics.
* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling" (**remove**)

### Inputs

* The notebook uses a subset (FD001) of the NASA CMAPSS (Commercial Modular Aero Propulsion System Simulation) Turbo Fan Engine Degradation dataset (Kaggle version) to build a predictive maintenance model capable of estimating the Remaining Useful Life (RUL) of engines based on sensor data collected over time.  The FD001 subset is one of four subsets of the dataset, with each subset representing a different engine but of the same class.
* Write down which data or information you need to run the notebook (**remove**)

### Outputs

* The notebook exports predictions and metrics for the Streamlit dashboard app
* Code, plots and a predictive model were writen, ploted and built at various stages of the notebook
* Write here which files, code or artefacts you generate by the end of the notebook (**remove**) 

### Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. (**remove later**)



---

### Import Important Libraries

* All the important libraries needed for the project have been loaded to the virtual environment (.venv).
* The libraries used for the project are listed in the requirement.txt file

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

plt.style.use("seaborn-v0_8")
pd.set_option("display.max_columns", None)

Matplotlib is building the font cache; this may take a moment.


---

### Load Dataset

* This loads the dataset on to the Jupyter Notebook
* Data Set: FD001
  - Train trjectories: 100
  - Test trajectories: 100
  - Conditions: ONE (Sea Level)
  - Fault Modes: ONE (HPC Degradation) 
  - Dataset subsets:
    - train_FD001.txt
    - test_FD001.txt
    - RUL_FD001.txt
    - x.txt

In [4]:
# The Kaggle dataset files are space separated with trailing empty columns
# The following function loads the FD001 files into pandas DataFrame
# The files have no header row, so column names were added using the metadata provided on Kaggle

def load_fd001(data_dir="data"):
    col_names = [
        "engine_id", "cycle",
        "op_setting_1", "op_setting_2", "op_setting_3"
    ] + [f"sensor_{i}" for i in range(1, 22)]

    df_train = pd.read_csv(
        f"{data_dir}/train_FD001.txt",
        sep=r"\s+",
        header=None,
        names=col_names
    )

    df_test = pd.read_csv(
        f"{data_dir}/test_FD001.txt",
        sep=r"\s+",
        header=None,
        names=col_names
    )

    df_rul = pd.read_csv(
        f"{data_dir}/RUL_FD001.txt",
        sep=r"\s+",
        header=None,
        names=["RUL"]
    )

    return df_train, df_test, df_rul

df_train, df_test, df_rul = load_fd001()

df_train.head(), df_test.head(), df_rul.head()





FileNotFoundError: [Errno 2] No such file or directory: 'data/train_FD001.txt'

In [None]:
df_train.head(), df_test.head(), df_rul.head()
print("Training data shape:", df_train.shape)
print("Testing data shape:", df_test.shape)
print("RUL data shape:", df_rul.shape)
# Display first few rows of the training data
print(df_train.head())

In [None]:
print("Training data shape:", df_train.shape)
print("Testing data shape:", df_test.shape)
print("RUL data shape:", df_rul.shape)
# Display first few rows of the training data
print(df_train.head())
# Calculate RUL for training data
rul_train = df_train.groupby("engine_id")["cycle"].max().reset_index()
rul_train.columns = ["engine_id", "max_cycle"]
df_train = df_train.merge(rul_train, on="engine_id", how="left")
df_train["RUL"] = df_train["max_cycle"] - df_train["cycle"]
df_train.drop("max_cycle", axis=1, inplace=True)
print("Training data with RUL:")
print(df_train.head())
# Prepare test data with RUL
rul_test = df_test.groupby("engine_id")["cycle"].max().reset_index()
rul_test.columns = ["engine_id", "max_cycle"]
df_test = df_test.merge(rul_test, on="engine_id", how="left")
df_test["RUL"] = df_rul["RUL"]
df_test["RUL"] = df_test["RUL"] + (df_test["max_cycle"] - df_test["cycle"])
df_test.drop("max_cycle", axis=1, inplace=True)
print("Testing data with RUL:")
print(df_test.head())
# Feature selection: Drop columns with low variance or irrelevant features
irrelevant_sensors = ["sensor_4", "sensor_5", "sensor_10", "sensor_16", "sensor_18", "sensor_19"]
df_train.drop(columns=irrelevant_sensors, inplace=True)
df_test.drop(columns=irrelevant_sensors, inplace=True)
# Split training data into features and target
X = df_train.drop(columns=["RUL", "engine_id", "cycle"])
y = df_train["RUL"]
# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(df_test.drop(columns=["RUL", "engine_id", "cycle"]))
# Train Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Validate the model
y_val_pred = rf_model.predict(X_val)
mse = mean_squared_error(y_val, y_val_pred)
mae = mean_absolute_error(y_val, y_val_pred)
print(f"Validation MSE: {mse:.2f}")
print(f"Validation MAE: {mae:.2f}")
# Test the model
y_test_pred = rf_model.predict(X_test)
print("Test predictions (first 10):", y_test_pred[:10])
# Visualize feature importance
feature_importances = rf_model.feature_importances_
features = X.columns
importance_df = pd.DataFrame({
    "Feature": features,
    "Importance": feature_importances
}).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x="Importance", y="Feature", data=importance_df)
plt.title("Feature Importance from Random Forest Regressor")
plt.show()

---

### Exploratory Data Analysis - EDA

* The following cells conduct a number of EDA steps to understand, clean and prepare the data for data analytics

In [None]:
print("Training data shape:", df_train.shape)
print("Testing data shape:", df_test.shape)
print("RUL data shape:", df_rul.shape)
# Display first few rows of the training data
print(df_train.head())
# Display basic information about the training data
print("Training Data Info:")
print(df_train.info())
print("\nTraining Data Head:")
print(df_train.head())
print("\nTraining Data Description:")
print(df_train.describe())
print("\nTraining Data Missing Values:")
print(df_train.isnull().sum())
print("\n" + "="*50 + "\n")
# Display basic information about the test data
print("Test Data Info:")
print(df_test.info())
print("\nTest Data Head:")
print(df_test.head())
print("\nTest Data Description:")
print(df_test.describe())
print("\nTest Data Missing Values:")
print(df_test.isnull().sum())
print("\n" + "="*50 + "\n")
# Display basic information about the RUL data
print("RUL Data Info:")
print(df_rul.info())    
print("\nRUL Data Head:")
print(df_rul.head())
print("\nRUL Data Description:")
print(df_rul.describe())
print("\nRUL Data Missing Values:")
print(df_rul.isnull().sum())

---

### Data Visualisation

* The following cells use a number of data visualisation libraries to plot a number of visualisations that attempt to answer a number of business questions or prove some of our hypotheses.

---

### Descriptive Analytics

* The following cells conduct a number of descriptive analytics on the dataset

---

### Diagnostic Analytics

* The following cells conduct a number of diagnostic analytics on the dataset

---

### Feature Engineering, Selection and Scaling

* The following cells conduct Feature Engineering, Feature Selection and Feature Scaling to prepare the data for predictive analytics

---

### Predictive Analytics

* The following cells perform predictive maintenance analytics on a NASA Turbofan Engine 
* Remaining Useful Live (RUL) is used as target for the model training

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\User\\Documents\\Sarkima\\Sarki\\MyEducation\\MyCourses\\CodeInstitute\\DataAnalytics\\VSCode\\predictive-maintenance-project\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\User\\Documents\\Sarkima\\Sarki\\MyEducation\\MyCourses\\CodeInstitute\\DataAnalytics\\VSCode\\predictive-maintenance-project'

# Section 1

Section 1 content

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
