## Data Science applied to maintenance planning optimization
### Objective

The objective is to reduce maintenance costs for the truck air system by identifying vehicles needing preventive maintenance in advance.

Maintenance Costs
Inspection without defect: $10
Preventive repair: $25
Corrective maintenance: $500

Key Information
class: Indicates whether the truck had a defect in the air system ("pos") or not ("neg").

### Challenge Activities

1. What steps would you take to solve this problem? Please describe as completely and clearly as possible all the steps that you see as essential for solving the problem.

  Steps to Solve the Problem:

* Problem Understanding: Understand the current maintenance costs and identify the relationship between air system failures and maintenance costs.
* Data Collection: Receive and prepare historical and current maintenance data.
* Exploratory Data Analysis: Identify patterns, distributions, and correlations in the data.
* Predictive Modeling: Train models to predict air system failures before corrective maintenance is needed.
* Model Evaluation: Evaluate models using technical metrics and calculate the expected financial impact.
* Implementation and Monitoring: Implement the model in production, monitor its performance, and regularly reassess for adjustments.

2. Which technical data science metric would you use to solve this challenge? Ex: absolute error, rmse, etc.

* For classification problems like this, metrics such as precision, recall, and F1-score are crucial. They allow assessing how accurately the model identifies air system failures in trucks.

3. Which business metric  would you use to solve the challenge?

* The main business metric here is the total maintenance cost. The goal is to reduce costs associated with corrective and preventive maintenance by early identification of air system issues in trucks.

4. How do technical metrics relate to the business metrics?

* Technical metrics (precision, recall, etc.) indicate how effectively the model identifies air system failures, which directly translates into reduced maintenance costs as calculated from the model predictions.

5. What types of analyzes would you like to perform on the customer database?

* Exploratory analysis to understand failure distributions, correlations with other variables (such as truck age, operational conditions, etc.), and identification of seasonal or temporal patterns in the data.

6. What techniques would you use to reduce the dimensionality of the problem?

* PCA (Principal Component Analysis) to reduce data dimensionality without losing significant information.
* Feature selection based on importance using methods like Random Forest Feature Importance.

7. What techniques would you use to select variables for your predictive model?

* Use techniques such as correlation analysis, feature importance, and domain knowledge to select the most relevant variables influencing air system failures in trucks.

8. What predictive models would you use or test for this problem? Please indicate at least 3.

* Random Forest Classifier
* Gradient Boosting Classifier (e.g., XGBoost)
* Logistic Regression

9. How would you rate which of the trained models is the best?

* Compare performance metrics (precision, recall, F1-score) across models.
* Assess the estimated financial impact of each model to select the best in terms of cost reduction.

10. How would you explain the result of your model? Is it possible to know which variables are most important?

* Interpret results including analysis of variable importance to identify which features have the most influence on predicting air system failures.

11. How would you assess the financial impact of the proposed model?

* Compare current costs with expected costs after implementing the model to calculate estimated savings.

12. What techniques would you use to perform the hyperparameter optimization of the chosen model?

* Use techniques like Grid Search or Random Search to tune model hyperparameters for improved performance.

13. What risks or precautions would you present to the customer before putting this model into production?

* Validate model robustness with real-world data.
* Ensure model predictions are interpreted correctly by end-users.
* Consider the impact of false positives and false negatives on actual business operations.

14. If your predictive model is approved, how would you put it into production?

* Develop automated pipelines for continuous integration of new data and model predictions.
* Implement ongoing monitoring to ensure the model operates as expected.

15. If the model is in production, how would you monitor it?

* Monitor performance metrics (precision, recall) regularly.
* Verify model stability over time and re-evaluate as needed.

16. If the model is in production, how would you know when to retrain it?

* Regularly retrain the model with new data to adapt to changes in truck conditions and operational environment.

## Load the data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Load the data
df_previous = pd.read_csv('air_system_previous_years.csv')
df_present = pd.read_csv('air_system_present_year.csv')

df_previous

Unnamed: 0,class,aa_000,ab_000,ac_000,ad_000,ae_000,af_000,ag_000,ag_001,ag_002,...,ee_002,ee_003,ee_004,ee_005,ee_006,ee_007,ee_008,ee_009,ef_000,eg_000
0,neg,76698,na,2130706438,280,0,0,0,0,0,...,1240520,493384,721044,469792,339156,157956,73224,0,0,0
1,neg,33058,na,0,na,0,0,0,0,0,...,421400,178064,293306,245416,133654,81140,97576,1500,0,0
2,neg,41040,na,228,100,0,0,0,0,0,...,277378,159812,423992,409564,320746,158022,95128,514,0,0
3,neg,12,0,70,66,0,10,0,0,0,...,240,46,58,44,10,0,0,0,4,32
4,neg,60874,na,1368,458,0,0,0,0,0,...,622012,229790,405298,347188,286954,311560,433954,1218,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59995,neg,153002,na,664,186,0,0,0,0,0,...,998500,566884,1290398,1218244,1019768,717762,898642,28588,0,0
59996,neg,2286,na,2130706538,224,0,0,0,0,0,...,10578,6760,21126,68424,136,0,0,0,0,0
59997,neg,112,0,2130706432,18,0,0,0,0,0,...,792,386,452,144,146,2622,0,0,0,0
59998,neg,80292,na,2130706432,494,0,0,0,0,0,...,699352,222654,347378,225724,194440,165070,802280,388422,0,0


In [7]:
# Analysing the proportion of trucks with defects
defects_percentage_previous = (df_previous['class'].value_counts(normalize=True) * 100).get('pos', 0)
defects_percentage_present = (df_present['class'].value_counts(normalize=True) * 100).get('pos', 0)

print(f"Percentage of lorries with defects in previous years' data: {defects_percentage_previous}%")
print(f"Percentage of lorries with defects in the current year's data: {defects_percentage_present}%")

Percentage of lorries with defects in previous years' data: 1.6666666666666667%
Percentage of lorries with defects in the current year's data: 2.34375%


In [8]:
# Check for missing values
print(df_previous.isna().sum())
print(df_present.isna().sum())

class         0
aa_000        0
ab_000    46329
ac_000     3335
ad_000    14861
          ...  
ee_007      671
ee_008      671
ee_009      671
ef_000     2724
eg_000     2723
Length: 171, dtype: int64
class         0
aa_000        0
ab_000    12363
ac_000      926
ad_000     3981
          ...  
ee_007      192
ee_008      192
ee_009      192
ef_000      762
eg_000      762
Length: 171, dtype: int64


### Handling missing values


In [9]:
import numpy as np

# Remove additional whitespace around 'na' and replace with NaN
df_previous = df_previous.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df_previous.replace('na', np.nan, inplace=True)

df_present = df_present.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df_present.replace('na', np.nan, inplace=True)

# Apply ffill to fill in missing values
df_previous_filled = df_previous.fillna(0)
df_present_filled = df_present.fillna(0)

### Train the Classification Model with the previous data

We will train different classification algorithms and evaluate their performance. We'll use Random Forest.

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Separate resources and labels
X_previous = df_previous_filled.drop(columns=['class'])
y_previous = df_previous_filled['class']

# Coding labels
y_previous = y_previous.map({'neg': 0, 'pos': 1})

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_previous, y_previous, test_size=0.2, random_state=42)

# Training the Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Making predictions
y_pred = clf.predict(X_test)

# Evaluating the model
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.99      1.00      1.00     11788
           1       0.89      0.69      0.78       212

    accuracy                           0.99     12000
   macro avg       0.94      0.84      0.89     12000
weighted avg       0.99      0.99      0.99     12000

[[11770    18]
 [   66   146]]


### Estimating Cost Reduction

We can estimate the cost reduction by comparing the expected costs using the predictive model with the current situation.

In [11]:
# Current costs
current_cost = (y_test.sum() * 500) + ((len(y_test) - y_test.sum()) * 10)

# Expected costs with predictive modelling
predicted_defects = (y_pred == 1)
predicted_corrective_maintenance = y_test[(y_test == 1) & (predicted_defects == 0)].count()
predicted_preventive_maintenance = y_test[(y_test == 1) & (predicted_defects == 1)].count()
predicted_inspections = y_test[(y_test == 0) & (predicted_defects == 1)].count()

expected_cost = (predicted_corrective_maintenance * 500) + (predicted_preventive_maintenance * 25) + (predicted_inspections * 10)

# Cost reduction
cost_reduction = current_cost - expected_cost
print(f"Estimated cost reduction: ${cost_reduction}")

Estimated cost reduction: $187050


### The final results

The final results that will be presented to the executive board need to be evaluated against air_system_present_year.csv.

In [12]:
# Separate resources and labels for current year data
X_present = df_present_filled.drop(columns=['class'])
y_present = df_present_filled['class']

# Training the RandomForestClassifier model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_previous, y_previous)

# Making predictions using the trained model
y_pred_present = clf.predict(X_present)

# Map numerical predictions back to the original labels
y_pred_present_labels = np.where(y_pred_present == 0, 'neg', 'pos')

# Evaluate the model with data from the current year
print("Evaluating the Model with Current Year Data:")
print(classification_report(y_present, y_pred_present_labels))
print(confusion_matrix(y_present, y_pred_present_labels))

# Calculate current costs based on data from the current year
current_cost_present = (y_present == 'pos').sum() * 500 + (y_present == 'neg').sum() * 10

# Estimar custos esperados com modelo preditivo para o ano presente
predicted_corrective_maintenance_present = y_present[(y_present == 'pos') & (y_pred_present_labels == 'neg')].count()
predicted_preventive_maintenance_present = y_present[(y_present == 'pos') & (y_pred_present_labels == 'pos')].count()
predicted_inspections_present = y_present[(y_present == 'neg') & (y_pred_present_labels == 'pos')].count()

expected_cost_present = (predicted_corrective_maintenance_present * 500) + (predicted_preventive_maintenance_present * 25) + (predicted_inspections_present * 10)

# Redução estimada de custos para o ano presente
cost_reduction_present = current_cost_present - expected_cost_present
print(f"Estimated cost reduction for the current year: ${cost_reduction_present}")

Evaluating the Model with Current Year Data:
              precision    recall  f1-score   support

         neg       0.99      1.00      1.00     15625
         pos       0.93      0.71      0.81       375

    accuracy                           0.99     16000
   macro avg       0.96      0.86      0.90     16000
weighted avg       0.99      0.99      0.99     16000

[[15606    19]
 [  108   267]]
Estimated cost reduction for the current year: $282885


The evaluation of the model with the current year data shows promising performance in predicting truck air system failures, aiming to reduce maintenance costs effectively. Here's an analysis of the results:

Precision and Recall: The model achieves high precision (0.93) for identifying trucks with air system defects ("pos"), indicating that when it predicts a truck needs maintenance, it is correct 93% of the time. The recall (0.71) suggests that it successfully identifies 71% of all trucks actually needing maintenance.

F1-score: The F1-score, which combines precision and recall into a single metric, is 0.81 for trucks with air system defects. This indicates a good balance between precision and recall, crucial for accurately identifying maintenance needs.

Accuracy: The overall accuracy of the model is 99%, which is high. However, accuracy can be misleading in imbalanced datasets (like when the majority of trucks do not have defects), making precision and recall more informative in this context.

Confusion Matrix: The confusion matrix further illustrates the model's performance, with 15606 true negatives (correctly identified as not needing maintenance), 267 true positives (correctly identified as needing maintenance), 19 false positives (incorrectly identified as needing maintenance), and 108 false negatives (incorrectly identified as not needing maintenance).

Cost Reduction Estimate: Based on these predictions, the model estimates a substantial cost reduction of $282,885 for the current year. This estimate accounts for reduced corrective maintenance costs by identifying defects early and recommending preventive maintenance when necessary.

In conclusion, while the model demonstrates strong performance in predicting air system failures in trucks. The estimated cost savings validate the model's potential to significantly impact operational expenses through proactive maintenance strategies. Ongoing monitoring and potential model refinements will be essential to optimize its effectiveness.