# Vehicle Maintenance Prediction

## Project Planning and Data Exploration

Building a reliable and accurate predictive model that categorizes cars as needing repair or not is the major objective of this project. 

We try to accomplish the following goals:
- __Develop a high-performing classification model:__ We aim for a minimum 90% accuracy on a held-out test set. Utilizing the machine learning algorithms covered in class, such as logistic regression, decision trees, etc. We will determine the most effective model for this prediction task.

- __Identify key features contributing to maintenance prediction:__ We will determine which features have the strongest influence on the model's predictions. This can provide valuable insights into factors affecting vehicle maintenance and inform preventative maintenance strategies.

#### IMPORTING LIBRARIES

In [21]:
import pandas as pd
import numpy as np
from vmpred.constant import *
import pickle

pd.set_option('display.max_columns', None)

#### DATASET OVERVIEW

In [22]:
df = pd.read_csv('rawData/vehicle_maintenance_data.csv')

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   vehicle_model         50000 non-null  object 
 1   mileage               50000 non-null  int64  
 2   maintenance_history   50000 non-null  object 
 3   reported_issues       50000 non-null  int64  
 4   vehicle_age           50000 non-null  int64  
 5   fuel_type             50000 non-null  object 
 6   transmission_type     50000 non-null  object 
 7   engine_size           50000 non-null  int64  
 8   odometer_reading      50000 non-null  int64  
 9   last_service_date     50000 non-null  object 
 10  warranty_expiry_date  50000 non-null  object 
 11  owner_type            50000 non-null  object 
 12  insurance_premium     50000 non-null  int64  
 13  service_history       50000 non-null  int64  
 14  accident_history      50000 non-null  int64  
 15  fuel_efficiency    

In [24]:
df.head(10)

Unnamed: 0,vehicle_model,mileage,maintenance_history,reported_issues,vehicle_age,fuel_type,transmission_type,engine_size,odometer_reading,last_service_date,warranty_expiry_date,owner_type,insurance_premium,service_history,accident_history,fuel_efficiency,tire_condition,brake_condition,battery_status,need_maintenance
0,Truck,58765,Poor,0,4,Diesel,Manual,2000,28524,23-11-2023,24-06-2025,First,20782,6,3,13.622204,New,New,Weak,1
1,Van,60353,Poor,1,7,Petrol,Manual,2500,133630,21-09-2023,04-06-2025,First,23489,7,0,13.625307,New,New,Weak,1
2,Car,68072,Poor,0,2,Petrol,Manual,1500,34022,27-06-2023,27-04-2025,First,17979,7,0,14.306302,New,Good,Weak,1
3,Bus,60849,Poor,4,5,Petrol,Manual,2500,81636,24-08-2023,05-11-2025,Second,6220,7,3,18.709467,New,Worn Out,New,1
4,Bus,45742,Poor,5,1,Petrol,Automatic,2000,97162,25-05-2023,14-09-2025,First,16446,6,2,16.977483,Good,Good,Weak,1
5,Truck,31653,Poor,2,1,Electric,Automatic,800,70954,12-08-2023,05-09-2024,Second,16813,5,3,15.954422,Worn Out,Good,New,0
6,Truck,51211,Good,2,8,Petrol,Automatic,2500,145563,13-01-2024,20-07-2025,First,21057,10,0,16.455703,New,Good,New,0
7,Van,79093,Poor,2,2,Petrol,Automatic,2000,132354,12-05-2023,13-02-2026,Third,6498,3,1,12.128404,Good,New,New,1
8,SUV,59673,Poor,2,6,Petrol,Automatic,800,85733,07-04-2023,21-04-2025,Second,12787,9,1,11.558027,Worn Out,Good,Weak,1
9,Car,37001,Poor,2,9,Petrol,Automatic,1500,8554,05-08-2023,14-05-2025,Second,20860,9,1,12.787248,Worn Out,New,New,0


In [25]:
print(f"Number of Samples: {df.shape[0]} \nNumber of Attributes: {df.shape[1]}")

Number of Samples: 50000 
Number of Attributes: 20


In [26]:
print("FEATURE COLUMNS:\n","\n".join(colName.title() for colName in df.columns if colName!="Need_Maintenance"))

FEATURE COLUMNS:
 Vehicle_Model
Mileage
Maintenance_History
Reported_Issues
Vehicle_Age
Fuel_Type
Transmission_Type
Engine_Size
Odometer_Reading
Last_Service_Date
Warranty_Expiry_Date
Owner_Type
Insurance_Premium
Service_History
Accident_History
Fuel_Efficiency
Tire_Condition
Brake_Condition
Battery_Status
Need_Maintenance


In [27]:
print("TARGET VARIABLE:", (df.columns.tolist()[-1]))

TARGET VARIABLE: need_maintenance


## Phase 1: DATASET INGESTION

Getting Data from __RootDirectory/rawData__ to __RootDirectory/data/ingestedData__ for further processing.

Refer [dataIngestion.py](vmpred/component/dataIngestion.py) for Data Ingestion Process.

## Phase 2: DATASET VALIDATION

After data ingestion, we validate the ingested dataset against the predefined schema documented during the data gathering phase.

Schema: [schema.yaml](config/schema.yaml)

Data Validation: [dataValidation.py](vmpred/component/dataValidation.py)

After validating the data, we saved it in Parquet format to optimize storage space and enable faster reading.

## Phase 3: DATA TRANSFORMATION
- Step1: Handle Missing Values 
- Step2: Drop Duplicates
- Step3: Feature Engineering [Adding New Features]
- Step4: Applying Feature Scaling for Numerical Features and One-Hot Encoding for Categorical Features
- Step5: Splitting the Data into Train and Test Data with ratio of 75:25 using StratifiedShuffle to eliminate any skewness issues

__[dataTransformation.py](vmpred/component/dataTransformation.py)__

##### List of New Features:
1. __Time Since Last Service (in days)__: ReferenceDate - Last_Service_Date
2. __Warranty Duration (in days)__: Warranty_Expiring_Date - ReferenceDate
3. __Mileage per Year__: Mileage / vehicle_age
4. __Service Frequency__: service_history / vehicle_age
5. __Accident Rate__: accident_history / vehicle_age

In [28]:
trainData = pd.read_parquet('data/transformedData/trainData/train_data.parquet', engine='pyarrow')
testData = pd.read_parquet('data/transformedData/testData/test_data.parquet', engine='pyarrow')

In [29]:
print(trainData.shape)
trainData.head(10)

(35000, 41)


Unnamed: 0,mileage,reported_issues,vehicle_age,engine_size,odometer_reading,insurance_premium,service_history,accident_history,fuel_efficiency,time_since_last_service,warranty_duration,mileage_per_year,service_frequency,accident_rate,vehicle_model_Bus,vehicle_model_Car,vehicle_model_Motorcycle,vehicle_model_SUV,vehicle_model_Truck,vehicle_model_Van,maintenance_history_1,maintenance_history_2,maintenance_history_3,fuel_type_Diesel,fuel_type_Electric,fuel_type_Petrol,transmission_type_Automatic,transmission_type_Manual,owner_type_First,owner_type_Second,owner_type_Third,tire_condition_1,tire_condition_2,tire_condition_3,brake_condition_1,brake_condition_2,brake_condition_3,battery_status_1,battery_status_2,battery_status_3,need_maintenance
8364,1.533129,0.879338,-1.214423,-1.204918,-0.115768,1.047117,-0.527175,-0.448022,-1.722842,0.262439,-1.287742,1.435075,0.206088,0.09837,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1
35463,-1.051276,0.879338,-1.56217,1.503508,0.360795,1.71966,1.559876,1.338492,-0.371811,-0.150199,-0.660332,1.517542,4.533543,4.311432,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1
49253,-0.923861,-0.876318,-0.171182,-0.089684,-1.422453,1.226812,0.168508,-0.448022,0.510156,1.510668,-0.467663,-0.501031,-0.226658,-0.407198,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1
48935,1.379189,1.464556,-0.171182,1.503508,-1.669461,-0.46493,-1.570701,-1.341279,-1.609334,1.335297,1.276238,-0.075477,-0.76759,-0.744243,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1
27149,-0.329837,-0.876318,-1.214423,1.503508,1.141088,0.321273,-1.570701,0.445235,-1.442569,-0.88263,-0.092206,0.574485,-0.60531,0.940982,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1
38180,0.533738,1.464556,1.567554,-0.88628,0.187451,0.773004,1.212034,-1.341279,-0.765406,0.355282,-0.255233,-0.633401,-0.388937,-0.744243,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1
41453,-0.477596,-0.876318,1.567554,0.706912,-0.84672,-0.532212,0.864192,1.338492,-0.658035,0.736972,-1.623677,-0.726837,-0.443031,-0.238675,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1
18113,-0.021333,0.294119,-0.171182,-0.88628,-0.25669,1.330643,-1.570701,-1.341279,1.273756,-0.892946,-1.277861,-0.334263,-0.76759,-0.744243,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1
20353,-0.955107,-1.461536,0.872059,0.706912,-0.374265,1.103739,0.864192,1.338492,-0.188664,-1.501587,-0.368859,-0.704916,-0.334844,-0.112283,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1
7987,-0.749855,-0.291099,0.176565,1.503508,1.39464,-0.393495,-0.875017,1.338492,-1.174486,1.42814,1.52819,-0.563249,-0.60531,0.09837,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1


In [30]:
print(testData.shape)
testData.head(10)

(15000, 41)


Unnamed: 0,mileage,reported_issues,vehicle_age,engine_size,odometer_reading,insurance_premium,service_history,accident_history,fuel_efficiency,time_since_last_service,warranty_duration,mileage_per_year,service_frequency,accident_rate,vehicle_model_Bus,vehicle_model_Car,vehicle_model_Motorcycle,vehicle_model_SUV,vehicle_model_Truck,vehicle_model_Van,maintenance_history_1,maintenance_history_2,maintenance_history_3,fuel_type_Diesel,fuel_type_Electric,fuel_type_Petrol,transmission_type_Automatic,transmission_type_Manual,owner_type_First,owner_type_Second,owner_type_Third,tire_condition_1,tire_condition_2,tire_condition_3,brake_condition_1,brake_condition_2,brake_condition_3,battery_status_1,battery_status_2,battery_status_3,need_maintenance
38603,1.156856,1.464556,-0.518929,-1.204918,0.367525,-0.495664,-1.222859,-0.448022,-0.631719,1.417824,-0.764077,0.113077,-0.60531,-0.322936,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1
19319,-0.278038,0.879338,1.219806,-1.204918,1.641347,1.227089,-0.179333,-0.448022,-0.403261,-0.459677,-1.589096,-0.672099,-0.575258,-0.556995,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1
41146,1.356067,-0.291099,0.524312,0.706912,1.030476,-1.025337,0.864192,1.338492,0.669205,-0.129567,-0.961686,-0.352707,-0.257568,-0.022003,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1
46300,1.452583,-1.461536,-0.171182,0.706912,-0.036256,0.825889,-0.875017,-0.448022,0.061977,-0.583468,0.910661,-0.061915,-0.551217,-0.407198,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1
44052,1.410922,1.464556,-0.171182,-0.88628,-1.360858,1.143748,-1.570701,-1.341279,-0.366086,-0.903262,1.454087,-0.069613,-0.76759,-0.744243,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1
26088,-0.630564,-0.291099,-0.171182,-1.204918,1.590358,-0.78279,0.864192,-1.341279,-0.960987,0.582233,0.638949,-0.446836,-0.010285,-0.744243,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0
46037,0.342442,1.464556,-0.866676,1.503508,1.138047,-1.359671,-1.222859,-0.448022,-0.978684,-0.820735,1.236717,0.244993,-0.515155,-0.182501,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1
9711,-1.173553,-0.876318,-0.171182,0.706912,1.378789,1.212414,-0.179333,-0.448022,-0.434762,-0.088303,1.701098,-0.547168,-0.334844,-0.407198,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1
17248,0.749128,0.294119,-1.56217,1.503508,-1.476343,0.966683,1.559876,1.338492,-0.203015,-0.335886,-0.329337,3.180921,4.533543,4.311432,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1
20876,-1.432757,-0.876318,-0.866676,-0.089684,-1.09085,-1.114355,-0.875017,-0.448022,-0.365066,-0.892946,-1.080252,-0.301704,-0.334844,-0.182501,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0


## Phase 4: MODEL TRAINING

- Step1: Reading the transformed_trained_data for model training
- Step2: Splitting the data into train and test 75:25
- Step3: Reading the model configurations from [model.yaml](config/model.yaml)
- Step4: Training 5 classifier models specified in the model configuration and saving each model as a pkl file for model evaluation in further steps.

__[modelTrainer.py](vmpred/component/modelTrainer.py)__

In [31]:
# Model Performance
modelPerformance = pd.read_csv('data/modelPerformance/model_performance.csv')

In [32]:
modelPerformance

Unnamed: 0,model_name,best_params,accuracy,cross_val_mean_accuracy,cross_val_std,roc_auc,log_loss,confusion_matrix,precision,recall,f1_score,training_time,prediction_time
0,LogisticRegression,"{'C': 10, 'solver': 'saga'}",0.929333,0.932449,0.001901,0.977343,0.149679,"[[1613, 369], [373, 8145]]",0.929388,0.929333,0.929361,5.761629,0.003352
1,RandomForestClassifier,"{'max_depth': 10, 'min_samples_leaf': 1, 'min_...",0.957905,0.958286,0.004923,0.990226,0.154507,"[[1741, 241], [201, 8317]]",0.957618,0.957905,0.95774,111.901277,0.082413
2,SGDClassifier,"{'alpha': 0.001, 'loss': 'log_loss', 'penalty'...",0.927333,0.932816,0.002364,0.977265,0.150309,"[[1621, 361], [402, 8116]]",0.927943,0.927333,0.927618,1.971516,0.003205
3,DecisionTreeClassifier,"{'criterion': 'gini', 'max_depth': 5, 'min_sam...",0.958286,0.958571,0.00506,0.990797,0.088404,"[[1745, 237], [201, 8317]]",0.958027,0.958286,0.958139,6.384347,0.004
4,XGBClassifier,"{'learning_rate': 0.01, 'max_depth': 5, 'n_est...",0.958286,0.958571,0.00506,0.991411,0.156882,"[[1745, 237], [201, 8317]]",0.958027,0.958286,0.958139,16.449282,0.044101


## Phase 5: MODEL EVALUATION

In [33]:
validationData = pd.read_parquet('data/transformedData/testData/test_data.parquet', engine='pyarrow')

In [34]:
X = validationData.drop(TARGET_VARIABLE, axis=1)
y = validationData[TARGET_VARIABLE]

In [36]:
# Load Trained Model
model_file_path = "model/trainedModel"

model_files = [f for f in os.listdir(model_file_path) if f.endswith(".pkl")]
print(model_files)

models = {}

for i, model_file in enumerate(model_files, start=1):
    with open(os.path.join(model_file_path, model_file), "rb") as file:
        models[f"model_{i}"] = pickle.load(file)

# Access models dynamically by name
DT = models["model_1"]
LR = models["model_2"]
RF = models["model_3"]
SVC = models["model_4"]
XGB = models["model_5"] 

['DecisionTreeClassifier.pkl', 'LogisticRegression.pkl', 'RandomForestClassifier.pkl', 'SGDClassifier.pkl', 'XGBClassifier.pkl']


In [37]:
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

y_pred_DT = DT.predict(X)
accuracy_DT = accuracy_score(y, y_pred_DT)
print("\nAccuracy of Decision Tree classifier on test set:", accuracy_DT)


Accuracy of Decision Tree classifier on test set: 0.9592666666666667


In [38]:
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

y_pred_LR = LR.predict(X)
accuracy_LR = accuracy_score(y, y_pred_LR)
print("\nAccuracy of LR Classifier on test set:", accuracy_LR)


Accuracy of LR Classifier on test set: 0.9315333333333333


In [39]:
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

y_pred_RF = RF.predict(X)
accuracy_RF = accuracy_score(y, y_pred_RF)
print("\nAccuracy of RF Classifier on test set:", accuracy_RF)


Accuracy of RF Classifier on test set: 0.9594


In [40]:
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

y_pred_SVC = SVC.predict(X)
accuracy_SVC = accuracy_score(y, y_pred_SVC)
print("\nAccuracy of SGDClassifier on test set:", accuracy_SVC)


Accuracy of SGDClassifier on test set: 0.9302


In [41]:
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

y_pred_XGB = XGB.predict(X)
accuracy_XGB = accuracy_score(y, y_pred_XGB)
print("\nAccuracy of XGBClassifier on test set:", accuracy_XGB)


Accuracy of XGBClassifier on test set: 0.9592666666666667
