# Intro

In this project, we aim to predict air pump failures during machine cycles based on known air pressure patterns. 
For the task I used a model of Decision Tree Classifier (classifier because we are predicting whether a specific event is likely to occur or not). 

The following reasons:
* Not that hard to understand.
* The ability to handle categorical and numerical features.
* Does not require scaling or normalization.
* On the other hand even small change in datasets can problems.


In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('data/labels.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27385 entries, 0 to 27384
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MachineId      27385 non-null  object
 1   MeasurementId  27385 non-null  int64 
 2   PumpFailed     26900 non-null  object
 3   SlowStart      19300 non-null  object
 4   SlowEnd        19300 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.0+ MB


In [4]:
data = pd.read_parquet('data/data.parquet')

In [6]:
merged_data = pd.merge(data, df, on=['MachineId', 'MeasurementId'])

In [7]:
merged_data.head()

Unnamed: 0,MachineId,MeasurementId,Pressure,PumpFailed,SlowStart,SlowEnd
0,0_0_0,0,0.0,False,False,False
1,0_0_0,0,0.0,False,False,False
2,0_0_0,0,0.0,False,False,False
3,0_0_0,0,0.0,False,False,False
4,0_0_0,0,0.0,False,False,False


Deleting NAN

Maybe not a perfect idea, rather exchange it with mean.

Doing it after the partitioning is also possible, maybe better.

In [8]:
merged_data=merged_data.dropna()

In [9]:
# Importing libraries
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Separate merged data into features (X) and labels (y)
X = merged_data[['MachineId', 'Pressure']]
y = merged_data['PumpFailed']

# Spliting the data into training and testing sets (I am not using validation set)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=333)

In [10]:
# I encoded the data to better align with the preferences and requirements of the model.
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Initialize the Decision Tree model
model = DecisionTreeClassifier(random_state=42)

# Train the model
model.fit(X_train, y_train_encoded)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test_encoded, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.89


An accuracy of 89% means that the model is performing pretty well on the test data. It has correctly predicted the pump failure or non-failure for approximately 89% of the instances in the test set.

In [11]:
model.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 42,
 'splitter': 'best'}

Tuning hyperparameters such as *criterion* or *max_depth* is definitely a crucial part. However, the accuracy we achieved suggests that the model is doing overall well.