## Problem definition and task allocations.

In this section, please describe briefly:
- the problems in this project.
Testing different models and seeing performance for each, while seeing which one is best for fault detection at each individual motor.

- how do you attribute the tasks to the team members.

| Member | Task | 
| --- | --- |
| Giovanni Low | Motor 1 |
| Lara Szterensus | Motor 2 |
| Beatriz Raposo | Motor 3 |
| Merse Szalai | Motor 4 |
| Juan Baserga | Motor 5 |
| Juan Baserga | Motor 6 |



In [3]:
import sys
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import warnings

# Import custom utility function
path_supporting_script = '/kaggle/input/robot-predictive-maintenance/'
sys.path.insert(1, path_supporting_script)
from utility import read_all_test_data_from_path, extract_selected_feature, prepare_sliding_window

warnings.filterwarnings('ignore')

# Load Data
def load_data():
    train_data_path = '/kaggle/input/robot-predictive-maintenance/training_data/training_data/'
    test_data_path = '/kaggle/input/robot-predictive-maintenance/testing_data/testing_data/'
    df_train = read_all_test_data_from_path(train_data_path, pre_processing, is_plot=False)
    df_test = read_all_test_data_from_path(test_data_path, pre_processing, is_plot=False)
    return df_train, df_test


# Preprocess Data

	•	Features Used: For all motors, the same set of features was used: positions, temperatures, and voltages of all six motors.
	•	Preprocessing: Data was preprocessed to remove outliers and sequence variability. Features were standardized or scaled depending on the model:
	•	Logistic Regression: StandardScaler
	•	Decision Tree and Random Forest: RobustScaler

In [4]:
# Preprocess Data
def pre_processing(df: pd.DataFrame):
    def remove_outliers(df: pd.DataFrame):
        for col in df.columns:
            if 'temperature' in col:
                df[col] = df[col].where(df[col] <= 100, np.nan)
                df[col] = df[col].where(df[col] >= 0, np.nan)
                df[col] = df[col].ffill()
            if 'voltage' in col:
                df[col] = df[col].where(df[col] >= 6000, np.nan)
                df[col] = df[col].where(df[col] <= 9000, np.nan)
                df[col] = df[col].ffill()
            if 'position' in col:
                df[col] = df[col].where(df[col] >= 0, np.nan)
                df[col] = df[col].where(df[col] <= 1000, np.nan)
                df[col] = df[col].ffill()
    def remove_seq_variability(df: pd.DataFrame):
        for col in df.columns:
            if 'temperature' in col or 'voltage' in col or 'position' in col:
                df[col] = df[col] - df[col].iloc[0]
    remove_outliers(df)
    remove_seq_variability(df)
    return df


# Train and evaluate the Logistic Regression model

For motor 2 and 4. Uses a linear regression model. 

	•	Features: Positions, temperatures, and voltages of all six motors.
	•	Hyperparameter Tuning: Yes, hyperparameters were tuned using GridSearchCV with a parameter grid for the regularization strength C (values: [0.001, 0.01, 0.1, 1, 10, 100]).
	•	Imbalance Consideration: Yes, imbalance was considered by using class_weight='balanced' in the Logistic Regression model, which adjusts weights inversely proportional to class frequencies.

In [5]:
# Train and evaluate the Logistic Regression model
def train_logistic_regression(x_tr, y_tr):
    steps = [
        ('standardizer', StandardScaler()),
        ('classifier', LogisticRegression(class_weight='balanced'))
    ]
    pipeline = Pipeline(steps)
    param_grid = {
        'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]
    }
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    grid_search = GridSearchCV(pipeline, param_grid, scoring='f1', cv=cv)
    grid_search.fit(x_tr, y_tr)
    return grid_search


# Train and evaluate the Decision Tree model

For motor 3, this model utilizes of Decision Tree Classifier. 

	•	Features: Positions, temperatures, and voltages of all six motors.
	•	Hyperparameter Tuning: Yes, hyperparameters were tuned using GridSearchCV with a parameter grid for:
	•	criterion (values: ['gini', 'entropy'])
	•	max_depth (values: [None, 10, 20, 30])
	•	min_samples_split (values: [2, 5, 10])
	•	min_samples_leaf (values: [1, 2, 4])
	•	Imbalance Consideration: Not explicitly handled in the model setup, but decision trees can inherently handle some level of class imbalance.

In [6]:
# Train and evaluate the Decision Tree model
def train_decision_tree(x_tr, y_tr):
    steps = [
        ('scaler', RobustScaler()),  
        ('classifier', DecisionTreeClassifier(random_state=42))    
    ]
    pipeline = Pipeline(steps)
    param_grid = {
        'classifier__criterion': ['gini', 'entropy'],
        'classifier__max_depth': [None, 10, 20, 30],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__min_samples_leaf': [1, 2, 4]
    }
    grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='f1', cv=5)
    grid_search.fit(x_tr, y_tr)
    return grid_search


# Train the Random Forest model

Used by motor 1,5 and 6. Random forest classifier, with hyperparameters tuned being the criterion, max_depth, min_samples_split and min_samples_leaf. The sliding window and sample step are both set at 1. 

	Features: Positions, temperatures, and voltages of all six motors.
	•	Hyperparameter Tuning: No explicit hyperparameter tuning mentioned, but the model is initialized with n_estimators=100 and max_depth=10.
	•	Imbalance Consideration: Not explicitly handled in the model setup, but Random Forests are generally robust to class imbalance.


In [7]:
# Train the Random Forest model
def train_random_forest(x_tr, y_tr):
    rf_pipeline = Pipeline([
        ('scaler', RobustScaler()),
        ('classifier', RandomForestClassifier(n_estimators=100, max_depth=10))
    ])
    rf_pipeline.fit(x_tr, y_tr)
    return rf_pipeline

# Data and submission

	•	Window Size and Sample Step: The sliding window approach was used with a window size of 1 and a sample step of 1 to prepare the training data.

In [8]:
# Load data
df_train, df_test = load_data()

# Prepare the submission dataframe
submission_path = '/kaggle/input/robot-predictive-maintenance/sample_submission.csv'
df_submission = pd.read_csv(submission_path)
df_submission.loc[:, ['data_motor_1_label', 'data_motor_2_label', 'data_motor_3_label', 'data_motor_4_label', 'data_motor_5_label', 'data_motor_6_label']] = -1

# Process each motor separately with the specified models
for motor_idx in range(1, 7):
    print(f"Processing motor {motor_idx}...")

    # Specify the test conditions to include in the training
    df_data_experiment = df_train[df_train['test_condition'].isin(['20240425_093699', '20240425_094425', '20240426_140055',
                                                                   '20240503_164675', '20240503_165189', '20240503_163963',
                                                                   '20240325_155003'])]

    # Define the features
    feature_list_all = [f'data_motor_{i}_position' for i in range(1, 7)] + \
                       [f'data_motor_{i}_temperature' for i in range(1, 7)] + \
                       [f'data_motor_{i}_voltage' for i in range(1, 7)]

    # Extract the features
    df_tr_x, df_tr_y = extract_selected_feature(df_data_experiment, feature_list_all, motor_idx, mdl_type='clf')

    # Prepare the training data based on the defined sliding window
    window_size = 1
    sample_step = 1
    X_train, y_train = prepare_sliding_window(df_x=df_tr_x, y=df_tr_y, window_size=window_size, sample_step=sample_step, mdl_type='clf')

    # Choose the appropriate model for each motor
    if motor_idx in [1, 5, 6]:
        model = train_random_forest(X_train, y_train)
    elif motor_idx in [2, 4]:
        model = train_logistic_regression(X_train, y_train)
    elif motor_idx == 3:
        model = train_decision_tree(X_train, y_train)

    # Prepare the testing dataset
    df_test_x = df_test[feature_list_all + ['test_condition']]
    X_test = prepare_sliding_window(df_x=df_test_x, window_size=window_size, sample_step=sample_step, mdl_type='clf')

    # Make predictions on the test set
    y_pred = model.predict(X_test.drop(columns=['test_condition'], errors='ignore'))

    # Add predictions for the current motor to the submission dataframe
    df_submission[f'data_motor_{motor_idx}_label'] = y_pred

# Save the submission file
df_submission.to_csv('/kaggle/working/submission.csv', index=False)
print("Submission file saved.")

Processing motor 1...
Processing motor 2...
Processing motor 3...
Processing motor 4...
Processing motor 5...
Processing motor 6...
Submission file saved.
