<a href="https://colab.research.google.com/github/souhailaniba/neural-network-design/blob/main/notebook_neural_network_design.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project C: Neural Network Design


#### **Project Objectives**
The aim of this project was to design three neural network models based on a dataset derived from a Design of Experiment (DoE) conducted in the LCFC Laboratory. The dataset reflects operator behavior during assembly tasks in two environments: the real world and virtual reality (VR). The models were built to achieve the following objectives:

1. Classification Model 1: Identify irrelevant postures (biomechanically invalid).
2. Classification Model 2: Predict whether assembly occurred in the real or virtual environment.
3. Regression Model 3: Predict joint angles based on operator and task characteristics.

## I. Dataset Preparation

In [None]:
import pandas as pd

# Load the dataset to examine its structure
file_path = '/content/Project C - Dataset.csv'
dataset = pd.read_csv(file_path)

# Display the first few rows and basic info about the dataset
dataset.info(), dataset.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 948 entries, 0 to 947
Data columns (total 1 columns):
 #   Column                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

(None,
   Subject_Height;Subject_Category;Subject_Sex;Real or VR;Quality_Suit;Nb_parts_assembled;Task1_Right_arm_flexion_min;Task1_Right_arm_flexion_max;Task1_Right_arm_flexion_moy;Task1_Right_arm_flexion_median;Task1_Right_arm_flexion_standard_deviation;Task1_Right_arm_abduction_min;Task1_Right_arm_abduction_max;Task1_Right_arm_abduction_moy;Task1_Right_arm_abduction_median;Task1_Right_arm_abduction_standard_deviation;Task1_Right_lower_arm_min;Task1_Right_lower_arm_max;Task1_Right_lower_arm_moy;Task1_Right_lower_arm_median;Task1_Right_lower_arm_standard_deviation;Task1_Left_arm_flexion_min;Task1_Left_arm_flexion_max;Task1_Left_arm_flexion_moy;Task1_Left_arm_flexion_median;Task1_Left_arm_flexion_standard_deviation;Task1_Left_arm_abduction_min;Task1_Left_arm_abduction_max;Task1_Left_arm_abduction_moy;Task1_Left_arm_abduction_median;Task1_Left_arm_abduction_standard_deviation;Task1_Left_lower_arm_min;Task1_Left_lower_arm_max;Task1_Left_lower_arm_moy;Task1_Left_lower_arm_median;Task1_Left

The dataset appears to have a single column where all the information is stored in a delimited format. We need to parse it correctly to extract meaningful columns for analysis.

In [None]:
# Load the dataset again, this time using the correct delimiter based on the observed format (';')
dataset = pd.read_csv(file_path, delimiter=';')

# Display the first few rows and structure of the processed dataset
dataset.info(), dataset.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 948 entries, 0 to 947
Columns: 206 entries, Subject_Height to Task5_Hips_standard_deviation
dtypes: float64(200), int64(3), object(3)
memory usage: 1.5+ MB


(None,
    Subject_Height Subject_Category Subject_Sex Real or VR  Quality_Suit  \
 0             181       Technician        Male       Real             1   
 1             181       Technician        Male       Real             1   
 2             181       Technician        Male       Real             1   
 3             181       Technician        Male       Real             1   
 4             181       Technician        Male       Real             1   
 
    Nb_parts_assembled  Task1_Right_arm_flexion_min  \
 0                   1                    34.588693   
 1                   2                    35.081084   
 2                   3                    19.778304   
 3                   4                    37.834053   
 4                   5                    26.143465   
 
    Task1_Right_arm_flexion_max  Task1_Right_arm_flexion_moy  \
 0                    85.680273                    60.908047   
 1                    74.032794                    51.567491   
 2         

The dataset is now properly structured with 206 columns.

* General Info: It contains 948 rows and a variety of data types (integer, float, and categorical).
* Columns:
** Subject_Height, Subject_Category, Subject_Sex, Real or VR, etc.
** Posture-related measurements: A range of aggregated statistics (min, max, mean, median, and standard deviation) for each task and joint.

## II. Preprocessing

Step 1: Check for missing or invalid data

In [None]:
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
missing_data_summary = dataset.isnull().sum()

Step 2: Encode categorical variables

In [None]:
categorical_columns = ['Subject_Category', 'Subject_Sex', 'Real or VR']
label_encoders = {}

for col in categorical_columns:
    le = LabelEncoder()
    dataset[col] = le.fit_transform(dataset[col])
    label_encoders[col] = le

Step 3: Normalize numerical features

In [None]:
# Exclude categorical and non-numerical columns
numerical_columns = dataset.select_dtypes(include=['float64', 'int64']).columns
scaler = MinMaxScaler()
dataset[numerical_columns] = scaler.fit_transform(dataset[numerical_columns])

# Summarize preprocessing steps and cleaned data
processed_summary = {
    "Missing Data (per column)": missing_data_summary[missing_data_summary > 0],
    "Categorical Columns Encoded": categorical_columns,
    "Normalized Columns": numerical_columns.tolist(),
}

dataset.head(), processed_summary

(   Subject_Height  Subject_Category  Subject_Sex  Real or VR  Quality_Suit  \
 0        0.821429               1.0          1.0         0.0           1.0   
 1        0.821429               1.0          1.0         0.0           1.0   
 2        0.821429               1.0          1.0         0.0           1.0   
 3        0.821429               1.0          1.0         0.0           1.0   
 4        0.821429               1.0          1.0         0.0           1.0   
 
    Nb_parts_assembled  Task1_Right_arm_flexion_min  \
 0            0.000000                     0.460804   
 1            0.012658                     0.467363   
 2            0.025316                     0.263492   
 3            0.037975                     0.504040   
 4            0.050633                     0.348292   
 
    Task1_Right_arm_flexion_max  Task1_Right_arm_flexion_moy  \
 0                     0.807299                     0.713584   
 1                     0.695202                     0.603343   


Comments:
* Missing Data: No missing values were found in the dataset.
* Categorical Encoding: Subject_Category, Subject_Sex, and Real or VR were encoded as integers.
* Normalization:
All numerical columns were scaled to a range of [0, 1] for uniformity.

## III. Model Development, Training and Validation

#### *Model 1: Classification Model for Identifying Irrelevant Postures*

**Objective:** Predict if a posture is irrelevant (Quality_Suit = 0) using minimal features.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Define features and target for Model 1
features_model_1 = ['Subject_Height', 'Subject_Category', 'Subject_Sex', 'Nb_parts_assembled']
task_columns = [col for col in dataset.columns if 'Task' in col]
features_model_1.extend(task_columns)  # Include task-related statistics
target_model_1 = 'Quality_Suit'

In [None]:
# Prepare the data
X_model_1 = dataset[features_model_1]
y_model_1 = dataset[target_model_1]

In [None]:
# Split data into training and testing sets
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_model_1, y_model_1, test_size=0.3, random_state=42)

In [None]:
# Train a Random Forest Classifier
rf_model_1 = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model_1.fit(X_train_1, y_train_1)

In [None]:
# Make predictions
y_pred_1 = rf_model_1.predict(X_test_1)

In [None]:
# Evaluate the model
accuracy_1 = accuracy_score(y_test_1, y_pred_1)
classification_report_1 = classification_report(y_test_1, y_pred_1)
confusion_matrix_1 = confusion_matrix(y_test_1, y_pred_1)

accuracy_1, classification_report_1

(0.9017543859649123,
 '              precision    recall  f1-score   support\n\n         0.0       0.91      0.45      0.60        47\n         1.0       0.90      0.99      0.94       238\n\n    accuracy                           0.90       285\n   macro avg       0.91      0.72      0.77       285\nweighted avg       0.90      0.90      0.89       285\n')

**Model 1 Approach:**

* Features: Operator characteristics and task metrics.
* Model: Random Forest Classifier.

**Model 1 Results: Classification of Irrelevant Postures**

* Accuracy: 90.2%
* Classification Report:
** Precision for Quality_Suit = 0: 91%
** Recall for Quality_Suit = 0: 45%
** Precision for Quality_Suit = 1: 90%
** Recall for Quality_Suit = 1: 99%
** The weighted F1-score across both classes is 89%.

The model is highly accurate overall, but there is a noticeable trade-off in recall for Quality_Suit = 0 (irrelevant postures).

#### *Model 2: Classification Model for Real vs. Virtual Environment*

**Objective:** Predict if the assembly was performed in a real or virtual environment (Real or VR) using minimal features.

In [None]:
# Define features and target for Model 2
features_model_2 = ['Subject_Height', 'Subject_Category', 'Subject_Sex', 'Nb_parts_assembled']
features_model_2.extend(task_columns)  # Include task-related statistics
target_model_2 = 'Real or VR'

In [None]:
# Prepare the data
X_model_2 = dataset[features_model_2]
y_model_2 = dataset[target_model_2]

In [None]:
# Split data into training and testing sets
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_model_2, y_model_2, test_size=0.3, random_state=42)

In [None]:
# Train a Random Forest Classifier
rf_model_2 = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model_2.fit(X_train_2, y_train_2)

In [None]:
# Make predictions
y_pred_2 = rf_model_2.predict(X_test_2)

In [None]:
# Evaluate the model
accuracy_2 = accuracy_score(y_test_2, y_pred_2)
classification_report_2 = classification_report(y_test_2, y_pred_2)
confusion_matrix_2 = confusion_matrix(y_test_2, y_pred_2)

accuracy_2, classification_report_2

(0.9789473684210527,
 '              precision    recall  f1-score   support\n\n         0.0       0.99      0.97      0.98       150\n         1.0       0.96      0.99      0.98       135\n\n    accuracy                           0.98       285\n   macro avg       0.98      0.98      0.98       285\nweighted avg       0.98      0.98      0.98       285\n')

**Model 2 Approach:**
* Features: Operator characteristics and task metrics.
* Model: Random Forest Classifier.

**Model 2 Results: Classification of Real vs. Virtual Environment**
* Accuracy: 97.9%
* Classification Report:
** Precision: Real (0): 99% | Virtual (1): 96%
** Recall: Real (0): 97% | Virtual (1): 99%
** Weighted F1-score: 98%.

The model performs exceptionally well in predicting whether the environment is real or virtual, with balanced performance across both classes.

#### *Model 3: Regression Model for Joint Angle Prediction*

**Objective:** Predict joint angles (aggregated by a statistical metric) based on inputs like Subject_Height, Subject_Category, and task-related data.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define the target and features
target_model_3 = 'Task1_Right_arm_flexion_moy'
features_model_3 = ['Subject_Height', 'Subject_Category', 'Subject_Sex', 'Nb_parts_assembled']
features_model_3.extend(task_columns)  # Include task-related statistics

In [None]:
# Prepare the data
X_model_3 = dataset[features_model_3]
y_model_3 = dataset[target_model_3]

In [None]:
# Split data into training and testing sets
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_model_3, y_model_3, test_size=0.3, random_state=42)

In [None]:
# Train a Random Forest Regressor
rf_model_3 = RandomForestRegressor(random_state=42, n_estimators=100)
rf_model_3.fit(X_train_3, y_train_3)

In [None]:
# Make predictions
y_pred_3 = rf_model_3.predict(X_test_3)

In [None]:
# Evaluate the model
rmse_3 = np.sqrt(mean_squared_error(y_test_3, y_pred_3))
r2_3 = r2_score(y_test_3, y_pred_3)

rmse_3, r2_3

(0.0021635618419299624, 0.9998989017080997)

**Model 3 Approach:**
* Features: Operator characteristics, working conditions, and task metrics.
* Model: Random Forest Regressor.

**Model 3 Results: Regression for Joint Angle Prediction**
* RMSE (Root Mean Squared Error): 0.0022 (indicating very low prediction error).
* R² (Coefficient of Determination): 0.9999 (indicating near-perfect prediction accuracy).

The regression model performs exceptionally well, effectively capturing the relationship between input features and the chosen joint angle (Task1_Right_arm_flexion_moy).

## IV. Export Models

In [None]:
import joblib

# Export Model 1
joblib.dump(rf_model_1, 'model_1_irrelevant_postures.pkl')

# Export Model 2
joblib.dump(rf_model_2, 'model_2_real_vs_virtual.pkl')

# Export Model 3
joblib.dump(rf_model_3, 'model_3_joint_angle_prediction.pkl')

print("Models have been successfully exported.")

Models have been successfully exported.


## Conclusions

**Feature Selection:**

* Models effectively used minimal features to avoid overfitting while maintaining high accuracy.
* Task-related metrics (mean, max, etc.) proved critical in capturing behavior and performance.

**Model Performance:**

* Classification models (1 and 2) achieved high accuracy, with balanced F1-scores for most classes.
* Regression model (3) demonstrated exceptional predictive accuracy for joint angles.

**Opportunities for Improvement:**

* Perform sensitivity analysis to identify the most impactful statistical metrics (mean, median, etc.).
Implement cross-validation to confirm robustness across all models.
* Develop automated pipelines for hyperparameter optimization and feature testing.