In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/defects-analysis/__results__.html
/kaggle/input/defects-analysis/__notebook__.ipynb
/kaggle/input/defects-analysis/__output__.json
/kaggle/input/defects-analysis/Defects.csv
/kaggle/input/defects-analysis/custom.css
/kaggle/input/defects-analysis/__results___files/__results___20_1.png
/kaggle/input/defects-analysis/__results___files/__results___31_1.png
/kaggle/input/defects-analysis/__results___files/__results___26_1.png
/kaggle/input/defects-analysis/__results___files/__results___33_1.png
/kaggle/input/defects-analysis/__results___files/__results___29_1.png
/kaggle/input/defects-analysis/__results___files/__results___24_1.png
/kaggle/input/defects-analysis/__results___files/__results___22_1.png


In [2]:
# Loading the dataset
df = pd.read_csv("/kaggle/input/defects-analysis/Defects.csv", parse_dates=['defect_date']).drop(columns='Unnamed: 0')
df.head()

Unnamed: 0,defect_id,product_id,defect_type,defect_date,defect_location,severity,inspection_method,repair_cost,month
0,1,15,Structural,2024-06-06,Component,Minor,Visual Inspection,245.47,June
1,2,6,Functional,2024-04-26,Component,Minor,Visual Inspection,26.87,April
2,3,84,Structural,2024-02-15,Internal,Minor,Automated Testing,835.81,February
3,4,10,Functional,2024-03-28,Internal,Critical,Automated Testing,444.47,March
4,5,14,Cosmetic,2024-04-26,Component,Minor,Manual Testing,823.64,April


In [3]:
from sklearn.preprocessing import LabelEncoder
X = df.drop(columns=['repair_cost', 'month', 'defect_date', 'product_id', 'defect_id'])
y = df['repair_cost']
encoders={}

for col in X:
    encoder = LabelEncoder()
    X[col] = encoder.fit_transform(X[col])  
    encoders[col] = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
    print(f'{col} mapping: {encoders[col]}')

defect_type mapping: {'Cosmetic': 0, 'Functional': 1, 'Structural': 2}
defect_location mapping: {'Component': 0, 'Internal': 1, 'Surface': 2}
severity mapping: {'Critical': 0, 'Minor': 1, 'Moderate': 2}
inspection_method mapping: {'Automated Testing': 0, 'Manual Testing': 1, 'Visual Inspection': 2}


## Label Encoding of Categorical Features

The categorical columns in the dataset (`defect_type`, `defect_location`, `severity`, `inspection_method`) were converted into numeric form using **Label Encoding** to make them suitable for machine learning models.

### Encoded Mappings

- **defect_type**:  
  - Cosmetic → 0  
  - Functional → 1  
  - Structural → 2  

- **defect_location**:  
  - Component → 0  
  - Internal → 1  
  - Surface → 2  

- **severity**:  
  - Critical → 0  
  - Minor → 1  
  - Moderate → 2  

- **inspection_method**:  
  - Automated Testing → 0  
  - Manual Testing → 1  
  - Visual Inspection → 2  

> **Note:** The mappings are stored in a dictionary (`encoders`) for reference during predictions or inverse transformation.


In [4]:
# Sample rows 
X.head()

Unnamed: 0,defect_type,defect_location,severity,inspection_method
0,2,0,1,2
1,1,0,1,2
2,2,1,1,0
3,1,1,0,0
4,0,0,1,1


In [5]:
# Sample rows
y.head()

0    245.47
1     26.87
2    835.81
3    444.47
4    823.64
Name: repair_cost, dtype: float64

In [6]:
from sklearn.model_selection import train_test_split, GridSearchCV

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Train-Test Split

The dataset was split into training and testing sets to evaluate model performance.  

- **Training set:** 80% of the data  
- **Testing set:** 20% of the data  
- **Random state:** 0 (for reproducibility)  

This ensures that the machine learning models can learn patterns from the training data and be evaluated on unseen data (testing set) to measure performance accurately.


In [7]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()

lr_fit = lr_model.fit(x_train, y_train)

lr_predict = lr_fit.predict(x_test)

## Linear Regression Model

A **Linear Regression** model was trained to predict `repair_cost` using the encoded categorical features (`defect_type`, `defect_location`, `severity`, `inspection_method`).

- The model was fit on the **training set** (`x_train`, `y_train`).  
- Predictions were made on the **testing set** (`x_test`) to evaluate performance.  

> Linear Regression assumes a **linear relationship** between features and target. Further evaluation metrics will indicate how well this assumption holds for our dataset.


In [8]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, lr_predict)
mse = mean_squared_error(y_test, lr_predict)
r2score = r2_score(y_test, lr_predict)
rmse = np.sqrt(mse)
print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R2 Score:", r2score)

Mean Absolute Error: 258.8921654187944
Mean Squared Error: 88639.95181384236
Root Mean Squared Error: 297.72462413082724
R2 Score: -0.004463151391963072


## Linear Regression Model Evaluation

The Linear Regression model was evaluated on the test set using standard regression metrics:

- **Mean Absolute Error (MAE):** 258.89  
  - On average, predictions are off by about ₹259.

- **Mean Squared Error (MSE):** 88,639.95  
  - Measures squared differences between predicted and actual values; higher values indicate larger errors.

- **Root Mean Squared Error (RMSE):** 297.72  
  - Typical magnitude of prediction error, in ₹.

- **R² Score:** -0.004  
  - Indicates the model explains almost none of the variance in repair costs.  
  - A negative R² suggests the model performs **worse than predicting the mean**.

**Key Insight:**  
- The low R² and high MAE indicate that the Linear Regression model is **not capturing the patterns in the data well**, likely due to the **small dataset (~1000 rows)** and mostly categorical features.


In [9]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(random_state=42)

param_grid_rf = {
    'n_estimators': [100, 200, 300],        # number of trees
    'max_depth': [None, 5, 10, 20],         # depth of trees
    'min_samples_split': [2, 5, 10],        # minimum samples to split a node
    'min_samples_leaf': [1, 2, 4]           # minimum samples in leaf node
}

grid = GridSearchCV(rf_model, cv=2, param_grid=param_grid_rf) 

rf_fit = grid.fit(x_train, y_train)

best_param = grid.best_estimator_

rf_predict = best_param.predict(x_test)

## Random Forest Regressor

A **Random Forest Regressor** was trained to predict `repair_cost` using the encoded categorical features.

- **Parameter Tuning:**  
  - `n_estimators`: [100, 200, 300] → number of trees in the forest  
  - `max_depth`: [None, 5, 10, 20] → maximum depth of each tree  
  - `min_samples_split`: [2, 5, 10] → minimum samples required to split a node  
  - `min_samples_leaf`: [1, 2, 4] → minimum samples in a leaf node  

- **Grid Search with 2-fold cross-validation** was used to find the best hyperparameters.  

- **Best Estimator:** `best_param` contains the model with the optimal combination of hyperparameters.  
- Predictions were made on the **test set** (`x_test`) using this best estimator.  

> Random Forest handles non-linear relationships better than Linear Regression and can automatically capture interactions between categorical features.


In [10]:
rf_mae = mean_absolute_error(y_test, rf_predict)
rf_mse = mean_squared_error(y_test, rf_predict)
rf_r2score = r2_score(y_test, rf_predict)
rf_rmse = np.sqrt(rf_mse)
print("Mean Absolute Error:", rf_mae)
print("Mean Squared Error:", rf_mse)
print("Root Mean Squared Error:", rf_rmse)
print("R2 Score:", rf_r2score)

Mean Absolute Error: 260.7738702424493
Mean Squared Error: 91584.58449236758
Root Mean Squared Error: 302.6294508014175
R2 Score: -0.037831570027557326


## Random Forest Model Evaluation

The Random Forest model was evaluated on the test set using standard regression metrics:

- **Mean Absolute Error (MAE):** 260.77  
  - On average, predictions are off by about ₹261.

- **Mean Squared Error (MSE):** 91,584.58  
  - Measures squared differences between predicted and actual values; higher values indicate larger errors.

- **Root Mean Squared Error (RMSE):** 302.63  
  - Typical magnitude of prediction error, in ₹.

- **R² Score:** -0.038  
  - Indicates the model explains almost none of the variance in repair costs.  
  - Negative R² suggests the model performs worse than simply predicting the mean.

**Key Insight:**  
- Similar to Linear Regression, the Random Forest model is **not significantly improving predictions** due to the **small dataset (~1000 rows)** and mostly categorical features.  
- Feature engineering or additional data may be needed to enhance performance.


In [11]:
from sklearn.ensemble import GradientBoostingRegressor

gr_model = GradientBoostingRegressor(random_state=0)
gr_fit = gr_model.fit(x_train, y_train)
gr_predict = gr_fit.predict(x_test)

## Gradient Boosting Regressor

A **Gradient Boosting Regressor** was trained to predict `repair_cost` using the encoded categorical features.

- Gradient Boosting builds an ensemble of **sequential decision trees**, where each tree tries to correct the errors of the previous one.  
- The model was trained on the **training set** (`x_train`, `y_train`) and used to predict repair costs on the **test set** (`x_test`).  

> Gradient Boosting is generally effective for small to medium datasets and can handle non-linear relationships and interactions between features better than Linear Regression.


In [12]:
gr_mae = mean_absolute_error(y_test, gr_predict)
gr_mse = mean_squared_error(y_test, gr_predict)
gr_r2score = r2_score(y_test, gr_predict)
gr_rmse = np.sqrt(mse)
print("Mean Absolute Error:", gr_mae)
print("Mean Squared Error:", gr_mse)
print("Root Mean Squared Error:", gr_rmse)
print("R2 Score:", gr_r2score)

Mean Absolute Error: 259.743447217093
Mean Squared Error: 91390.38855205402
Root Mean Squared Error: 297.72462413082724
R2 Score: -0.035630952109752645


## Gradient Boosting Model Evaluation

The Gradient Boosting model was evaluated on the test set using standard regression metrics:

- **Mean Absolute Error (MAE):** 259.74  
  - On average, predictions are off by about ₹260.

- **Mean Squared Error (MSE):** 91,390.39  
  - Measures squared differences between predicted and actual values.

- **Root Mean Squared Error (RMSE):** 297.72  
  - Typical magnitude of prediction error, in ₹.

- **R² Score:** -0.036  
  - Indicates the model explains almost none of the variance in repair costs.  
  - Negative R² suggests the model performs worse than predicting the mean.

**Key Insight:**  
- Similar to Linear Regression and Random Forest, Gradient Boosting **does not significantly improve predictions** due to the **small dataset (~1000**


## Final Comparison
| Model                       |   MAE  |  RMSE  | R² Score | Observation                          |
| :-------------------------- | :----: | :----: | :------: | :----------------------------------- |
| Linear Regression           | 258.89 | 297.72 |  -0.004  | Baseline model, low performance      |
| Random Forest Regressor     | 260.77 | 302.62 |  -0.037  | Slightly worse, small dataset impact |
| Gradient Boosting Regressor | 259.74 | 297.72 |  -0.035  | Stable results, similar trend        |
