In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
claytonmiller_construction_and_project_management_example_data_path = kagglehub.dataset_download('claytonmiller/construction-and-project-management-example-data')
amirgh83_construction_quality_path = kagglehub.dataset_download('amirgh83/construction-quality')

print('Data source import complete.')


# 1. Introduction

## 1.1 Background
In the construction industry, quality issues such as improper workmanship, incomplete documentation, and material non-compliance frequently lead to costly delays, rework, and disputes. Despite the use of modern project management tools and routine inspections, many issues are still identified reactively — only after they have already caused disruption.

As a construction manager and researcher, I have observed the untapped potential in administrative data generated daily on construction sites. This includes task logs and daily forms submitted by field personnel. Leveraging this data with machine learning offers an opportunity to move from reactive issue detection to proactive risk prediction.

## 1.2 Project Objective
This project explores whether machine learning can be used to predict potential quality issues in construction projects using structured administrative data, without relying on sensors, photos, or BIM.

The key objective is to:  
**Build a machine learning model that predicts whether a newly submitted task record is likely to indicate a quality issue**, using features such as task group (e.g., Safety, QA/QC), cause (e.g., Documentation, Housekeeping), priority level, and overdue status.

## 1.3 Key Research Question
**Can we use routine administrative construction data to predict potential quality issues before they occur?**

If successful, this project will:
- Provide a low-cost, scalable solution using data already collected  
- Empower field teams with smart, data-driven insights  
- Reduce risk, rework, and delays across construction projects of all sizes


# 2. Understanding the Dataset

## 2.1 Dataset Description
The dataset used in this project is titled **“Construction and Project Management Example Data”**, sourced from Kaggle. It consists of two primary components:
- Daily forms and reports submitted by field personnel (e.g., site diaries, work plans)
- Task records, including:
  - Task type  
  - Task group (e.g., Safety, QA/QC)  
  - Cause (e.g., Documentation, Housekeeping)  
  - Overdue status  
  - Priority level  
  - Status (Open/Closed)

Each record in this dataset reflects real-world construction field activity, making it highly relevant for quality risk modeling.

## 2.2 Why This Dataset Is Relevant
This dataset is ideal for building a predictive model because:
- It captures real-time field-level decision-making.  
- It includes operational indicators like **Cause**, **OverDue**, and **Priority**, which are commonly monitored by quality managers.  
- It requires no additional data collection tools or hardware — just analysis of existing logs.

**Examples:**
- Tasks marked **"High" priority** often signal urgent quality or safety concerns.  
- Causes like **"Documentation"** or **"Housekeeping"** frequently appear in non-conformance reports.  
- **Overdue tasks** may indicate lags in resolving critical issues.

These features are strong predictors for identifying potential risks.


# 3. Problem Framing

## 3.1 Machine Learning Problem Type
This project is a **supervised binary classification** task:

- **Target variable:** Whether a task is a quality issue (1 = Yes, 0 = No)  
- **Input features:** Cause, Task Group, Priority, Overdue flag, Task Type

## 3.2 Labeling Strategy
A task is labeled as a quality issue if:
- It has a **High priority**, or  
- It is associated with a known quality-related cause, such as **Documentation**, **Housekeeping**, or **Access**.

This labeling approach aligns with how field engineers and inspectors typically flag and prioritize worksite issues.

## 3.3 Potential Impact
Once trained, the model will:
- Take in a new task record  
- Output a risk probability score  
- Automatically flag risky tasks for early review  

This enables construction teams to:
- Act earlier to prevent quality failures  
- Optimize site inspections and interventions  
- Reduce rework, delays, and client disputes


# 4. Exploratory Data Analysis (EDA) on the Dataset

## 4.1 Goal of This Step
The purpose of exploratory data analysis is to better understand the structure of the dataset and uncover meaningful patterns or trends that will inform the selection of features for the predictive model. This step focuses on:

- Understanding how key features are distributed  
- Detecting patterns or anomalies linked to quality issues  
- Spotting missing values, class imbalance, and other data challenges  

This step focuses on the Tasks dataset, where each row represents an issue, observation, or comment logged by a construction project team.

## 4.2 Load and Preview the Data


In [None]:
import pandas as pd

# Load the task dataset
df = pd.read_csv("/kaggle/input/construction-and-project-management-example-data/Construction_Data_PM_Tasks_All_Projects.csv")

# Preview the first few rows
df.head()


✅ This dataset contains 12,424 rows, each representing a construction task or issue logged in the field.

## 4.3 Initial Cleaning and Feature Preparation
Before diving into visualizations, we clean and prepare several key columns that will be used in modeling:


In [None]:
# Fill missing values
df["Priority"] = df["Priority"].fillna("None")
df["Cause"] = df["Cause"].fillna("Unknown")
df["Task Group"] = df["Task Group"].fillna("Unknown")

# Convert OverDue from boolean to integer
df["OverDue"] = df["OverDue"].astype(bool).astype(int)


## 4.4 Feature Distributions and Visual Insights

### 📊 1. Distribution of Task Priorities
Understanding how task priorities are distributed in the dataset is crucial because this feature is directly tied to how construction teams assess the urgency and severity of site issues. In our model, tasks labeled as "High" priority will serve as the positive class, indicating a likely quality issue.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.countplot(data=df, x="Priority", order=df["Priority"].value_counts().index, palette="Set2")
plt.title("Distribution of Task Priorities")
plt.xlabel("Priority Level")
plt.ylabel("Number of Tasks")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


**What This Tells Us:**
- Many tasks are labeled "None", "Medium", or include vague custom labels.
- A relatively smaller portion of tasks are labeled "High", which may represent the most urgent or risk-prone issues.
- This imbalance highlights a challenge: the model must learn from limited positive examples, which increases the risk of bias or poor recall.


### 📊 2. Top 10 Causes of Tasks
The `Cause` column identifies why a task was created (e.g., housekeeping failures, documentation issues). These causes often reflect quality or safety problems.


In [None]:
top_causes = df["Cause"].value_counts().nlargest(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_causes.values, y=top_causes.index, palette="pastel")
plt.title("Top 10 Causes of Tasks")
plt.xlabel("Number of Tasks")
plt.ylabel("Cause")
plt.tight_layout()
plt.show()


**What This Tells Us:**
- The most frequent causes include Housekeeping, Access, and Documentation.
- These are highly relevant to quality and safety risks and are often flagged in non-compliance reports.
- This supports using `Cause` as a core feature in our model.


### 📊 3. Overdue Status by Task Priority
The `OverDue` column shows whether tasks were resolved on time. We expect higher-priority tasks to be more frequently overdue, indicating unresolved risk.


In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x="Priority", hue="OverDue", order=df["Priority"].value_counts().index)
plt.title("Overdue Status by Task Priority")
plt.xlabel("Priority Level")
plt.ylabel("Task Count")
plt.legend(title="OverDue")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


**What This Tells Us:**
- High and Medium priority tasks tend to be more overdue.
- Overdue status is correlated with severity and should be included as a predictive feature.


### 📊 4. Most Common Task Groups
The `Task Group` field helps us understand which departments most frequently log tasks — such as Safety, QA/QC, and Site Management.


In [None]:
top_groups = df["Task Group"].value_counts().nlargest(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_groups.values, y=top_groups.index, palette="muted")
plt.title("Most Common Task Groups")
plt.xlabel("Number of Tasks")
plt.ylabel("Task Group")
plt.tight_layout()
plt.show()


**What This Tells Us:**
- Most tasks are logged by Safety and Site Management teams.
- QA/QC and Design are also involved — likely tied to quality and compliance.
- Task Group provides helpful context for modeling.


## 4.5 Summary of Key Features for Modeling

| Feature      | Description                                      | Role in Prediction                      |
|--------------|--------------------------------------------------|------------------------------------------|
| Priority     | Indicates how urgent or severe a task is         | Used to define the target variable       |
| Cause        | Root cause of the task (e.g., Housekeeping)      | Strong categorical predictor of quality  |
| OverDue      | Whether the task is overdue (1 = Yes, 0 = No)    | Time-based risk signal                   |
| Task Group   | Department responsible for the task              | Provides operational context             |
| Type         | Nature of the task (e.g., Safety Notice)         | May reflect recurring issue patterns     |

These features were selected because they align with how field engineers assess risk, and they are available in structured formats across most construction projects.


# 5. Data Preprocessing for Machine Learning

Before training a machine learning model, I converted the raw construction task data into a clean, structured, and numerical format.

In this section, we will:
1. Define the target variable  
2. Select relevant input features  
3. Handle missing values  
4. Encode categorical features  
5. Scale numeric values  
6. Split the dataset for training and evaluation  

## 5.1 Define the Target Variable

The target variable is what we want the model to predict — whether a task is a quality issue.  
We use the `Priority` column as a proxy. If a task is labeled **High**, it’s considered a potential quality issue (`1`).  
All other priorities are labeled as non-issues (`0`). This creates a **binary classification problem**.


In [None]:
# Create the target variable
df["Priority"] = df["Priority"].fillna("None")
df["target_quality_issue"] = df["Priority"].apply(lambda x: 1 if str(x).strip().lower() == "high" else 0)


## 5.2 Select the Input Features

We select four features likely to influence whether a task is related to quality:

- **Cause**: Why the task was logged (e.g., Documentation, Access)  
- **Task Group**: Responsible team (e.g., Safety, QA/QC)  
- **Type**: Task or form type (e.g., Safety Notice, RFI)  
- **OverDue**: Whether the task was completed on time  

These were selected because:
- They are directly tied to how field teams monitor and escalate issues  
- They appear in most task records  
- They showed clear patterns during exploratory analysis  


In [None]:
# Fill missing values and select features
df["Cause"] = df["Cause"].fillna("Unknown")
df["Task Group"] = df["Task Group"].fillna("Unknown")
df["Type"] = df["Type"].fillna("Unknown")
df["OverDue"] = df["OverDue"].fillna(False).astype(int)

# Define features and target
features = ["Cause", "Task Group", "Type", "OverDue"]
X = df[features]
y = df["target_quality_issue"]


## 5.3 Split into Training, Validation, and Test Sets

To evaluate the model fairly, we split the dataset into:

- **Training Set (64%)** — used to train the model  
- **Validation Set (16%)** — used to tune and test the model during development  
- **Test Set (20%)** — used for final evaluation  

Stratified sampling ensures that the ratio of quality issues (`1`) to non-issues (`0`) stays consistent in each split.


In [None]:
from sklearn.model_selection import train_test_split

# First split: temp (train + val) and test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42)

# Second split: train and val
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.2, stratify=y_temp, random_state=42)

# Display split sizes
print("Training set size:", X_train.shape)
print("Validation set size:", X_val.shape)
print("Test set size:", X_test.shape)


## 5.4 Encode and Scale the Features

Machine learning algorithms require numeric inputs. We process features in two steps:

### 🔤 One-Hot Encoding (Categorical Columns)
We convert categorical features into binary columns. For example:
- `Cause = Documentation` becomes a column called `Cause_Documentation`

### 📏 Standard Scaling (Numeric Columns)
We scale the `OverDue` column so that its values have a mean of 0 and standard deviation of 1.

We'll use **scikit-learn pipelines** to clean, encode, and scale the data consistently.


In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Define categorical and numeric columns
categorical_cols = ["Cause", "Task Group", "Type"]
numeric_cols = ["OverDue"]

# Categorical pipeline
categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# Numeric pipeline
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("scaler", StandardScaler())
])

# Combine the pipelines
preprocessor = ColumnTransformer([
    ("cat", categorical_pipeline, categorical_cols),
    ("num", numeric_pipeline, numeric_cols)
])


## 5.5 Apply Transformations to the Data

We apply the preprocessing pipeline to the training, validation, and test sets.


In [None]:
# Fit on training, transform all
X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

# Confirm output shapes
print("Training shape:", X_train_processed.shape)
print("Validation shape:", X_val_processed.shape)
print("Test shape:", X_test_processed.shape)


# 6. Model Training: Logistic Regression

Now that the dataset is preprocessed, we move to model training. The objective is to build a model that can accurately classify construction tasks as either quality issues (1) or non-issues (0) using the selected features.

## 6.1 Selecting Logistic Regression for Baseline Modeling

Logistic Regression is chosen as the initial model because it combines:

- **Simplicity**: Easy to understand and implement.
- **Interpretability**: Coefficients explain each feature’s influence on risk.
- **Efficiency**: Fast to train even on large datasets.
- **Balance Handling**: Supports `class_weight='balanced'` to account for rare quality issues.

This makes it an ideal field-friendly, transparent baseline before testing more complex models.

## 6.2 Training the Model

We use the `scikit-learn` library to initialize and train the logistic regression model.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize the logistic regression model
model = LogisticRegression(
    max_iter=1000,              # Avoid premature convergence
    class_weight='balanced',   # Handle class imbalance
    random_state=42            # Reproducibility
)

# Train the model
model.fit(X_train_processed, y_train)


## 6.3 Validation and Evaluation

After training, we evaluate the model on the validation set to assess how well it generalizes to unseen data.


In [None]:
# Predict class labels and probabilities
y_val_pred = model.predict(X_val_processed)
y_val_prob = model.predict_proba(X_val_processed)[:, 1]

# Evaluate performance
print("Accuracy:", accuracy_score(y_val, y_val_pred))
print("Classification Report:\n", classification_report(y_val, y_val_pred))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred))


## 📊 6.4 Performance Summary

Key evaluation metrics:

- **Accuracy**: ~83.7%
- **Recall (for quality issues)**: ~97%
- **Precision (for quality issues)**: ~9%
- **ROC AUC Score**: ~0.83

### What These Metrics Mean:

- **Accuracy** is high, but not fully reliable in imbalanced datasets.
- **Recall** is excellent (~97%), meaning the model catches almost all true quality issues.
- **Precision** is low (~9%), indicating many false positives — which may be acceptable in safety-critical fields.
- **ROC AUC Score** of ~0.83 shows strong ability to separate risky from normal tasks.

> ROC = Receiver Operating Characteristic  
> AUC = Area Under the Curve — measures how well the model ranks true positives higher than negatives.


## 6.5 Interpretation

The results reflect common trade-offs in construction risk modeling:

- High **recall** ensures we rarely miss real problems.
- Low **precision** means we flag many false alarms, but this is acceptable when safety is a priority.
- The model becomes a proactive assistant, guiding field engineers to investigate high-risk tasks first.

**Example**:  
A superintendent gets a flagged list every morning. Even one early warning about poor documentation could prevent costly rework.


## 6.6 How the Model Aligns with Project Objectives

This model supports our goals in multiple ways:

- **Early Detection**: Predicts which tasks pose quality risks.
- **Explainability**: Coefficients clarify why a task was flagged.
- **Rare Case Handling**: `class_weight='balanced'` addresses low-frequency critical tasks.
- **Efficiency**: Lightweight and easy to retrain or deploy in a construction field environment.

We don’t need a perfect model — we need a reliable one that supports proactive risk management and communicates its logic clearly.


## 6.7 What the Model Learns

Logistic Regression assigns a **coefficient** (weight) to each feature:

- A **positive coefficient** increases the likelihood a task is labeled a quality issue.
- A **negative coefficient** reduces that likelihood.

This allows teams to interpret and trust model outputs — making the results actionable, not just predictive.


# 7. Model Evaluation and Insights

After training the Logistic Regression model, we assess how well it performs on unseen data using the validation set. This evaluation addresses several key questions:

- How accurate is the model overall?
- Can it reliably detect actual quality issues?
- Does it generate too many false alarms?
- Is it effective enough to support field decision-making?


## 📊 7.1 Confusion Matrix

The confusion matrix breaks predictions into four categories:

- **True Positives (TP)**: Correctly flagged quality issues
- **True Negatives (TN)**: Correctly identified non-issues
- **False Positives (FP)**: Normal tasks incorrectly flagged
- **False Negatives (FN)**: Missed quality issues

In our results:

- ✅ TP = 33
- ❌ FN = 1
- ✅ TN = 1,632
- ❌ FP = 322

👉 The model is **highly sensitive** (catches most real issues) but slightly over-cautious (raises some false alarms), which is acceptable in construction settings where missing a real issue is riskier.


## 📈 7.2 ROC Curve (Receiver Operating Characteristic)

The ROC curve shows the trade-off between recall (sensitivity) and specificity across all thresholds.

- **AUC = 0.83**, which is considered strong.
- AUC (Area Under the Curve) ranges from 0.5 (random guessing) to 1.0 (perfect classification).

✅ This means the model does a solid job separating risky from non-risky tasks.


## 📉 7.3 Precision-Recall Curve

In imbalanced datasets like ours, the Precision-Recall Curve is very informative.

- **Recall ~97%** → The model catches nearly all true quality issues.
- **Precision ~9%** → Only about 1 in 11 flagged tasks is truly an issue.

This indicates the model casts a wide net. While it generates many false positives, it's more cautious than conservative — a safer approach in construction risk mitigation.


## 📌 7.4 Summary of Evaluation Results

- **Accuracy ~83.7%**: Good overall but can be misleading in imbalanced datasets.
- **Recall ~97%**: Excellent at detecting real issues.
- **Precision ~9%**: Lower, which leads to more false positives.
- **AUC ~0.83**: Strong at ranking task risk.

🎯 Overall, this model prioritizes safety by flagging anything potentially risky — ideal for field usage.


## ✅ 7.5 Strengths of the Model

1. **High Recall**: Rarely misses actual quality issues.
2. **Interpretable**: Logistic Regression offers feature weights that are easy to explain.
3. **Lightweight & Adaptable**: Easy to retrain with new data.

These make it an excellent baseline tool for field engineers and project managers.


## ⚠️ 7.6 Limitations of the Model

1. **Low Precision**: Generates many false alarms.
2. **Class Imbalance**: Few “High” priority tasks may limit the learning depth.
3. **Model Simplicity**: Logistic Regression may not capture complex feature interactions.

These are common challenges in early-stage models and provide direction for further improvement.


## 🔧 7.7 What Can Be Improved Next

To boost performance, especially precision, the following strategies can be explored:

- **SMOTE or Undersampling**: Balance the dataset by adjusting class ratios.
- **Advanced Models**: Try Random Forest, XGBoost, or Neural Networks for better pattern recognition.
- **Feature Engineering**: Add context with features like time of day, recurrence history, or project phase.
- **Hyperparameter Tuning**: Adjust regularization, thresholds, and solver choices.

These steps will help build a more robust model that balances caution and accuracy.
