# **ML in Construction: Extended Practice Notebook**

Welcome! This notebook demonstrates **machine learning workflows** in a construction context, using **Concrete Compressive Strength** data as an example.

We'll cover:
1. **Data Loading & Exploration**
2. **Basic Model**: Linear Regression
3. **Advanced Models**: Random Forest & XGBoost
4. **Hyperparameter Tuning**
5. **Discussion of Data Leakage and Best Practices**
6. **Pointers to Real-World Data Sources**

The methods here apply to many construction-related problems (cost estimation, schedule forecasting, safety classification, etc.), even though we focus on concrete strength data for demonstration.


## 1. Import Libraries

We'll use:
- **pandas**, **numpy** for data handling.
- **matplotlib**, **seaborn** for visualization.
- **scikit-learn**, **xgboost** for machine learning models.

> If you haven't installed `xgboost`, you'll need to install it (e.g., `pip install xgboost`).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# For XGBoost:
try:
    from xgboost import XGBRegressor
    xgboost_installed = True
except ImportError:
    print("XGBoost is not installed. Please install via 'pip install xgboost' if you'd like to run XGBRegressor.")
    xgboost_installed = False

sns.set_theme(style="whitegrid")
print("Libraries imported successfully!")

## 2. Load the Dataset

We'll use the **Concrete Compressive Strength** dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength). It contains concrete mixtures with 8 input variables and 1 target (compressive strength).

Make sure you have `Concrete_Data.csv` in the same directory (or update the path below).

In [None]:
try:
    df = pd.read_csv('Concrete_Data.csv')
    print("Data loaded from Concrete_Data.csv")
except FileNotFoundError:
    print("Could not find Concrete_Data.csv. Please place it in the same folder or update the file path.")
    df = None

if df is not None:
    df.head()

### Dataset Info
According to many versions of this dataset, columns are often in the order:
```
1) Cement (component 1) -- quantitative
2) Blast Furnace Slag (component 2) -- quantitative
3) Fly Ash (component 3) -- quantitative
4) Water (component 4) -- quantitative
5) Superplasticizer (component 5) -- quantitative
6) Coarse Aggregate (component 6) -- quantitative
7) Fine Aggregate (component 7) -- quantitative
8) Age (day) -- quantitative
9) Concrete compressive strength (MPa) -- quantitative
```


## 3. Data Inspection & Cleaning
We'll look for:
- Missing values
- Basic statistics
- Potential outliers

> This dataset typically has no missing values, but let's confirm. Also, we'll rename the last column to 'Strength' if needed.

In [None]:
if df is not None:
    print("Dataset Shape:", df.shape)
    print("\nColumn Names:", df.columns.tolist())
    print("\nMissing Values per Column:")
    print(df.isna().sum())

    # Rename the last column to 'Strength' if not already
    if df.columns[-1].lower() not in ['strength', 'compressivestrength']:
        df.rename(columns={df.columns[-1]: 'Strength'}, inplace=True)

    display(df.describe())

### Quick Visual Checks

In [None]:
if df is not None:
    # Histograms
    df.hist(figsize=(12,8), bins=20, color='steelblue', edgecolor='black')
    plt.suptitle('Feature Distributions', fontsize=14)
    plt.tight_layout()
    plt.show()

    # Correlation Heatmap
    plt.figure(figsize=(8,6))
    corr = df.corr()
    sns.heatmap(corr, annot=True, cmap='Blues')
    plt.title('Correlation Matrix')
    plt.show()

## 4. Train/Test Split & Avoiding Data Leakage
We'll separate the target variable (**Strength**) from the features. Then we’ll create a train/test split. **Important**: If we decide to scale or transform the features, we do so *after* splitting to avoid leakage.

In [None]:
if df is not None:
    # Separate features and target
    X = df.drop('Strength', axis=1)
    y = df['Strength']

    # Create a train/test split
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        test_size=0.2,
                                                        random_state=42)

    print(f"Training set: {X_train.shape[0]} rows, Test set: {X_test.shape[0]} rows")

### (Optional) Feature Scaling
Some advanced models (like neural networks) benefit significantly from scaling. Tree-based models (like Random Forest, XGBoost) are generally less sensitive, but let's demonstrate it for completeness.

> **Note**: We fit the scaler on **X_train** only, then apply the same transform to **X_test**.

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features scaled using StandardScaler.")

## 5. Baseline Model: Linear Regression
Let's start with a simple baseline model to see how it performs.

In [None]:
# Initialize
lin_reg = LinearRegression()
# Fit on scaled data
lin_reg.fit(X_train_scaled, y_train)
# Predict
y_pred_lr = lin_reg.predict(X_test_scaled)
# Evaluate
mae_lr = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = mean_squared_error(y_test, y_pred_lr, squared=False)
r2_lr = r2_score(y_test, y_pred_lr)

print(f"Linear Regression Performance:")
print(f"  MAE : {mae_lr:.3f}")
print(f"  RMSE: {rmse_lr:.3f}")
print(f"  R^2 : {r2_lr:.3f}")

### Baseline Interpretation
We have an initial benchmark. Next, we’ll try **more advanced ML models** to see if we can improve performance.

## 6. Advanced Model: Random Forest

Random Forest is an **ensemble** of decision trees, often performing very well on tabular data. It can capture non-linear relationships better than linear regression.

In [None]:
# Initialize RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=100,
                               random_state=42)

# Train on scaled data (though for Random Forest, scaling is often less critical)
rf_reg.fit(X_train_scaled, y_train)

# Predict
y_pred_rf = rf_reg.predict(X_test_scaled)

# Evaluate
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = mean_squared_error(y_test, y_pred_rf, squared=False)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Performance:")
print(f"  MAE : {mae_rf:.3f}")
print(f"  RMSE: {rmse_rf:.3f}")
print(f"  R^2 : {r2_rf:.3f}")

### Feature Importance
Random Forest (and other tree-based models) can provide an estimate of feature importance.

In [None]:
if df is not None:
    importances = rf_reg.feature_importances_
    feature_names = X_train.columns
    # Pair and sort
    feat_imp = sorted(zip(feature_names, importances), key=lambda x: x[1], reverse=True)

    print("Feature Importances (Random Forest):")
    for name, imp in feat_imp:
        print(f"  {name}: {imp:.4f}")

    # Let's do a quick bar chart
    names, scores = zip(*feat_imp)
    plt.figure(figsize=(8,5))
    sns.barplot(x=list(scores), y=list(names), palette='viridis')
    plt.title('Random Forest Feature Importance')
    plt.xlabel('Importance')
    plt.ylabel('Features')
    plt.show()

## 7. Advanced Model: XGBoost (Optional)

**XGBoost** (Extreme Gradient Boosting) is another popular, high-performing library for tree-based models. It often beats random forest in speed or performance for many structured datasets. If you have not installed it, you can skip this step.


In [None]:
if xgboost_installed:
    xgb_reg = XGBRegressor(random_state=42)
    xgb_reg.fit(X_train_scaled, y_train)
    
    y_pred_xgb = xgb_reg.predict(X_test_scaled)
    
    mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
    rmse_xgb = mean_squared_error(y_test, y_pred_xgb, squared=False)
    r2_xgb = r2_score(y_test, y_pred_xgb)

    print("XGBoost Performance:")
    print(f"  MAE : {mae_xgb:.3f}")
    print(f"  RMSE: {rmse_xgb:.3f}")
    print(f"  R^2 : {r2_xgb:.3f}")
else:
    print("XGBoost not installed. Skipping XGBRegressor.")

## 8. Hyperparameter Tuning

We can improve model performance by tuning key hyperparameters. Let's illustrate with **RandomizedSearchCV** on **RandomForestRegressor**. (We choose random search because it's often more efficient than grid search for multiple parameters.)

> In real practice, you'd do this on the full training set, sometimes combining it with cross-validation. Also, be sure to watch out for training time on larger datasets.

In [None]:
# We'll do a quick random search on RandomForestRegressor.
# Adjust the param_distributions as needed.

param_distributions = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10]
}

rf_search = RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_distributions=param_distributions,
    scoring='neg_root_mean_squared_error', # negative RMSE (maximize this)
    n_iter=5, # for demonstration, keep it small
    cv=3,
    random_state=42,
    n_jobs=-1 # use all available CPU cores
)

rf_search.fit(X_train_scaled, y_train)

print("Best Hyperparameters from RandomizedSearchCV:")
print(rf_search.best_params_)

# Evaluate best model on the test set
best_rf = rf_search.best_estimator_
y_pred_best_rf = best_rf.predict(X_test_scaled)

mae_best = mean_absolute_error(y_test, y_pred_best_rf)
rmse_best = mean_squared_error(y_test, y_pred_best_rf, squared=False)
r2_best = r2_score(y_test, y_pred_best_rf)

print("\nPerformance of Tuned Random Forest:")
print(f"  MAE : {mae_best:.3f}")
print(f"  RMSE: {rmse_best:.3f}")
print(f"  R^2 : {r2_best:.3f}")

## 9. Data Leakage & Best Practices
1. **Train-Test Split Before Transformations**: We scaled features *after* splitting.
2. **No Usage of Future Info**: Make sure we do not use any data that would only be available after the target is known.
3. **Cross-Validation**: More robust estimation of performance.
4. **Feature Engineering**: Domain insights (e.g., ratio of Water/Cement can be important in real design, etc.).
5. **Versioning**: In real projects, keep track of data, transformations, and model versions to ensure reproducibility.

## 10. Summary of Results
Let's compare **MAE, RMSE, R²** for each model we tested.

- **Linear Regression**
- **Random Forest** (default)
- **XGBoost** (if installed)
- **Tuned Random Forest**

In [None]:
# Summarize results
results = {
    'LinearRegression': {
        'MAE': mae_lr,
        'RMSE': rmse_lr,
        'R2': r2_lr
    },
    'RandomForest_Default': {
        'MAE': mae_rf,
        'RMSE': rmse_rf,
        'R2': r2_rf
    },
    'RandomForest_Tuned': {
        'MAE': mae_best,
        'RMSE': rmse_best,
        'R2': r2_best
    }
}

if xgboost_installed:
    results['XGBoost'] = {
        'MAE': mae_xgb,
        'RMSE': rmse_xgb,
        'R2': r2_xgb
    }

# Print the table
df_results = pd.DataFrame(results).T
print("\nModel Comparison:")
display(df_results)

## 11. Real-World Data Sources for Construction

1. **UCI ML Repository** – [Concrete Compressive Strength](https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength) and other datasets.
2. **Kaggle** – Datasets on building permits, civil engineering, or cost estimation.
3. **Open Government Data** – Some city/state portals share inspection logs, building permits, etc.
4. **Company/Internal Data** – Real site logs, sensor data, project schedules (often proprietary). Combine them with domain knowledge to create robust ML solutions.

## 12. Next Steps & Further Practice
- **Feature Engineering**: For real projects, domain knowledge is crucial (e.g., water/cement ratio, plasticizer usage, weather data, etc.).
- **Other Models**: Try LightGBM, CatBoost, or neural networks.
- **Cross-Validation**: For final model validation, use k-fold cross-validation.
- **Deployment**: In production, set up model monitoring (MLOps) to track performance over time.

Thank you for going through this extended notebook!