# Tutorial 1: Interactive Diagnostics - From Concepts to Code

**Course:** AE 597 - Diagnostics and Monitoring  
**Grant:** NCPTT Data-Driven Heritage Preservation

---

## 🎯 Goal: Accessible Data Science for Preservation

This tutorial is designed for **all skill levels**:
1.  **No-Code Explorer:** Use sliders and buttons to understand the concepts.
2.  **Low-Code Student:** Read the explanations to understand *how* it works.
3.  **Code-Curious:** Expand the "Under the Hood" sections to see the Python implementation.

**We will explore:**
-   **Factor Analysis:** Grouping hidden patterns of damage.
-   **Machine Learning:** Finding out what *really* causes degradation.

## 📂 Google Colab: How to Load Data
**If you are running this in Google Colab, you must upload the data file first.**

1.  Download `synthetic_adobe_data.csv` from the GitHub repository to your computer.
2.  In Colab, click the **Folder Icon** 📁 on the left sidebar.
3.  Click the **Upload Icon** (page with an upward arrow).
4.  Select `synthetic_adobe_data.csv` from your computer.
5.  Wait for the upload to finish.

## 🛠️ Setup (Run this once)
Press `Shift + Enter` on the cell below to load our tools.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display, clear_output
from factor_analyzer import FactorAnalyzer
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer

# Style settings
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Load Data
try:
    df = pd.read_csv('synthetic_adobe_data.csv')
    # Preprocess
    cols_to_drop = ['Wall ID', 'Image Name', 'Notes', 'Date', 'Reviewer', 'Wall Rank', 'Section ID']
    df_clean = df.drop(columns=[c for c in cols_to_drop if c in df.columns])
    df_numeric = df_clean.select_dtypes(include=[np.number]).dropna(axis=1, how='all')
    imputer = SimpleImputer(strategy='median')
    df_imputed = pd.DataFrame(imputer.fit_transform(df_numeric), columns=df_numeric.columns)
    print("✅ Data Loaded Successfully!")
except FileNotFoundError:
    print("⚠️ Data file not found. Please run 'generate_synthetic_data.py' first.")

✅ Data Loaded Successfully!


---
## 🔍 Part 1: Data Explorer (No-Code)

Before analyzing, we must look at our data. 

**Interactive Task:**
1.  Select a variable from the dropdown.
2.  Look at the histogram. Is the data spread out? Clumped together?
3.  *Preservation Insight:* If "Cap Deterioration" is mostly high numbers (4-5), we have a serious roofing problem.

In [2]:
# --- INTERACTIVE WIDGET CODE ---
def plot_distribution(column):
    plt.figure(figsize=(8, 4))
    sns.histplot(df_imputed[column], kde=True, color='teal')
    plt.title(f'Distribution of {column}')
    plt.xlabel('Score / Value')
    plt.ylabel('Count of Walls')
    plt.show()

dropdown = widgets.Dropdown(
    options=sorted(df_imputed.columns),
    value='Total Scr',
    description='Variable:',
    disabled=False,
)

widgets.interactive(plot_distribution, column=dropdown)

interactive(children=(Dropdown(description='Variable:', index=27, options=('Animal Activity', 'Bracing', 'Brac…

---
## 🧩 Part 2: Factor Analysis (Low-Code)

**The Concept:** 
Imagine you have 30 different symptoms. A doctor groups them into 1 diagnosis (e.g., "Flu"). 
**Factor Analysis** does the same for buildings. It groups 30 damage types into a few "Latent Vulnerabilities."

**Interactive Task:**
1.  Use the slider to change the **Number of Factors**.
2.  Watch the heatmap change.
3.  **Goal:** Find the number where the groups look distinct (dark red blocks) and make physical sense.
    *   *Hint: Try 3 factors.*

In [3]:
# --- INTERACTIVE FACTOR ANALYSIS ---
def run_interactive_fa(n_factors=3):
    fa = FactorAnalyzer(n_factors=n_factors, rotation='varimax')
    fa.fit(df_imputed)
    
    # Get loadings
    loadings = pd.DataFrame(
        fa.loadings_, 
        index=df_imputed.columns, 
        columns=[f'Factor {i+1}' for i in range(n_factors)]
    )
    
    # Filter for visibility (hide weak connections)
    significant = loadings[loadings.abs().max(axis=1) > 0.4]
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(significant, annot=True, fmt='.2f', cmap='RdBu_r', center=0, vmin=-1, vmax=1)
    plt.title(f'Factor Analysis with {n_factors} Factors')
    plt.show()

slider_fa = widgets.IntSlider(value=3, min=2, max=6, step=1, description='Factors:')
widgets.interactive(run_interactive_fa, n_factors=slider_fa)

interactive(children=(IntSlider(value=3, description='Factors:', max=6, min=2), Output()), _dom_classes=('widg…

### 🧠 Reflection
If you selected **3 Factors**, you likely saw:
1.  **Factor 1:** Sills (Windows)
2.  **Factor 2:** Surface Coats & Lintels
3.  **Factor 3:** Structural Cracks & Out-of-Plane

This tells us we have three main "Diseases" attacking the fort.

---
## 🤖 Part 3: Machine Learning (Low-Code)

**The Concept:**
We want to know: *What causes the MOST damage?*
A **Random Forest** is like asking 100 preservation experts to vote on what matters most.

**Interactive Task:**
1.  **Trees:** How many experts (trees) do we ask? (10 vs 100)
2.  **Depth:** How complex can their logic be? (Shallow vs Deep)
3.  Observe which feature stays at the top. Is it consistent?

In [4]:
# --- INTERACTIVE RANDOM FOREST ---
def run_interactive_rf(n_trees=100, max_depth=5):
    X = df_imputed.drop(columns=['Total Scr'])
    y = df_imputed['Total Scr']
    
    rf = RandomForestRegressor(n_estimators=n_trees, max_depth=max_depth, random_state=42)
    rf.fit(X, y)
    
    importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False).head(10)
    
    plt.figure(figsize=(10, 5))
    sns.barplot(x=importances.values, y=importances.index, palette='viridis')
    plt.title(f'Top Predictors (Trees={n_trees}, Depth={max_depth})')
    plt.xlabel('Importance Score')
    plt.show()

slider_trees = widgets.IntSlider(value=100, min=10, max=200, step=10, description='Trees:')
slider_depth = widgets.IntSlider(value=5, min=2, max=20, step=1, description='Depth:')

ui = widgets.HBox([slider_trees, slider_depth])
out = widgets.interactive_output(run_interactive_rf, {'n_trees': slider_trees, 'max_depth': slider_depth})
display(ui, out)

HBox(children=(IntSlider(value=100, description='Trees:', max=200, min=10, step=10), IntSlider(value=5, descri…

Output()

### 🎓 Under the Hood (For the Code-Curious)

How did we do that? Here is the raw Python code used in the widget above.

```python
# 1. Separate Predictors (X) and Target (y)
X = df.drop(columns=['Total Scr'])
y = df['Total Scr']

# 2. Create the Model
model = RandomForestRegressor(n_estimators=100, max_depth=5)

# 3. Train the Model
model.fit(X, y)

# 4. Get Importance Scores
scores = model.feature_importances_
```