# Tutorial 1: Interactive Diagnostics - From Concepts to Code

**Grant:** NCPTT Data-Driven Heritage Preservation

---

## 🎯 Goal: Accessible Data Science for Preservation

This tutorial is designed for **all skill levels**:
1.  **No-Code Explorer:** Use sliders and buttons to understand the concepts.
2.  **Low-Code Student:** Read the explanations to understand *how* it works.
3.  **Code-Curious:** Expand the "Under the Hood" sections to see the Python implementation.

**We will explore:**
-   **Factor Analysis:** Grouping hidden patterns of damage.
-   **Machine Learning (Elastic Net):** Finding out what *really* causes degradation using advanced regression.

## 📦 Install Libraries
**Run this cell first to install the required analysis tools.**

In [None]:
!pip install factor_analyzer

## 📂 Google Colab: How to Load Data
**If you are running this in Google Colab, you must upload the data file first.**

1.  Download `synthetic_adobe_data.csv` from the GitHub repository to your computer.
2.  In Colab, click the **Folder Icon** 📁 on the left sidebar.
3.  Click the **Upload Icon** (page with an upward arrow).
4.  Select `synthetic_adobe_data.csv` from your computer.
5.  Wait for the upload to finish.

## 🛠️ Setup (Run this once)
Press `Shift + Enter` on the cell below to load our tools.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display, clear_output
from factor_analyzer import FactorAnalyzer
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Style settings
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Load Data
try:
    df = pd.read_csv('synthetic_adobe_data.csv')
    # Preprocess
    cols_to_drop = ['Wall ID', 'Image Name', 'Notes', 'Date', 'Reviewer', 'Wall Rank', 'Section ID']
    df_clean = df.drop(columns=[c for c in cols_to_drop if c in df.columns])
    df_numeric = df_clean.select_dtypes(include=[np.number]).dropna(axis=1, how='all')
    imputer = SimpleImputer(strategy='median')
    df_imputed = pd.DataFrame(imputer.fit_transform(df_numeric), columns=df_numeric.columns)
    print("✅ Data Loaded Successfully!")
except FileNotFoundError:
    print("⚠️ Data file not found. Please run 'generate_synthetic_data.py' first.")

---
## 🔍 Part 1: Data Explorer (No-Code)

Before analyzing, we must look at our data. 

**Interactive Task:**
1.  Select a variable from the dropdown.
2.  Look at the histogram. Is the data spread out? Clumped together?
3.  *Preservation Insight:* If "Cap Deterioration" is mostly high numbers (4-5), we have a serious roofing problem.

In [2]:
# --- INTERACTIVE WIDGET CODE ---
def plot_distribution(column):
    plt.figure(figsize=(8, 4))
    sns.histplot(df_imputed[column], kde=True, color='teal')
    plt.title(f'Distribution of {column}')
    plt.xlabel('Score / Value')
    plt.ylabel('Count of Walls')
    plt.show()

dropdown = widgets.Dropdown(
    options=sorted(df_imputed.columns),
    value='Total Scr',
    description='Variable:',
    disabled=False,
)

widgets.interactive(plot_distribution, column=dropdown)

---
## 🧩 Part 2: Factor Analysis (Low-Code)

**The Concept:** 
Imagine you have 30 different symptoms. A doctor groups them into 1 diagnosis (e.g., "Flu"). 
**Factor Analysis** does the same for buildings. It groups 30 damage types into a few "Latent Vulnerabilities."

**Interactive Task:**
1.  Use the slider to change the **Number of Factors**.
2.  Watch the heatmap change.
3.  **Goal:** Find the number where the groups look distinct (dark red blocks) and make physical sense.
    *   *Hint: Try 3 factors.*

In [3]:
# --- INTERACTIVE FACTOR ANALYSIS ---
def run_interactive_fa(n_factors=3):
    fa = FactorAnalyzer(n_factors=n_factors, rotation='varimax')
    fa.fit(df_imputed)
    
    # Get loadings
    loadings = pd.DataFrame(
        fa.loadings_, 
        index=df_imputed.columns, 
        columns=[f'Factor {i+1}' for i in range(n_factors)]
    )
    
    # Filter for visibility (hide weak connections)
    significant = loadings[loadings.abs().max(axis=1) > 0.4]
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(significant, annot=True, fmt='.2f', cmap='RdBu_r', center=0, vmin=-1, vmax=1)
    plt.title(f'Factor Analysis with {n_factors} Factors')
    plt.show()

slider_fa = widgets.IntSlider(value=3, min=2, max=6, step=1, description='Factors:')
widgets.interactive(run_interactive_fa, n_factors=slider_fa)

### 🧠 Reflection
If you selected **3 Factors**, you likely saw:
1.  **Factor 1:** Sill Deterioration (Windows)
2.  **Factor 2:** Surface Coating Degradation (Lintels & Plaster)
3.  **Factor 3:** Structural Instability (Cracking & Out-of-Plane)

This tells us we have three main "Diseases" attacking the fort, matching the findings in the manuscript.

---
## 🤖 Part 3: Machine Learning (Low-Code)

**The Concept:**
We want to know: *What causes the MOST damage?*
We use **Elastic Net Regression**, which is a smart way to fit a line through data while ignoring noise. It's better than simple regression because it handles small datasets well and removes unimportant variables (sets them to zero).

**Interactive Task:**
1.  **Alpha (Regularization):** Controls how simple the model should be. Higher = Simpler (more zeros).
2.  **L1 Ratio:** Mixes two types of penalties. 1.0 = Lasso (removes variables aggressively), 0.0 = Ridge (shrinks all variables).
3.  Observe which features stay at the top. Are they consistent?

In [4]:
# --- INTERACTIVE ELASTIC NET ---
def run_interactive_en(alpha=1.0, l1_ratio=0.5):
    X = df_imputed.drop(columns=['Total Scr'])
    y = df_imputed['Total Scr']
    
    # Scale data (important for Elastic Net)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Fit Model
    en = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42, max_iter=10000)
    en.fit(X_scaled, y)
    
    # Get coefficients
    coefs = pd.Series(en.coef_, index=X.columns)
    
    # Sort by absolute value
    importances = coefs.abs().sort_values(ascending=False).head(10)
    
    plt.figure(figsize=(10, 5))
    sns.barplot(x=importances.values, y=importances.index, palette='viridis')
    plt.title(f'Top Predictors (Alpha={alpha}, L1 Ratio={l1_ratio})')
    plt.xlabel('Coefficient Magnitude (Importance)')
    plt.show()

slider_alpha = widgets.FloatSlider(value=1.0, min=0.1, max=10.0, step=0.1, description='Alpha:')
slider_l1 = widgets.FloatSlider(value=0.5, min=0.0, max=1.0, step=0.1, description='L1 Ratio:')

ui = widgets.HBox([slider_alpha, slider_l1])
out = widgets.interactive_output(run_interactive_en, {'alpha': slider_alpha, 'l1_ratio': slider_l1})
display(ui, out)

### 🎓 Under the Hood (For the Code-Curious)

How did we do that? Here is the raw Python code used in the widget above.

```python
# 1. Separate Predictors (X) and Target (y)
X = df.drop(columns=['Total Scr'])
y = df['Total Scr']

# 2. Scale the Data (Crucial for Elastic Net)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Create the Model
model = ElasticNet(alpha=1.0, l1_ratio=0.5)

# 4. Train the Model
model.fit(X_scaled, y)

# 5. Get Importance Scores (Coefficients)
scores = model.coef_
```

---
## 🙏 Acknowledgements

This material was developed under a grant from the **National Center for Preservation Technology and Training (NCPTT)**.

*Data-Driven Heritage Preservation: Leveraging Machine Learning for Informed Adobe Conservation Strategies*