# Archaeology Quick Start: Artifact Classification

**Duration:** 10-30 minutes  
**Goal:** Classify archaeological artifacts using a simple machine learning model

## What You'll Learn

- Load and explore archaeological artifact data
- Extract features from artifact measurements
- Train a classification model to identify artifact types
- Evaluate model performance with archaeological metrics
- Understand how ML can assist archaeological research

## Dataset

We'll use a **synthetic archaeological dataset** based on real artifact patterns:
- Artifact categories: Pottery, Stone Tools, Metal Objects, Ornaments
- Features: Length, width, thickness, weight, material type
- 500 artifacts with measurements
- Data mimics real archaeological survey results

No downloads needed - let's get started!

## 1. Setup and Data Generation

In [None]:
# Import libraries (all pre-installed in Colab/Studio Lab)
import warnings
from datetime import datetime

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore")

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 11

print("Libraries loaded successfully!")
print(f"Analysis date: {datetime.now().strftime('%Y-%m-%d')}")

In [None]:
# Generate synthetic archaeological artifact data
np.random.seed(42)


def generate_artifact_data(n_samples=500):
    """Generate synthetic artifact measurements based on real patterns"""

    artifacts = []

    # Pottery: typically round, medium-large, clay
    n_pottery = n_samples // 4
    for _ in range(n_pottery):
        artifacts.append(
            {
                "type": "Pottery",
                "length_cm": np.random.normal(15, 4),
                "width_cm": np.random.normal(12, 3),
                "thickness_cm": np.random.normal(0.8, 0.2),
                "weight_g": np.random.normal(250, 80),
                "material": "Clay",
            }
        )

    # Stone Tools: elongated, smaller, stone
    n_stone = n_samples // 4
    for _ in range(n_stone):
        artifacts.append(
            {
                "type": "Stone Tool",
                "length_cm": np.random.normal(8, 2),
                "width_cm": np.random.normal(4, 1),
                "thickness_cm": np.random.normal(1.5, 0.4),
                "weight_g": np.random.normal(150, 50),
                "material": "Stone",
            }
        )

    # Metal Objects: small-medium, heavy, metal
    n_metal = n_samples // 4
    for _ in range(n_metal):
        artifacts.append(
            {
                "type": "Metal Object",
                "length_cm": np.random.normal(10, 3),
                "width_cm": np.random.normal(3, 1),
                "thickness_cm": np.random.normal(0.3, 0.1),
                "weight_g": np.random.normal(180, 60),
                "material": "Metal",
            }
        )

    # Ornaments: small, light, varied materials
    n_ornament = n_samples - (n_pottery + n_stone + n_metal)
    materials = ["Bone", "Shell", "Stone", "Metal"]
    for _ in range(n_ornament):
        artifacts.append(
            {
                "type": "Ornament",
                "length_cm": np.random.normal(3, 1),
                "width_cm": np.random.normal(2, 0.5),
                "thickness_cm": np.random.normal(0.5, 0.2),
                "weight_g": np.random.normal(15, 10),
                "material": np.random.choice(materials),
            }
        )

    return pd.DataFrame(artifacts)


# Generate dataset
df = generate_artifact_data(500)

print(f"Generated {len(df)} artifact records")
print(f"\nArtifact types: {df['type'].unique()}")
print("\nArtifact counts:")
print(df["type"].value_counts())
df.head(10)

## 2. Exploratory Data Analysis

In [None]:
# Summary statistics by artifact type
print("=== Artifact Measurements by Type ===")
print("\nAverage dimensions:")
print(df.groupby("type")[["length_cm", "width_cm", "thickness_cm", "weight_g"]].mean())

print("\nMaterial distribution:")
print(pd.crosstab(df["type"], df["material"]))

In [None]:
# Visualize artifact dimensions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Length distribution
for artifact_type in df["type"].unique():
    data = df[df["type"] == artifact_type]["length_cm"]
    axes[0, 0].hist(data, alpha=0.5, label=artifact_type, bins=20)
axes[0, 0].set_xlabel("Length (cm)")
axes[0, 0].set_ylabel("Frequency")
axes[0, 0].set_title("Artifact Length Distribution")
axes[0, 0].legend()

# Width distribution
for artifact_type in df["type"].unique():
    data = df[df["type"] == artifact_type]["width_cm"]
    axes[0, 1].hist(data, alpha=0.5, label=artifact_type, bins=20)
axes[0, 1].set_xlabel("Width (cm)")
axes[0, 1].set_ylabel("Frequency")
axes[0, 1].set_title("Artifact Width Distribution")
axes[0, 1].legend()

# Thickness distribution
for artifact_type in df["type"].unique():
    data = df[df["type"] == artifact_type]["thickness_cm"]
    axes[1, 0].hist(data, alpha=0.5, label=artifact_type, bins=20)
axes[1, 0].set_xlabel("Thickness (cm)")
axes[1, 0].set_ylabel("Frequency")
axes[1, 0].set_title("Artifact Thickness Distribution")
axes[1, 0].legend()

# Weight distribution
for artifact_type in df["type"].unique():
    data = df[df["type"] == artifact_type]["weight_g"]
    axes[1, 1].hist(data, alpha=0.5, label=artifact_type, bins=20)
axes[1, 1].set_xlabel("Weight (g)")
axes[1, 1].set_ylabel("Frequency")
axes[1, 1].set_title("Artifact Weight Distribution")
axes[1, 1].legend()

plt.tight_layout()
plt.show()

print("Different artifact types show distinct measurement patterns!")

In [None]:
# Length vs Width scatter plot (colored by type)
fig, ax = plt.subplots(figsize=(12, 8))

for artifact_type in df["type"].unique():
    data = df[df["type"] == artifact_type]
    ax.scatter(data["length_cm"], data["width_cm"], label=artifact_type, alpha=0.6, s=100)

ax.set_xlabel("Length (cm)", fontsize=12, fontweight="bold")
ax.set_ylabel("Width (cm)", fontsize=12, fontweight="bold")
ax.set_title("Artifact Dimensions: Length vs Width", fontsize=14, fontweight="bold")
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Clear clustering by artifact type - good for classification!")

## 3. Prepare Data for Machine Learning

In [None]:
# Encode material as numeric feature
from sklearn.preprocessing import LabelEncoder

le_material = LabelEncoder()
df["material_encoded"] = le_material.fit_transform(df["material"])

# Prepare features (X) and target (y)
feature_columns = ["length_cm", "width_cm", "thickness_cm", "weight_g", "material_encoded"]
X = df[feature_columns]
y = df["type"]

# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} artifacts")
print(f"Testing set: {len(X_test)} artifacts")
print(f"\nFeatures used: {feature_columns}")

## 4. Train Classification Model

In [None]:
# Train Random Forest classifier
print("Training Random Forest classifier...")
model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
model.fit(X_train, y_train)

print("Model training complete!")

# Feature importance
feature_importance = pd.DataFrame(
    {"feature": feature_columns, "importance": model.feature_importances_}
).sort_values("importance", ascending=False)

print("\n=== Feature Importance ===")
print(feature_importance)

In [None]:
# Visualize feature importance
fig, ax = plt.subplots(figsize=(10, 6))

feature_names = ["Length", "Width", "Thickness", "Weight", "Material"]
ax.barh(feature_names, model.feature_importances_, color="steelblue", alpha=0.8)
ax.set_xlabel("Importance", fontsize=12, fontweight="bold")
ax.set_title("Feature Importance in Artifact Classification", fontsize=14, fontweight="bold")
ax.grid(True, alpha=0.3, axis="x")

plt.tight_layout()
plt.show()

print("Feature importance shows which measurements are most useful for classification")

## 5. Evaluate Model Performance

In [None]:
# Make predictions on test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("=== Model Performance ===")
print(f"Overall Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
artifact_types = sorted(df["type"].unique())

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=artifact_types,
    yticklabels=artifact_types,
    cbar_kws={"label": "Count"},
    ax=ax,
)

ax.set_xlabel("Predicted Type", fontsize=12, fontweight="bold")
ax.set_ylabel("True Type", fontsize=12, fontweight="bold")
ax.set_title("Confusion Matrix: Artifact Classification", fontsize=14, fontweight="bold")

plt.tight_layout()
plt.show()

print("Confusion matrix shows where the model makes mistakes")
print("Diagonal = correct predictions, off-diagonal = misclassifications")

## 6. Example Predictions

In [None]:
# Show some example predictions
n_examples = 10
example_indices = np.random.choice(len(X_test), n_examples, replace=False)

examples = pd.DataFrame(
    {
        "True Type": y_test.iloc[example_indices].values,
        "Predicted Type": y_pred[example_indices],
        "Length (cm)": X_test.iloc[example_indices]["length_cm"].values,
        "Width (cm)": X_test.iloc[example_indices]["width_cm"].values,
        "Weight (g)": X_test.iloc[example_indices]["weight_g"].values,
    }
)

# Add correctness indicator
examples["Correct"] = examples["True Type"] == examples["Predicted Type"]

print("=== Example Predictions ===")
print(examples.to_string(index=False))

correct_predictions = examples["Correct"].sum()
print(f"\n{correct_predictions}/{n_examples} predictions correct in this sample")

## 7. Summary and Key Findings

In [None]:
# Generate summary report
print("=" * 60)
print("ARCHAEOLOGICAL CLASSIFICATION ANALYSIS SUMMARY")
print("=" * 60)

print(f"\nDataset: {len(df)} artifacts across {len(df['type'].unique())} categories")
print("\nArtifact Types:")
for artifact_type in sorted(df["type"].unique()):
    count = len(df[df["type"] == artifact_type])
    print(f"  • {artifact_type}: {count} artifacts")

print("\nModel Performance:")
print(f"  • Overall Accuracy: {accuracy:.1%}")
print(f"  • Training Set: {len(X_train)} artifacts")
print(f"  • Testing Set: {len(X_test)} artifacts")

print("\nMost Important Features:")
for _i, row in feature_importance.head(3).iterrows():
    feature_name = row["feature"].replace("_", " ").title()
    print(f"  • {feature_name}: {row['importance']:.3f}")

print("\nKey Insights:")
print("  • Machine learning can effectively classify artifacts based on measurements")
print("  • Physical dimensions (length, width, weight) are strong predictors")
print("  • Material type also contributes to accurate classification")
print(f"  • Model achieves {accuracy:.1%} accuracy on unseen test data")

print("\nArchaeological Applications:")
print("  • Automated artifact categorization for large collections")
print("  • Consistent classification across multiple excavation sites")
print("  • Identification of unusual or hybrid artifact forms")
print("  • Support for archaeological typology development")

print("=" * 60)

## What You Learned

In just 10-30 minutes, you:

1. Loaded and explored archaeological artifact data
2. Visualized measurement patterns across artifact types
3. Trained a machine learning classifier using Random Forest
4. Evaluated model performance with multiple metrics
5. Understood how ML can assist archaeological classification
6. Identified the most important features for artifact identification

## Next Steps

### Ready for More?

**Tier 1: SageMaker Studio Lab (4-8 hours, free)**
- Analyze multi-site archaeological data (imagery, LiDAR, geophysical)
- Train deep learning models on artifact images
- Use persistent storage for 10GB+ datasets
- Build ensemble models with saved checkpoints
- Cross-site comparative analysis

**Tier 2: AWS Starter (4-6 hours, $10-25)**
- Store artifact imagery in S3
- Process images with Lambda functions
- Train models with SageMaker
- Set up automated artifact detection pipeline

**Tier 3: Production Infrastructure (5-7 days, $100-500/month)**
- Multi-site datasets (100GB+)
- Distributed processing with SageMaker
- Real-time artifact identification API
- Integration with archaeological databases
- Full CloudFormation deployment

## Learn More

- **Archaeological Data Science:** [Open Context](https://opencontext.org/)
- **Digital Archaeology:** [Archaeological Institute of America](https://www.archaeological.org/)
- **ML in Archaeology:** Recent papers on automated artifact classification

---

**Generated with [Claude Code](https://claude.com/claude-code)**