# Sketchfab Cultural Heritage Models Analysis

This notebook demonstrates how to use the Sketchfab Data API to retrieve and analyze cultural heritage 3D models for discourse analysis.

**Research Purpose**: Examining how modelers employ historical discourses in their cultural heritage models on Sketchfab.

**API Documentation**: https://docs.sketchfab.com/data-api/v3/

## 1. Setup

First, let's install the required dependencies and import the scraper module.

In [None]:
# Install required packages
!pip install requests pandas matplotlib seaborn wordcloud

In [None]:
# Upload the sketchfab_scraper.py file to Colab
# Option 1: Upload manually using the file browser on the left
# Option 2: Clone from GitHub (uncomment below)
# !git clone https://github.com/your-username/km-sf.git
# import sys
# sys.path.append('/content/km-sf')

# For this example, we'll assume the file is in the current directory
import requests
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 50)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# Import the scraper module
from sketchfab_scraper import SketchfabScraper, quick_search

## 2. API Token (Optional)

While many searches work without authentication, having an API token may provide access to more data and higher rate limits.

**To get your API token:**
1. Log in to Sketchfab
2. Go to https://sketchfab.com/settings/password
3. Find your API token in the settings

**Note**: Keep your API token private!

In [None]:
# Set your API token here (optional)
API_TOKEN = None  # Replace with your token: "your_api_token_here"

# Initialize the scraper
# rate_limit_delay: seconds between requests (increase if you get 429 errors)
scraper = SketchfabScraper(api_token=API_TOKEN, rate_limit_delay=1.5)

print("✓ Scraper initialized successfully!")

## 3. Basic Search Examples

Let's start with some basic searches to retrieve cultural heritage models.

In [None]:
# Search for cultural heritage models with a specific query
# This uses the convenience method that automatically filters by cultural heritage category
df_roman = scraper.search_cultural_heritage(
    query="roman",
    max_results=50  # Limit to 50 results for this example
)

print(f"Found {len(df_roman)} Roman cultural heritage models")
print("\nFirst 5 results:")
df_roman[['name', 'user_username', 'viewCount', 'likeCount']].head()

In [None]:
# Search for ancient Egyptian models
df_egyptian = scraper.search_cultural_heritage(
    query="ancient egypt",
    max_results=50
)

print(f"Found {len(df_egyptian)} Ancient Egyptian models")
df_egyptian[['name', 'user_displayName', 'tags']].head()

In [None]:
# Advanced search with multiple filters
models_filtered = scraper.search_models(
    query="medieval architecture",
    categories='cultural-heritage-history',
    downloadable=True,  # Only downloadable models
    sort_by='-likeCount',  # Sort by most liked
    max_results=30
)

df_medieval = scraper.to_dataframe(models_filtered)
print(f"Found {len(df_medieval)} downloadable medieval architecture models")
df_medieval[['name', 'likeCount', 'isDownloadable']].head(10)

## 4. Comprehensive Cultural Heritage Dataset

Let's create a larger dataset for discourse analysis by searching multiple terms.

In [None]:
# Define search terms related to cultural heritage
search_terms = [
    "archaeology",
    "ancient",
    "historical",
    "monument",
    "artifact",
    "ruins",
    "heritage",
    "museum"
]

all_models = []

for term in search_terms:
    print(f"Searching for: {term}...")
    df_temp = scraper.search_cultural_heritage(
        query=term,
        max_results=100,
        sort_by='-relevance'
    )
    all_models.append(df_temp)
    print(f"  Found {len(df_temp)} models")

# Combine all results
df_combined = pd.concat(all_models, ignore_index=True)

# Remove duplicates based on uid
df_combined = df_combined.drop_duplicates(subset=['uid'], keep='first')

print(f"\n✓ Total unique models collected: {len(df_combined)}")

## 5. Data Exploration and Analysis

Now let's analyze the collected data to understand patterns in cultural heritage modeling.

In [None]:
# Basic statistics
print("Dataset Overview:")
print("="*50)
print(f"Total models: {len(df_combined)}")
print(f"Unique users: {df_combined['user_username'].nunique()}")
print(f"Downloadable models: {df_combined['isDownloadable'].sum()}")
print(f"\nEngagement Statistics:")
print(f"Total views: {df_combined['viewCount'].sum():,}")
print(f"Total likes: {df_combined['likeCount'].sum():,}")
print(f"Average views per model: {df_combined['viewCount'].mean():.1f}")
print(f"Average likes per model: {df_combined['likeCount'].mean():.1f}")

In [None]:
# Most popular models
print("Top 10 Most Viewed Models:")
top_viewed = df_combined.nlargest(10, 'viewCount')[['name', 'user_displayName', 'viewCount', 'likeCount']]
display(top_viewed)

In [None]:
# Most prolific creators
creator_counts = df_combined['user_username'].value_counts().head(10)
print("Top 10 Most Prolific Creators:")
display(creator_counts)

## 6. Visualizations

In [None]:
# Distribution of views and likes
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Views distribution (log scale)
df_combined[df_combined['viewCount'] > 0]['viewCount'].apply(lambda x: np.log10(x+1)).hist(bins=50, ax=axes[0])
axes[0].set_title('Distribution of Views (log scale)')
axes[0].set_xlabel('log10(View Count)')
axes[0].set_ylabel('Frequency')

# Likes distribution (log scale)
df_combined[df_combined['likeCount'] > 0]['likeCount'].apply(lambda x: np.log10(x+1)).hist(bins=50, ax=axes[1])
axes[1].set_title('Distribution of Likes (log scale)')
axes[1].set_xlabel('log10(Like Count)')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Most common categories (if models have multiple categories)
# Split and count all categories
all_categories = df_combined['categories'].str.split(', ').explode()
category_counts = all_categories.value_counts().head(15)

plt.figure(figsize=(12, 6))
category_counts.plot(kind='barh')
plt.title('Top 15 Categories in Cultural Heritage Models')
plt.xlabel('Count')
plt.ylabel('Category')
plt.tight_layout()
plt.show()

In [None]:
# Publication timeline
# Convert publishedAt to datetime
df_combined['publishedAt'] = pd.to_datetime(df_combined['publishedAt'])
df_combined['publishedYear'] = df_combined['publishedAt'].dt.year
df_combined['publishedMonth'] = df_combined['publishedAt'].dt.to_period('M')

# Plot by year
yearly_counts = df_combined['publishedYear'].value_counts().sort_index()

plt.figure(figsize=(12, 5))
yearly_counts.plot(kind='bar')
plt.title('Cultural Heritage Models Published by Year')
plt.xlabel('Year')
plt.ylabel('Number of Models')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 7. Discourse Analysis: Tags and Descriptions

Analyze the language used by modelers to describe their cultural heritage models.

In [None]:
# Most common tags
all_tags = df_combined['tags'].str.split(', ').explode()
tag_counts = all_tags.value_counts().head(30)

print("Top 30 Most Common Tags:")
display(tag_counts)

In [None]:
# Word cloud of tags
from wordcloud import WordCloud

# Combine all tags into a single string
tags_text = ' '.join(all_tags.dropna().astype(str))

# Create word cloud
wordcloud = WordCloud(width=1200, height=600, background_color='white', 
                     colormap='viridis', max_words=100).generate(tags_text)

plt.figure(figsize=(15, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Tags in Cultural Heritage Models', fontsize=16, pad=20)
plt.tight_layout()
plt.show()

In [None]:
# Analyze descriptions for common terms
import re
from collections import Counter

# Combine all descriptions
all_descriptions = ' '.join(df_combined['description'].fillna('').astype(str))

# Extract words (simple tokenization)
words = re.findall(r'\b[a-z]{4,}\b', all_descriptions.lower())

# Common stopwords to exclude
stopwords = {'this', 'that', 'with', 'from', 'have', 'been', 'were', 'their', 
             'there', 'what', 'when', 'where', 'which', 'who', 'will', 'would',
             'could', 'should', 'about', 'these', 'those', 'more', 'than', 'into',
             'such', 'some', 'other', 'them', 'then', 'also', 'only', 'very',
             'much', 'many', 'most', 'made', 'make'}

# Filter stopwords
filtered_words = [w for w in words if w not in stopwords]

# Count most common
word_counts = Counter(filtered_words).most_common(30)

print("Top 30 Words in Model Descriptions:")
for word, count in word_counts:
    print(f"{word:20s}: {count}")

In [None]:
# Analyze specific discourse-related keywords
discourse_keywords = {
    'preservation': ['preserv', 'conserv', 'restor', 'protect'],
    'authenticity': ['authentic', 'original', 'genuine', 'real', 'actual'],
    'reconstruction': ['reconstruct', 'recreat', 'rebuild', 'remodel'],
    'education': ['educat', 'learn', 'teach', 'study', 'research'],
    'technology': ['scan', 'photogrammetry', 'laser', 'digital', '3d'],
    'heritage': ['heritage', 'cultural', 'historic', 'legacy', 'tradition']
}

# Count occurrences
discourse_counts = {}
descriptions_lower = df_combined['description'].fillna('').str.lower()

for category, keywords in discourse_keywords.items():
    count = sum(descriptions_lower.str.contains('|'.join(keywords), regex=True))
    discourse_counts[category] = count

# Plot
plt.figure(figsize=(10, 6))
plt.bar(discourse_counts.keys(), discourse_counts.values())
plt.title('Discourse Themes in Model Descriptions')
plt.xlabel('Theme')
plt.ylabel('Number of Models Mentioning Theme')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("\nDiscourse Theme Frequency:")
for theme, count in discourse_counts.items():
    percentage = (count / len(df_combined)) * 100
    print(f"{theme:20s}: {count:4d} models ({percentage:.1f}%)")

## 8. License Analysis

Examine licensing practices in cultural heritage models.

In [None]:
# License distribution
license_counts = df_combined['license_label'].value_counts()

print("License Distribution:")
display(license_counts)

# Plot
plt.figure(figsize=(10, 6))
license_counts.plot(kind='bar')
plt.title('Distribution of Licenses in Cultural Heritage Models')
plt.xlabel('License Type')
plt.ylabel('Number of Models')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 9. Export Data

Save the collected data for further analysis.

In [None]:
# Export to CSV
output_filename = f'cultural_heritage_models_{datetime.now().strftime("%Y%m%d")}.csv'
scraper.export_to_csv(df_combined, output_filename)

print(f"✓ Data exported to {output_filename}")
print(f"  Total records: {len(df_combined)}")
print(f"  Columns: {', '.join(df_combined.columns.tolist())}")

In [None]:
# Export a subset with key columns for qualitative analysis
analysis_columns = [
    'uid', 'name', 'description', 'tags', 'categories',
    'user_username', 'user_displayName', 
    'license_label', 'publishedAt',
    'viewCount', 'likeCount', 'commentCount',
    'isDownloadable', 'viewerUrl'
]

df_analysis = df_combined[analysis_columns]
analysis_filename = f'cultural_heritage_analysis_{datetime.now().strftime("%Y%m%d")}.csv'
scraper.export_to_csv(df_analysis, analysis_filename)

print(f"✓ Analysis subset exported to {analysis_filename}")

## 10. Advanced Search Examples

More sophisticated searches for specific research questions.

In [None]:
# Get all models from a specific user
username = "example_user"  # Replace with actual username

# Uncomment to run:
# user_models = scraper.get_user_models(username, max_results=100)
# df_user = scraper.to_dataframe(user_models)
# print(f"User {username} has {len(df_user)} models")
# display(df_user[['name', 'viewCount', 'likeCount']].head())

In [None]:
# Search for models with specific Creative Commons licenses
cc_models = scraper.search_models(
    query="archaeology",
    categories='cultural-heritage-history',
    licenses=['cc0', 'by', 'by-sa'],  # CC0 and Attribution licenses
    downloadable=True,
    max_results=50
)

df_cc = scraper.to_dataframe(cc_models)
print(f"Found {len(df_cc)} openly licensed archaeology models")
print("\nLicense breakdown:")
display(df_cc['license_label'].value_counts())

In [None]:
# Search for models by polygon count (complexity)
# Useful for understanding modeling approaches

low_poly_models = scraper.search_models(
    query="ancient",
    categories='cultural-heritage-history',
    max_face_count=50000,  # Low-poly models
    max_results=30
)

high_poly_models = scraper.search_models(
    query="ancient",
    categories='cultural-heritage-history',
    min_face_count=500000,  # High-poly models
    max_results=30
)

df_low_poly = scraper.to_dataframe(low_poly_models)
df_high_poly = scraper.to_dataframe(high_poly_models)

print(f"Low-poly models (< 50k faces): {len(df_low_poly)}")
print(f"High-poly models (> 500k faces): {len(df_high_poly)}")
print(f"\nAverage engagement comparison:")
print(f"  Low-poly views:  {df_low_poly['viewCount'].mean():.0f}")
print(f"  High-poly views: {df_high_poly['viewCount'].mean():.0f}")

## Notes and Best Practices

### Rate Limiting
- The scraper includes built-in rate limiting (default: 1.5 seconds between requests)
- If you encounter 429 errors, increase the `rate_limit_delay` parameter
- The scraper will automatically retry once after a 60-second wait

### Data Ethics
- Respect Sketchfab's Terms of Service
- Use data responsibly for research purposes
- Cite creators when using models or data in publications
- Consider privacy implications when analyzing user data

### Further Analysis
- Use the exported CSV files for qualitative analysis
- Consider sentiment analysis on descriptions
- Network analysis of creator communities
- Temporal analysis of modeling trends
- Comparison across different cultural heritage categories

### API Limitations
- Public searches may have different limits than authenticated searches
- Some model data may require authentication
- Consider caching results to avoid redundant API calls

### Citations
When publishing research using this data, please cite:
- Sketchfab as the data source
- Individual creators where appropriate
- The Sketchfab Data API: https://docs.sketchfab.com/data-api/v3/

---

**Happy researching!** 🏛️📊