KYM Scraper - Usage Guide

Overview

This system collects and organises data from KnowYourMeme (KYM) entries including memes, people, sites, events, and cultures. It's designed to be resumable, incremental and respectful to KYM's servers.

Important Notes

Timers delays are set to 31 seconds or higher to avoid overloading the website. Please adhere to the robots.txt file when collecting data. These scripts take a long time, but patience is a virtue (and it's not nice to hammer another website's resources).

Quick Start

First Time Setup

Test Run - Verify everything works:
- Edit 01_collect_index.py: Set TEST_MODE = True
- Run 01_collect_index.py - Collects ~15 entries per category
- Run 03_scrape_entries.py - Scrapes the actual data
- Run 04_generate_stats.py - Merges and analyzes results
Full Collection:
- Edit 01_collect_index.py: Set TEST_MODE = False
- Run 01_collect_index.py - Gets complete index
- Run 03_scrape_entries.py - Scrapes all entries (skips test ones)
- Run 04_generate_stats.py - Final merge and statistics

The Four Scripts

01_collect_index.py - Collect Entry Lists

Function: Gets lists of entries from KYM categories

Configuration options:

TEST_MODE = True/False - Quick test (1 page) or full collection
CATEGORIES = ['meme', 'people', 'sites', 'events', 'cultures'] - Which to collect
STATUS_TYPES = ['confirmed', 'submissions'] - Entry status to include
MAX_PAGES = None - Limit pages (None = all, number = that many)

Output:

Creates index file in data/csv/index_updates/
Filename: index_YYYYMMDD_HHMMSS.csv
Also creates a summary JSON with statistics

02_check_updates.py - Find New Entries

Function: Compares existing index with current KYM to find new/removed entries

Configuration options:

BASELINE_INDEX = None - Which index to compare (None = most recent)
QUICK_CHECK = True/False - Fast scan (2 pages) or thorough check
CATEGORIES = None - Specific categories or auto-detect from baseline

Output:

new_entries_TIMESTAMP.csv - Entries added since baseline
removed_entries_TIMESTAMP.csv - Entries no longer on KYM
Update report JSON with statistics

USE: After a main a collection and want to check for updates

03_scrape_entries.py - Collect Full Entry Data

Function: Takes an index and scrapes the full page data for each entry

Configuration options:

INDEX_FILE = None - Which index to process (None = most recent)
ENTRY_TYPE = 'meme' - Affects output filename prefix
RETRY_FAILED = True/False - Whether to retry failed URLs

Output:

Batch files in data/csv/scraped_data/
Format: meme_batch_TIMESTAMP.csv
Updates progress.json to track completion

Expected time:

31+ seconds per entry

Important: Can be interrupted anytime with Ctrl+C - will resume where it left off

04_generate_stats.py - Merge and Analyze

Function: Combines all batch files and generates statistics

No configuration needed - Automatically finds and processes all batch files

Output:

all_memes_merged.csv - Combined dataset
Statistics reports
JSON export of data

Directory Structure

data/
├── csv/
│   ├── index_updates/       # Index files from script 01
│   ├── scraped_data/        # Batch files from script 03
│   └── statistics/          # Analysis from script 04
├── html/
│   ├── confirmed/           # Saved HTML pages by category
│   ├── people/
│   ├── sites/
│   ├── events/
│   └── cultures/
└── logs/
    └── progress.json        # Tracks what's been scraped

Common Workflows

Starting Fresh

Set TEST_MODE = True in script 01
Run scripts 01 → 03 → 04
Verify test data looks good
Set TEST_MODE = False in script 01
Run scripts 01 → 03 → 04 again (will skip test entries)

Updating Existing Collection

Run script 02 to check for updates
If new entries found, run script 03 on the new entries file
Run script 04 to merge everything

Resuming After Interruption

Run script 01 or 03 again - it automatically continues from where it stopped. Both have incremental saving to resume.

Collecting Specific Categories

Edit script 01:

CATEGORIES = ['meme', 'people']  # Just these two
STATUS_TYPES = ['confirmed']      # Only confirmed, not submissions
MAX_PAGES = 10                    # First 10 pages only

Data Notes

Data Collected

For each entry, the system collects:

Basic info (title, URL, ID)
Metadata (status, type, year, origin)
Statistics (views, comments, favorites)
Content (about section, references)
Media (saves images locally)
Timestamps (when collected)

Processing Speed

The scraper waits 31 seconds between requests to be respectful:

10 entries: ~5 minutes
100 entries: ~50 minutes
1,000 entries: ~8.5 hours
10,000 entries: ~3.5 days

Skip Logic

The system tracks everything in progress.json:

Completed URLs won't be re-scraped
Failed URLs can be retried
You can safely stop and restart anytime

Tips

Test first: Always do a test run before full collection
Be patient: Full collections take time but are resumable
Check updates periodically: Use script 02 weekly/monthly
Respect the server: Don't reduce the delay below 30 seconds

Modifying Behavior

All scripts have configuration sections at the top. Common modifications:

Collect only memes:

CATEGORIES = ['meme']

Skip submissions, only confirmed:

STATUS_TYPES = ['confirmed']

Limit collection for testing:

MAX_PAGES = 5  # Just first 5 pages

Change output naming:

OUTPUT_NAME = "my_custom_index"  # In script 01
ENTRY_TYPE = "collection_v2"     # In script 03

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KYM Scraper - Usage Guide

Overview

Important Notes

Quick Start

First Time Setup

The Four Scripts

01_collect_index.py - Collect Entry Lists

02_check_updates.py - Find New Entries

03_scrape_entries.py - Collect Full Entry Data

04_generate_stats.py - Merge and Analyze

Directory Structure

Common Workflows

Starting Fresh

Updating Existing Collection

Resuming After Interruption

Collecting Specific Categories

Data Notes

Data Collected

Processing Speed

Skip Logic

Tips

Modifying Behavior

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

KYM Scraper - Usage Guide

Overview

Important Notes

Quick Start

First Time Setup

The Four Scripts

01_collect_index.py - Collect Entry Lists

02_check_updates.py - Find New Entries

03_scrape_entries.py - Collect Full Entry Data

04_generate_stats.py - Merge and Analyze

Directory Structure

Common Workflows

Starting Fresh

Updating Existing Collection

Resuming After Interruption

Collecting Specific Categories

Data Notes

Data Collected

Processing Speed

Skip Logic

Tips

Modifying Behavior

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages