Skip to content

vemchance/kym_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KYM Scraper - Usage Guide

Overview

This system collects and organises data from KnowYourMeme (KYM) entries including memes, people, sites, events, and cultures. It's designed to be resumable, incremental and respectful to KYM's servers.

Important Notes

Timers delays are set to 31 seconds or higher to avoid overloading the website. Please adhere to the robots.txt file when collecting data. These scripts take a long time, but patience is a virtue (and it's not nice to hammer another website's resources).

Quick Start

First Time Setup

  1. Test Run - Verify everything works:

    • Edit 01_collect_index.py: Set TEST_MODE = True
    • Run 01_collect_index.py - Collects ~15 entries per category
    • Run 03_scrape_entries.py - Scrapes the actual data
    • Run 04_generate_stats.py - Merges and analyzes results
  2. Full Collection:

    • Edit 01_collect_index.py: Set TEST_MODE = False
    • Run 01_collect_index.py - Gets complete index
    • Run 03_scrape_entries.py - Scrapes all entries (skips test ones)
    • Run 04_generate_stats.py - Final merge and statistics

The Four Scripts

01_collect_index.py - Collect Entry Lists

Function: Gets lists of entries from KYM categories

Configuration options:

  • TEST_MODE = True/False - Quick test (1 page) or full collection
  • CATEGORIES = ['meme', 'people', 'sites', 'events', 'cultures'] - Which to collect
  • STATUS_TYPES = ['confirmed', 'submissions'] - Entry status to include
  • MAX_PAGES = None - Limit pages (None = all, number = that many)

Output:

  • Creates index file in data/csv/index_updates/
  • Filename: index_YYYYMMDD_HHMMSS.csv
  • Also creates a summary JSON with statistics

02_check_updates.py - Find New Entries

Function: Compares existing index with current KYM to find new/removed entries

Configuration options:

  • BASELINE_INDEX = None - Which index to compare (None = most recent)
  • QUICK_CHECK = True/False - Fast scan (2 pages) or thorough check
  • CATEGORIES = None - Specific categories or auto-detect from baseline

Output:

  • new_entries_TIMESTAMP.csv - Entries added since baseline
  • removed_entries_TIMESTAMP.csv - Entries no longer on KYM
  • Update report JSON with statistics

USE: After a main a collection and want to check for updates

03_scrape_entries.py - Collect Full Entry Data

Function: Takes an index and scrapes the full page data for each entry

Configuration options:

  • INDEX_FILE = None - Which index to process (None = most recent)
  • ENTRY_TYPE = 'meme' - Affects output filename prefix
  • RETRY_FAILED = True/False - Whether to retry failed URLs

Output:

  • Batch files in data/csv/scraped_data/
  • Format: meme_batch_TIMESTAMP.csv
  • Updates progress.json to track completion

Expected time:

  • 31+ seconds per entry

Important: Can be interrupted anytime with Ctrl+C - will resume where it left off

04_generate_stats.py - Merge and Analyze

Function: Combines all batch files and generates statistics

No configuration needed - Automatically finds and processes all batch files

Output:

  • all_memes_merged.csv - Combined dataset
  • Statistics reports
  • JSON export of data

Directory Structure

data/
├── csv/
│   ├── index_updates/       # Index files from script 01
│   ├── scraped_data/        # Batch files from script 03
│   └── statistics/          # Analysis from script 04
├── html/
│   ├── confirmed/           # Saved HTML pages by category
│   ├── people/
│   ├── sites/
│   ├── events/
│   └── cultures/
└── logs/
    └── progress.json        # Tracks what's been scraped

Common Workflows

Starting Fresh

  1. Set TEST_MODE = True in script 01
  2. Run scripts 01 → 03 → 04
  3. Verify test data looks good
  4. Set TEST_MODE = False in script 01
  5. Run scripts 01 → 03 → 04 again (will skip test entries)

Updating Existing Collection

  1. Run script 02 to check for updates
  2. If new entries found, run script 03 on the new entries file
  3. Run script 04 to merge everything

Resuming After Interruption

Run script 01 or 03 again - it automatically continues from where it stopped. Both have incremental saving to resume.

Collecting Specific Categories

Edit script 01:

CATEGORIES = ['meme', 'people']  # Just these two
STATUS_TYPES = ['confirmed']      # Only confirmed, not submissions
MAX_PAGES = 10                    # First 10 pages only

Data Notes

Data Collected

For each entry, the system collects:

  • Basic info (title, URL, ID)
  • Metadata (status, type, year, origin)
  • Statistics (views, comments, favorites)
  • Content (about section, references)
  • Media (saves images locally)
  • Timestamps (when collected)

Processing Speed

The scraper waits 31 seconds between requests to be respectful:

  • 10 entries: ~5 minutes
  • 100 entries: ~50 minutes
  • 1,000 entries: ~8.5 hours
  • 10,000 entries: ~3.5 days

Skip Logic

The system tracks everything in progress.json:

  • Completed URLs won't be re-scraped
  • Failed URLs can be retried
  • You can safely stop and restart anytime

Tips

  • Test first: Always do a test run before full collection
  • Be patient: Full collections take time but are resumable
  • Check updates periodically: Use script 02 weekly/monthly
  • Respect the server: Don't reduce the delay below 30 seconds

Modifying Behavior

All scripts have configuration sections at the top. Common modifications:

Collect only memes:

CATEGORIES = ['meme']

Skip submissions, only confirmed:

STATUS_TYPES = ['confirmed']

Limit collection for testing:

MAX_PAGES = 5  # Just first 5 pages

Change output naming:

OUTPUT_NAME = "my_custom_index"  # In script 01
ENTRY_TYPE = "collection_v2"     # In script 03

About

KYM webscraper for SemioMeme

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors