This system collects and organises data from KnowYourMeme (KYM) entries including memes, people, sites, events, and cultures. It's designed to be resumable, incremental and respectful to KYM's servers.
Timers delays are set to 31 seconds or higher to avoid overloading the website. Please adhere to the robots.txt file when collecting data. These scripts take a long time, but patience is a virtue (and it's not nice to hammer another website's resources).
-
Test Run - Verify everything works:
- Edit
01_collect_index.py: SetTEST_MODE = True - Run
01_collect_index.py- Collects ~15 entries per category - Run
03_scrape_entries.py- Scrapes the actual data - Run
04_generate_stats.py- Merges and analyzes results
- Edit
-
Full Collection:
- Edit
01_collect_index.py: SetTEST_MODE = False - Run
01_collect_index.py- Gets complete index - Run
03_scrape_entries.py- Scrapes all entries (skips test ones) - Run
04_generate_stats.py- Final merge and statistics
- Edit
Function: Gets lists of entries from KYM categories
Configuration options:
TEST_MODE = True/False- Quick test (1 page) or full collectionCATEGORIES = ['meme', 'people', 'sites', 'events', 'cultures']- Which to collectSTATUS_TYPES = ['confirmed', 'submissions']- Entry status to includeMAX_PAGES = None- Limit pages (None = all, number = that many)
Output:
- Creates index file in
data/csv/index_updates/ - Filename:
index_YYYYMMDD_HHMMSS.csv - Also creates a summary JSON with statistics
Function: Compares existing index with current KYM to find new/removed entries
Configuration options:
BASELINE_INDEX = None- Which index to compare (None = most recent)QUICK_CHECK = True/False- Fast scan (2 pages) or thorough checkCATEGORIES = None- Specific categories or auto-detect from baseline
Output:
new_entries_TIMESTAMP.csv- Entries added since baselineremoved_entries_TIMESTAMP.csv- Entries no longer on KYM- Update report JSON with statistics
USE: After a main a collection and want to check for updates
Function: Takes an index and scrapes the full page data for each entry
Configuration options:
INDEX_FILE = None- Which index to process (None = most recent)ENTRY_TYPE = 'meme'- Affects output filename prefixRETRY_FAILED = True/False- Whether to retry failed URLs
Output:
- Batch files in
data/csv/scraped_data/ - Format:
meme_batch_TIMESTAMP.csv - Updates
progress.jsonto track completion
Expected time:
- 31+ seconds per entry
Important: Can be interrupted anytime with Ctrl+C - will resume where it left off
Function: Combines all batch files and generates statistics
No configuration needed - Automatically finds and processes all batch files
Output:
all_memes_merged.csv- Combined dataset- Statistics reports
- JSON export of data
data/
├── csv/
│ ├── index_updates/ # Index files from script 01
│ ├── scraped_data/ # Batch files from script 03
│ └── statistics/ # Analysis from script 04
├── html/
│ ├── confirmed/ # Saved HTML pages by category
│ ├── people/
│ ├── sites/
│ ├── events/
│ └── cultures/
└── logs/
└── progress.json # Tracks what's been scraped
- Set
TEST_MODE = Truein script 01 - Run scripts 01 → 03 → 04
- Verify test data looks good
- Set
TEST_MODE = Falsein script 01 - Run scripts 01 → 03 → 04 again (will skip test entries)
- Run script 02 to check for updates
- If new entries found, run script 03 on the new entries file
- Run script 04 to merge everything
Run script 01 or 03 again - it automatically continues from where it stopped. Both have incremental saving to resume.
Edit script 01:
CATEGORIES = ['meme', 'people'] # Just these two
STATUS_TYPES = ['confirmed'] # Only confirmed, not submissions
MAX_PAGES = 10 # First 10 pages onlyFor each entry, the system collects:
- Basic info (title, URL, ID)
- Metadata (status, type, year, origin)
- Statistics (views, comments, favorites)
- Content (about section, references)
- Media (saves images locally)
- Timestamps (when collected)
The scraper waits 31 seconds between requests to be respectful:
- 10 entries: ~5 minutes
- 100 entries: ~50 minutes
- 1,000 entries: ~8.5 hours
- 10,000 entries: ~3.5 days
The system tracks everything in progress.json:
- Completed URLs won't be re-scraped
- Failed URLs can be retried
- You can safely stop and restart anytime
- Test first: Always do a test run before full collection
- Be patient: Full collections take time but are resumable
- Check updates periodically: Use script 02 weekly/monthly
- Respect the server: Don't reduce the delay below 30 seconds
All scripts have configuration sections at the top. Common modifications:
Collect only memes:
CATEGORIES = ['meme']Skip submissions, only confirmed:
STATUS_TYPES = ['confirmed']Limit collection for testing:
MAX_PAGES = 5 # Just first 5 pagesChange output naming:
OUTPUT_NAME = "my_custom_index" # In script 01
ENTRY_TYPE = "collection_v2" # In script 03