This is a fun experiment to make USCIS civics test prep more engaging. The official study guide only provides a set of allowed answers per question, so we're making it better with both multiple choice and free text versions.
We first ingest the 100 questions provided by USCIS.
We use LLMs to augment the data by:
- Adding incorrect choices for multiple choice questions
- Generating helpful hints
- Providing detailed correct answers
This is where LLMs make the pipeline much simpler - instead of validating answers with a ton of custom rules, we rely on LLM to validate user's answers for:
- Typos
- Semantically correct answers
- Alternative valid explanations
--
The /scripts
folder contains Python utilities that prepare and enrich the USCIS test data:
Base data extraction script that:
- Pulls questions from the official USCIS PDF
- Scrapes current questions from the USCIS website
- Merges both sources to create a comprehensive question bank
- Outputs:
merged_questions.json
Enhances questions for multiple choice by:
- Using GPT-4 to generate 3 plausible but incorrect answers per question
- Ensures wrong answers are distinct and based on common misconceptions
- Outputs:
questions_with_incorrect.json
Adds learning aids by:
- Generating concise, contextual hints for each question
- Provides historical context without giving away answers
- Keeps hints under 100 characters when possible
- Outputs:
questions_with_hints.json
Each script processes data in batches of 10 questions and includes retry logic for API calls. The pipeline runs sequentially: extract → add incorrect answers → add hints, with each step building on the previous output.
Built with:
Built by @asisbot