A scalable, production-ready tool for extracting and processing OpenStreetMap data with hierarchical geographic structure.
OSM Data Extractor is a comprehensive pipeline for extracting, processing, and structuring geographic data from OpenStreetMap. Currently focused on Turkey πΉπ· as the primary dataset, with plans to expand to additional countries.
The tool extracts:
- Administrative boundaries (country, region, province, district, neighborhood)
- Street networks organized by region
- Points of Interest (POI) including education, healthcare, government facilities, and more
- πΊοΈ Hierarchical Data Structure: Organized geographic data with proper parent-child relationships
- π Retry Logic: Built-in error handling and automatic retries for API requests
- π Progress Tracking: Real-time logging and progress monitoring
- π UTF-8 Support: Full support for international characters (Turkish, Arabic, etc.)
- β‘ Batch Processing: Efficient data extraction with configurable batch sizes
- π§ Modular Design: Easy to extend for additional countries or data types
- βοΈ Cloud-Ready: Deployable on GCP, AWS, or Azure
- Python 3.8 or higher
- pip package manager
# Clone the repository
git clone https://github.com/youssef509/OSM-Data-Extractor.git
cd OSM-Data-Extractor
# Create virtual environment
python -m venv .venv
# Activate virtual environment
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Run the complete extraction pipeline
python run_pipeline.py
# Or run individual extractors
python -m src.extractors.extract_administrative
python -m src.extractors.extract_streets
python -m src.extractors.extract_poiOSM-Data-Extractor/
βββ src/
β βββ extractors/ # Data extraction modules
β β βββ extract_administrative.py
β β βββ extract_streets.py
β β βββ extract_poi.py
β β βββ extract_turkey.py
β βββ utils/ # Utility functions
β βββ utils.py
βββ data/
β βββ raw/ # Raw OSM data files
β βββ processed/ # Processed output files
βββ config/ # Deployment configurations
βββ tests/ # Test suite
βββ docs/ # Documentation
βββ config.py # Application configuration
βββ run_pipeline.py # Main entry point
βββ requirements.txt # Python dependencies
βββ README.md
{
"type": "province",
"name": "Δ°stanbul",
"admin_level": 4,
"children": [
{
"type": "district",
"name": "KadΔ±kΓΆy",
"admin_level": 6
}
]
}{
"region": "Δ°stanbul",
"streets": [
{
"name": "BaΔdat Caddesi",
"type": "primary",
"surface": "asphalt"
}
]
}Edit config.py to customize:
- Regions: Add or remove geographic regions
- POI Categories: Define custom points of interest
- API Settings: Adjust timeout, retries, and batch sizes
# Example: Add a new region
REGIONS = [
'Δ°stanbul', 'Ankara', 'Δ°zmir',
'Your-New-Region' # Add here
]The project is designed to be country-agnostic. To add support for a new country:
- Create a new extractor in
src/extractors/extract_[country].py - Update
config.pywith country-specific regions - Run the pipeline with your new extractor
Coming Soon: France π«π·, Germany π©πͺ, Spain πͺπΈ, and more!
Google Cloud Platform (GCP)
# Run the GCP setup script
chmod +x config/gcp-setup.sh
./config/gcp-setup.sh
# SSH to VM
gcloud compute ssh --zone=us-central1-a osm-extractor
# Run the pipeline
python run_pipeline.pySee docs/GCP_DEPLOYMENT.md for detailed instructions.
AWS / Azure
Documentation coming soon!
# Run all tests
python -m pytest tests/
# Run specific test
python -m pytest tests/test_extraction.py- Turkey Complete Dataset: ~24-48 hours
- Single Region: ~1-2 hours
- Memory Usage: ~2-4 GB
- Storage: ~250 GB for full Turkey dataset
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Data sourced from OpenStreetMap contributors
- Built with OverPy and osmium
Youssef - @youssef509
Project Link: https://github.com/youssef509/OSM-Data-Extractor
Note: This project currently focuses on Turkey as the primary dataset. Support for additional countries is planned and contributions are welcome!