Skip to content

OSM Data Extractor is a comprehensive pipeline for extracting, processing, and structuring geographic data from OpenStreetMap. Currently focused on Turkey πŸ‡ΉπŸ‡· as the primary dataset, with plans to expand to additional countries.

License

Notifications You must be signed in to change notification settings

youssef509/OSM-Data-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OSM Data Extractor

A scalable, production-ready tool for extracting and processing OpenStreetMap data with hierarchical geographic structure.

Python License OpenStreetMap

🌍 Overview

OSM Data Extractor is a comprehensive pipeline for extracting, processing, and structuring geographic data from OpenStreetMap. Currently focused on Turkey πŸ‡ΉπŸ‡· as the primary dataset, with plans to expand to additional countries.

The tool extracts:

  • Administrative boundaries (country, region, province, district, neighborhood)
  • Street networks organized by region
  • Points of Interest (POI) including education, healthcare, government facilities, and more

✨ Features

  • πŸ—ΊοΈ Hierarchical Data Structure: Organized geographic data with proper parent-child relationships
  • πŸ”„ Retry Logic: Built-in error handling and automatic retries for API requests
  • πŸ“Š Progress Tracking: Real-time logging and progress monitoring
  • 🌐 UTF-8 Support: Full support for international characters (Turkish, Arabic, etc.)
  • ⚑ Batch Processing: Efficient data extraction with configurable batch sizes
  • πŸ”§ Modular Design: Easy to extend for additional countries or data types
  • ☁️ Cloud-Ready: Deployable on GCP, AWS, or Azure

πŸš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Installation

# Clone the repository
git clone https://github.com/youssef509/OSM-Data-Extractor.git
cd OSM-Data-Extractor

# Create virtual environment
python -m venv .venv

# Activate virtual environment
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Basic Usage

# Run the complete extraction pipeline
python run_pipeline.py

# Or run individual extractors
python -m src.extractors.extract_administrative
python -m src.extractors.extract_streets
python -m src.extractors.extract_poi

πŸ“ Project Structure

OSM-Data-Extractor/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ extractors/           # Data extraction modules
β”‚   β”‚   β”œβ”€β”€ extract_administrative.py
β”‚   β”‚   β”œβ”€β”€ extract_streets.py
β”‚   β”‚   β”œβ”€β”€ extract_poi.py
β”‚   β”‚   └── extract_turkey.py
β”‚   └── utils/                # Utility functions
β”‚       └── utils.py
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                  # Raw OSM data files
β”‚   └── processed/            # Processed output files
β”œβ”€β”€ config/                   # Deployment configurations
β”œβ”€β”€ tests/                    # Test suite
β”œβ”€β”€ docs/                     # Documentation
β”œβ”€β”€ config.py                 # Application configuration
β”œβ”€β”€ run_pipeline.py           # Main entry point
β”œβ”€β”€ requirements.txt          # Python dependencies
└── README.md

πŸ“Š Output Format

Administrative Boundaries

{
  "type": "province",
  "name": "Δ°stanbul",
  "admin_level": 4,
  "children": [
    {
      "type": "district",
      "name": "KadΔ±kΓΆy",
      "admin_level": 6
    }
  ]
}

Streets

{
  "region": "Δ°stanbul",
  "streets": [
    {
      "name": "Bağdat Caddesi",
      "type": "primary",
      "surface": "asphalt"
    }
  ]
}

βš™οΈ Configuration

Edit config.py to customize:

  • Regions: Add or remove geographic regions
  • POI Categories: Define custom points of interest
  • API Settings: Adjust timeout, retries, and batch sizes
# Example: Add a new region
REGIONS = [
    'Δ°stanbul', 'Ankara', 'Δ°zmir',
    'Your-New-Region'  # Add here
]

🌐 Expanding to Other Countries

The project is designed to be country-agnostic. To add support for a new country:

  1. Create a new extractor in src/extractors/extract_[country].py
  2. Update config.py with country-specific regions
  3. Run the pipeline with your new extractor

Coming Soon: France πŸ‡«πŸ‡·, Germany πŸ‡©πŸ‡ͺ, Spain πŸ‡ͺπŸ‡Έ, and more!

☁️ Cloud Deployment

Google Cloud Platform (GCP)
# Run the GCP setup script
chmod +x config/gcp-setup.sh
./config/gcp-setup.sh

# SSH to VM
gcloud compute ssh --zone=us-central1-a osm-extractor

# Run the pipeline
python run_pipeline.py

See docs/GCP_DEPLOYMENT.md for detailed instructions.

AWS / Azure

Documentation coming soon!

πŸ§ͺ Testing

# Run all tests
python -m pytest tests/

# Run specific test
python -m pytest tests/test_extraction.py

πŸ“ˆ Performance

  • Turkey Complete Dataset: ~24-48 hours
  • Single Region: ~1-2 hours
  • Memory Usage: ~2-4 GB
  • Storage: ~250 GB for full Turkey dataset

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸ“§ Contact

Youssef - @youssef509

Project Link: https://github.com/youssef509/OSM-Data-Extractor


Note: This project currently focuses on Turkey as the primary dataset. Support for additional countries is planned and contributions are welcome!

About

OSM Data Extractor is a comprehensive pipeline for extracting, processing, and structuring geographic data from OpenStreetMap. Currently focused on Turkey πŸ‡ΉπŸ‡· as the primary dataset, with plans to expand to additional countries.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published