Skip to content

Feature/enhanced processing and storage #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

crystalnet
Copy link

Hey,

I've been working quite extensively on Fredy because I think it's a great idea!

Here is what I added:

🚀 Enhanced Real Estate Listing System with AI-Powered Processing

📋 Summary

This PR introduces a comprehensive enhancement to the real estate listing system, adding AI-powered listing processing, waypoint calculations, improved error handling, and a new dashboard interface. The changes transform the basic listing scraper into a sophisticated real estate analysis platform.

🎯 Key Features

🤖 GenAI-Enhanced Listing Processing

  • ChatGPT Integration: Added intelligent extraction of custom fields from listing content
  • Custom Fields System: Configurable field extraction with natural language prompts
  • Enhanced Storage: New enhancedListingsStorage with schema validation and enforcement
  • Robust Error Handling: Graceful fallbacks when AI processing fails

🗺️ Waypoint Calculator

  • Travel Time Analysis: Calculate travel times to important locations (work, gym, etc.)
  • Multiple Transport Modes: Support for transit, driving, walking, and cycling
  • Google Maps Integration: Real-time travel data via Google Maps API

📊 Dashboard & UI Enhancements

  • New Dashboard: Comprehensive listing overview with filtering and sorting
  • Enhanced Job Management: Custom fields and waypoints configuration in job creation
  • Improved Navigation: Updated menu structure and routing

🔧 Technical Improvements

Error Handling & Logging

  • Centralized Logging: New logger.js utility with Winston integration
  • Defensive Programming: Robust error handling throughout the extraction pipeline
  • Schema Validation: Ensures database consistency for enhancedListings
  • Graceful Degradation: System continues working even when external APIs fail

Code Quality

  • Type Consistency: Converted all IDs to strings for consistency
  • Modular Architecture: Separated concerns with dedicated extractors and storage
  • Comprehensive Testing: Added extensive test coverage for new features
  • VSCode Configuration: Improved development experience with launch configurations

Performance & Reliability

  • Sequential Processing: Prevents overwhelming target servers
  • Persistent Storage: Enhanced listings stored with proper schema validation
  • Memory Management: Proper cleanup and resource management

📁 Files Changed

Core System (59 files changed, +1053/-325 lines)

  • lib/FredyRuntime.js - Enhanced with AI processing and waypoint calculation
  • lib/services/extractor/ - New extraction pipeline with ChatGPT integration
  • lib/services/storage/enhancedListingsStorage.js - New storage system with schema validation
  • lib/services/waypoint-calculator/ - New travel time calculation service

UI Components

  • ui/src/views/dashboard/ - New dashboard interface
  • ui/src/views/jobs/mutation/ - Enhanced job configuration with custom fields and waypoints
  • ui/src/services/rematch/models/ - Updated state management

Testing & Configuration

  • Comprehensive test suite for new features
  • VSCode launch configurations for debugging
  • Updated package dependencies

🧪 Testing

  • Unit Tests: Enhanced listings storage, waypoint calculator, ChatGPT integration
  • Integration Tests: Full end-to-end workflow testing
  • Error Handling: Tested fallback scenarios and error recovery
  • UI Testing: Dashboard and job configuration interfaces
  • Performance Testing: Sequential processing and delay mechanisms

�� Breaking Changes

None - This is a feature addition that maintains backward compatibility with existing jobs and configurations.

�� Migration Notes

  • Existing jobs will continue to work without modification
  • New features (custom fields, waypoints) are opt-in
  • Enhanced listings are stored separately from basic listings
  • No database migration required

📈 Impact

  • Enhanced Data Quality: AI-powered extraction provides richer listing information
  • Better User Experience: Dashboard provides comprehensive overview of listings
  • Improved Reliability: Robust error handling and bot detection prevention
  • Scalability: Modular architecture supports future enhancements

🔮 Future Considerations

  • Consider rate limiting for ChatGPT API usage
  • Monitor Google Maps API usage and costs
  • Evaluate performance impact of sequential processing
  • Consider caching mechanisms for waypoint calculations

Total Changes: 74 files, +3,436 insertions, -7,285 deletions (net -3,849 lines, mostly due to yarn.lock removal)

This PR represents a significant evolution of the real estate listing system, transforming it from a basic scraper into a comprehensive analysis platform with AI capabilities and travel insights.

🏠 Enhanced Real Estate Listing System - Detailed Feature Architecture

1. Custom Fields System

Overview

The custom fields system allows users to define specific attributes they want to extract from real estate listings using natural language processing. This transforms basic listing data into rich, structured information tailored to individual preferences.

Architecture

User Input → Job Configuration → ChatGPT Prompt → AI Extraction → Structured Data

How It Works

  1. User Configuration: Users define custom fields in the job creation interface with:

    • Field Name: Human-readable identifier (e.g., "Price per Square Meter")
    • Question Prompt: Natural language question for ChatGPT (e.g., "What is the price per square meter?")
    • Answer Length: Expected response format (one_word, one_statement, several_sentences)
  2. AI Processing Pipeline:

    // Example custom field configuration
    {
      id: "price_per_sqm",
      name: "Price per Square Meter", 
      questionPrompt: "What is the price per square meter?",
      answerLength: "one_word"
    }
  3. ChatGPT Integration:

    • System generates structured prompts from user questions
    • ChatGPT analyzes listing content and extracts specific values
    • Responses are validated and mapped to field IDs
    • Fallback to empty strings if extraction fails

Functionality

  • Dynamic Field Creation: No code changes needed for new field types
  • Natural Language Processing: Uses AI to understand context and extract precise values
  • Validation & Fallbacks: Graceful handling of extraction failures
  • Schema Enforcement: Ensures all listings have consistent field structure

Security

  • API Key Management: ChatGPT API keys are provided by users and stored securely in backend configuration
  • No Data Persistence: API keys are not stored in listings or transmitted to frontend
  • Rate Limiting: Built-in delays prevent API abuse

2. Waypoints System

Overview

The waypoints system calculates travel times and distances from listings to user-defined important locations, providing crucial insights for location-based decision making.

Architecture

User Waypoints → Google Maps API → Travel Calculations → Enhanced Listings

How It Works

  1. Waypoint Configuration: Users define locations with:

    • Name: Human-readable identifier (e.g., "Work", "Gym")
    • Address: Physical location for geocoding
    • Transport Mode: Preferred travel method (transit, driving, walking, bicycling)
  2. Google Maps Integration:

    // Example waypoint configuration
    {
      id: "work",
      name: "Work",
      location: "Alexanderplatz 1, Berlin",
      transportMode: "transit"
    }
  3. Calculation Process:

    • Geocodes listing address and waypoint addresses
    • Queries Google Maps Distance Matrix API
    • Calculates travel time and distance for each waypoint
    • Stores results as travelTime_work, travelDistance_work fields

Functionality

  • Multi-Modal Transport: Support for public transit, car, walking, cycling
  • Real-Time Data: Live travel information from Google Maps
  • Batch Processing: Efficient calculation for multiple waypoints
  • Error Handling: Graceful fallbacks when API calls fail

Security

  • API Key Management: Google Maps API keys provided by users and stored securely
  • No Key Exposure: API keys never transmitted to frontend or stored in listings
  • Usage Monitoring: Built-in logging for API usage tracking

3. Enhanced Listings Processing

Overview

Enhanced listings represent a complete transformation of basic search results into rich, AI-analyzed data with travel insights, providing comprehensive information for informed decision making.

Architecture

Search Results → Expose Fetching → Content Extraction → AI Processing → Travel Calculation → Storage

How It Works

  1. Initial Search: Standard listing discovery via search pages

  2. Expose Fetching:

    • Navigates to individual listing pages
    • Extracts full listing content (not just search snippets)
    • Handles different provider formats (HTML, JSON APIs)
  3. Content Processing:

    // Sequential processing with bot detection prevention
    for (listing of listings) {
      await delay(2000-7000ms); // Random delay
      const exposeContent = await fetchExpose(listing.url);
      const enhancedData = await processWithAI(exposeContent);
      const waypointData = await calculateWaypoints(listing.address);
      await storeEnhancedListing({...listing, ...enhancedData, ...waypointData});
    }
  4. AI Enhancement:

    • Extracts custom fields using ChatGPT
    • Processes natural language content
    • Validates and structures responses
  5. Travel Calculation:

    • Calculates distances to all configured waypoints
    • Provides travel times for different transport modes
    • Handles API failures gracefully

Functionality

  • Comprehensive Data: Combines basic listing info with AI insights and travel data
  • Sequential Processing: Prevents bot detection with intelligent delays
  • Error Resilience: Continues processing even when individual listings fail
  • Real-Time Updates: Fresh data on each processing run

4. Enhanced Listings Storage

Overview

A sophisticated storage system designed to handle the complex, schema-enforced data structure of enhanced listings with proper validation and efficient retrieval.

Architecture

Enhanced Listings → Schema Validation → Object Storage → JSON Files → UI Retrieval

How It Works

  1. Schema Management:

    // Dynamic schema generation from job configuration
    const schema = [
      ...basicFields,           // id, title, price, size, link, date_found, details
      ...customFieldColumns,    // User-defined custom fields
      ...waypointColumns        // travelTime_*, travelDistance_* fields
    ];
  2. Storage Structure:

    • Object-based Storage: Listings stored as objects keyed by ID for fast access
    • Schema Enforcement: All listings must conform to defined schema
    • Automatic Backfilling: New fields added to existing listings with defaults
    • Deduplication: Overwrites existing listings with same ID
  3. Data Validation:

    // Ensures all required fields exist
    function validateWithSchema(listing, schema) {
      return schema.every(col => Object.prototype.hasOwnProperty.call(listing, col.id));
    }
  4. File Organization:

    • One JSON file per job: db/enhanced-listings/{jobId}.json
    • Contains both listings object and schema definition
    • Automatic directory creation and file management

Functionality

  • Schema Evolution: Supports adding/removing fields without data loss
  • Efficient Retrieval: Object storage enables fast lookups by ID
  • Data Integrity: Validation ensures consistent data structure
  • Scalability: File-per-job organization supports large datasets

5. Dashboard Interface

Overview

A comprehensive table-based interface that transforms enhanced listing data into actionable insights, enabling users to compare properties and make informed decisions.

Architecture

Enhanced Listings → Dashboard API → React Table → Filtered/Sorted View

How It Works

  1. Data Retrieval:

    // Fetches enhanced listings for specific job
    const enhancedListings = await getEnhancedListings(jobId);
    const schema = await getSchema(jobId);
  2. Dynamic Column Generation:

    • Generates table columns from schema definition
    • Supports different column types (basic, custom, waypoint)
    • Configurable visibility and sorting
  3. Advanced Filtering:

    • Text search across all fields
    • Numeric range filters for prices, sizes, travel times
    • Multi-select filters for categorical data
  4. Interactive Features:

    // Example table features
    - Sortable columns (price, size, travel time)
    - Filterable data (price range, location)
    - Export functionality
    - Direct links to original listings

Functionality

  • Comprehensive Overview: All listing data in one view
  • Comparison Tools: Side-by-side property comparison
  • Travel Insights: Visual representation of travel times
  • Custom Field Display: Shows AI-extracted custom information
  • Responsive Design: Works on desktop and mobile devices

Security Features

  • User Authentication: Dashboard access requires login
  • Job Isolation: Users only see listings from their own jobs
  • API Key Protection: No sensitive configuration data exposed
  • Input Validation: All user inputs sanitized and validated

�� Security & Privacy

API Key Management

  • User-Provided Keys: All API keys (ChatGPT, Google Maps) are provided by users
  • Secure Storage: Keys stored in backend configuration files, not in database
  • No Frontend Exposure: Keys never transmitted to browser or stored in listings
  • Environment-Based: Support for environment variables for production deployments

Data Protection

  • Local Storage: All data stored locally, no external data transmission
  • User Isolation: Jobs and listings are user-specific
  • No PII Storage: No personal information stored in listings
  • Configurable Retention: Users control data retention policies

Access Control

  • Authentication Required: All enhanced features require user login
  • Job-Level Permissions: Users can only access their own job data
  • Audit Logging: Comprehensive logging for debugging and monitoring

This architecture provides a robust, scalable, and secure foundation for advanced real estate analysis while maintaining user privacy and data security.

@orangecoding
Copy link
Owner

Dude.. that's quite a pr. First and foremost thanks for the work. I need some time to check it out and go through all of this which might take 2 weeks as I'm on a business trip starting end of this week, so there might be some delays. I'll check it out as soon as possible.

@crystalnet
Copy link
Author

Yeah, sorry that it became so big... Take your time to review it and feel free to ask me any questions :)

@orangecoding
Copy link
Owner

hey. I just did a quick check (I need to dig way deeper), just a couple of questions for now.

  1. you removed the config.json which breaks Fredy. Can you add it again?
  2. you introduced some big new concepts into Fredy. Some of which are extremely powerful (despite only being relevant to a smaller group of people as I assume the regular user won't use ai), but it would anyhow be benefitial to describe all of this in the Readme
  3. why is there a random delay? await delay(2000-7000ms); (and such a big one)
  4. As for the new settings "Google Maps Api Key" & "Open Ai Key", would it make sense to add a little bit of documentation on how to obtain the keys (for the dau)?
  5. Can we somehow check if the api keys are valid upon saving?
  6. As for the custom fields in the jobs, I think it makes sense to not let the user add any fields if there is no openapi key
  7. I do not understand why (in the sequence of steps) you send out the notifications first and only after you enhance the listings. This way the user will never actually get all the infos. If I put the notification last, I can actually see the enhancements, but due to the delay, it takes super long to process it. Dependening on the number of found listings (could be north of 100 on the first run), this can theoretically take longer than it takes for Fredy to run again.
  8. You do have a function called _enhanceListings. It might be good to rename it to _enhanceListingsWithAi
  9. For the enhancement step, you extract all the data again. I think this is a very big overkill, why don't you use the already extracted data instead?

Lastly, if I change the order as mentioned in (7), and use the ai enhancement, I do get tons of these errors:

error: processExpose failed for listing 1c5733e31026f43622146abbf05f4e87b46de0630c4afa0b662b436169dcaeac: No response received from https://www.immonet.de/classified-search?distributionTypes=Buy,Buy_Auction,Compulsory_Auction&estateTypes=House,Apartment&locations=AD08DE2112&order=Default&m=homepage_new_search_classified_search_result {"timestamp":"2025-06-24T10:59:01.028Z"}

package-lock.json

# Config files
config.json
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must be included, otherwise Fredy will break for everybody when cloning freshly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants