# Cars.com Web Scraper - Optimized Version

## Key Improvements:

### 1. Try-Catch for Every Field
- Each field wrapped in try-catch to prevent crashes
- If one field fails, continues to next field
- Handles variable HTML structures across listings
- No more script crashes from missing elements

### 2. Batch HTML Collection (Memory Efficient)
- **NEW**: Open browser once, collect all HTML, then parse offline
- Saves HTML to disk first, extracts data later
- Avoids OOM errors from frequent browser open/close
- 3x faster than old method

### 3. Single Browser Session Per Page
- Old: Open/close browser for EACH listing (slow, memory leak)
- New: Open browser ONCE per page, get all URLs sequentially
- Significantly reduces lag and OOM issues

### 4. Two-Phase Processing
**Phase 1 - HTML Collection:**
- Step 1: Collect all listing HTML (1 browser session)
- Step 2: Parse listings to find seller/review URLs
- Step 3: Collect all seller HTML (1 browser session)
- Step 4: Collect all review HTML (1 browser session)

**Phase 2 - Data Extraction:**
- Step 5: Parse all HTML offline and save JSON

### 5. HTML Caching
- All HTML saved to `html_cache/` folder
- Can re-parse without re-scraping
- Resume interrupted scraping
- Debug parsing issues easily

### 6. Data Completeness Tracking
- `_metadata` field tracks data quality
- `categorize_scraping_result()` classifies:
  - **complete**: Full data with reviews
  - **partial_no_reviews**: No reviews (new models)
  - **partial**: Incomplete data
  - **failed**: No data retrieved

## Architecture:

### Old Method (Deprecated):
```
For each URL:
  Open browser → Get listing → Close browser
  Open browser → Get seller → Close browser  
  Open browser → Get reviews → Close browser
  Parse and save
```
**Problems**: Slow, OOM errors, browser lag

### New Method (Recommended):
```
Open browser once:
  Get all listing URLs → Save HTML
Close browser

Open browser once:
  Get all seller URLs → Save HTML
Close browser

Open browser once:
  Get all review URLs → Save HTML
Close browser

Parse all HTML offline → Save JSON
```
**Benefits**: Fast, memory efficient, resumable

## Usage:

### Recommended - Batch Method:
```python
scrape_from_url_files_batch(
    link_folder="car_links",
    output_root="raw_data",
    html_cache_root="html_cache",
    headless=True,
    from_page=1,
    to_page=2,
)
```

### Old Method (if needed):
```python
scrape_from_url_files(
    link_folder="car_links",
    output_root="raw_data",
    delay=2.5,
    headless=True,
    from_page=1,
    to_page=2,
    max_workers=5,
)
```

## Edge Cases Handled:
- New models without reviews
- Missing seller links
- Variable HTML structures
- Network timeouts
- Incomplete listings

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install selenium

Collecting selenium
  Downloading selenium-4.38.0-py3-none-any.whl.metadata (7.5 kB)
Collecting trio<1.0,>=0.31.0 (from selenium)
  Downloading trio-0.32.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket<1.0,>=0.12.2 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting outcome (from trio<1.0,>=0.31.0->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket<1.0,>=0.12.2->selenium)
  Downloading wsproto-1.3.1-py3-none-any.whl.metadata (5.2 kB)
Downloading selenium-4.38.0-py3-none-any.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio-0.32.0-py3-none-any.whl (512 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.0/512.0 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio_websocket-0.12.2-py3-none-any.whl (21 kB)
Downloadin

In [None]:
!pip install webdriver-manager

In [None]:
# Install Chrome and ChromeDriver (standalone version, no snap)
!apt-get update
!apt-get install -y wget unzip

# Download and install Chrome
!wget -q https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
!apt-get install -y ./google-chrome-stable_current_amd64.deb

# Verify installation
!which google-chrome
!google-chrome --version

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:2 https://cli.github.com/packages stable InRelease
Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,289 kB]
Get:13 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [5,

In [None]:
# Test Chrome installation
print("Testing Chrome installation...")
!which google-chrome
!google-chrome --version
print("\n✓ Chrome is installed!")

In [None]:
# Quick test to verify Chrome driver works
print("=" * 50)
print("TESTING CHROME DRIVER INITIALIZATION")
print("=" * 50)

try:
    test_driver = init_driver(headless=True)
    print("\n✅ SUCCESS! Chrome driver is working!")
    print("Attempting to navigate to Google...")
    test_driver.get("https://www.google.com")
    print(f"✅ Page title: {test_driver.title}")
    test_driver.quit()
    print("✅ All tests passed!\n")
except Exception as e:
    print(f"\n❌ FAILED: {e}\n")
    import traceback
    traceback.print_exc()

---

## Code Structure

This notebook is organized into the following sections:

1. **Environment Setup** (Cells 2-3)
   - Install Selenium
   - Install Chrome driver

2. **Library Imports** (Cell 4)
   - All required imports in one place
   - Organized by: Standard library → Third-party → Selenium

3. **Core Utility Functions** (Cell 5)
   - `init_driver()` - Initialize browser
   - `clean_and_convert_to_int()` - Data cleaning
   - `get_html_with_driver()` - Fetch HTML
   - `save_html()` / `load_html()` - HTML caching
   - `get_soup_from_html()` - Parse HTML

4. **HTML Parsing Functions** (Cell 5)
   - `_parse_listing_data()` - Parse car listing
   - `_parse_seller_data()` - Parse seller info
   - `_parse_reviews_data()` - Parse reviews

5. **Scraping Functions** (Cell 5)
   - `scrape_*_from_html()` - Parse from HTML string (NEW)
   - `scrape_*_website()` - Direct scraping (DEPRECATED)
   - `collect_html_batch()` - Batch HTML collection (NEW)
   - `scrape_full_info()` - Complete scraping

6. **URL Crawling Functions** (Cell 6)
   - `crawl_all_listing_urls()` - Collect car URLs from search pages

7. **Batch Scraping Functions** (Cell 7)
   - `scrape_from_url_files_batch()` - NEW batch method (RECOMMENDED)
   - `scrape_from_url_files()` - Old threaded method (DEPRECATED)

8. **Testing Cells** (Cells 8-11)
   - Test individual URLs

---

In [None]:
# Standard library imports
import copy
import json
import os
import random
import re
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Any, Dict, List, Optional, Set
from urllib.parse import urljoin

# Third-party imports
from bs4 import BeautifulSoup

# Selenium imports
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webdriver import WebDriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

# WebDriver Manager (for auto-downloading correct ChromeDriver)
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.core.os_manager import ChromeType

In [None]:
# ============================================================================
# CORE UTILITY FUNCTIONS
# ============================================================================

def init_driver(headless: bool = True) -> webdriver.Chrome:
    """Initializes and configures the Selenium Chrome WebDriver for Colab."""
    chrome_options = Options()
    
    # Set binary location to Google Chrome (not chromium)
    chrome_options.binary_location = '/usr/bin/google-chrome'

    if headless:
        chrome_options.add_argument('--headless=new')  # Use new headless mode

    # Essential arguments for Colab environment
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--disable-software-rasterizer')
    chrome_options.add_argument('--disable-extensions')
    chrome_options.add_argument('--disable-setuid-sandbox')
    chrome_options.add_argument('--remote-debugging-port=9222')  # Important for Colab
    chrome_options.add_argument('--ignore-certificate-errors')
    chrome_options.add_argument('--disable-blink-features=AutomationControlled')
    chrome_options.add_argument("--blink-settings=imagesEnabled=false")
    chrome_options.add_argument('--window-size=1920,1080')
    chrome_options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36')
    
    # Additional preferences to prevent crashes
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    try:
        print("Initializing Chrome driver with ChromeDriverManager...")
        # Use ChromeDriverManager to auto-download matching ChromeDriver for Google Chrome
        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=chrome_options)
        print("✓ Chrome driver initialized successfully")
        return driver
    except Exception as e:
        print(f"✗ Failed with ChromeDriverManager: {e}")
        raise Exception(f"Could not initialize Chrome driver: {e}")

def clean_and_convert_to_int(text: Optional[str]) -> Optional[int]:
    """
    Removes all non-digit characters from a string and converts it to an integer.

    Args:
        text (str | None): The input string to clean (e.g., "$25,999", "(38,191 mi.)").

    Returns:
        int | None: The cleaned integer, or None if the input is empty or invalid.
    """
    if not isinstance(text, str) or not text:
        return None

    try:
        # Use regex to remove any character that is not a digit
        cleaned_string = re.sub(r'\D', '', text)
        return int(cleaned_string)
    except (ValueError, TypeError):
        print(f"Warning: Could not convert '{text}' to an integer.")
        return None


def get_html_with_driver(driver: webdriver.Chrome, url: str) -> Optional[str]:
    """
    Get raw HTML using existing driver instance.

    Args:
        driver: Existing Chrome WebDriver instance.
        url: The URL of the page to scrape.

    Returns:
        Raw HTML string or None on failure.
    """
    try:
        print(f"Navigating to {url}...")
        start_time = time.time()

        driver.get(url)
        print(f"Page loaded in {time.time() - start_time:.2f} seconds.")

        time.sleep(random.uniform(0.5, 1.5))

        return driver.page_source

    except Exception as e:
        print(f"Error loading {url}: {e}")
        return None


def save_html(html_content: str, output_path: str) -> bool:
    """Save HTML content to file."""
    try:
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(html_content)
        return True
    except Exception as e:
        print(f"Error saving HTML to {output_path}: {e}")
        return False


def load_html(html_path: str) -> Optional[str]:
    """Load HTML content from file."""
    try:
        with open(html_path, 'r', encoding='utf-8') as f:
            return f.read()
    except Exception as e:
        print(f"Error loading HTML from {html_path}: {e}")
        return None


def get_soup_from_html(html_content: str) -> Optional[BeautifulSoup]:
    """Parse HTML string to BeautifulSoup object."""
    try:
        return BeautifulSoup(html_content, "html.parser")
    except Exception as e:
        print(f"Error parsing HTML: {e}")
        return None


# ============================================================================
# HTML PARSING FUNCTIONS
# ============================================================================

def _parse_listing_data(soup: BeautifulSoup, url: str = "") -> Dict[str, Any]:
    """
    Parse car listing data from HTML soup with robust error handling.
    Each field wrapped in try-catch to continue on errors.
    """
    data = {}

    def safe_get_text(element, default=None):
        """Safely extract text from element."""
        try:
            return element.get_text(strip=True) if element else default
        except:
            return default

    def safe_get_attr(element, attr, default=None):
        """Safely get attribute from element."""
        try:
            return element.get(attr) if element and hasattr(element, 'get') else default
        except:
            return default

    # POST DATA
    post = {}

    # Each field wrapped in try-catch
    try:
        post["new_used"] = safe_get_text(soup.select_one("p.new-used"))
    except:
        post["new_used"] = None

    try:
        post["title"] = safe_get_text(soup.select_one("h1.listing-title"))
    except:
        post["title"] = None

    try:
        mileage_text = safe_get_text(soup.select_one("p.listing-mileage"))
        post["mileage"] = clean_and_convert_to_int(mileage_text)
    except:
        post["mileage"] = None

    try:
        price_text = safe_get_text(soup.select_one("span.primary-price"))
        post["price"] = clean_and_convert_to_int(price_text)
    except:
        post["price"] = None

    try:
        payment_button = soup.select_one('spark-button.monthly-payment-est-link')
        payment_value = safe_get_attr(payment_button, 'phx-value-monthly-payment')
        post["monthly_payment"] = clean_and_convert_to_int(payment_value)
    except:
        post["monthly_payment"] = None

    # Basics (Engine, Transmission, etc.)
    try:
        basic_dict = {}
        basics_terms = soup.select_one('section.sds-page-section.basics-section dl.fancy-description-list')
        if basics_terms:
            for term in basics_terms.find_all('dt', recursive=False):
                try:
                    key = safe_get_text(term)
                    value_tag = term.find_next_sibling('dd')
                    if key and value_tag:
                        if key == 'MPG':
                            mpg_span = value_tag.find('span', attrs={'slot': 'trigger'})
                            value = safe_get_text(mpg_span)
                        else:
                            value = safe_get_text(value_tag)
                        basic_dict[key] = value
                except:
                    continue
        post["basics_des"] = basic_dict if basic_dict else None
    except:
        post["basics_des"] = None

    # Features
    try:
        feature_dict = {}
        feature_terms = soup.select_one('section.sds-page-section.features-section dl.fancy-description-list')
        if feature_terms:
            for term in feature_terms.find_all('dt', recursive=False):
                try:
                    key = safe_get_text(term)
                    value_tag = term.find_next_sibling('dd')
                    if key and value_tag:
                        value = safe_get_text(value_tag, default='')
                        value_list = value.replace('\n\n', '').replace('\n', ',').split(",")
                        feature_dict[key] = [v.strip() for v in value_list if v.strip()]
                except:
                    continue
        post["feature_des"] = feature_dict if feature_dict else None
    except:
        post["feature_des"] = None

    # User History
    try:
        user_history_dict = {}
        user_history_terms = soup.select_one('section.sds-page-section.vehicle-history-section dl.fancy-description-list')
        if user_history_terms:
            for term in user_history_terms.find_all('dt', recursive=False):
                try:
                    key = safe_get_text(term)
                    value_tag = term.find_next_sibling('dd')
                    if key and value_tag:
                        user_history_dict[key] = safe_get_text(value_tag)
                except:
                    continue
        post["user_history_des"] = user_history_dict if user_history_dict else None
    except:
        post["user_history_des"] = None

    # Warranty
    try:
        warranty_dict = {}
        warranty_terms = soup.select_one('section.sds-page-section.warranty_section dl.fancy-description-list')
        if warranty_terms:
            for term in warranty_terms.find_all('dt', recursive=False):
                try:
                    key = safe_get_text(term)
                    value_tag = term.find_next_sibling('dd')
                    if key and value_tag:
                        value = safe_get_text(value_tag)
                        warranty_dict[key] = None if value in ['–', '—'] else value
                except:
                    continue
        post["warranty_des"] = warranty_dict if warranty_dict else None
    except:
        post["warranty_des"] = None

    # Images
    try:
        image_terms = soup.select_one('gallery-slides')
        if image_terms:
            images = image_terms.find_all('img', recursive=False)
            image_urls = [safe_get_attr(img, 'src') for img in images]
            post['image'] = [url for url in image_urls if url]
        else:
            post['image'] = None
    except:
        post['image'] = None

    data['post'] = post

    # SELLER DATA
    seller = {}
    try:
        seller_name_tag = soup.select_one('h3.spark-heading-5.heading.seller-name')
        seller['seller_name'] = safe_get_text(seller_name_tag)
    except:
        seller['seller_name'] = None

    try:
        seller_link_tag = soup.select_one('a.sds-rating__link.sds-button-link')
        seller_link = safe_get_attr(seller_link_tag, 'href')

        if seller_link:
            seller_link = 'https://www.cars.com' + seller_link
            seller_key = seller_link.split('/')[-2] if '/' in seller_link else None
        else:
            seller_key = None

        seller['seller_link'] = seller_link
        seller['seller_key'] = seller_key
    except:
        seller['seller_link'] = None
        seller['seller_key'] = None

    data['seller'] = seller

    # CAR MODEL DATA
    car = {}
    try:
        car_link_tag = soup.select_one('div.mmy-page-link a')

        if car_link_tag:
            car['car_model'] = safe_get_attr(car_link_tag, 'data-slugs')
            car_link = safe_get_attr(car_link_tag, 'href')

            if car_link:
                car_link = 'https://www.cars.com' + car_link
                review_link = car_link + 'consumer-reviews/?page_size=200'
            else:
                review_link = None

            car['car_link'] = car_link
            car['review_link'] = review_link
        else:
            car['car_model'] = None
            car['car_link'] = None
            car['review_link'] = None
    except:
        car['car_model'] = None
        car['car_link'] = None
        car['review_link'] = None

    # Car rating
    try:
        car_rating_tag = soup.select_one('div.vehicle-reviews spark-rating')
        car['car_rating'] = safe_get_attr(car_rating_tag, 'rating')
    except:
        car['car_rating'] = None

    # Rating breakdown
    try:
        ratings = {}
        car_terms = soup.select_one('div.review-breakdown ul.sds-definition-list.review-breakdown--list')
        if car_terms:
            for li in car_terms.select("li"):
                try:
                    name_tag = li.select_one(".sds-definition-list__display-name")
                    value_tag = li.select_one(".sds-definition-list__value")
                    if name_tag and value_tag:
                        name = safe_get_text(name_tag)
                        value = safe_get_text(value_tag)
                        try:
                            ratings[name] = float(value)
                        except (ValueError, TypeError):
                            ratings[name] = value
                except:
                    continue
        car['ratings'] = ratings if ratings else None
    except:
        car['ratings'] = None

    # Percentage recommend
    try:
        percentage_elem = soup.select_one('div.reviews-recommended')
        if percentage_elem:
            percentage_text = safe_get_text(percentage_elem, default='')
            if percentage_text:
                parts = percentage_text.split(' ')
                first_part = parts[0] if parts else ''
                try:
                    car['percentage_recommend'] = float(first_part.rstrip('%')) if first_part else None
                except (ValueError, TypeError):
                    car['percentage_recommend'] = None
            else:
                car['percentage_recommend'] = None
        else:
            car['percentage_recommend'] = None
    except:
        car['percentage_recommend'] = None

    data['car'] = car

    # Metadata for tracking data completeness
    data['_metadata'] = {
        'url': url,
        'has_car_link': bool(car_link_tag),
        'has_ratings': bool(car_rating_tag),
        'has_percentage': bool(percentage_elem),
        'is_complete': all([
            post.get('title'),
            seller.get('seller_name'),
            post.get('price')
        ])
    }

    return data

def _parse_reviews_data(soup: BeautifulSoup, data: Dict[str, Dict]) -> Dict[str, Dict]:
    """Parse review data with try-catch for each field."""
    output_data = copy.deepcopy(data)

    def safe_get_text(element, default=None):
        try:
            return element.get_text(strip=True) if element else default
        except:
            return default

    # Car Brand and Name
    try:
        car_name_elem = soup.select_one('div.sds-page-section.vehicle-reviews-page h1')
        if car_name_elem:
            output_data['car']['car_name'] = car_name_elem.get_text(strip=True).replace(' consumer reviews', '')
        else:
            output_data['car']['car_name'] = None
    except:
        output_data['car']['car_name'] = None

    try:
        brand_elem = soup.select_one('a[data-linkname="research-make"]')
        output_data['car']['brand'] = safe_get_text(brand_elem)
    except:
        output_data['car']['brand'] = None

    # Reviews
    try:
        listing_reviews_terms = soup.select('div.sds-container.consumer-review-container')
        reviews = []

        if listing_reviews_terms:
            for review_term in listing_reviews_terms:
                try:
                    # Skip if no review body
                    if not review_term.select_one('p.review-body'):
                        continue

                    review_data = {}

                    # Overall Rating
                    try:
                        overall_rating_tag = review_term.select_one('spark-rating')
                        review_data['overall_rating'] = float(overall_rating_tag.get('rating')) if overall_rating_tag and overall_rating_tag.get('rating') else None
                    except:
                        review_data['overall_rating'] = None

                    # Time
                    try:
                        time_tag = review_term.select_one('.review-byline.review-section > div:nth-child(1)')
                        review_data['time'] = safe_get_text(time_tag)
                    except:
                        review_data['time'] = None

                    # User Name and Location
                    try:
                        byline_tag = review_term.select_one('.review-byline.review-section > div:nth-child(2)')
                        if byline_tag:
                            byline_text = byline_tag.get_text(strip=True)
                            try:
                                username = byline_text.split('By ')[1].split(' from ')[0]
                                review_data['user_name'] = username
                                review_data['from'] = byline_text.split(' from ')[1].strip()
                            except (IndexError, AttributeError):
                                review_data['user_name'] = byline_text
                                review_data['from'] = None
                        else:
                            review_data['user_name'] = None
                            review_data['from'] = None
                    except:
                        review_data['user_name'] = None
                        review_data['from'] = None

                    # Review Body
                    try:
                        review_body_tag = review_term.select_one('p.review-body')
                        review_data['review'] = safe_get_text(review_body_tag)
                    except:
                        review_data['review'] = None

                    # Rating Breakdown
                    try:
                        ratings_breakdown = {}
                        breakdown_list = review_term.select_one('.review-breakdown--list')
                        if breakdown_list:
                            for item in breakdown_list.select('li'):
                                try:
                                    key_tag = item.select_one('.sds-definition-list__display-name')
                                    value_tag = item.select_one('.sds-definition-list__value')
                                    if key_tag and value_tag:
                                        key = key_tag.get_text(strip=True)
                                        try:
                                            value = float(value_tag.get_text(strip=True))
                                        except (ValueError, AttributeError):
                                            value = value_tag.get_text(strip=True)
                                        ratings_breakdown[key] = value
                                except:
                                    continue
                        review_data['ratings_breakdown'] = ratings_breakdown
                    except:
                        review_data['ratings_breakdown'] = {}

                    reviews.append(review_data)

                except:
                    # Skip this review if any error
                    continue

        output_data['car']['reviews'] = reviews if reviews else None
    except:
        output_data['car']['reviews'] = None

    return output_data

def _parse_seller_data(soup: BeautifulSoup, data: Dict[str, Dict]) -> Dict[str, Dict]:
    """Parse seller data with try-catch for each field."""
    output_data = copy.deepcopy(data)

    def safe_get_text(element, default=None):
        try:
            return element.get_text(strip=True) if element else default
        except:
            return default

    # Phone
    try:
        phone_data = {}
        phone_terms = soup.select('div.dealer-phone')
        for phone in phone_terms:
            try:
                title_tag = phone.select_one('span.phone-number-title')
                phone_type = title_tag.get_text(strip=True) if title_tag else 'Unknown'

                number_tag = phone.select_one('a.phone-number')
                phone_number = number_tag.get_text(strip=True) if number_tag else None

                if phone_type and phone_number:
                    phone_data[phone_type] = phone_number
            except:
                continue
        output_data['seller']['phone_info'] = phone_data
    except:
        output_data['seller']['phone_info'] = {}

    # Address
    try:
        destination = soup.select_one('div.dealer-address')
        output_data['seller']['destination'] = safe_get_text(destination)
    except:
        output_data['seller']['destination'] = None

    # Hours
    try:
        hours_data = {}
        rows = soup.select('table.dealer-hours tr')
        for row in rows:
            try:
                cells = row.find_all('td')
                if len(cells) == 2:
                    key = cells[0].get_text(strip=True).replace(':', '')
                    value = cells[1].get_text(strip=True)
                    hours_data[key] = value
            except:
                continue
        output_data['seller']['hours'] = hours_data
    except:
        output_data['seller']['hours'] = {}

    # Rating
    try:
        rating_terms = soup.select_one('div.dealer-info-section spark-rating')
        if rating_terms:
            rating = float(rating_terms['rating'])
            rating_counts = clean_and_convert_to_int(rating_terms.select_one('span.test1.sds-rating__link.sds-button-link').get_text(strip=True))
        else:
            rating = None
            rating_counts = None
        output_data['seller']['seller_rating'] = rating
        output_data['seller']['seller_rating_count'] = rating_counts
    except:
        output_data['seller']['seller_rating'] = None
        output_data['seller']['seller_rating_count'] = None

    # Description
    try:
        description_terms = soup.select_one('div.dealer-description.scrubbed-html')
        output_data['seller']['description'] = safe_get_text(description_terms)
    except:
        output_data['seller']['description'] = None

    # Images
    try:
        image_terms = soup.select('div.media-gallery-section img')
        if image_terms:
            image_urls = []
            for image in image_terms:
                try:
                    image_urls.append(image['src'])
                except:
                    continue
            output_data['seller']['images'] = image_urls
        else:
            output_data['seller']['images'] = None
    except:
        output_data['seller']['images'] = None

    return output_data

# ============================================================================
# SCRAPING FUNCTIONS (New & Deprecated)
# ============================================================================

def scrape_post_from_html(html_content: str, url: str = "") -> Optional[Dict[str, Any]]:
    """
    Parse car listing data from HTML content.

    Args:
        html_content: Raw HTML string.
        url: Original URL (for metadata).

    Returns:
        Dictionary containing scraped data, or None on failure.
    """
    try:
        soup = get_soup_from_html(html_content)
        if not soup:
            return None
        return _parse_listing_data(soup, url)
    except Exception as e:
        print(f"Error parsing listing {url}: {e}")
        return None


def scrape_post_website(url: str, headless: bool = True) -> Optional[Dict[str, Any]]:
    """
    Scrape car listing data from a Cars.com listing page.
    DEPRECATED: Use batch HTML collection instead.
    """
    driver = None
    try:
        driver = init_driver(headless=headless)
        html_content = get_html_with_driver(driver, url)
        if not html_content:
            return None
        return scrape_post_from_html(html_content, url)
    except Exception as e:
        print(f"Error scraping listing {url}: {e}")
        return None
    finally:
        if driver:
            driver.quit()

def scrape_seller_from_html(html_content: str, data: Dict[str, Any]) -> Dict[str, Any]:
    """Parse seller data from HTML content."""
    try:
        soup = get_soup_from_html(html_content)
        if not soup:
            return data
        return _parse_seller_data(soup, data)
    except Exception as e:
        print(f"Error parsing seller data: {e}")
        return data


def scrape_review_from_html(html_content: str, data: Dict[str, Any]) -> Dict[str, Any]:
    """Parse review data from HTML content."""
    try:
        soup = get_soup_from_html(html_content)
        if not soup:
            return data
        return _parse_reviews_data(soup, data)
    except Exception as e:
        print(f"Error parsing review data: {e}")
        return data


def scrape_seller_website(url: str, data: Optional[Dict[str, Any]], headless: bool = True) -> Optional[Dict[str, Any]]:
    """
    Scrape seller data from a dealer page.
    DEPRECATED: Use batch HTML collection instead.
    """
    driver = None
    try:
        driver = init_driver(headless=headless)
        html_content = get_html_with_driver(driver, url)
        if not html_content:
            return data
        return scrape_seller_from_html(html_content, data)
    except Exception as e:
        print(f"Error scraping seller {url}: {e}")
        return data
    finally:
        if driver:
            driver.quit()


def scrape_review_website(url: str, data: Optional[Dict[str, Any]], headless: bool = True) -> Optional[Dict[str, Any]]:
    """
    Scrape car reviews from a reviews page.
    DEPRECATED: Use batch HTML collection instead.
    """
    driver = None
    try:
        driver = init_driver(headless=headless)
        html_content = get_html_with_driver(driver, url)
        if not html_content:
            return data
        return scrape_review_from_html(html_content, data)
    except Exception as e:
        print(f"Error scraping reviews {url}: {e}")
        return data
    finally:
        if driver:
            driver.quit()


def collect_html_batch(urls: List[str], html_dir: str, headless: bool = True) -> Dict[str, str]:
    """
    Collect HTML for multiple URLs using single browser instance.
    Saves HTML files and returns mapping of URL to HTML file path.

    Args:
        urls: List of URLs to scrape.
        html_dir: Directory to save HTML files.
        headless: Whether to run browser in headless mode.

    Returns:
        Dictionary mapping URL to HTML file path.
    """
    os.makedirs(html_dir, exist_ok=True)
    url_to_html_path = {}

    driver = init_driver(headless=headless)
    try:
        for idx, url in enumerate(urls, 1):
            try:
                print(f"[{idx}/{len(urls)}] Collecting HTML: {url}")

                html_content = get_html_with_driver(driver, url)
                if not html_content:
                    print(f"  Failed to get HTML")
                    continue

                # Generate filename from URL
                url_hash = str(abs(hash(url)))[:10]
                html_filename = f"{idx}_{url_hash}.html"
                html_path = os.path.join(html_dir, html_filename)

                if save_html(html_content, html_path):
                    url_to_html_path[url] = html_path
                    print(f"  Saved -> {html_path}")

                time.sleep(random.uniform(0.3, 0.8))

            except Exception as e:
                print(f"  Error: {e}")
                continue
    finally:
        print("Closing browser...")
        driver.quit()

    return url_to_html_path


def scrape_full_info_from_html(listing_html: str, seller_html: Optional[str], review_html: Optional[str], url: str = "") -> Optional[Dict[str, Any]]:
    """
    Parse complete car information from saved HTML files.

    Args:
        listing_html: HTML content of listing page.
        seller_html: HTML content of seller page (optional).
        review_html: HTML content of review page (optional).
        url: Original listing URL (for metadata).

    Returns:
        Complete data dictionary, or None on failure.
    """
    try:
        # Parse listing
        scraped_data = scrape_post_from_html(listing_html, url)
        if not scraped_data:
            print(f"Failed to parse listing data for {url}")
            return None

        # Parse seller (if available)
        if seller_html:
            try:
                scraped_data = scrape_seller_from_html(seller_html, scraped_data)
            except Exception as e:
                print(f"Could not parse seller data: {e}")

        # Parse reviews (if available)
        if review_html:
            try:
                scraped_data = scrape_review_from_html(review_html, scraped_data)
            except Exception as e:
                print(f"Could not parse review data: {e}")

        return scraped_data

    except Exception as e:
        print(f"Fatal error parsing {url}: {e}")
        return None


def scrape_full_info(url: str, headless: bool = True) -> Optional[Dict[str, Any]]:
    """
    Scrape complete car information including listing, seller, and reviews.
    DEPRECATED: Use batch HTML collection + parsing instead for better performance.
    """
    try:
        scraped_data = scrape_post_website(url, headless=headless)
        if not scraped_data:
            print(f"Failed to scrape listing data for {url}")
            return None

        if scraped_data.get('seller', {}).get('seller_link'):
            try:
                scraped_data = scrape_seller_website(
                    scraped_data['seller']['seller_link'],
                    scraped_data,
                    headless=headless
                )
            except Exception as e:
                print(f"Could not scrape seller data: {e}")

        if scraped_data.get('car', {}).get('review_link'):
            try:
                scraped_data = scrape_review_website(
                    scraped_data['car']['review_link'],
                    scraped_data,
                    headless=headless
                )
            except Exception as e:
                print(f"Could not scrape review data: {e}")
        else:
            print(f"No review link available - skipping reviews")

        return scraped_data

    except Exception as e:
        print(f"Fatal error scraping {url}: {e}")
        return None

In [None]:
# ============================================================================
# URL CRAWLING FUNCTIONS (For collecting car listing URLs from search pages)
# ============================================================================

# Constants for URL crawling
RESULTS_URL_TMPL = (
    "https://www.cars.com/shopping/results/?"
    "list_price_max=&makes[]=&maximum_distance=all&models[]=&page={page}&stock_type=all&zip=60606"
)
SITE_BASE = "https://www.cars.com"


def init_driver(headless: bool = True, driver_path: Optional[str] = None) -> webdriver.Chrome:
    """Initialize Selenium Chrome WebDriver with sensible defaults for Colab."""
    from webdriver_manager.chrome import ChromeDriverManager
    
    chrome_options = Options()
    
    # Set binary location to Google Chrome (not chromium)
    chrome_options.binary_location = '/usr/bin/google-chrome'
    
    if headless:
        chrome_options.add_argument("--headless=new")  # modern headless
    
    # Essential arguments for Colab environment
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--disable-software-rasterizer")
    chrome_options.add_argument("--disable-extensions")
    chrome_options.add_argument("--disable-setuid-sandbox")
    chrome_options.add_argument("--remote-debugging-port=9222")  # Important for Colab
    chrome_options.add_argument("--ignore-certificate-errors")
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_argument("--blink-settings=imagesEnabled=false")
    chrome_options.add_argument("--window-size=1920,1080")
    chrome_options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
    )
    
    # Additional preferences to prevent crashes
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)

    try:
        print("Initializing Chrome driver with ChromeDriverManager...")
        # Use ChromeDriverManager to auto-download matching ChromeDriver for Google Chrome
        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=chrome_options)
        print("✓ Chrome driver initialized successfully")
        return driver
    except Exception as e:
        print(f"✗ Failed with ChromeDriverManager: {e}")
        raise Exception(f"Could not initialize Chrome driver: {e}")


def wait_for_cards(driver: webdriver.Chrome, timeout: int = 15) -> None:
    """Wait until at least one vehicle card link is present in the DOM."""
    WebDriverWait(driver, timeout).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "a.vehicle-card-link"))
    )


def scroll_to_bottom(driver: webdriver.Chrome, pause: float = 0.6, max_rounds: int = 12) -> None:
    """Scroll down to trigger lazy loading."""
    last_height = driver.execute_script("return document.body.scrollHeight")
    rounds = 0
    while rounds < max_rounds:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(pause)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
        rounds += 1


def extract_links_from_html(html: str) -> List[str]:
    """Parse Cars.com results page HTML and extract all vehicle detail links."""
    soup = BeautifulSoup(html, "html.parser")
    tags = soup.select("a.vehicle-card-link")
    links: List[str] = []
    for tag in tags:
        href = tag.get("href")
        if href and "/vehicledetail/" in href:
            links.append(urljoin(SITE_BASE, href))
    return links


def crawl_all_listing_urls(
    start_page: int = 170,
    end_page: int = 200,
    output_dir: str = "car_links",
    delay_between_pages: float = 2.0,
    headless: bool = True,
    driver_path: Optional[str] = None,
    clear_output_first: bool = False,
) -> None:
    """
    Crawl Cars.com listings across multiple pages and store URLs page-by-page.

    Each page will have its own file, e.g. car_links/page_1.txt, car_links/page_2.txt, etc.
    """
    os.makedirs(output_dir, exist_ok=True)

    if clear_output_first:
        for file in os.listdir(output_dir):
            if file.startswith("page_") and file.endswith(".txt"):
                os.remove(os.path.join(output_dir, file))

    driver = init_driver(headless=headless, driver_path=driver_path)
    try:
        for page in range(start_page, end_page + 1):
            url = RESULTS_URL_TMPL.format(page=page)
            print(f"Fetching page {page}: {url}")

            try:
                driver.get(url)
            except Exception as e:
                print(f"Error navigating to page {page}: {e}")
                continue

            try:
                wait_for_cards(driver, timeout=20)
            except Exception:
                pass

            scroll_to_bottom(driver, pause=0.6, max_rounds=12)
            links = extract_links_from_html(driver.page_source)

            if not links:
                print(f"No car links found on page {page}")
                continue

            output_file = os.path.join(output_dir, f"page_{page}.txt")

            with open(output_file, "w", encoding="utf-8") as f_out:
                for lnk in links:
                    f_out.write(lnk + "\n")

            print(f"Saved {len(links)} links to {output_file}")

            time.sleep(delay_between_pages + random.uniform(0.3, 1.2))

    finally:
        driver.quit()
        print(f"Done. All links saved to folder '{output_dir}'.")


if __name__ == "__main__":
    start_page = 170
    end_page = 200
    crawl_all_listing_urls(
        start_page=start_page,
        end_page=end_page,
        output_dir="car_links",
        delay_between_pages=2.0,
        headless=True,
        driver_path=None,  # or specify "/usr/lib/chromium-browser/chromedriver"
        clear_output_first=False,
    )


SessionNotCreatedException: Message: session not created: Chrome instance exited. Examine ChromeDriver verbose log to determine the cause.; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#sessionnotcreatedexception
Stacktrace:
#0 0x5a200764332a <unknown>
#1 0x5a200708fe4b <unknown>
#2 0x5a20070ca919 <unknown>
#3 0x5a20070c6375 <unknown>
#4 0x5a2007117016 <unknown>
#5 0x5a2007116736 <unknown>
#6 0x5a20070d4c1a <unknown>
#7 0x5a20070d5921 <unknown>
#8 0x5a200760a239 <unknown>
#9 0x5a200760d1e8 <unknown>
#10 0x5a20075f34c9 <unknown>
#11 0x5a200760ddb5 <unknown>
#12 0x5a20075dae93 <unknown>
#13 0x5a2007630098 <unknown>
#14 0x5a2007630273 <unknown>
#15 0x5a20076422c3 <unknown>
#16 0x7da610259ac3 <unknown>


In [None]:
# ============================================================================
# BATCH SCRAPING FUNCTIONS
# ============================================================================

def read_urls_from_file(file_path: str) -> List[str]:
    """Read all non-empty lines (URLs) from a file."""
    if not os.path.exists(file_path):
        print(f"URL file not found: {file_path}")
        return []
    with open(file_path, "r", encoding="utf-8") as f:
        return [line.strip() for line in f if line.strip()]


def categorize_scraping_result(data: Optional[Dict], url: str) -> str:
    """
    Categorize scraping result based on data completeness.

    Returns:
        'complete': Full data with reviews
        'partial_no_reviews': Complete listing but no reviews (new model)
        'partial': Incomplete data
        'failed': No data retrieved
    """
    if not data:
        return 'failed'

    metadata = data.get('_metadata', {})

    if metadata.get('is_complete'):
        if metadata.get('has_car_link') and metadata.get('has_ratings'):
            return 'complete'
        else:
            return 'partial_no_reviews'

    return 'partial'


def scrape_one_car(i, url, page_number, page_output_dir, headless, delay):
    """
    Scrape one car listing and save JSON output with result categorization.
    """
    output_path = os.path.join(page_output_dir, f"{i}.json")

    if os.path.exists(output_path):
        print(f"[Page {page_number}] Car {i}: Skipping (exists)")
        return

    print(f"[Page {page_number}] Car {i}: Scraping {url}")
    try:
        data = scrape_full_info(url, headless=headless)

        result_type = categorize_scraping_result(data, url)

        if result_type == 'failed':
            print(f"  Failed - no data retrieved")
            return

        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(data, f, ensure_ascii=False, indent=2)

        if result_type == 'complete':
            print(f"  Complete -> {output_path}")
        elif result_type == 'partial_no_reviews':
            print(f"  Partial (no reviews) -> {output_path}")
        else:
            print(f"  Partial -> {output_path}")

    except Exception as e:
        print(f"  Error: {e}")

    time.sleep(delay + random.uniform(0.5, 1.5))


def scrape_from_url_files_batch(
    link_folder: str = "car_links",
    output_root: str = "raw_data",
    html_cache_root: str = "html_cache",
    headless: bool = True,
    from_page: int = 170,
    to_page: int = 200,
    max_workers: int = 5,
):
    """
    IMPROVED: Batch HTML collection then parsing with multi-threading.
    - Opens browser once per page
    - Collects all HTML first
    - Then parses offline with MULTI-THREADING (NEW!)
    - Avoids OOM from frequent browser open/close

    Args:
        link_folder: Folder containing page_X.txt files with car URLs.
        output_root: Root folder to save scraped JSON data.
        html_cache_root: Root folder to cache HTML files.
        headless: Whether to run browser in headless mode.
        from_page: Starting page number.
        to_page: Ending page number (inclusive).
        max_workers: Number of parallel threads for parsing (default 10).
    """
    os.makedirs(output_root, exist_ok=True)
    os.makedirs(html_cache_root, exist_ok=True)

    for page_number in range(from_page, to_page + 1):
        file_name = f"page_{page_number}.txt"
        file_path = os.path.join(link_folder, file_name)

        if not os.path.exists(file_path):
            print(f"File not found: {file_name} - skipping.")
            continue

        urls = read_urls_from_file(file_path)
        print(f"\n=== Page {page_number}: {len(urls)} URLs ===")

        # Step 1: Collect all listing HTML (single browser session)
        print(f"\n[STEP 1] Collecting listing HTML...")
        html_dir = os.path.join(html_cache_root, f"page_{page_number}", "listings")
        url_to_html = collect_html_batch(urls, html_dir, headless=headless)
        print(f"Collected {len(url_to_html)} listing HTML files")

        # Step 2: Parse listings and collect secondary URLs (MULTI-THREADED)
        print(f"\n[STEP 2] Parsing listings and collecting secondary URLs (using {max_workers} threads)...")
        seller_urls = []
        review_urls = []
        listing_data_map = {}

        def parse_single_listing(url_html_pair):
            """Helper function to parse a single listing in parallel."""
            url, html_path = url_html_pair
            try:
                html_content = load_html(html_path)
                if not html_content:
                    return None

                data = scrape_post_from_html(html_content, url)
                if data:
                    seller_link = data.get('seller', {}).get('seller_link')
                    review_link = data.get('car', {}).get('review_link')
                    return (url, data, seller_link, review_link)
            except Exception as e:
                print(f"Error parsing {url}: {e}")
            return None

        # Parse in parallel
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = [executor.submit(parse_single_listing, item) for item in url_to_html.items()]

            for future in as_completed(futures):
                result = future.result()
                if result:
                    url, data, seller_link, review_link = result
                    listing_data_map[url] = data

                    if seller_link:
                        seller_urls.append((url, seller_link))

                    if review_link:
                        review_urls.append((url, review_link))

        print(f"Parsed {len(listing_data_map)} listings")
        print(f"Found {len(seller_urls)} seller links, {len(review_urls)} review links")

        # Step 3: Collect seller HTML (single browser session)
        if seller_urls:
            print(f"\n[STEP 3] Collecting seller HTML...")
            seller_html_dir = os.path.join(html_cache_root, f"page_{page_number}", "sellers")
            seller_only_urls = [s[1] for s in seller_urls]
            seller_url_to_html = collect_html_batch(seller_only_urls, seller_html_dir, headless=headless)
            print(f"Collected {len(seller_url_to_html)} seller HTML files")
        else:
            seller_url_to_html = {}

        # Step 4: Collect review HTML (single browser session)
        if review_urls:
            print(f"\n[STEP 4] Collecting review HTML...")
            review_html_dir = os.path.join(html_cache_root, f"page_{page_number}", "reviews")
            review_only_urls = [r[1] for r in review_urls]
            review_url_to_html = collect_html_batch(review_only_urls, review_html_dir, headless=headless)
            print(f"Collected {len(review_url_to_html)} review HTML files")
        else:
            review_url_to_html = {}

        # Step 5: Parse all data and save JSON (MULTI-THREADED)
        print(f"\n[STEP 5] Parsing and saving final data (using {max_workers} threads)...")
        page_output_dir = os.path.join(output_root, str(page_number))
        os.makedirs(page_output_dir, exist_ok=True)

        def process_final_data(url_idx_pair):
            """Helper function to process final data in parallel."""
            idx, url = url_idx_pair
            try:
                if url not in listing_data_map:
                    return None

                data = copy.deepcopy(listing_data_map[url])

                # Get seller HTML if exists
                seller_link = data.get('seller', {}).get('seller_link')
                if seller_link and seller_link in seller_url_to_html:
                    seller_html = load_html(seller_url_to_html[seller_link])
                    if seller_html:
                        data = scrape_seller_from_html(seller_html, data)

                # Get review HTML if exists
                review_link = data.get('car', {}).get('review_link')
                if review_link and review_link in review_url_to_html:
                    review_html = load_html(review_url_to_html[review_link])
                    if review_html:
                        data = scrape_review_from_html(review_html, data)

                # Save JSON
                output_path = os.path.join(page_output_dir, f"{idx}.json")
                with open(output_path, "w", encoding="utf-8") as f:
                    json.dump(data, f, ensure_ascii=False, indent=2)

                result_type = categorize_scraping_result(data, url)
                return (idx, result_type, output_path)

            except Exception as e:
                print(f"  [{idx}] Error: {e}")
                return None

        # Process in parallel
        url_idx_pairs = [(idx, url) for idx, url in enumerate(urls, 1)]
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = [executor.submit(process_final_data, pair) for pair in url_idx_pairs]

            results = []
            for future in as_completed(futures):
                result = future.result()
                if result:
                    results.append(result)

            # Sort and print results in order
            results.sort(key=lambda x: x[0])
            for idx, result_type, output_path in results:
                if result_type == 'complete':
                    print(f"  [{idx}] Complete -> {output_path}")
                elif result_type == 'partial_no_reviews':
                    print(f"  [{idx}] Partial (no reviews) -> {output_path}")
                else:
                    print(f"  [{idx}] Partial -> {output_path}")

    print(f"\n=== Scraping Complete ===")
    print(f"Pages processed: {from_page} to {to_page}")
    print(f"Output saved to: {output_root}")
    print(f"HTML cache saved to: {html_cache_root}")


def scrape_from_url_files(
    link_folder: str = "car_links",
    output_root: str = "raw_data",
    delay: float = 2.0,
    headless: bool = True,
    from_page: int = 170,
    to_page: int = 200,
    max_workers: int = 5,
):
    """
    DEPRECATED: Old multi-threaded approach with frequent browser open/close.
    Use scrape_from_url_files_batch() instead for better performance.
    """
    os.makedirs(output_root, exist_ok=True)

    for page_number in range(from_page, to_page + 1):
        file_name = f"page_{page_number}.txt"
        file_path = os.path.join(link_folder, file_name)

        if not os.path.exists(file_path):
            print(f"File not found: {file_name} - skipping.")
            continue

        page_output_dir = os.path.join(output_root, str(page_number))
        os.makedirs(page_output_dir, exist_ok=True)

        urls = read_urls_from_file(file_path)
        print(f"\n=== Processing {file_name}: {len(urls)} URLs ===")

        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = [
                executor.submit(scrape_one_car, i, url, page_number, page_output_dir, headless, delay)
                for i, url in enumerate(urls, start=1)
            ]

            for future in as_completed(futures):
                try:
                    future.result()
                except Exception as e:
                    print(f"  Thread error: {e}")

    print(f"\n=== Scraping Complete ===")
    print(f"Pages processed: {from_page} to {to_page}")
    print(f"Output saved to: {output_root}")


if __name__ == "__main__":
    # NEW IMPROVED METHOD: Batch HTML collection with multi-threading
    scrape_from_url_files_batch(
        link_folder="car_links",
        output_root="raw_data",
        html_cache_root="html_cache",
        headless=True,
        from_page=170,
        to_page=200,
        max_workers=5,  # Number of parallel threads for parsing
    )

    # OLD METHOD (deprecated - causes OOM from frequent browser open/close)
    # scrape_from_url_files(
    #     link_folder="car_links",
    #     output_root="raw_data",
    #     delay=2.5,
    #     headless=True,
    #     from_page=1,
    #     to_page=2,
    #     max_workers=5,
    # )


=== Page 1: 22 URLs ===

[STEP 1] Collecting listing HTML...
[1/22] Collecting HTML: https://www.cars.com/vehicledetail/d285c5a6-db6b-440e-b139-ef9c3937db8c/?attribution_type=isa
Navigating to https://www.cars.com/vehicledetail/d285c5a6-db6b-440e-b139-ef9c3937db8c/?attribution_type=isa...
Page loaded in 13.16 seconds.
  Saved -> html_cache/page_1/listings/1_8120663023.html
[2/22] Collecting HTML: https://www.cars.com/vehicledetail/dfb2feae-5988-4a48-940e-0d53b19dacec/?attribution_type=isa
Navigating to https://www.cars.com/vehicledetail/dfb2feae-5988-4a48-940e-0d53b19dacec/?attribution_type=isa...
Page loaded in 6.96 seconds.
  Saved -> html_cache/page_1/listings/2_7300785386.html
[3/22] Collecting HTML: https://www.cars.com/vehicledetail/8ef3078a-4a46-4e2c-84fd-c0cc0d5226f4/
Navigating to https://www.cars.com/vehicledetail/8ef3078a-4a46-4e2c-84fd-c0cc0d5226f4/...
Page loaded in 5.70 seconds.
  Saved -> html_cache/page_1/listings/3_8730294452.html
[4/22] Collecting HTML: https://www.c

In [None]:
!cp -r /content/raw_data/* "/content/drive/MyDrive/Car Recsys Consultant Chatbot/"