<center><span style="color:#27A3F5;font-size:30px">A brief description of the scraper</span></center>


The data collection process is implemented using Python and Selenium WebDriver to scrape rental listings from Immowelt. Selenium is required because Immowelt renders a significant portion of its content dynamically using JavaScript, which prevents reliable extraction via simple HTTP requests.

The scraper is designed to:
- Collect detailed rental listing data
- Avoid duplicate listings
- Handle dynamically loaded content
- Respect ethical scraping practices through rate limiting

The output of the scraper is a structured CSV dataset containing 27 features per listing.

<span style="color:#609926;font-size:18px">Deduplication Strategy</span>  
To avoid scraping the same listing multiple times, the scraper maintains a set of previously seen listing URLs.
At startup:
- If a CSV file already exists, all previously stored listing URLs are loaded into memory
- Each new listing URL is checked against this set before scraping
This allows the scraper to be safely re-run and incrementally extend the dataset without creating duplicates.

```python
seen_urls = load_seen_urls(CSV_FILE)
...
seen.add(row["listing_url"])
```

<span style="color:#609926;font-size:18px">Pagination Through Search Results</span>  
The scraper iterates through multiple pages of Immowelt’s rental search results using a predefined URL structure with a page parameter.
For each results page:
- All listing cards are identified using a unique data-testid
- The hyperlink (href) for each listing is extracted
- Only listings that have not been previously scraped are collected for further processing
A delay is introduced after loading each results page to avoid excessive request rates.

```python
for page in range(1, 100):
    driver.get(f"...&page={page}")
    time.sleep(random.uniform(5, 6))
```

<span style="color:#609926;font-size:18px">Visiting Individual Listing Pages</span>  
Each extracted listing URL is opened individually to collect detailed property-level information that is not available on the overview pages.
Randomized waiting times are used between page loads to simulate human browsing behavior and reduce server load.

```python
cards = driver.find_elements(By.CSS_SELECTOR, '[data-testid="serp-core-classified-card-testid"]')

for card in cards:
    a_tag = card.find_element(By.TAG_NAME, 'a')
    href = a_tag.get_attribute('href')
    if href and href not in seen_urls:
        hrefs.append(href)

for href in hrefs:
    driver.get(href)
    time.sleep(random.uniform(1.5, 2))

```

<span style="color:#609926;font-size:18px">Price Extraction</span>  
The following pricing information is extracted using targeted CSS selectors:
- Cold rent (Kaltmiete)
- Warm rent (Warmmiete)
- Deposit (Kaution)
All price values are initially stored as raw text strings and later converted into numerical formats during data preprocessing.


```python
cold_rent = safe_text(By.CSS_SELECTOR, "div... span.css-9wpf20")
warm_rent = safe_text(By.CSS_SELECTOR, "div... div.css-vt5v8p")
deposit   = safe_text(By.CSS_SELECTOR, "div... span.css-9wpf20")

```

<span style="color:#609926;font-size:18px">Parsing Core Property Attributes</span>  
Key attributes such as area, number of rooms, floor, and availability date are displayed in a compact summary section on the listing page.
This section is parsed by:
- Splitting the text using a delimiter (•)
- Identifying each attribute based on keyword patterns (e.g., "Zimmer", "m²", "frei ab")
This approach allows multiple features to be extracted from a single HTML element.

```python
facts = safe_text(By.CSS_SELECTOR, "div.css-g6cs9n div.css-1ulazoo")
parts = [p.strip() for p in facts.split("•")]

for p in parts:
    if "Zimmer" in p: rooms = p
    elif "m²" in p: area = p
    elif "frei ab" in p: free_from = p.replace("frei ab","").strip()
    else: floor = p

```

<span style="color:#609926;font-size:18px">Location Extraction</span>  
The property location is extracted from a dedicated address element on the listing page.
This typically contains neighborhood- or district-level information, which can later be used for spatial analysis or feature engineering.  
```python
location = safe_text(By.CSS_SELECTOR, "button.css-xdqdk0 div.css-gd6b0m span.css-wpv6zq")
```

<span style="color:#609926;font-size:18px">Feature Extraction via Modal Window</span>  
Many property features on Immowelt are hidden behind a feature modal (popup).
To extract these features:
- The scraper checks whether a feature button is present
- If found, the button is clicked programmatically
- The scraper waits for the modal content to load
- Feature lists inside the modal are parsed
If the modal is not available, the scraper falls back to extracting features that are directly visible on the page.
  
```python
modal_opened = False
try:
    features_button = driver.find_element(By.CSS_SELECTOR, "button.css-1818k7y")
    features_button.click()
    time.sleep(random.uniform(1, 2))
    modal_opened = True
except:
    ...
if modal_opened:
    feature_sections = safe_elements(By.CSS_SELECTOR, "div.css-ev30hg ul.css-q2aypr")
else:
    feature_sections = safe_elements(By.CSS_SELECTOR, "div.css-1j9uv53 ul")
```

<span style="color:#609926;font-size:18px">Boolean Feature Detection</span>  
Binary features are extracted by scanning feature text entries for predefined keywords.
These include:
- Balcony
- Terrace
- Garden
- Elevator
- Parking
- Cellar
- Barrier-free access
- Fitted kitchen
- Bathtub
- Shower
If a keyword is detected, the corresponding feature is set to True.
If not present, the feature remains None, indicating missing or unspecified information.

```python
text = item.text.lower()

if "balkon" in text: has_balkon = True
if "terrasse" in text: has_terrasse = True
if "garten" in text: has_garten = True

if "aufzug" in text or "personenaufzug" in text: elevator = True
if "stellplatz" in text or "garage" in text or "tiefgarage" in text: parking = True
if "keller" in text or "kelleranteil" in text: keller = True

```

<span style="color:#609926;font-size:18px">Flooring Type Extraction</span>  
Interior information such as the flooring type is extracted by identifying labeled entries (e.g., "Bodenbelag:") and storing the corresponding value as a categorical feature.
```python
if "bodenbelag:" in text:
    flooring_type = text.split("bodenbelag:")[1].strip()
```

<span style="color:#609926;font-size:18px">Energy and Building Characteristics</span>  
Energy-related and building attributes are extracted from a structured list section on the listing page.
The scraper maps German attribute labels to standardized features, including:
- Energy source
- Heating type
- Property condition
- Year built
Only attributes present in the final dataset schema are stored.
```python
feature_rows = safe_elements(By.CSS_SELECTOR, "section.css-13o7eu2 ul.css-rnqikx li")

spans = row.find_elements(By.CSS_SELECTOR, "div.css-j7qwjs span")
key = spans[0].text.lower().strip()
val = spans[1].text.strip()

if "baujahr" in key: year_built = val
elif "heizungsart" in key: heating_type = val
elif "zustand" in key: property_condition = val
elif "energieträger" in key: energy_source = val

```

<span style="color:#609926;font-size:18px">Schufa Requirement Detection</span>  
The scraper checks for the presence of Schufa-related elements on the listing page.
If such an element is detected, the schufa_check feature is set to True, indicating that a credit check is required.
```python
schufa_check = bool(safe_elements(By.CSS_SELECTOR, "div.css-f5efyb a"))
```

<span style="color:#609926;font-size:18px">Metadata Collection</span>  
Additional metadata includes:
- Number of images posted in the listing
- Timestamp indicating when the listing was scraped
The number of images is extracted by parsing a button label and filtering numeric characters.
We take a string like “Alle 9 Bilder ansehen” and keep only digits.
```python
images_btn = safe_text(By.CSS_SELECTOR, "button div.css-tbuq8s span.css-1gur7lg")
number_of_images_posted = "".join(filter(str.isdigit, images_btn))

```

<span style="color:#609926;font-size:18px">Data Storage</span>  
Each listing is stored as a single row in a CSV file:
- If the file does not yet exist, a header row is automatically created
- New listings are appended incrementally
After saving a listing, its URL is added to the deduplication set to prevent re-scraping.
```python
append_to_csv(data)
seen_urls.add(href)
```

<span style="color:#609926;font-size:18px">Rate Limiting and Ethical Considerations</span>  
Throughout the scraping process:
- Randomized delays are applied between page loads and interactions
- The scraper avoids rapid or excessive requests
This ensures respectful interaction with the target website and reduces the risk of server overload

<span style="color:#609926;font-size:18px">Append rows safely to CSV</span>  
Each scraped listing becomes one row appended to the CSV. In case the file doesn’t exist yet, we write the header first.
```python
with open(CSV_FILE, "a", newline="", encoding="utf-8-sig") as f:
    writer = csv.DictWriter(f, fieldnames=FIELDNAMES)
    if not file_exists:
        writer.writeheader()
    writer.writerow(row)
```

<span style="color:#609926;font-size:18px">Append rows safely to CSV</span>  
Each scraped listing becomes one row appended to the CSV. In case the file doesn’t exist yet, we write the header first.
```python
with open(CSV_FILE, "a", newline="", encoding="utf-8-sig") as f:
    writer = csv.DictWriter(f, fieldnames=FIELDNAMES)
    if not file_exists:
        writer.writeheader()
    writer.writerow(row)
```

<span style="color:#609926;font-size:18px">Safe extraction helpers (to avoid crashes)</span>  
Instead of crashing when an element is missing, we return None or [].
```python
def safe_text(by, selector):
    try:
        return driver.find_element(by, selector).text.strip()
    except:
        return None
```