# Advanced: web scraping

In this notebook, we’ll explore how to **collect real data directly from websites** - a process known as **web scraping**. Using Python, we’ll learn how to read a page’s HTML structure and extract the information we need automatically.

Imagine you’re researching the **housing market** and want to analyze rental house listings from a public website. Instead of copying each ad manually, we can use Python to find and download all key details, such as the **title, price, location, size, and number of rooms** from every listing.

Our goal is to:
1. Understand the **HTML structure** of a web page.  
2. Identify where data is stored inside **tags, classes, and attributes**.  
3. Use Python libraries to **extract, clean, and organize** that data into a structured table.

We’ll also use **ChatGPT** to help us write and explain the Python code step by step, showing how AI can simplify even advanced tasks like scraping and data extraction.

### Step 1: Inspect the page and understand the HTML structure

Every web page is built with **HTML (HyperText Markup Language)** — the basic structure behind everything you see online. HTML uses **tags** (like `<div>`, `<p>`, `<a>`, `<img>`) to organize and display content such as text, links, or images.

Each element on a page can include:
- an **attribute** — extra information about that element, e.g. `href="link"` or `src="image.jpg"`  
- a **class** — written as `class="text-price"`, used to group and style similar elements  
- an **id** — written as `id="main"`, used to uniquely identify one element  

When we do **web scraping**, we look “under the surface” of the web page — we open *Inspect* (right-click → *Inspect Element*) to see the raw HTML code.  
There, we can find patterns like:
- Each real estate ad is inside an `<article>` tag.  
- The **title** appears inside an `<h2>` tag.  
- The **price** is stored in a `<span>` tag with a specific class.  
- The **location**, **size**, **rooms**, and **description** are in their own HTML sections.

By identifying these patterns, we can tell Python exactly where to look for the data we want — which makes it possible to automatically collect information from many ads at once.

### Step 2: Extract all listings from one page

Now we’ll scrape real data about **houses for rent in Serbia** from [oglasi.rs](https://www.oglasi.rs/nekretnine/izdavanje-kuca).

*I have copied the full HTML code of one ad from the website https://www.oglasi.rs/nekretnine/izdavanje-kuca. Can you write me simple Python code that will read this HTML, extract the main information (like title, price, location, size, rooms, and description), and show it in a table or pandas DataFrame? I’m a beginner, so please explain each step with short comments.*

We will try with one listing.

In [None]:
# First, install the libraries if you don’t have them yet
# (run this line only once)
!pip install beautifulsoup4 pandas

In [None]:
from bs4 import BeautifulSoup  # BeautifulSoup = library for parsing and extracting data from HTML or XML files
import pandas as pd

In [None]:
# Paste here the full HTML content of one <article> ad
html = """
<article itemprop="itemListElement" itemscope="" itemtype="http://schema.org/Product">
        <meta itemprop="position" content="2">
                <div class="fpogl-holder advert_list_item_istaknut">
            <img class="promote_fpogl_list_item" src="/bundles/oglasicommon/images/top.png" alt="Top oglas">
        <div class="row">
        <div class="col-sm-4 col-md-3">
            <figure>
                                <a class="fpogl-list-image" data-scalable-image-holder="" href="/nekretnine/izdavanje-kuca/03-3271540/namestena-kuca" style="height: 122.75px;">
                                                        <img src="https://media.oglasi.rs/44f45448-4a14-4bda-895d-ec0a6f8f2b1c/medium-8667_1.jpg" alt="Nameštena kuća" itemprop="image" style="top: 0px; height: 100%; width: auto; left: 0px;">
                                    </a>
            </figure>
            <small class="visible-ms visible-xs pull-right">Kapital NS021 reg. br.648</small>
            <small>
                                    Obnovljen:
                                <time datetime="2025-10-27T18:07:33+01:00">
                27.10.2025.
                </time>
            </small>
            <div class="visible-sm">
                <small>Kapital NS021 reg. br.648</small>
            </div>
        </div>
        <div class="col-sm-8 col-md-9 col-lg-7">
            <div class="pull-right visible-md" style="margin-left:8px">
                <div class="text-right">
                                        <span class="text-price"><strong>500,00&nbsp;EUR</strong></span>
                                    </div>
                            </div>
            <div style="margin-bottom:16px">
                                <a class="fpogl-list-title" href="/nekretnine/izdavanje-kuca/03-3271540/namestena-kuca" itemprop="url">
                                    <h2 itemprop="name">Nameštena kuća</h2>
                </a>
                <a itemprop="category" href="/nekretnine">Nekretnine</a>  / <a itemprop="category" href="/nekretnine/izdavanje-kuca">Izdavanje kuća</a>  / <a itemprop="category" href="/nekretnine/izdavanje-kuca/novi-sad">Novi Sad</a>  / <a itemprop="category" href="/nekretnine/izdavanje-kuca/telep-novi-sad">Telep</a>             </div>
            <div class="visible-md" style="margin-bottom:16px">
                Kapital NS021 reg. br.648
            </div>
            <div class="visible-ms visible-sm visible-xs" style="margin-bottom:16px">
                <div>
                                        <span class="text-price"><strong>500,00&nbsp;EUR</strong></span>
                                    </div>
                            </div>
            <div style="margin-bottom:16px">
                <div class="row">
                                                        <div class="col-sm-6">
                        Kvadratura:
                                                    <strong>51m2</strong>
                                            </div>
                                        <div class="col-sm-6">
                        Površina zemljišta:
                                                    <strong>350m2</strong>
                                            </div>
                                                                            <div class="col-sm-6">
                        Broj soba:
                                                    <strong>2 sobe</strong>
                                            </div>
                                        <div class="col-sm-6">
                        Opremljenost:
                                                    <strong>Namešten</strong>
                                            </div>
                                                    </div>
            </div>
            <p itemprop="description">Manja kuća, sa kolskim ulazom i pomoćnim objektom, na izdavanje.

Agencija KAPITAL NS021 DOO&nbsp; Broj Registra posrednika: 648</p>
        </div>
        <div class="col-sm-2 visible-lg">
            <div class="text-right" itemprop="offers" itemscope="" itemtype="http://schema.org/Offer">
                                <span class="text-price" itemprop="price" content="500.00"><strong>500,00</strong></span><span class="text-price" itemprop="priceCurrency" content="EUR"><strong>&nbsp;EUR</strong></span>
                            </div>
                                    <div class="text-right">
                <cite>Kapital NS021 reg. br.648</cite>
            </div>
        </div>
    </div>
</div>
        
    </article>
"""

# Parse the HTML using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# Extract basic information (use .find to locate each tag)
title = soup.find("h2", itemprop="name").get_text(strip=True) if soup.find("h2", itemprop="name") else None
price = soup.find("span", itemprop="price").get_text(strip=True) if soup.find("span", itemprop="price") else None
location = soup.find_all("a", itemprop="category")[-1].get_text(strip=True) if soup.find_all("a", itemprop="category") else None
description = soup.find("p", itemprop="description").get_text(strip=True) if soup.find("p", itemprop="description") else None
image = soup.find("img", itemprop="image")["src"] if soup.find("img", itemprop="image") else None
link = soup.find("a", class_="fpogl-list-title")["href"] if soup.find("a", class_="fpogl-list-title") else None
date = soup.find("time")["datetime"] if soup.find("time") else None

# For additional details (like size, rooms, furnishing), we can search by keywords
details = soup.find_all("div", class_="col-sm-6")

size = None
land_area = None
rooms = None
furnishing = None

for d in details:
    text = d.get_text(" ", strip=True)
    if "Kvadratura" in text:
        size = text.replace("Kvadratura:", "").strip()
    elif "Površina zemljišta" in text:
        land_area = text.replace("Površina zemljišta:", "").strip()
    elif "Broj soba" in text:
        rooms = text.replace("Broj soba:", "").strip()
    elif "Opremljenost" in text:
        furnishing = text.replace("Opremljenost:", "").strip()

# Create a pandas DataFrame to show the extracted data in a table
data = {
    "title": [title],
    "price": [price],
    "location": [location],
    "size": [size],
    "land_area": [land_area],
    "rooms": [rooms],
    "furnishing": [furnishing],
    "description": [description],
    "image": [image],
    "link": [link],
    "date": [date]
}

df = pd.DataFrame(data)
df


Now that we know how to extract information from **one ad**, let’s do the same for **all ads on the page**.

Each real estate listing on the website is wrapped inside an `<article>` tag.  
We’ll use the `requests` library to load the whole web page and then loop through all `<article>` tags using **BeautifulSoup**.  
For each one, we’ll extract the same fields we used before — title, price, location, size, rooms, furnishing, description, image, link, and date.

In [None]:
import requests

url = "https://www.oglasi.rs/nekretnine/izdavanje-kuca"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

articles = soup.find_all("article")

data = []

for art in articles:
    title = art.find("h2", itemprop="name").get_text(strip=True) if art.find("h2", itemprop="name") else None
    price = art.find("span", itemprop="price").get_text(strip=True) if art.find("span", itemprop="price") else None
    location = art.find_all("a", itemprop="category")[-1].get_text(strip=True) if art.find_all("a", itemprop="category") else None
    description = art.find("p", itemprop="description").get_text(strip=True) if art.find("p", itemprop="description") else None
    image = art.find("img", itemprop="image")["src"] if art.find("img", itemprop="image") else None
    link = art.find("a", class_="fpogl-list-title")["href"] if art.find("a", class_="fpogl-list-title") else None
    date = art.find("time")["datetime"] if art.find("time") else None

    details = art.find_all("div", class_="col-sm-6")
    size = land_area = rooms = furnishing = None
    for d in details:
        text = d.get_text(" ", strip=True)
        if "Kvadratura" in text:
            size = text.replace("Kvadratura:", "").strip()
        elif "Površina zemljišta" in text:
            land_area = text.replace("Površina zemljišta:", "").strip()
        elif "Broj soba" in text:
            rooms = text.replace("Broj soba:", "").strip()
        elif "Opremljenost" in text:
            furnishing = text.replace("Opremljenost:", "").strip()

    data.append({
        "title": title,
        "price": price,
        "location": location,
        "size": size,
        "land_area": land_area,
        "rooms": rooms,
        "furnishing": furnishing,
        "description": description,
        "image": image,
        "link": "https://www.oglasi.rs" + link if link else None,
        "date": date
    })

df = pd.DataFrame(data)
df.head(10)

### Step 3: Scrape all listings from all pages

Now that we’ve successfully extracted **all listings from one page**, the next step is to collect **all listings from the entire website section**, not just the first page.

If we scroll to the bottom of the page and open *Inspect* again, we’ll see a **navigation (pagination)** block in the HTML.  
It contains links like `?p=2`, `?p=3`, and so on — each representing a new page with more house listings.  

We’ll copy that part of the HTML (the pagination section) and give it to ChatGPT, so it can help us write a new version of the code that loops through all pages and scrapes every ad automatically.

In [None]:
import time  # to add short pauses between requests

base_url = "https://www.oglasi.rs/nekretnine/izdavanje-kuca"

# Step 1: Get the first page and find how many total pages exist
response = requests.get(base_url)
soup = BeautifulSoup(response.text, "html.parser")

# Find the pagination block
pagination = soup.find("ul", class_="pagination hidden-xs")
pages = []
if pagination:
    for a in pagination.find_all("a", href=True):
        href = a["href"]
        if "?p=" in href:
            # Extract the number after ?p=
            page_num = href.split("?p=")[-1]
            if page_num.isdigit():
                pages.append(int(page_num))

total_pages = max(pages) if pages else 1
print(f"Total pages found: {total_pages}")

# Step 2: Scrape all pages
all_listings = []

for page in range(1, total_pages + 1):
    print(f"Scraping page {page}...")
    url = f"{base_url}?p={page}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    articles = soup.find_all("article")

    for art in articles:
        title = art.find("h2", itemprop="name").get_text(strip=True) if art.find("h2", itemprop="name") else None
        price = art.find("span", itemprop="price").get_text(strip=True) if art.find("span", itemprop="price") else None
        location = art.find_all("a", itemprop="category")[-1].get_text(strip=True) if art.find_all("a", itemprop="category") else None
        description = art.find("p", itemprop="description").get_text(strip=True) if art.find("p", itemprop="description") else None
        image = art.find("img", itemprop="image")["src"] if art.find("img", itemprop="image") else None
        link = art.find("a", class_="fpogl-list-title")["href"] if art.find("a", class_="fpogl-list-title") else None
        date = art.find("time")["datetime"] if art.find("time") else None

        details = art.find_all("div", class_="col-sm-6")
        size = land_area = rooms = furnishing = None
        for d in details:
            text = d.get_text(" ", strip=True)
            if "Kvadratura" in text:
                size = text.replace("Kvadratura:", "").strip()
            elif "Površina zemljišta" in text:
                land_area = text.replace("Površina zemljišta:", "").strip()
            elif "Broj soba" in text:
                rooms = text.replace("Broj soba:", "").strip()
            elif "Opremljenost" in text:
                furnishing = text.replace("Opremljenost:", "").strip()

        all_listings.append({
            "title": title,
            "price": price,
            "location": location,
            "size": size,
            "land_area": land_area,
            "rooms": rooms,
            "furnishing": furnishing,
            "description": description,
            "image": image,
            "link": "https://www.oglasi.rs" + link if link else None,
            "date": date
        })

    # Small pause so we don't overload the website
    time.sleep(2)

# Step 3: Convert all data to DataFrame
df = pd.DataFrame(all_listings)
print(f"Total ads collected: {len(df)}")

# Preview first few rows
df.head()

Before saving the data, let’s quickly check how many listings we actually collected. This helps confirm that our scraping loop worked correctly and that we have data from all pages.

In [None]:
# Check how many total ads we scraped
print(f"Total number of listings collected: {len(df)}")

# Optional — preview the first few rows
df.head()

### Step 4: Save the data for analysis

Now that we’ve collected **all listings from all pages**, it’s time to save our dataset so we can explore it later.

We’ll use `pandas` to export the results into a **CSV file**, a simple, widely used format that can be opened in Excel, Google Sheets, or directly in the next Python notebook for further analysis and visualization.

Once saved, the data is ready for the next stage: cleaning, exploring price trends, or comparing listings by city, size, or number of rooms.

In [None]:
# Save the complete dataset to a CSV file
df.to_csv("houses_for_rent_serbia.csv", index=False)

print("✅ Data saved as houses_for_rent_serbia.csv — ready for analysis!")