# 📦 Data Collection via Web Scraping

This notebook performs web scraping on **Zingat.com**, a Turkish real estate platform, to collect rental apartment listings in Istanbul. The data includes the following features:

- `Listing_Title`: The title of the apartment listing
- `County`: The county where the apartment is located
- `Price`: The monthly rental price in Turkish Lira (₺)
- `Net_Area_(m²)`: The usable area of the apartment in square meters
- `Gross_Area_(m²)`: The total area including walls and common spaces
- `Room-Living_Room_Count`: The format of rooms (e.g., 2+1 means 2 rooms and 1 living room)
- `Room_Count`: The total number of rooms (excluding living room)
- `Bathroom_Count`: Number of bathrooms
- `Photo_Count`: Number of listing photos

The collected dataset will be used for rental price prediction in a separate modeling notebook.


In [2]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
def getAndParseURL(url):
    # Send a GET request to the specified URL with a custom User-Agent header to avoid bot blocking
    result = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    
    # Parse the HTML content of the page using BeautifulSoup with the 'html.parser' parser
    soup = BeautifulSoup(result.text, 'html.parser')
    
    # Return the parsed HTML content (BeautifulSoup object) for further processing
    return soup

In [4]:
PAGE_NUM = []  # Initialize an empty list to store the page URLs

def find_page_num():
    # Get and parse the HTML content of the first page
    html = getAndParseURL("https://www.zingat.com/istanbul-kiralik-konut")
    
    # Loop through page numbers from 1 to 48 and append the corresponding URLs to the PAGE_NUM list
    for i in range(1, 49):
        PAGE_NUM.append("https://www.zingat.com/istanbul-kiralik-konut" + "?page=" + str(i))
    
    # Return the list of URLs for all pages
    return PAGE_NUM

find_page_num()  # Call the function to populate PAGE_NUM with the URLs

['https://www.zingat.com/istanbul-kiralik-konut?page=1',
 'https://www.zingat.com/istanbul-kiralik-konut?page=2',
 'https://www.zingat.com/istanbul-kiralik-konut?page=3',
 'https://www.zingat.com/istanbul-kiralik-konut?page=4',
 'https://www.zingat.com/istanbul-kiralik-konut?page=5',
 'https://www.zingat.com/istanbul-kiralik-konut?page=6',
 'https://www.zingat.com/istanbul-kiralik-konut?page=7',
 'https://www.zingat.com/istanbul-kiralik-konut?page=8',
 'https://www.zingat.com/istanbul-kiralik-konut?page=9',
 'https://www.zingat.com/istanbul-kiralik-konut?page=10',
 'https://www.zingat.com/istanbul-kiralik-konut?page=11',
 'https://www.zingat.com/istanbul-kiralik-konut?page=12',
 'https://www.zingat.com/istanbul-kiralik-konut?page=13',
 'https://www.zingat.com/istanbul-kiralik-konut?page=14',
 'https://www.zingat.com/istanbul-kiralik-konut?page=15',
 'https://www.zingat.com/istanbul-kiralik-konut?page=16',
 'https://www.zingat.com/istanbul-kiralik-konut?page=17',
 'https://www.zingat.co

In [5]:
ALL_PRODUCT = []  # Initialize an empty list to store the product URLs

def all_product():
    # Loop through all page URLs in the PAGE_NUM list
    for i in PAGE_NUM[::]:
        # Get and parse the HTML content for the current page
        html = getAndParseURL(i)
        
        # Find all <a> tags with the class "zl-card-inner", which contain the links to individual product pages
        for result in html.findAll("a", {"class": "zl-card-inner"}):
            # Append the full URL of the product to the ALL_PRODUCT list
            ALL_PRODUCT.append("https://www.zingat.com" + result.get("href"))
    
    # Return the list of all product URLs
    return ALL_PRODUCT

all_product()  # Call the function to populate ALL_PRODUCT with the product URLs

['https://www.zingat.com/kozyataginda-minibus-cad-yakini-kiralik-net-83-m2-giris-kat-5209913i',
 'https://www.zingat.com/selenium-twins-residence-kiralik-1-1-daire-5284155i',
 'https://www.zingat.com/selenium-twins-residence-kiralik-bogaz-manzarali-3-5-1-daire-5284152i',
 'https://www.zingat.com/atisalani-kemer-mah-stada-yakin-buyuk-2-1-100m2-3-kat-temiz-kullanisli-ferah-kiralik-daire-5284144i',
 'https://www.zingat.com/beyoglu-1-1-kiralik-daire-tarihi-binada-esyali-ve-genis-5284123i',
 'https://www.zingat.com/fahrettin-kerim-gokay-caddesi-3-1-sifir-kiralik-daire-5284063i',
 'https://www.zingat.com/derinden-him-caddede-full-esyali-2-1-dubleks-daire-5284011i',
 'https://www.zingat.com/emlak-konut-cinarkoy-evleri-2-1-kapali-mutfak-guncel-sifir-kiralik-daire-5283986i',
 'https://www.zingat.com/cinarkoy-evlerinde-3-1-sifir-binada-guncel-kiralik-daire-5283968i',
 'https://www.zingat.com/emlak-konut-cinarkoy-3-1-ara-kat-guncel-kiralik-daire-5283955i',
 'https://www.zingat.com/vadistanbul-ter

In [None]:
RESULT = []  # Initialize an empty list to store the scraped data

# Loop through the first 1000 product URLs in the ALL_PRODUCT list
for result in ALL_PRODUCT[:1000]:
    # Get and parse the HTML content for the current product page
    html = getAndParseURL(result)
    
    # Extract the listing title from the page
    listing_title = html.find("div", {"class": "col-xs-12"}).h1.text
    
    # Extract the county name from the page
    county = html.find_all("a", {"class": "location-link", "data-zingalite": "m-location-county-name"})[0].text.strip()
    
    # Extract the price and remove "TL" currency symbol
    price = html.find("div", {"class": "listing-price"}).text.replace("TL", "").strip()
    
    # Extract the net area (m²) from the page
    net_area_m2 = html.find("ul", {"class": "row attribute-detail-list"}).findAll("li", {"class": "col-md-6"})[2].find("span").text.strip()
    
    # Extract the gross area (m²) and remove the "m²" unit
    gross_area_m2 = html.find("ul", {"class": "row attribute-detail-list"}).findAll("li", {"class": "col-md-6"})[3].find("span").text.replace("m²", "").strip()
    
    # Extract the room-living room count
    room_living_room_count = html.find("ul", {"class": "row attribute-detail-list"}).findAll("li", {"class": "col-md-6"})[4].find("span").text.strip()
    
    # Extract the room count, removing any extra values like +1, +2, +3
    room_count = html.find("ul", {"class": "row attribute-detail-list"}).findAll("li", {"class": "col-md-6"})[4].find("span").text.replace("+1", "").replace("+2", "").replace("+3", "").strip()
    
    # Extract the bathroom count
    bathroom_count = html.find("ul", {"class": "row attribute-detail-list"}).findAll("li", {"class": "col-md-6"})[5].find("span").text.strip()
    
    # Extract the photo count and remove the "+" symbol
    photo_count = html.find("div", {"class": "detail-images-slide-container hidden-xs hidden-sm"}).a.span.text.replace("+", "")
    
    # Append all the extracted information as a list to the RESULT list
    RESULT.append([listing_title, county, price, net_area_m2, gross_area_m2, room_living_room_count, room_count, bathroom_count, photo_count])

# Define the column names for the DataFrame
columns = ["Listing_Title", "County", "Price", "Net_Area_(m²)", "Gross_Area_(m²)", "Room-Living_Room_Count", "Room_Count", "Bathroom_Count", "Photo_Count"]

# Create a DataFrame from the RESULT list using the defined column names
df = pd.DataFrame.from_records(RESULT, columns=columns)

In [8]:
# We are checking if we have successfully scraped the data.
df

Unnamed: 0,Listing_Title,County,Price,Net_Area_(m²),Gross_Area_(m²),Room-Living_Room_Count,Room_Count,Bathroom_Count,Photo_Count
0,KOZYATAĞI'NDA MİNİBÜS CAD. YAKINI KİRALIK NET ...,"Kadıköy,",55.000,83,100,2+1,2,1,36
1,Selenium Twins Residence Kiralık 1+1 Daire,"Beşiktaş,",60.000,70,96,1+1,1,1,30
2,Selenium Twins Residence Kiralık Boğaz Manzara...,"Beşiktaş,",150.000,180,270,3+1,3,3,30
3,"ATIŞALANI KEMER MAH. STADA YAKIN, BÜYÜK 2+1 10...","Esenler,",20.000,100,110,2+1,2,2,26
4,Beyoğlu 1+1 Kiralık Daire Tarihi Binada Eşyalı...,"Beyoğlu,",37.000,80,90,1+1,1,1,20
...,...,...,...,...,...,...,...,...,...
995,GÖKTÜRK SEBA RESERVE SİTESİNDE BALKONLU ARA KA...,"Eyüpsultan,",115.000,125,168,3+1,3,3,39
996,AĞAOĞLU MASLAK MY HOME 1+1 KİRALIK EŞYALI DAİR...,"Sarıyer,",50.000,55,75,1+1,1,1,10
997,SUADİYE ŞAŞKINDA 3+1 EŞYALI KİRALIK DAİRE,"Kadıköy,",65.000,92,120,3+1,3,1,56
998,AĞAOĞLU MASLAK 1453 1+0 (STÜDYO) EŞYALI KİRALI...,"Sarıyer,",40.000,50,65,1+0 (Stüdyo),1+0 (Stüdyo),1,9


In [9]:
# Save the DataFrame to a CSV file named "zingat_istanbul.csv" with UTF-8 encoding
df.to_csv("zingat_istanbul.csv", index=False, encoding="utf-8-sig")