# Real Estate Data Collection and Cleaning Project

#### •	Timeline
##### –	Start: Wednesday morning
##### –	Deadline: End of day Thursday


### Project Objective
This project is designed to simulate a real-world data science task. You will scrape property data from two real estate websites, clean and integrate the data, engineer features, and frame a predictive modeling challenge based on pricing. The final deliverable will be a GitHub repository containing all project artifacts.


### Assigned Websites
##### You will be assigned two real estate websites. Each site contains listings for properties for rent and properties for sale.
##### •	Website 1: https://www.bayut.om/en/
##### •	Website 2: https://om.opensooq.com/en/property

### Listed properties for rent.

### Project Requirements:
##### 1. Web Scraping
##### 2. Data Cleaning & Integration
##### 3. Feature Engineering
##### 4. Predictive Modeling (Challenge)
##### 5. GitHub Repository

## Import the Libraries that I will use :

In [31]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import re
import time

# 1- Web Scraping For www.bayut.om

In [81]:
base_url = "https://www.bayut.om"

response = requests.get(base_url, headers=headers)

if response.status_code == 200:
    print("You have the permission to use the page ✅")
else:
    print(f"Request failed with status code: {response.status_code}")

You have the permission to use the page ✅


In [86]:
response.status_code
print(f"The status code is: ",{response})

The status code is:  {<Response [200]>}


In [76]:
base_url = "https://www.bayut.om"
start_url = f"{base_url}/en/oman/properties-for-rent/"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

properties = {
    "Title": [],
    "Location": [],
    "Price": [],
    "Size": [],
    "Listing_Type": [],
    "Link": []
}

current_page_url = start_url
page_num = 1

while True:
    print(f"Scraping page {page_num}: {current_page_url}")
    try:
        response = requests.get(current_page_url, headers=headers)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching page {current_page_url}: {e}")
        break

    soup = BeautifulSoup(response.text, 'html.parser')
    ads = soup.find_all('article', class_='fbc619bc')

    if not ads:
        print("No property ads found. Exiting.")
        break

    for ad in ads:
        # Title & Link
        title_tag = ad.find('a', attrs={'aria-label': 'Listing link'})
        title = title_tag['title'] if title_tag and 'title' in title_tag.attrs else None
        link = base_url + title_tag['href'] if title_tag else None
        properties["Title"].append(title)
        properties["Link"].append(link)

        # Location
        location = None
        location_tag = ad.find('div', attrs={'aria-label': 'Location'})
        if location_tag:
            location = location_tag.text.strip()
        elif title:
            match = re.search(r'in\s(.+)', title)
            if match:
                location = match.group(1)
        properties["Location"].append(location)

        # Price
        price_tag = ad.find('span', attrs={'aria-label': 'Price'})
        price = price_tag.text.strip() if price_tag else None
        properties["Price"].append(price)

        # Size (from detail page)
        size = None
        if link:
            try:
                detail_response = requests.get(link, headers=headers)
                detail_soup = BeautifulSoup(detail_response.content, 'html.parser')
                for span in detail_soup.find_all('span'):
                    text = span.get_text(strip=True)
                    if re.search(r'\d+\s*(Sq\.?\s*M\.?|sqm|m²)', text, re.I):
                        size = text
                        break
            except Exception as e:
                print(f"Error fetching size from {link}: {e}")
        properties["Size"].append(size)

        # Listing Type
        properties["Listing_Type"].append("For Rent")

    # ---- pagination ----
    next_btn = soup.find('a', title='Next')
    if next_btn and next_btn.get('href'):
        next_page_relative_url = next_btn['href']
        current_page_url = requests.compat.urljoin(base_url, next_page_relative_url)
        page_num += 1
        time.sleep(2) 
    else:
        print("No more pages found. Scraping finished.")
        break

print(f"\nTotal properties collected: {len(properties['Title'])}")

df = pd.DataFrame(properties)
df.to_csv('bayut_for_rent.csv', index=False)



Scraping page 1: https://www.bayut.om/en/oman/properties-for-rent/
Scraping page 2: https://www.bayut.om/en/oman/properties-for-rent/page-2/
Scraping page 3: https://www.bayut.om/en/oman/properties-for-rent/page-3/
Scraping page 4: https://www.bayut.om/en/oman/properties-for-rent/page-4/
Scraping page 5: https://www.bayut.om/en/oman/properties-for-rent/page-5/
Scraping page 6: https://www.bayut.om/en/oman/properties-for-rent/page-6/
Scraping page 7: https://www.bayut.om/en/oman/properties-for-rent/page-7/
Scraping page 8: https://www.bayut.om/en/oman/properties-for-rent/page-8/
Scraping page 9: https://www.bayut.om/en/oman/properties-for-rent/page-9/
Scraping page 10: https://www.bayut.om/en/oman/properties-for-rent/page-10/
Scraping page 11: https://www.bayut.om/en/oman/properties-for-rent/page-11/
Scraping page 12: https://www.bayut.om/en/oman/properties-for-rent/page-12/
Scraping page 13: https://www.bayut.om/en/oman/properties-for-rent/page-13/
Scraping page 14: https://www.bayut.o

In [88]:
# Output as DataFrame
df = pd.DataFrame(properties)
df

Unnamed: 0,Title,Location,Price,Size,Listing_Type,Link
0,"1 Bedroom Apartment For Rent Ruwi, Muscat","Ruwi, Muscat",150,70 Sq. M.,For Rent,https://www.bayut.om/en/property/details-13012...
1,"1 Bedroom Apartment For Rent Al Hail, Muscat","Al Hail, Muscat",300,100 Sq. M.,For Rent,https://www.bayut.om/en/property/details-13027...
2,"3 Bedrooms Villa For Rent Qurum, Muscat","Qurum, Muscat",750,300 Sq. M.,For Rent,https://www.bayut.om/en/property/details-12994...
3,4 Bedrooms Villa For Rent Madinat As Sultan Qa...,"Madinat As Sultan Qaboos, Muscat",950,300 Sq. M.,For Rent,https://www.bayut.om/en/property/details-13018...
4,"2 Bedrooms Apartment For Rent in Al Hamriyah, ...","Al Hamriyah, Muscat",250,100 Sq. M.,For Rent,https://www.bayut.om/en/property/details-12994...
...,...,...,...,...,...,...
1300,"1 Bedroom Apartment For Rent in Al Duqum, Al W...","Al Duqum, Al Wusta",80,30 Sq. M.,For Rent,https://www.bayut.om/en/property/details-13008...
1301,"Land for Rent - Residential Land in Barka, Al ...","Barka, Al Batinah",300,619 Sq. M.,For Rent,https://www.bayut.om/en/property/details-12778...
1302,"2 Bedroom Apartment For Rent Barr Al Jissah, M...","Barr Al Jissah, Muscat",850,200 Sq. M.,For Rent,https://www.bayut.om/en/property/details-13020...
1303,"4 Bedroom Villa For Rent in Muscat Hills, Muscat","Muscat Hills, Muscat",1450,419 Sq. M.,For Rent,https://www.bayut.om/en/property/details-12991...


In [35]:
print("Scraping page link:", url)


Scraping page link: https://www.bayut.om/en/oman/properties-for-rent/


# 2- Web Scraping For om.opensooq.com

In [79]:
base_url = 'https://om.opensooq.com/en/property/property-for-rent'

response = requests.get(base_url, headers=headers)

if response.status_code == 200:
    print("You have the permission to use the page ✅")
else:
    print(f"Request failed with status code: {response.status_code}")

You have the permission to use the page ✅


In [87]:
response.status_code
print(f"The status code is: ",{response})

The status code is:  {<Response [200]>}


In [75]:
base_url = 'https://om.opensooq.com/en/property/property-for-rent'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

properties = {
    "Title": [],
    "Location": [],
    "Price": [],
    "Size": [],
    "Listing_Type": [],
    "Link": []
}

page = 1
while True:
    url = f"{base_url}?page={page}"
    print(f"Scraping page {page}: {url}")
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")

    cards = soup.find_all('a', class_=lambda x: x and 'postListItemData' in x)
    if not cards:
        print("No more cards found. Stopping.")
        break

    for card in cards:
        card_text = card.get_text(separator='\n', strip=True)
        lines = card_text.split('\n')

        title = lines[2] if len(lines) > 2 else None

        location = None
        for i in range(len(lines)):
            if 'Chat' in lines[i]:
                if i >= 2:
                    location = lines[i-2] + ", " + lines[i-1]
                break

        price = lines[-1] if lines else None

        size = None
        p_tag = card.find('p')
        if p_tag:
            match = re.search(r'Surface\s*Area:\s*([0-9,]+)\s*m2', p_tag.text)
            if match:
                size = match.group(1).replace(',', '')

        listing_type = "For Rent"
        link = 'https://om.opensooq.com' + card.get('href') if card.get('href') else None

        properties["Title"].append(title)
        properties["Location"].append(location)
        properties["Price"].append(price)
        properties["Size"].append(size)
        properties["Listing_Type"].append(listing_type)
        properties["Link"].append(link)

    next_btn = soup.find('a', attrs={'data-id': 'nextPageArrow'})
    if not next_btn:
        print("No next page. Done scraping.")
        break

    page += 1
    time.sleep(1)

df = pd.DataFrame(properties)
df.to_csv('opensooq_for_rent.csv', index=False)






Scraping page 1: https://om.opensooq.com/en/property/property-for-rent?page=1
Scraping page 2: https://om.opensooq.com/en/property/property-for-rent?page=2
Scraping page 3: https://om.opensooq.com/en/property/property-for-rent?page=3
Scraping page 4: https://om.opensooq.com/en/property/property-for-rent?page=4
Scraping page 5: https://om.opensooq.com/en/property/property-for-rent?page=5
Scraping page 6: https://om.opensooq.com/en/property/property-for-rent?page=6
Scraping page 7: https://om.opensooq.com/en/property/property-for-rent?page=7
Scraping page 8: https://om.opensooq.com/en/property/property-for-rent?page=8
Scraping page 9: https://om.opensooq.com/en/property/property-for-rent?page=9
Scraping page 10: https://om.opensooq.com/en/property/property-for-rent?page=10
Scraping page 11: https://om.opensooq.com/en/property/property-for-rent?page=11
Scraping page 12: https://om.opensooq.com/en/property/property-for-rent?page=12
Scraping page 13: https://om.opensooq.com/en/property/prop

In [74]:
# Output as DataFrame
df = pd.DataFrame(properties)
df

Unnamed: 0,Title,Location,Price,Size,Listing_Type,Link
0,80 m2 2 Bedrooms Apartments for Rent in Dhofar...,", Salala, 922133XX",35 OMR,80,For Rent,https://om.opensooq.com/en/search/266056301
1,عقود ايجار مكاتب مؤقتة وترخيص الانشطة Temporar...,", Amerat, 955469XX",75 OMR,20,For Rent,https://om.opensooq.com/en/search/263485983
2,2 Bedrooms Chalet for Rent in Al Dakhiliya Bidbid,", Bidbid, 966934XX",40 OMR,220,For Rent,https://om.opensooq.com/en/search/259878571
3,100 m2 2 Bedrooms Apartments for Rent in Dhofa...,", Salala, 713317XX",20 OMR,100,For Rent,https://om.opensooq.com/en/search/266550047
4,Furnished Daily in Dhofar Salala,", Salala, 990838XX",50 OMR,,For Rent,https://om.opensooq.com/en/search/266121233
...,...,...,...,...,...,...
5471,Flat for rent (inclusive of bills),", Al Maabilah, 972922XX",175 OMR,90,For Rent,https://om.opensooq.com/en/search/266829563
5472,Apartment for sharing only for working ladies ...,", Bosher, 944697XX",110 OMR,50,For Rent,https://om.opensooq.com/en/search/267019863
5473,160 m2 2 Bedrooms Apartments for Rent in Musca...,", Al Maabilah, 791585XX",250 OMR,160,For Rent,https://om.opensooq.com/en/search/266841899
5474,30 m2 Studio Apartments for Rent in Muscat Al ...,", Al Mawaleh, 727503XX",120 OMR,30,For Rent,https://om.opensooq.com/en/search/267035849
