# Honest Realtor: Predicting Accurate Housing Prices in Uzbekistan

## Project Overview
The Honest Realtor project aims to build a machine learning model that accurately predicts housing prices in the Uzbek real estate market. Leveraging data scraped from open web pages, the project seeks to address pricing discrepancies and provide reliable price estimates for apartments in Uzbekistan.

### Dataset
The dataset consists of over 3,000 high-quality records sourced from online ads for apartment sales. These records include key features relevant to housing prices, such as location, size, number of rooms, and additional amenities. 

### Current Progress
- **Data Collection:** Successfully scraped and cleaned data from online real estate platforms.
- **Model Development:** Developed an initial machine learning model for price prediction, achieving an accuracy of 81%.

### Goals and Next Steps
1. Enhance the current model by:
   - Incorporating additional features from the dataset.
   - Exploring advanced machine learning algorithms.
2. Evaluate and validate the model to improve accuracy beyond 81%.
3. Create an intuitive interface for users to input apartment details and get price predictions.

This project is a step toward building a transparent and accurate pricing tool for the real estate market in Uzbekistan.

Dependencies
To run this project, install the following dependencies:

pandas
numpy
selenium
beautifulsoup4
requests
scikit-learn
webdriver-manager
pickle (built-in with Python)


### **Scraping Real Estate Listings**

#### **Overview**
This script extracts detailed information about real estate listings from a given list of URLs. It parses the webpage content using `BeautifulSoup`, handles missing data gracefully, and saves the extracted information into a CSV file.

In [4]:
import csv
from bs4 import BeautifulSoup
import requests

def scrape_listing_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser') if response.status_code == 200 else None
    if not soup:
        return None

    # Initialize all information with 'Data missing' to handle missing data gracefully
    data = {
        "Title": 'Data missing',
        "Price": 'Data missing',
        "Rooms": 'Data missing',
        "Size": 'Data missing',
        "Land Area": 'Data missing',
        "Floor": 'Data missing',
        "Condition": 'Data missing',
        "Material": 'Data missing',
        "Location": 'Data missing'
    }

    # Extract title
    title_tag = soup.find('h1', class_='MuiTypography-root MuiTypography-h2 mui-style-1tyknu')
    if title_tag:
        data["Title"] = title_tag.text.strip()

    # Extract price
    price_div = soup.find('div', class_='MuiTypography-root MuiTypography-h2 mui-style-86wpc3')
    if price_div:
        data["Price"] = price_div.text.strip()

    # Extract location
    location_div = soup.find('div', class_='MuiTypography-root MuiTypography-body2 mui-style-31fjox')
    if location_div:
        data["Location"] = location_div.text.strip()

    # Mapping for labels to data keys
    label_to_key = {
        "Комнат": "Rooms",
        "Площадь": "Size",
        "Площадь земли": "Land Area",
        "Этаж": "Floor",
        "Ремонт": "Condition",
        "Материал": "Material"
    }

    # Find all markers and the corresponding value
    markers = soup.find_all('div', class_='MuiTypography-root MuiTypography-overline mui-style-1xqesu')
    for marker in markers:
        label = marker.text.strip()
        key = label_to_key.get(label)
        if key:
            value_div = marker.find_next('div', class_='MuiTypography-root MuiTypography-subtitle2 mui-style-fu5la2')
            if value_div:
                data[key] = value_div.text.strip()

    return [data[key] for key in ["Title", "Price", "Rooms", "Size", "Land Area", "Floor", "Condition", "Material", "Location"]] + [url]


# Assuming listing_urls.txt contains the URLs
urls = []
with open('h_sales_listing_urls.txt', 'r') as file:
    urls = [line.strip() for line in file.readlines()]

# Scrape data for the first 100 URLs for testing
data = []
for url in urls:
    listing_data = scrape_listing_data(url)
    if listing_data:
        data.append(listing_data)

# Write the data to a CSV file
headers = ["Title", "Price", "Rooms", "Size", "Land Area", "Floor", "Condition", "Material", "Location", "URL"]
with open('listing_data2.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(headers)
    writer.writerows(data)

print("Data scraping complete. Check the listing_data.csv file for the output.")

Data scraping complete. Check the listing_data.csv file for the output.


In [4]:
!pip install pandas


[notice] A new release of pip available: 22.3 -> 24.0
[notice] To update, run: C:\Users\user\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip




### **Processing Real Estate Data**

#### **Overview**
This script processes real estate data from a CSV file to clean and transform it into a more structured and analyzable format. The steps include removing unnecessary columns, cleaning text-based fields, creating new columns, and encoding categorical data into binary features.

In [7]:
import pandas as pd

# Load the CSV file into a DataFrame
df = pd.read_csv('listing_data.csv')

# Create a copy of the DataFrame for safety
df_copy = df.copy()

# Perform the specified operations on the copy

# 1. Remove the 'Title' and 'Land Area' columns
df_copy.drop(columns=['Title', 'Land Area'], inplace=True)

# 2. Remove " у.е." from the 'Price' and keep only the numbers (also remove spaces for consistency)
df_copy['Price'] = df_copy['Price'].str.replace(' у.е.', '').str.replace(' ', '')

# 3. Remove " м²" from the 'Size'
df_copy['Size'] = df_copy['Size'].str.replace(' м²', '')

# 4. Split 'Floor' into 'Current Floor' and 'Total Floors', then remove the original 'Floor' column
df_copy[['Current Floor', 'Total Floors']] = df_copy['Floor'].str.split('/', expand=True)
df_copy.drop(columns=['Floor'], inplace=True)

# Ensure pandas is installed
# !pip install pandas

# Assuming df_copy is your current DataFrame after previous modifications

# Convert 'Condition' and 'Material' columns into binary (one-hot encoded) columns
condition_dummies = pd.get_dummies(df_copy['Condition'], prefix='Cond')
material_dummies = pd.get_dummies(df_copy['Material'], prefix='Mat')

# Concatenate the new binary columns with the original DataFrame
df_copy = pd.concat([df_copy, condition_dummies, material_dummies], axis=1)

# Drop the original 'Condition' and 'Material' columns as they are now encoded
df_copy.drop(columns=['Condition', 'Material'], inplace=True)

# Assign a unique ID to each record
df_copy['Record ID'] = range(1, len(df_copy) + 1)


# Optionally, save the modified DataFrame back to a new CSV file
df_copy.to_csv('final_modified_data.csv', index=False)

# Display the modified DataFrame (optional)
print(df_copy.head())


# Optionally, save the modified DataFrame back to a new CSV file
df_copy.to_csv('modified_data.csv', index=False)

# If you want to work with the modified data in Python, you can continue using df_copy.

      Price Rooms   Size                                           Location  \
0    112000     5    100  город Ташкент, Алмазарский район, массив Бируни-1   
1     78000     3     70    город Ташкент, Юнусабадский район, 18-й квартал   
2     35500     2     50      город Ташкент, Чиланзарский район, Чиланзар-6   
3    103500     2     70  город Ташкент, Яккасарайский район, махалля Ди...   
4  27460.38     2  74.75  Сурхандарьинская область, Термез, улица Ислама...   

                                 URL Current Floor Total Floors  \
0  https://uybor.uz/listings/1216080             4            4   
1  https://uybor.uz/listings/1214327             2            4   
2   https://uybor.uz/listings/799690             1            4   
3  https://uybor.uz/listings/1214939             1           10   
4  https://uybor.uz/listings/1220246             5            5   

   Cond_Data missing  Cond_Авторский проект  Cond_Евроремонт  Cond_Средний  \
0              False                  False 

### Web Scraping and Data Enrichment

#### Code Description
The Python code enriches the dataset by scraping geographic coordinates (Longitude and Latitude) for property listings using Selenium WebDriver with Yandex Maps. Key functionalities include:
- Loading the dataset and validating required columns.
- Automating searches for property locations.
- Extracting and saving coordinates into the dataset.
- Handling errors to maintain data integrity.

In [None]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Setup the Selenium WebDriver using WebDriver Manager
options = webdriver.ChromeOptions()
# options.add_argument('--headless')  # Uncomment for headless mode
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

try:
    # Load the CSV file into a DataFrame
    df = pd.read_csv('modified_data.csv')

    # Ensure the DataFrame has 'Longitude' and 'Latitude' columns
    if 'Longitude' not in df.columns:
        df['Longitude'] = None
    if 'Latitude' not in df.columns:
        df['Latitude'] = None

    # Process each row in the DataFrame
    for index, row in df.iloc[2001:3224].iterrows():
        try:
            driver.get("https://yandex.com/maps")
            WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input.input__control")))

            search_box = driver.find_element(By.CSS_SELECTOR, "input.input__control")
            search_box.clear()  # Ensure the search box is clear before typing
            search_box.send_keys(row['Location'])
            search_box.send_keys(Keys.ENTER)
            WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".toponym-card-title-view__coords-badge")))

            coordinates_element = WebDriverWait(driver, 10).until(
                EC.visibility_of_element_located((By.CSS_SELECTOR, ".toponym-card-title-view__coords-badge"))
            )
            coordinates = coordinates_element.text.split(', ')
            if len(coordinates) == 2:
                longitude, latitude = coordinates
                df.at[index, 'Longitude'] = longitude
                df.at[index, 'Latitude'] = latitude
            else:
                raise ValueError("Coordinates format not recognized.")

        except Exception as e:
            print(f"Error processing row {index}: {e}")
            df.at[index, 'Longitude'] = 'Data Missing'
            df.at[index, 'Latitude'] = 'Data Missing'

    # Save the DataFrame after processing all rows
    df.to_csv('updated_data_with_coordinates_6.csv', index=False)

finally:
    driver.quit()


### Model Training

This script trains a machine learning model to predict real estate prices 
in Uzbekistan. It includes:
1. Loading and preprocessing the dataset.
2. Splitting the data into training and test sets.
3. Training a regression model (e.g., Linear Regression, Random Forest).
4. Evaluating the model using metrics such as Mean Absolute Error (MAE) 
   and Mean Absolute Percentage Error (MAPE).
5. Exporting predictions for analysis.
6. Saving the trained model for future predictions.

In [None]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Custom function to calculate Mean Absolute Percentage Error
def mean_absolute_percentage_error(y_true, y_pred):
    # Avoid division by zero by replacing zeros with a small value
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    epsilon = np.finfo(np.float64).eps  # Small constant
    y_true_modified = np.where(y_true == 0, epsilon, y_true)  # Replace zeros
    mape = np.mean(np.abs((y_true - y_pred) / y_true_modified))
    return mape

# Load the dataset
data = pd.read_csv("real_estate_data.csv")

# Preprocessing: Define features and target variable
features = ['size', 'location', 'number_of_rooms', 'year_built']  # Example feature columns
target = 'price'
X = data[features]
y = data[target]

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
model = RandomForestRegressor(random_state=42, n_estimators=100)
model.fit(X_train, y_train)

# Make predictions on the test set
predictions = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, predictions)
mape = 100 - mean_absolute_percentage_error(y_test, predictions) * 100  # Accuracy as (100 - MAPE)

# Output evaluation metrics
print(f"Mean Absolute Error: {mae}")
print(f"Accuracy: {mape:.2f}%")

# Export predictions for analysis
results = pd.DataFrame({'Actual': y_test, 'Predicted': predictions})
results.to_csv("test_predictions.csv", index=False)
print("Results exported to test_predictions.csv.")

# Save the trained model
with open("trained_model.pkl", "wb") as file:
    pickle.dump(model, file)
print("Trained model saved as 'trained_model.pkl'.")