## Performance Evaluation Metrics  

To assess the effectiveness of our models, we need to use appropriate performance measures. In this case, we have chosen accuracy score and classification report as our primary evaluation metrics.  

## Why These Metrics?

### Accuracy Score
- Accuracy provides a straightforward measure of overall model performance by calculating the proportion of correct predictions out of all predictions.  
- Since we are dealing with a binary classification problem, accuracy serves as a useful baseline metric to compare the models.  
- This metric is particularly relevant when class distribution is not highly imbalanced, allowing us to assess how well the models differentiate between similar and non-similar properties.
 
It is defined as:  


$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
$$


### Classification Report
While accuracy gives a general performance overview, it does not reveal how well the model handles each class. The classification report provides a more detailed evaluation through:  

- Precision – Measures how many of the predicted similar properties were actually similar, helping assess false positives. It is defined as:  

$$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$$

A high precision score indicates that the model makes few false positive predictions.  

- Recall – Measures how well the model identifies all similar properties, ensuring we minimize false negatives. It is defined as:  

$$
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
$$

A high recall score means the model is good at identifying positive instances.  

- F1-Score – Balances precision and recall, making it useful if there are slight class imbalances. It is defined as:  

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

A high F1-score indicates a good balance between precision and recall.



### K-Nearest Neighbors and Random Forest in Property Recommendation Systems

Since our goal of the recommendation system is to suggest properties based on two numerical 
user inputs Price and Area and the output is a list of similar properties matching the criteria, 
including details such as: Property Type (نوع العقار) Location (الموقع) District (الحي) Bedrooms (الغرف) 
Bathrooms (دورات المياه) Price (السعر) Agency Name , we want the output to be a search for the 
most similar properties not a prediction task.

#### K-Nearest Neighbors (KNN)

KNN is a simple, non-parametric machine learning algorithm used for classification and regression. It operates by identifying the k closest data points to a given input and then assigning the most frequent class for classification or calculating the average for regression[1].

##### Why K-Nearest Neighbors (KNN) is Best for Property Recommendation Systems?

KNN is a great choice for a property recommendation system for several reasons. Unlike other 
models that try to predict a specific price or property, like Linear Regression, KNN focuses on 
finding the most similar properties according to user inputs such as price and area.
When a user specifies their desired price and area, KNN locates the most similar properties in the dataset through a series of steps: first, it chooses a value for k, which determines how many similar properties to find; next, it measures the similarity of each property to the user's input using a distance metric; then, it identifies the k nearest neighbors that best match the specified price and area; finally, it returns a list of the k most relevant properties, providing multiple options rather than just one. This makes KNN an effective method for recommendation systems, as it offers users a range of property choices that best match their criteria [1].

##### Why K-Nearest Neighbors (KNN) is Best for This Dataset?

By observing the dataset, we can see that certain features are more closely related, which affects how KNN determines similarity among properties. For example, there's a strong correlation between the number of bedrooms and bathrooms, showing that properties with more bedrooms usually have more bathrooms. Additionally, the price tends to be more closely related to the number of bedrooms and bathrooms, indicating that properties with more rooms typically have higher prices [2].

#### Random Forest (RF)

Random Forest is a robust machine learning model that is particularly effective at analyzing structured real estate data and providing accurate predictions. It functions as an ensemble of multiple decision trees, with each tree learning from different subsets of data to create a more balanced and generalized recommendation system. This approach enables the model to effectively capture the intricate relationships between "السعر" (price) and "المساحة" (area), ensuring that users receive property suggestions that align with their financial and spatial preferences. Unlike traditional models that assume a linear relationship between price and area, Random Forest can identify patterns and non-linear trends, making it highly suitable for the complexities of real-world real estate markets [3]. 

##### Why Random Forest is Suitable for Recommendations?

When it comes to property recommendations, "السعر" and "المساحة" are among the most important factors that affect user choices. Random Forest emphasizes these attributes through its feature selection process, allowing it to effectively filter and rank properties according to user needs. Given that real estate pricing is affected by various factors, the model's capability to combine insights from multiple decision trees ensures it delivers reliable and balanced recommendations. Additionally, its strength in handling outliers enables it to produce valuable suggestions, making it a practical and robust option for recommendation systems. [4].

##### Why Random Forest is Suitable for This Dataset?

In real estate, the connection between "السعر" (price) and "المساحة" (area) can differ greatly depending on "الموقع" (location) and "نوع العقار" (property type). Random Forest effectively captures these variations by using these attributes as features, enabling the model to identify unique pricing trends for various areas and property types. By adjusting its decision-making based on these differences, Random Forest ensures that property recommendations are both accurate and relevant, whether someone is looking for an apartment, villa, or commercial space in different locations. This flexibility makes it especially suitable for real estate datasets that involve numerous influencing factors. [4].




In [7]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
file_path = "Dataset/cleaned_dataset.csv"
df = pd.read_csv(file_path)

# Drop the "Property_ID" column as it's an identifier not useful for modeling.
df = df.drop(columns=["Property_ID"], errors='ignore')

# Define the numerical features which we will use for analysis and predictions.
num_features = ["Area", "Price"]

# These columns will keep the original, unscaled values to display later in the recommendations.
df["Original_Area"] = df["Area"]
df["Original_Price"] = df["Price"]

# A preprocessing pipeline is created using a StandardScaler to normalize the 'Area' and 'Price' columns for consistent scaling of the data.
preprocessor = Pipeline([
    ('scaler', StandardScaler())  # This scales the features to have zero mean and unit variance.
])

# Normalize the numerical features (Area and Price).
df[num_features] = preprocessor.fit_transform(df[num_features])

# Generate similarity labels for training. This function assigns a 'Similarity_Label' to properties based on their proximity to each other in price and area.
df["Similarity_Label"] = 0  # Default label is "Not similar".

def assign_similarity_labels(df, threshold=0.1):
    for i, row in df.iterrows():
        area, price = row["Area"], row["Price"]
        # Calculate the Euclidean distance between this property and all others.
        distances = np.sqrt((df["Area"] - area) ** 2 + (df["Price"] - price) ** 2)
        closest_indices = distances.nsmallest(3).index  # Adjusted to 2 closest + itself.
        df.loc[closest_indices, "Similarity_Label"] = 1  # Mark these properties as similar.

assign_similarity_labels(df)

# We will split the data into a training set (80%) and a testing set (20%) for model evaluation.
X = df[num_features]
y = df["Similarity_Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

# We will train three models: KNN, Random Forest, and Gradient Boosting.
models = {
    "KNN": KNeighborsClassifier(n_neighbors=5, weights="distance"),  # Enable probability-based similarity
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100)
}


# Train each model and evaluate it using accuracy and classification report.
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model on the training set.
    y_pred = model.predict(X_test)  # Make predictions on the test set.
    
    # Output the accuracy score and classification report for each model with clear separation
    print(f"\n{'-'*100}")
    print(f" {name} Performance Metrics:")
    print(f"{'-'*100}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    print(f"{'-'*100}")

# This function will return a list of recommended properties based on the price and area specified by the user.
def recommend_properties(price, area, model_name, top_n=5):
    input_data = pd.DataFrame([[area, price]], columns=num_features)
    input_data_scaled = pd.DataFrame(preprocessor.transform(input_data), columns=num_features)

    model = models.get(model_name)

    if model is None:
        print(f"🚨 Model '{model_name}' not found!")
        return None

    if hasattr(model, "predict_proba"):  
        # **Use only the input property for prediction**
        input_prob = model.predict_proba(input_data_scaled)[:, 1]

        # **Predict probabilities for all properties compared to input**
        probabilities = model.predict_proba(X)[:, 1]

    elif isinstance(model, KNeighborsClassifier):  
        # **For KNN: Find closest properties to the input**
        distances, indices = model.kneighbors(input_data_scaled, n_neighbors=min(top_n + 5, len(X)))  
        probabilities = np.zeros(len(X))  
        probabilities[indices.flatten()] = 1 / (distances.flatten() + 1e-5)  

    else:
        # **Fallback: Use Euclidean distance manually**
        probabilities = -np.sqrt((df["Area"] - input_data_scaled.iloc[0, 0]) ** 2 + (df["Price"] - input_data_scaled.iloc[0, 1]) ** 2)

    # **Sort based on highest similarity to input**
    recommended_indices = np.argsort(-probabilities)[:top_n]
    recommended = df.iloc[recommended_indices].copy()

    # **Rename columns to Arabic**
    recommended = recommended.rename(columns={
        "Property Type": "نوع العقار",
        "Location": "الموقع",
        "District": "الحي",
        "Bedrooms": "الغرف",
        "Bathrooms": "دورات المياة",
        "Original_Area": "المساحة",
        "Original_Price": "السعر",
        "Agency_Name": "الوكالة"
    })

    headers = ["نوع العقار", "الموقع", "الحي", "الغرف", "دورات المياة", "المساحة", "السعر", "الوكالة"]
    results = recommended[headers]

    print(f"\n توصيات العقارات المتوافقة مع الميزانية والمساحة المطلوبة باستخدام مودل {model_name}:\n")
    print(" | ".join(headers))
    print("-" * 100)
    for _, row in results.iterrows():
        print(" | ".join(str(x) for x in row.values))
    print("-" * 100)

    return results

# Example usage of the agar recommendation system:
price_input = 1000
area_input = 300

# Loop through all models and print recommendations for each one.
for model_name in models.keys():
    recommend_properties(price_input, area_input, model_name=model_name)


FileNotFoundError: [Errno 2] No such file or directory: 'Dataset/cleaned_dataset.csv'

## References

[1] GeeksforGeeks, "Recommender Systems using KNN," GeeksforGeeks, Feb. 27, 2025. [Online]. Available: https://www.geeksforgeeks.org/recommender-systems-using-knn/?ref=asr9. [Accessed: Feb. 27, 2025].

[2] R. K. Halder, M. N. Uddin, M. A. Uddin, S. Aryal, and A. Khraisat, "Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications," Journal of Big Data, vol. 11, no. 1, pp. 1–55, Aug. 2024. [Online]. Available: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-024-00973-y. [Accessed: Feb. 27, 2025].

[3] E2E Networks, "Random Forest Algorithm in Machine Learning: A Guide," E2E Networks Blog, [Online]. Available: https://www.e2enetworks.com/blog/random-forest-algorithm-in-machine-learning-a-guide. [Accessed: Feb. 27, 2025].

[4] IBM, "Random Forest," IBM Think, [Online]. Available: https://www.ibm.com/think/topics/random-forest. [Accessed: Feb. 27, 2025].


