
## K-Nearest Neighbors and Random Forest in Property Recommendation Systems

Since our goal of the recommendation system is to suggest properties based on two numerical 
user inputs Price and Area and the output is a list of similar properties matching the criteria, 
including details such as: Property Type (نوع العقار) Location (الموقع) District (الحي) Bedrooms (الغرف) 
Bathrooms (دورات المياه) Price (السعر) Agency Name , we want the output to be a search for the 
most similar properties not a prediction task.

### K-Nearest Neighbors (KNN)

KNN is a simple, non-parametric machine learning algorithm used for classification and regression. It operates by identifying the k closest data points to a given input and then assigning the most frequent class for classification or calculating the average for regression[1].

##### Why K-Nearest Neighbors (KNN) is Best for Property Recommendation Systems?

KNN is a great choice for a property recommendation system for several reasons. Unlike other 
models that try to predict a specific price or property, like Linear Regression, KNN focuses on 
finding the most similar properties according to user inputs such as price and area.
When a user specifies their desired price and area, KNN locates the most similar properties in the dataset through a series of steps: first, it chooses a value for k, which determines how many similar properties to find; next, it measures the similarity of each property to the user's input using a distance metric; then, it identifies the k nearest neighbors that best match the specified price and area; finally, it returns a list of the k most relevant properties, providing multiple options rather than just one. This makes KNN an effective method for recommendation systems, as it offers users a range of property choices that best match their criteria [1].

##### Why K-Nearest Neighbors (KNN) is Best for This Dataset?

By observing the dataset, we can see that certain features are more closely related, which affects how KNN determines similarity among properties. For example, there's a strong correlation between the number of bedrooms and bathrooms, showing that properties with more bedrooms usually have more bathrooms. Additionally, the price tends to be more closely related to the number of bedrooms and bathrooms, indicating that properties with more rooms typically have higher prices [2].

### Random Forest (RF)

Random Forest is a robust machine learning model that is particularly effective at analyzing structured real estate data and providing accurate predictions. It functions as an ensemble of multiple decision trees, with each tree learning from different subsets of data to create a more balanced and generalized recommendation system. This approach enables the model to effectively capture the intricate relationships between "السعر" (price) and "المساحة" (area), ensuring that users receive property suggestions that align with their financial and spatial preferences. Unlike traditional models that assume a linear relationship between price and area, Random Forest can identify patterns and non-linear trends, making it highly suitable for the complexities of real-world real estate markets [3]. 

##### Why Random Forest is Suitable for Recommendations?

When it comes to property recommendations, "السعر" and "المساحة" are among the most important factors that affect user choices. Random Forest emphasizes these attributes through its feature selection process, allowing it to effectively filter and rank properties according to user needs. Given that real estate pricing is affected by various factors, the model's capability to combine insights from multiple decision trees ensures it delivers reliable and balanced recommendations. Additionally, its strength in handling outliers enables it to produce valuable suggestions, making it a practical and robust option for recommendation systems. [4].

##### Why Random Forest is Suitable for This Dataset?

In real estate, the connection between "السعر" (price) and "المساحة" (area) can differ greatly depending on "الموقع" (location) and "نوع العقار" (property type). Random Forest effectively captures these variations by using these attributes as features, enabling the model to identify unique pricing trends for various areas and property types. By adjusting its decision-making based on these differences, Random Forest ensures that property recommendations are both accurate and relevant, whether someone is looking for an apartment, villa, or commercial space in different locations. This flexibility makes it especially suitable for real estate datasets that involve numerous influencing factors.[4]




## Performance Evaluation Metrics  

To assess the effectiveness of our models, we need to use appropriate performance measures. In this case, we have chosen accuracy score and classification report as our primary evaluation metrics.[5][6]  

### Why These Metrics?

### Accuracy Score
- Accuracy provides a straightforward measure of overall model performance by calculating the proportion of correct predictions out of all predictions.  
- Since we are dealing with a binary classification problem, accuracy serves as a useful baseline metric to compare the models.  
- This metric is particularly relevant when class distribution is not highly imbalanced, allowing us to assess how well the models differentiate between similar and non-similar properties.
 
It is defined as:  


$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
$$


### Classification Report
While accuracy gives a general performance overview, it does not reveal how well the model handles each class. The classification report provides a more detailed evaluation through:  

- Precision – Measures how many of the predicted similar properties were actually similar, helping assess false positives. It is defined as:  

$$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$$

A high precision score indicates that the model makes few false positive predictions.  

- Recall – Measures how well the model identifies all similar properties, ensuring we minimize false negatives. It is defined as:  

$$
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
$$

A high recall score means the model is good at identifying positive instances.  

- F1-Score – Balances precision and recall, making it useful if there are slight class imbalances. It is defined as:  

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

A high F1-score indicates a good balance between precision and recall.


In [9]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import resample

# Load dataset
file_path = "../Dataset/cleaned_dataset.csv"
df = pd.read_csv(file_path)

# Remove non-predictive column
if "Property_ID" in df.columns:
    df.drop(columns=["Property_ID"], inplace=True)

# Define numerical features and store original values before scaling
num_features = ["Area", "Price"]
df["Original_Area"] = df["Area"]
df["Original_Price"] = df["Price"]

# Normalize numerical data
preprocessor = StandardScaler()
df[num_features] = preprocessor.fit_transform(df[num_features])

# Assign similarity labels
def assign_similarity_labels(df, threshold=0.1):
    df["Similarity_Label"] = 0
    for i, row in df.iterrows():
        area, price = row["Area"], row["Price"]
        distances = np.sqrt((df["Area"] - area) ** 2 + (df["Price"] - price) ** 2)
        closest_indices = distances.nsmallest(3).index
        df.loc[closest_indices, "Similarity_Label"] = 1

assign_similarity_labels(df)

# Balance the dataset
majority = df[df["Similarity_Label"] == 0]
minority = df[df["Similarity_Label"] == 1]
minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=42)
df = pd.concat([majority, minority_upsampled])

# Split the dataset into training and testing sets
X = df[num_features]
y = df["Similarity_Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

# Define models
models = {
    "KNN": KNeighborsClassifier(n_neighbors=5, weights="distance"),
    "Random Forest": RandomForestClassifier(n_estimators=100)
}

# Train and evaluate models
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{'-'*100}\n {name} Performance Metrics:\n{'-'*100}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}\n")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print(f"{'-'*100}")

# Property recommendation function
def recommend_properties(price, area, model_name, top_n=5):
    input_data = pd.DataFrame([[area, price]], columns=num_features)  # Ensure correct column names
    input_data_scaled = preprocessor.transform(input_data)  # Scale input data
    input_data_scaled = pd.DataFrame(input_data_scaled, columns=num_features)  # Retain column names after scaling
   
    model = models.get(model_name)
   
    if isinstance(model, KNeighborsClassifier):
        # Use KNN to find nearest properties
        distances, indices = model.kneighbors(input_data_scaled, n_neighbors=top_n)
        recommended_indices = indices.flatten()
    else:
        # Use Euclidean distance based on supervised learning labels for Random Forest
        distances = np.sqrt((df["Area"] - input_data_scaled["Area"][0]) ** 2 + (df["Price"] - input_data_scaled["Price"][0]) ** 2)
        recommended_indices = np.argsort(distances)[:top_n]
   
    recommended = df.iloc[recommended_indices].copy()
   # Rename columns to Arabic for the recommendations output
    recommended = recommended.rename(columns={
        "Property Type": "نوع العقار",
        "Location": "الموقع",
        "District": "الحي",
        "Bedrooms": "الغرف",
        "Bathrooms": "دورات المياة",
        "Original_Area": "المساحة",
        "Original_Price": "السعر",
        "Agency_Name": "الوكالة"
    })

    headers = ["نوع العقار", "الموقع", "الحي", "الغرف", "دورات المياة", "المساحة", "السعر", "الوكالة"]
    results = recommended[headers]
   
    print(f"\n :{model_name} توصيات العقارات المتوافقة مع الميزانية والمساحة المطلوبة باستخدام مودل \n")
    print(" | ".join(headers))
    print("-" * 100)
    for _, row in results.iterrows():
        print(" | ".join(str(x) for x in row.values))
    print("-" * 100)
    return results

# Test recommendations
price_input = 980000 
area_input = 300

for model_name in models.keys():
    recommend_properties(price_input, area_input, model_name=model_name)

----------------------------------------------------------------------------------------------------
 KNN Performance Metrics:
----------------------------------------------------------------------------------------------------
Accuracy: 0.7609

Classification Report:
              precision    recall  f1-score   support

           0       0.66      0.95      0.78        20
           1       0.94      0.62      0.74        26

    accuracy                           0.76        46
   macro avg       0.80      0.78      0.76        46
weighted avg       0.82      0.76      0.76        46

----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
 Random Forest Performance Metrics:
----------------------------------------------------------------------------------------------------
Accuracy: 0.8478

Classification Report:
              precision 

### Performance Comparison

| Model              | Accuracy | Precision (Class 0) | Precision (Class 1) | Recall (Class 0) | Recall (Class 1) | F1-Score (Class 0) | F1-Score (Class 1) |
|-------------------|----------|---------------------|---------------------|------------------|------------------|-------------------|-------------------|
| Random Forest | 0.8478 | 0.76                | 0.95                | 0.95             | 0.77             | 0.84              | 0.85              |
| KNN               | 0.7609   | 0.66                | 0.94                | 0.95             | 0.62             | 0.78              | 0.74              |

### Final Decision: Random Forest as the Best Model

After evaluating the models, **Random Forest achieved the highest accuracy (0.8478)**, making it the most suitable choice. 
The key reasons for selecting **Random Forest** are:

- **Highest Accuracy:** With an accuracy of **0.8478**, it outperforms KNN.
- **Better Performance on Class 0:** It has the highest precision (0.76) and recall (0.95) for **Class 0**, reducing false positives and false negatives.
- **Best Overall F1-Score:** It maintains a strong balance between precision and recall, making it more reliable for our dataset.
- **Handling Imbalanced Data Well:** The model effectively differentiates between **Class 0 and Class 1**, leading to more stable predictions.

## References

[1] GeeksforGeeks, "Recommender Systems using KNN," GeeksforGeeks, [Online]. Available: https://www.geeksforgeeks.org/recommender-systems-using-knn/?ref=asr9. [Accessed: Feb. 10, 2025].

[2] R. K. Halder, M. N. Uddin, M. A. Uddin, S. Aryal, and A. Khraisat, "Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications," Journal of Big Data, vol. 11, no. 1, pp. 1–55, Aug. 2024. [Online]. Available: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-024-00973-y. [Accessed: Feb. 10, 2025].

[3] E2E Networks, "Random Forest Algorithm in Machine Learning: A Guide," E2E Networks Blog, [Online]. Available: https://www.e2enetworks.com/blog/random-forest-algorithm-in-machine-learning-a-guide. [Accessed: Feb. 10, 2025].

[4] IBM, "Random Forest," IBM Think, [Online]. Available: https://www.ibm.com/think/topics/random-forest. [Accessed: Feb. 10, 2025].

[5] Ali, N., Neagu, D., & Trundle, P. (2019). Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets. SN Applied Sciences, 1(12). https://doi.org/10.1007/s42452-019-1356-9 [Accessed: Feb. 20, 2025].

[6] GeeksforGeeks. (2024, May 27). Interpreting random forest classification results. GeeksforGeeks. https://www.geeksforgeeks.org/interpreting-random-forest-classification-results/#model-performance-metrics-for-random-forest-classification [Accessed: Feb. 20, 2025]