## Rent Price Predictor App

The real estate market is complex, and often the estimators available on rental sites are not as accurate or accessible. For individuals moving to a new city, especially young people and those looking for room rentals, the task of finding a fair-priced place can be daunting. Recognizing this need, we developed the "Rent Price Predictor App."

This digital tool was designed to provide rental price estimates, eliminating the need to rely solely on traditional site estimators. Instead, our app provides a tailor-made solution based on the specific features provided by the user.

Created with Streamlit, the "Rent Price Predictor App" boasts a user-friendly interface that guides the user through a series of questions about either rooms or houses. Questions range from size and location to available amenities.

After collecting the information, the app uses a pre-trained machine learning model to provide a price estimate. We used three renowned machine learning algorithms - Linear Regression, Random Forest, and XGBoost - to ensure accurate and reliable predictions.

The code structure is split into main functions:

main(): Controls the app interface.
room_questions(): Gathers information about rooms.
house_questions(): Gathers information about houses.
At the end of the process, users receive a price estimate, aiding them in making an informed decision. This is especially valuable for young adults and new city dwellers who might not be familiar with the average prices in the local market.

The "Rent Price Predictor App" is not just a price prediction tool but a solution designed to make the experience of renting a room or a house more transparent and less intimidating.


## data collection
how we collect
where
why

In [None]:
df = pd.read_csv('your_dataset.csv')




## data cleaning and splitting (traing and test)

In [4]:
X = df.drop('target_column', axis=1)
y = df['target_column']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

NameError: name 'X' is not defined

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from math import sqrt

In [2]:

results = {} 

FileNotFoundError: [Errno 2] No such file or directory: 'your_dataset.csv'

## Linear Regression

Linear regression is a simple yet powerful statistical technique used to model and analyze the relationships between two variables. With roots tracing back to the work of the English polymath Sir Francis Galton, linear regression has been a foundational tool in statistics and machine learning.

At its core, linear regression attempts to find the best straight line that describes the relationship between an independent (or predictor) variable and a dependent variable. This "best straight line" is determined using the least squares method, which tries to minimize the sum of the squares of the errors between the observed data points and the points predicted by the regression line.

The equation for this straight line is usually expressed as y=mx+b, where:

y is the dependent variable or what we are trying to predict.

x is the independent variable or the predictor.

m is the slope of the line.

b is the y-axis intercept.

Implementing the Linear Regression Function:

In the context of the provided code, the function linear_regression_tuning() carries out the following steps:

Preparation: Initially, the function sets up a linear regression model.
Feature Selection: It uses the "step-forward" method, a sequential selection technique where we start with no features and keep adding those that provide the best performance boost for the model.
Training: With the best features selected, the model is trained using the training data.
Prediction and Evaluation: The trained model is used to make predictions on the test data. The model's performance is then assessed by computing the Mean Squared Error (MSE) between the predictions and the true values.
Return: The function returns the trained model, the MSE, and the best-selected features.
This approach provides a robust way to select the most relevant features for linear regression and train a model capable of making accurate predictions.

In [None]:
def step_forward_regression(X_train, y_train, X_test, y_test):
    
    # Seleção step-forward
    sfs = SFS(LinearRegression(),
              k_features='best',
              forward=True,
              scoring='neg_mean_squared_error',
              cv=5)

    sfs = sfs.fit(X_train, y_train)
    
    selected_features = sfs.k_feature_names_

    # Treinar modelo com as características selecionadas
    model = LinearRegression().fit(X_train[list(selected_features)], y_train)
    
    # Avaliação do modelo
    y_pred = model.predict(X_test[list(selected_features)])
    mse = mean_squared_error(y_test, y_pred)
    
    print(f"Selected Features: {selected_features}")
    print(f"MSE: {mse}")
    
    return {"model": model, "mse": mse, "features": selected_features}




In [None]:
results["Step_Forward_Regression"] = step_forward_regression(X_train, y_train, X_test, y_test)


## Random Forests

Random Forests are a machine learning method utilized for tasks such as classification and regression. They function by constructing a "forest" of decision trees during training and generate predictions based on the mode (for classification) or mean (for regression) of individual trees' outputs.

A few key characteristics and workings of Random Forest include:

Random Splitting Method: Random forests, unlike standard decision trees, utilize a random subset of features for each split. This method introduces diversity, making the model more robust and less prone to overfitting.

Combining Trees: Multiple trees are built, and the final prediction is an aggregation, either by majority voting (classification) or averaging (regression).

Out-of-Bag (OOB) Estimation: Trees are trained on bootstrapped samples, and the unselected samples, called OOB samples, gauge the model's generalization error.

Feature Importance: Random forests can indicate the significance of variables by assessing the drop in accuracy due to permutations of each variable's values.

Key hyperparameters in Random Forest include:

n_estimators: Number of trees to be constructed.

max_features: Maximum number of features considered during a split.

max_depth: Maximum tree depth.

min_samples_split and min_samples_leaf: Determine node splitting and leaf node creation.

bootstrap: Whether to use bootstrap sampling.

The inception of random forests traces back to Leo Breiman, combining methods similar to CART with random node optimization and bagging.

For hyperparameter tuning, a function like random_forest_tuning() can utilize techniques like RandomizedSearchCV to search for optimal hyperparameters. The function would assess hyperparameters such as n_estimators, max_features, and max_depth among others to optimize the Random Forest model's performance.

In [None]:
def random_forest_tuning(X_train, y_train, X_test, y_test):
    rf = RandomForestRegressor(random_state=42)
    
    param_dist = {
        'n_estimators': randint(50, 200),
        'max_features': ['auto', 'sqrt', 'log2'],
        'max_depth': randint(1, 40)
    }

    random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, n_iter=100, cv=5, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error', random_state=42)

    random_search.fit(X_train, y_train)

    best_random_params = random_search.best_params_

    best_random = random_search.best_estimator_
    random_pred = best_random.predict(X_test)
    mse = mean_squared_error(y_test, random_pred)

    print(f'Best Parameters: {best_random_params}')
    print(f'MSE of Best Model: {mse}')
    
    return {"model": best_random, "mse": mse, "best_params": best_random_params}

In [None]:
results["Random_Forest"] = random_forest_tuning(X_train, y_train, X_test, y_test)

## XGBoost Algorithm

XGBoost is a decision-tree based machine learning algorithm that leverages the Gradient boosting framework. While artificial neural networks excel in problems involving unstructured data like images and texts, decision-tree based algorithms are found to be highly efficient for structured data scenarios.

Key points about XGBoost:

Origin: The algorithm was developed at the University of Washington by Tianqi Chen and Carlos Guestrin and gained prominence on platforms like Kaggle.
Unique Features:
Versatile across various applications, useful for regression, classification, and more.
Portable across various operating systems.
Supports multiple programming languages.
Seamlessly integrates with cloud platforms and other ecosystems.
Functionality: Decision trees are intuitive by design, but XGBoost enhances this concept. It's akin to an optimized interview process where various criteria are tweaked and merged to yield the best outcome. Specifically, XGBoost is like supercharged gradient boosting, combining optimization techniques to achieve better results with fewer resources.
Effectiveness: XGBoost and Gradient Boosting Machines (GBMs) are tree-based methods that apply the boosting principle. XGBoost, however, optimizes GBM, enhancing both system efficiency and algorithmic accuracy.
In summary, XGBoost is a sophisticated, adaptable algorithm that stands out in a variety of machine learning scenarios, especially with structured data.

For hyperparameter tuning in XGBoost, functions like xgboost_tuning() can be applyed. This functions typically assess and tune parameters like learning rate, max depth of trees, and regularization terms, among others. These adjustments ensure optimal model performance while leveraging the unique capabilities of the XGBoost algorithm.

In [None]:


def xgboost_tuning(X_train, y_train, X_test, y_test):
    xg = xgb.XGBRegressor(random_state=42, objective='reg:squarederror')

    param_dist = {
        'n_estimators': randint(50, 200),
        'learning_rate': uniform(0.01, 0.6),
        'max_depth': randint(1, 40),
        'gamma': uniform(0, 0.5),
        'colsample_bytree': uniform(0.5, 0.9)
    }

    random_search = RandomizedSearchCV(estimator=xg, param_distributions=param_dist, n_iter=100, cv=5, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error', random_state=42)

    random_search.fit(X_train, y_train)

    best_random_params = random_search.best_params_

    best_random = random_search.best_estimator_
    random_pred = best_random.predict(X_test)
    mse = mean_squared_error(y_test, random_pred)

    print(f'Best Parameters: {best_random_params}')
    print(f'MSE of Best Model: {mse}')

    return {"model": best_random, "mse": mse, "best_params": best_random_params}





In [None]:
results["XGBoost"] = xgboost_tuning(X_train, y_train, X_test, y_test)

## Using Mean Squared Error (MSE) for Model Selection

When developing machine learning models, it is crucial to use evaluation metrics to determine a model's performance. For regression models, the Mean Squared Error (MSE) is a widely recognized and utilized metric.

**Mean Squared Error (MSE)**

MSE is used to measure the average of the squares of the errors between predictions and actual observations. Mathematically, it's expressed by the formula:

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{\text{actual},i} - y_{\text{predicted},i})^2 \]

Where:

- \( n \) is the total number of observations.
- \( y_{\text{actual},i} \) is the actual value of the i-th observation.
- \( y_{\text{predicted},i} \) is the value predicted by the model for the i-th observation.

**Evaluating the Model Based on MSE**

To choose the best model among various candidates, it's common to select the one with the lowest MSE. In the project context, after applying step-by-step regression, Random Forest, and XGBoost, the model with the lowest MSE was selected.

In [None]:
print(results)