#Housing price forecast
In this project, we want to predict housing prices with the help of data. This dataset is downloaded from the Kaggel website and contains two datasets, train dataset and test dataset.

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

##import library and load data
In this section, we load the required datasets and view their basic information, including the number and name of columns, data type, etc.

In [None]:
import pandas as pd
train_data = pd.read_csv('/content/drive/MyDrive/ml/Housing-price-forecast/train.csv')
test_data = pd.read_csv('/content/drive/MyDrive/ml/Housing-price-forecast/test.csv')


In [None]:
print(train_data.head())
print(train_data.info())
print(train_data.describe())

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

In [None]:
print(test_data.head())
print(test_data.info())
print(test_data.describe())

##data prepare
At this stage, we select the features that we intend to use to predict the housing price. If we have a non-numeric data type, we convert them to numeric data and complete the incomplete data.

###feature selection

Choosing the optimal features is one of the critical steps in the success of a prediction model. Choosing the right features can improve model performance, reduce training and prediction time, and avoid overfitting. In the following, I will explain the different methods of selecting features:

Analysis of the impact and importance of features:
You can see the effect of each feature on the output of the model from methods such as Feature Importance analysis, which is used by decision trees and warehouse-based algorithms, including Random Forest.

Dimension reduction:
Use dimensionality reduction techniques such as principal component analysis (PCA). These methods help you reduce the number of features and retain important information.

Using methods related to the model:
Some machine learning algorithms such as LASSO (Least Absolute Shrinkage and Selection Operator) and Ridge Regression use techniques to avoid overfitting by reducing unnecessary features.

Selection of evaluation criteria:
Evaluation criteria can help you select features that perform better based on their effectiveness on the evaluation criteria.

Thematic search algorithms:
Algorithms such as genetic algorithm and thematic search algorithms (such as Recursive Feature Elimination) can help you find the optimal set of features.

Correlation analysis:
Check the correlation between features and if there are features that are correlated, you can remove one of them.

Using special feature selection methods:
Some special feature selection methods such as Recursive Feature Elimination (RFE) and SelectKBest use feature selection modules in Scikit-learn.

Remember that feature selection should be done carefully and consider its detrimental effects on model performance. Also, it is always important to carefully evaluate the effects of feature selection on the final model.

###fill NaN cells and convert data types

We filled the empty and incomplete values of the dataset in the selected features with the help of the average and with the help of the "fillna()" function.


Because some of the selected features are of non-numeric data type, in order to be able to include them in the model, we first need to convert these features to numeric data type. Here we used the "One-Hot Encoding" method. In this method, we call a function named "get_dummies" from the Pendaz library and give the selected features to this function. In this function, in order for the number of new columns to match the rows, we have considered the value "drop_first=True".

In [10]:
# select features and depend variable (target)
features = train_data[['MSZoning', 'LotFrontage','LandContour', 'BldgType', 'YearBuilt', 'RoofStyle', 'Exterior1st', 'Foundation', 'YrSold', 'SaleType', 'SaleCondition']]
target = train_data['SalePrice']

# fill NaN variables with mean
features.fillna(features.mean(), inplace=True)

# convert features to numeric
features_encoded = pd.get_dummies(features, drop_first=True)

# add new columns to features
features = pd.concat([features, features_encoded], axis=1)

# remove main features that converted to One-Hot Encoding
features = features.drop(['MSZoning', 'LandContour', 'BldgType', 'RoofStyle', 'Exterior1st', 'Foundation', 'SaleType', 'SaleCondition'], axis=1)

# dividing dataset to train data and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

  features.fillna(features.mean(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features.fillna(features.mean(), inplace=True)


##Select Features with RandomForest (Optional)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split

# ساخت یک مدل Random Forest
model_selection_feature = RandomForestRegressor()

all_features = train_data[['MSSubClass','MSZoning','LotFrontage','LotArea','Street','Alley','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','MasVnrArea','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinSF1','BsmtFinType2','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','Heating','HeatingQC','CentralAir','Electrical','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','KitchenQual','TotRmsAbvGrd','Functional','Fireplaces','FireplaceQu','GarageType','GarageYrBlt','GarageFinish','GarageCars','GarageArea','GarageQual','GarageCond','PavedDrive','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','PoolQC','Fence','MiscFeature','MiscVal','MoSold','YrSold','SaleType','SaleCondition']]
target = train_data['SalePrice']

# fill NaN variables with mean
all_features.fillna(all_features.mean(), inplace=True)

# convert features to numeric
features_encoded = pd.get_dummies(all_features, drop_first=True)

# add new columns to features
all_features = pd.concat([all_features, features_encoded], axis=1)

# remove main features that converted to One-Hot Encoding
all_features = all_features.drop(['MSSubClass','MSZoning','LotFrontage','LotArea','Street','Alley','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','MasVnrArea','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinSF1','BsmtFinType2','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','Heating','HeatingQC','CentralAir','Electrical','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','KitchenQual','TotRmsAbvGrd','Functional','Fireplaces','FireplaceQu','GarageType','GarageYrBlt','GarageFinish','GarageCars','GarageArea','GarageQual','GarageCond','PavedDrive','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','PoolQC','Fence','MiscFeature','MiscVal','MoSold','YrSold','SaleType','SaleCondition'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(all_features, target, test_size=0.2, random_state=42)

# آموزش مدل بر روی داده‌ها
model_selection_feature.fit(X_train, y_train)

# ایجاد یک مدل انتخاب ویژگی بر اساس اهمیت ویژگی‌ها
sfm = SelectFromModel(model_selection_feature, threshold=0.1)

# اعمال مدل انتخاب ویژگی بر روی داده‌های آموزشی
sfm.fit(X_train, y_train)

# انتخاب ویژگی‌ها
selected_features = X_train.columns[sfm.get_support()]


In [9]:
print(selected_features)

Index(['ExterQual_TA'], dtype='object')


##Create Model

**with RandomForest algorithm**

The RandomForest algorithm is a machine learning algorithm based on ensemble ideas that uses the combination of several decision trees to improve performance and prediction accuracy. Below is a description of the RandomForest algorithm:

Decision Tree: A decision tree is a machine learning model that is built by dividing data into smaller parts (areas) and applying a decision to each area.
A decision tree results in a tree-like structure with branches and nodes.

Ensemble: RandomForest uses the idea of ensemble. Instead of using one decision tree as the main model, it uses multiple decision trees as groups. Each decision tree is trained independently and makes decisions.

Random Feature Selection: In each node of each tree, only a limited number of features are considered for data division. This random selection of features helps to diversify and prevent overfitting.

Combining Predictions: The final prediction is made by combining the predictions of each decision tree. A final forecast is usually produced by applying a majority decision or averaging the forecasts.

Adjustable parameters: In the RandomForest algorithm, the number of trees (n_estimators), the number of features for each partition (max_features) and the depth level of the trees (max_depth) are adjustable parameters.

Application in housing price forecasting: RandomForest performs well in housing price forecasting problems. Due to its ability to control over-discrepancy and prevent over-fitting to the training data, this algorithm is usually successful in prediction tasks.
The RandomForest algorithm is usually used to predict housing prices due to its ability to handle high-dimensional data and the complexity of issues related to the housing market.

In [11]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)

##train model

In [12]:
model.fit(X_train, y_train)

##Evaluation model

In the model evaluation stage, various criteria are used to quantitatively evaluate the performance of the model. In the housing price prediction project, various criteria can be used to evaluate the model. One of the common criteria for regression prediction problems is the MSE (Mean Squared Error) criterion. MSE is one of the most common evaluation criteria for regression problems and measures the root mean square difference between the model predictions and the actual values.

1. Can AUC and ROC be calculated for housing price forecasting project?

AUC (Area Under the Curve) and ROC (Receiver Operating Characteristic) are terms used as evaluation criteria in classification problems. These criteria are usually used in cases where the number of positive and negative samples is different, such as disease diagnosis problems or problems related to the medical field.

In a house price forecasting project that is modeled as a regression problem (numerical value forecasting), classification measures such as AUC and ROC are usually not used. These criteria are used to evaluate a model's ability to separate positive and negative classes based on estimated probabilities.

In [13]:
from sklearn.metrics import mean_squared_error

predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 3794358217.654185


##Prediction Model

In [15]:
import pandas as pd

# new data for predict price
new_data = pd.DataFrame({
    'MSZoning': ["RH"],
    'LotFrontage': [117],
    'LandContour': ["Lvl"],
    'BldgType': ["1Fam"],
    'YearBuilt': [2003],
    'RoofStyle': ["Gable"],
    'Exterior1st': ["VinylSd"],
    'Foundation': ["CBlock"],
    'YrSold': [2010],
    'SaleType': ["WD"],
    'SaleCondition': ["Normal"]
})

# convert categorials data in new data to On-Hot Encoding
new_data_encoded = pd.get_dummies(new_data, drop_first=True)

# Homogenization new data columns with train data columns
new_data_encoded_aligned = new_data_encoded.reindex(columns=X_train.columns, fill_value=0)

# predicting with Basic model
predicted_price = model.predict(new_data_encoded_aligned)
print(f'Predicted Price: {predicted_price[0]}')


Predicted Price: 296501.05


##optimization Model (Optional)
The model optimization stage is an important stage in building and adjusting a machine learning model. Methods such as GridSearchCV and RandomizedSearchCV are used to optimally adjust the parameters of the model. These methods allow you to search a space of parameters and choose the best values for your model.

1. Grid Search:
In GridSearchCV, you provide a set of possible values for each parameter and the algorithm tries all possible combinations. This method is the most time-consuming, but it finds the best parameters.

2. Randomized Search:
In RandomizedSearchCV, instead of trying all combinations, a certain number of combinations are randomly selected. This method is more practical for larger or time-consuming parameter spaces.

In [None]:
from sklearn.model_selection import GridSearchCV

# The model used
model = RandomForestRegressor()

#Parameters space
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV with model, parameter space and number of partitions
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Training the model on the training data
grid_search.fit(X_train, y_train)

# Best parameters and relevant results
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f'Best Parameters: {best_params}')
print(f'Best Score: {best_score}')

In [16]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# The model used
model = RandomForestRegressor()

#Parameters space
param_dist = {
    'n_estimators': randint(10, 200),
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 4)
}

#Create a RandomizedSearchCV with model, parameter space and number of partitions
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, random_state=42)

#Training the model on the training data
random_search.fit(X_train, y_train)

#Best parameters and relevant results
best_params = random_search.best_params_
best_score = random_search.best_score_

print(f'Best Parameters: {best_params}')
print(f'Best Score: {best_score}')


Best Parameters: {'max_depth': 30, 'min_samples_leaf': 3, 'min_samples_split': 7, 'n_estimators': 199}
Best Score: -2916148236.956235


In [None]:
#evaluation random_search model
from sklearn.metrics import mean_squared_error

predictions = random_search.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 3408387808.1307755


In [17]:
import pandas as pd

# new data for predict price
new_data = pd.DataFrame({
    'MSZoning': ["RH"],
    'LotFrontage': [117],
    'LandContour': ["Lvl"],
    'BldgType': ["1Fam"],
    'YearBuilt': [2003],
    'RoofStyle': ["Gable"],
    'Exterior1st': ["VinylSd"],
    'Foundation': ["CBlock"],
    'YrSold': [2010],
    'SaleType': ["WD"],
    'SaleCondition': ["Normal"]
})

# convert categorials data in new data to On-Hot Encoding
new_data_encoded = pd.get_dummies(new_data, drop_first=True)

# Homogenization new data columns with train data columns
new_data_encoded_aligned = new_data_encoded.reindex(columns=X_train.columns, fill_value=0)

# predicting with Basic model
predicted_price = random_search.predict(new_data_encoded_aligned)
print(f'Predicted Price: {predicted_price[0]}')


Predicted Price: 297306.3156426348


##Feature Engineering (Optional)

Feature engineering is an important process in data analysis and building predictive models, which involves transforming and creating new features based on existing data. In the housing price prediction project, you can create more useful and high-quality information for the model and improve its performance by performing feature engineering. Below are some ideas for doing feature engineering in this project:

Convert attributes to a suitable format: You may want to convert the home's construction date into a useful attribute, for example, calculate the age of the home and add it as a new attribute.
If the data contains spatial information (such as latitude and longitude), you can create new spatial features.

Extracting information from variables: If you have a feature such as LotArea, you can separately calculate the area of the house (internal) and the area of the yard and add it as a new feature. You can extract information from other features, such as the number of rooms and bathrooms, and add it as a new feature.

Apply mathematical transformations: You can transform features, such as taking the logarithm of features whose distribution is heterogeneous. If the distribution of a feature is close to a normal distribution, it may be improved by statistical transformations such as standardization (a subset of feature scaling).

Use subject knowledge: If you have specific knowledge in the field of real estate pricing, you can add features related to this knowledge. For example, topographic distances, distance to guest houses or urban service centers may be useful to add to the accuracy of the model.

Consider corrective measures: If the data has unusual or defamatory values, you may want to apply corrective measures. For example, removing outliers or setting default values for incomplete data.

Working with time features: If the data includes sales time, you might use information related to seasons or time changes as new features.
In either case, it's important to note that the changes you make will not only improve the information, but also make your model more capable of detecting hidden and more complex patterns in the data. For each change, it is critical to test and evaluate the performance of the model with these changes as well.