
# Build a best machine learining model to predict the right real estate price for the property owners.
    
## Project description :

Imagine that we work for a real estate platform. Instead of using real estate agent services, property owners submit their own listings, and buyers can respond to them directly. If a transaction goes through successfully, the platform takes a cut.
Website analytics showed that property owners often fail to base their prices on the market value. This practice is always bad for the website: inexpensive items are sold quickly, but the platform's cut is also lower because of this. Overpriced items, on the other hand, are never sold, which means no profit at all. The service needs to prevent sellers from underselling and overpricing. We need to figure out an algorithm to help property owners determine the right price

## Description of the data :

For this task, we are going to use the data presented by an online real estate market. To make it suitable for model training, we have deleted the variables that don't affect the price, as well as missing values and apartments outside the city limits. The file name is train_data_us.csv and  contains the following columns:


1. last_price — price at listing closure (in dollars)
1. total_area — apartment area in square meters (m²)
1. bedrooms — number of bedrooms
1. ceiling_height — ceiling height (m)
1. floors_total — total number of floors
1. living_area — living area (m²)
1. floor — floor
1. bike_parking — bike parking in the building (Boolean data type)
1. is_studio — studio (Boolean data type)
1. is_open_plan — open plan (Boolean data type)
1. kitchen_area — kitchen area (m²)
1. balconies — number of balconies
1. airport_dist — distance to the nearest airport in meters (m)
1. city_center_dist — distance to the city center (m)


## Outline

### Task:

Use classification and regression models and find the best perfomance model.

<img src="diagram.png">




In [1]:
import pandas as pd
df = pd.read_csv('train_data_us.csv')
display(df.head())
display(df.info())

Unnamed: 0,last_price,bedrooms,kitchen_area,living_area,total_area,balconies,ceiling_height,floors_total,floor,bike_parking,is_studio,is_open_plan,airport_dist,city_center_dist
0,108000.0,2,6.6,31.5,59.0,0,2.87,4,2,0,0,0,20485,8180
1,264000.0,4,12.2,72.0,109.0,0,3.15,5,2,0,0,0,42683,8643
2,140000.0,3,10.8,49.0,74.5,0,2.58,10,9,0,0,0,14078,16670
3,64000.0,1,6.2,20.0,37.4,2,2.5,9,4,0,0,0,17792,17699
4,133000.0,3,10.4,41.9,64.9,0,2.65,12,11,0,0,0,14767,10573


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6495 entries, 0 to 6494
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   last_price        6495 non-null   float64
 1   bedrooms          6495 non-null   int64  
 2   kitchen_area      6495 non-null   float64
 3   living_area       6495 non-null   float64
 4   total_area        6495 non-null   float64
 5   balconies         6495 non-null   int64  
 6   ceiling_height    6495 non-null   float64
 7   floors_total      6495 non-null   int64  
 8   floor             6495 non-null   int64  
 9   bike_parking      6495 non-null   int64  
 10  is_studio         6495 non-null   int64  
 11  is_open_plan      6495 non-null   int64  
 12  airport_dist      6495 non-null   int64  
 13  city_center_dist  6495 non-null   int64  
dtypes: float64(5), int64(9)
memory usage: 710.5 KB


None

In [2]:
print('Average apartment price:',df['last_price'].mean())

Average apartment price: 161005.67427559663


In [3]:
print('Median apertment price:',df['last_price'].median())

Median apertment price: 113000.0


* Conclusion:

As we have some very large price, so it will not give us actual average price in general.So, we took the median price.

Apartment price is a numerical target, so this is a regression task. Regression usually involves lengthy calculations with many possible answers, so regression tasks aren't the easiest way to get acquainted with machine learning. For simplicity's sake, we'll split all prices into "high" and "low" for now, effectively turning our task into a binary classification task with only two possible answers. Then all we have to do is predict which class any given listing falls into. We'll deal with regression later.

In [4]:
df['price_class'] = df['last_price']
df.loc[df['last_price'] > 113000,'price_class'] = 1
df.loc[df['last_price'] <= 113000,'price_class'] = 0
display(df.head())


Unnamed: 0,last_price,bedrooms,kitchen_area,living_area,total_area,balconies,ceiling_height,floors_total,floor,bike_parking,is_studio,is_open_plan,airport_dist,city_center_dist,price_class
0,108000.0,2,6.6,31.5,59.0,0,2.87,4,2,0,0,0,20485,8180,0.0
1,264000.0,4,12.2,72.0,109.0,0,3.15,5,2,0,0,0,42683,8643,1.0
2,140000.0,3,10.8,49.0,74.5,0,2.58,10,9,0,0,0,14078,16670,1.0
3,64000.0,1,6.2,20.0,37.4,2,2.5,9,4,0,0,0,17792,17699,0.0
4,133000.0,3,10.4,41.9,64.9,0,2.65,12,11,0,0,0,14767,10573,1.0


In [5]:
features = df.loc[:, ~df.columns.isin(['last_price','price_class'])]
target = df['price_class']
print(features.shape)
print(target.shape)

(6495, 13)
(6495,)


### Checking model perfomance DecisionTreeClassifier / RandomForestClassifier / LogisticRegression 

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score , f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 

In [7]:
df = pd.read_csv('train_data_us.csv')
df['price_class'] = df['last_price']
df.loc[df['last_price'] > 113000, 'price_class'] = 1
df.loc[df['last_price'] <= 113000, 'price_class'] = 0

# < split data into training and validation >
df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345) 

# < declare variables for features and target feature >
features_train = df_train.drop(['last_price', 'price_class'], axis=1)
target_train = df_train['price_class']
features_valid = df_valid.drop(['last_price', 'price_class'], axis=1)
target_valid = df_valid['price_class'] 

In [8]:
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)

(4871, 13)
(4871,)
(1624, 13)
(1624,)


In [9]:
# Create scaler object and apply it to train set
scaler = StandardScaler()

# Train scaler and transform the matric for train set
features_train_st = scaler.fit_transform(features_train)

# apply standardization of feature matric for test set
features_valid_st = scaler.transform(features_valid)

In [10]:
# define the models to compare
models = [DecisionTreeClassifier(random_state=12345, max_depth=5), RandomForestClassifier(random_state=12345, n_estimators=5),LogisticRegression(random_state=12345, solver='liblinear')]

# function that predicts model by taking data as input and outputting metrics
def make_prediction(model, features_train, target_train, features_valid, target_valid):
    model = model
    model.fit(features_train_st, target_train)
    predictions = model.predict(features_valid_st)
    print('Model: ', model)
    print('Accuracy: {:.2f}'.format(accuracy_score(target_valid , predictions)))
    print('Precision: {:.2f}'.format(precision_score(target_valid , predictions)))
    print('Recall: {:.2f}'.format(recall_score(target_valid , predictions)))
    print('F1: {:.2f}'.format(f1_score(target_valid, predictions)))
    print('\n')

# output metric for both models
for i in models:
    make_prediction(i, features_train, target_train, features_valid, target_valid)
    
print('Mean: {:.2f}'.format(target_valid.mean()))

Model:  DecisionTreeClassifier(max_depth=5, random_state=12345)
Accuracy: 0.87
Precision: 0.85
Recall: 0.88
F1: 0.86


Model:  RandomForestClassifier(n_estimators=5, random_state=12345)
Accuracy: 0.88
Precision: 0.87
Recall: 0.88
F1: 0.88


Model:  LogisticRegression(random_state=12345, solver='liblinear')
Accuracy: 0.89
Precision: 0.90
Recall: 0.87
F1: 0.88


Mean: 0.49


Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. For example, we have got 0.803 which means our model is approx. 80% accurate.

Accuracy = TP+TN/TP+FP+FN+TN

Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate.For example, if we have got 0.788 precision which is pretty good.

Precision = TP/TP+FP

Recall (Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. The question recall answers is: Of all the passengers that truly survived, how many did we label? For example, If We got recall of 0.631 which is good for this model as it’s above 0.5.

Recall = TP/TP+FN

F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall. 

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

## Hyper parameter tuning

In [11]:
#if we want get improve perfomanece:
# < create a loop for estimator from 1 to 11>
for estimator in range(1, 11):
       model = RandomForestClassifier(random_state=12345, n_estimators=estimator) 

        # < train the model >
       model.fit(features_train, target_train)
       predictions = model.predict(features_valid)
       score =  accuracy_score(target_valid , predictions) 
       print("estimator =", estimator, ": ", end='')
       print(score)

estimator = 1 : 0.8491379310344828
estimator = 2 : 0.8448275862068966
estimator = 3 : 0.8737684729064039
estimator = 4 : 0.8793103448275862
estimator = 5 : 0.8830049261083743
estimator = 6 : 0.8928571428571429
estimator = 7 : 0.8934729064039408
estimator = 8 : 0.8922413793103449
estimator = 9 : 0.895320197044335
estimator = 10 : 0.896551724137931


In [12]:
#if we want get improve perfomanece:
# < create a loop for depth from 1 to 11>
for depth in range(1, 11):
       model = DecisionTreeClassifier(max_depth=depth, random_state=12345) 

        # < train the model >
       model.fit(features_train, target_train)
       predictions = model.predict(features_valid)
       score =  accuracy_score(target_valid , predictions) 
       print("depth =", depth, ": ", end='')
       print(score)


depth = 1 : 0.8522167487684729
depth = 2 : 0.8522167487684729
depth = 3 : 0.8466748768472906
depth = 4 : 0.8725369458128078
depth = 5 : 0.8663793103448276
depth = 6 : 0.8706896551724138
depth = 7 : 0.8663793103448276
depth = 8 : 0.8725369458128078
depth = 9 : 0.8657635467980296
depth = 10 : 0.8608374384236454


<img src="diagram.png">

* Conclusion :

1. There is no specific rules which model we should use.It depends on the job need.Even though the name logistic regression is suggestive of a regression problem, it is still a classification algorithm.

2. Comparing three model,random forest have the highest accuracy and low speed.Logistic regression works as moderate level.Sometimes, hyper parater tuning helps a lot to improve model perfomance.

### Check model works or not?

Create two new observations and check the prediction results. Remember that everything above and below the median price was labeled with price classes 1 and 0 respectively. The observations in our task are apartments. Write down the values of the features for each observation:
1.	The first apartment has 12 bedrooms with a total area of 900 m². The living area is 409.7 m², and the kitchen area is 112 m².
2.	The second apartment has 2 bedrooms with a total area of 109 m². The living area is 32 m², and the kitchen area is 40.5 m².

In [13]:
new_features = pd.DataFrame(
    [
        [None, None, None, None, 0, 2.8, 25, 25, 0, 0, 0, 30706.0, 7877.0],
        [None, None, None, None, 0, 2.75, 25, 25, 0, 0, 0, 36421.0, 9176.0],
    ],
    columns=features.columns,
)

# complete the table with the new features
new_features.loc[0, 'bedrooms'] = 12
new_features.loc[0, 'kitchen_area'] = 112
new_features.loc[0, 'living_area'] = 409.7
new_features.loc[0, 'total_area'] = 900
new_features.loc[1, 'bedrooms'] = 2
new_features.loc[1, 'kitchen_area'] = 40.5
new_features.loc[1, 'living_area'] = 32
new_features.loc[1, 'total_area'] = 109

In [14]:
# predict answers and print the result on the screen
answers = model.predict(new_features) 
print(answers)


[1. 0.]


**Conclusion:**

The model confidently puts the luxurious twelve bedrooms apartment into the expensive class,while a tiny two bedrooms is obviously is a cheap dwelling.

## Regression task:

A decision tree can be used for regression just as well as for classification.
For a regression task, the decision tree is trained in a manner similar to classification, but it predicts a 
number instead of a class.


In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [16]:
df = pd.read_csv('train_data_us.csv')
df['price_class'] = df['last_price']
df.loc[df['last_price'] > 113000, 'price_class'] = 1
df.loc[df['last_price'] <= 113000, 'price_class'] = 0

# < split data into training and validation >
df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345) 

# < declare variables for features and target feature >
features_train = df_train.drop(['last_price', 'price_class'], axis=1)
target_train = df_train['price_class']
features_valid = df_valid.drop(['last_price', 'price_class'], axis=1)
target_valid = df_valid['price_class'] 

In [17]:
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)

(4871, 13)
(4871,)
(1624, 13)
(1624,)


In [18]:
df = pd.read_csv('train_data_us.csv')
df['price_class']=df['last_price']
df.loc[df['last_price'] > 113000, 'price_class'] = 1
df.loc[df['last_price'] <= 113000, 'price_class'] = 0

df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345)
features_train = df_train.drop(['last_price', 'price_class'], axis=1)
target_train = df_train['price_class']
features_valid = df_valid.drop(['last_price', 'price_class'], axis=1)
target_valid = df_valid['price_class']


In [19]:
# Create scaler object and apply it to train set
scaler = StandardScaler()

# Train scaler and transform the matric for train set
features_train_st = scaler.fit_transform(features_train)

# apply standardization of feature matric for test set
features_valid_st = scaler.transform(features_valid)

In [20]:
# declare the list of models
models = [Lasso(), Ridge(), DecisionTreeRegressor(), RandomForestRegressor(), GradientBoostingRegressor()]

# the function that calculates MAPE
def mape(y_true, y_pred):
    y_error = y_true - y_pred
    y_error_abs = abs(y_error)
    perc_error_abs = y_error_abs / y_true
    mape = (perc_error_abs.sum()/len(y_true))
    return mape

# the function that takes the model and data as input and outputs metrics
def make_prediction(model, features_train, target_train, features_valid, target_valid):
    model = model
    model.fit(features_train_st, target_train)
    predictions = model.predict(features_valid_st)
    MAE = mean_absolute_error(target_valid, predictions)
    MSE = mean_squared_error(target_valid, predictions)
    MAPE = mape(target_train, target_valid)
    R2 = r2_score(target_valid, predictions)
    RMSE = MSE ** 0.5
    print('MAE:{:.2f} MSE:{:.2f} RMSE:{:.2f} MAPE:{:.2f} R2:{:.2f} '.format(MAE, MSE, RMSE, MAPE, R2))
    print('\n')

# write a loop that outputs metrics for each model
for i in models:
    print(i)
    make_prediction(i, features_train, target_train, features_valid, target_valid)
    
# print the mean target variable value on the test set
print('Mean: {:.2f}'.format(target_valid.mean()))

Lasso()
MAE:0.50 MSE:0.25 RMSE:0.50 MAPE:0.00 R2:-0.00 


Ridge()
MAE:0.33 MSE:0.15 RMSE:0.38 MAPE:0.00 R2:0.41 


DecisionTreeRegressor()
MAE:0.15 MSE:0.15 RMSE:0.39 MAPE:0.00 R2:0.39 


RandomForestRegressor()
MAE:0.15 MSE:0.08 RMSE:0.28 MAPE:0.00 R2:0.69 


GradientBoostingRegressor()
MAE:0.16 MSE:0.08 RMSE:0.27 MAPE:0.00 R2:0.70 


Mean: 0.49


The MSE, MAE, RMSE, and R-Squared metrics are mainly used to evaluate the prediction error rates and model performance in regression analysis.

1. MAE (Mean absolute error): represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set.
1. MSE (Mean Squared Error) : represents the difference between the original and predicted values extracted by squared the average difference over the data set.
1. RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.
1. R-squared (Coefficient of determination) represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is.

Comparing all the model, GradientBoostingRegressor() model did really good work in this problem.