## 1.IMPORTING LIBRARIES

    *Numpy for linear algebra and data manipulation.
    *Matplotlib for visualising data.
    *Pandas for loading and analyzing data.
    *Sklearn for normalizing the data and creating the model.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn

## 2.LOADING AND ANALYZING THE DATA

    *Loading the real_estate.csv dataset.
    *The dataset is composed of features that can potentially effect prices of houses.
    *Analyzing the data using pandas.

In [2]:
df = pd.read_csv("real_estate.csv")

df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,,36.2


In [3]:
print("This dataset has {} houses and {} features.".format(df.shape[0], df.shape[1]))

This dataset has 506 houses and 13 features.


As we see above, there are missing values. We must clean them before training.

In [6]:
print("There are missing values:\n{}".format(df.isna().sum()))

There are missing values:
CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
LSTAT      20
MEDV        0
dtype: int64


In [8]:
df.dropna(inplace=True)

print("Currently, there are no missing values:\n{}".format(df.isna().sum()))

Currently, there are no missing values:
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
LSTAT      0
MEDV       0
dtype: int64


## 3.PREPARING AND SPLITTING DATA

    *Converting type of the data
    *Normalizing the data
    *Splitting train and test sets.

In [12]:
#We will get independent variable by dropping only dependent variable MEDV.
X = df.drop(columns=["MEDV"])
y = df["MEDV"]

X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3,222,18.7,5.21


In [13]:
y.head()

0    24.0
1    21.6
2    34.7
3    33.4
5    28.7
Name: MEDV, dtype: float64

In [15]:
training_X, test_X, training_y, test_y = sklearn.model_selection.train_test_split(X, y, test_size=0.25)

## 4.CREATING and EVALUATING the MODEL

    *Creating the model using sklearn.
    *Decision tree algorithm is used for predicting the most accurate prices of houses.
    *Evaluating with accuracy metric. Accuracy is 100%.
    *Plotting the tree.

In [28]:
model = sklearn.tree.DecisionTreeRegressor(criterion="squared_error")

model.fit(training_X, training_y)

print("Accuracy is: {}".format(model.score(test_X, test_y)))

Accuracy is: 0.7585596052361815


In [36]:
prediction = model.predict(test_X)

print("Average error is: {}".format(round(((prediction-test_y).abs().mean()*1000), 2)))

Average error is: 2694.95


## 5.CONCLUSIONS

    *Accuracy is 76%. It means that our model works fairly good.
    *Other types of criteria like friedman_mse can be used for training