# **Predicting House Prices with Regression**

Understand the problem statement

Importing liabraries and helper functions

Understand the dataset

Data normalization

Convert Label Value

Train and Test split

Create a neural network model

Train the model to fit the dataset

Evaluate the model

Visualize the predictions

## **1 . Understand the problem statement**

For this project, we are going to work on evaluating price of houses given the following features:

1. Year of sale of the house
2. The age of the house at the time of sale
3. Distance from city center
4. Number of stores in the locality
5. The latitude
6. The longitude

![Regression](regression.png)

Note: This notebook uses `python 3` and these packages: `tensorflow`, `pandas`, `matplotlib`, `scikit-learn`.

## **2 . Importing Libraries & Helper Functions**

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import logging

from utils import *
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback

%matplotlib inline
logging.getLogger("tensorflow").setLevel(logging.ERROR)

print('Libraries imported.')

Libraries imported.


In [3]:
df=pd.read_csv("data.csv",names = ['Serial','Date','Age','Distance','Stores','Latitude','Longitude','Price'])
df.head(5)

Unnamed: 0,Serial,Date,Age,Distance,Stores,Latitude,Longitude,Price
0,0,2009,21,9,6,84,121,14264
1,1,2007,4,2,3,86,121,12032
2,2,2016,18,3,7,90,120,13560
3,3,2002,13,2,2,80,128,12029
4,4,2014,25,5,8,81,122,14157


## **3 . Understand the dataset**

In [4]:
df.columns

Index(['Serial', 'Date', 'Age', 'Distance', 'Stores', 'Latitude', 'Longitude',
       'Price'],
      dtype='object')

In [5]:
df.describe()

Unnamed: 0,Serial,Date,Age,Distance,Stores,Latitude,Longitude,Price
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2499.5,2008.9128,18.945,4.9778,4.915,84.9714,124.9942,13906.6386
std,1443.520003,5.457578,11.329539,3.199837,3.142889,3.16199,3.167992,1020.774876
min,0.0,2000.0,0.0,0.0,0.0,80.0,120.0,11263.0
25%,1249.75,2004.0,9.0,2.0,2.0,82.0,122.0,13197.75
50%,2499.5,2009.0,19.0,5.0,5.0,85.0,125.0,13893.5
75%,3749.25,2014.0,29.0,8.0,8.0,88.0,128.0,14614.0
max,4999.0,2018.0,38.0,10.0,10.0,90.0,130.0,16964.0


### **3.1. Check Missing Data**

It's a good practice to check if the data has any missing values. In real world data, this is quite common and must be taken care of before any data pre-processing or model training.

In [6]:
df.isna().sum()

Serial       0
Date         0
Age          0
Distance     0
Stores       0
Latitude     0
Longitude    0
Price        0
dtype: int64

### **3.2.  Data normalization**

We can make it easier for optimization algorithms to find minimas by normalizing the data before training a model.

In [9]:
df=df.iloc[:,1:]
df_norm=(df-df.mean())/df.std()
df_norm

Unnamed: 0,Date,Age,Distance,Stores,Latitude,Longitude,Price
0,0.015978,0.181384,1.257002,0.345224,-0.307212,-1.260799,0.350088
1,-0.350485,-1.319118,-0.930610,-0.609312,0.325301,-1.260799,-1.836486
2,1.298598,-0.083410,-0.618094,0.663402,1.590328,-1.576456,-0.339584
3,-1.266643,-0.524735,-0.930610,-0.927491,-1.572238,0.948803,-1.839425
4,0.932135,0.534444,0.006938,0.981581,-1.255981,-0.945141,0.245266
...,...,...,...,...,...,...,...
4995,-0.350485,-0.171675,0.319454,-0.609312,1.590328,0.001831,-0.360156
4996,1.298598,-1.054324,1.569518,-1.563848,0.009045,1.264460,0.833055
4997,1.481830,-1.142588,1.569518,0.027045,1.590328,0.001831,0.191385
4998,0.199209,1.593622,-0.618094,0.027045,-1.255981,0.948803,0.398091


### **3.3. Convert Label Value**


Because we are using normalized values for the labels, we will get the predictions back from a trained model in the same distribution. So, we need to convert the predicted values back to the original distribution if we want predicted prices.

In [10]:
y_mean=df['Price'].mean()
y_std=df['Price'].std()

def convert_label_values(pred):
    return int(round(pred * y_std + y_mean))

convert_label_values(0.350088)

14264

## **4: Create Training and Test Set**


### **4.1: Select Features**

In [14]:
X=df_norm.iloc[:,:6]
X.head()

Unnamed: 0,Date,Age,Distance,Stores,Latitude,Longitude
0,0.015978,0.181384,1.257002,0.345224,-0.307212,-1.260799
1,-0.350485,-1.319118,-0.93061,-0.609312,0.325301,-1.260799
2,1.298598,-0.08341,-0.618094,0.663402,1.590328,-1.576456
3,-1.266643,-0.524735,-0.93061,-0.927491,-1.572238,0.948803
4,0.932135,0.534444,0.006938,0.981581,-1.255981,-0.945141


### **4.2: Select Labels**

In [13]:
y=df_norm.iloc[:,-1]
y.head()

0    0.350088
1   -1.836486
2   -0.339584
3   -1.839425
4    0.245266
Name: Price, dtype: float64

### **4.3: Features and Label values**

In [16]:
x_array=X.values
y_array=y.values

### **4.4: Train and Test Split**

In [22]:
X_train,X_test,y_train,y_test=train_test_split(x_array,y_array,test_size=0.05,random_state=0)

In [23]:
print("Training shape : " ,X_train.shape,y_train.shape)
print("Test shape : ", X_test.shape,y_test.shape)

Training shape :  (4750, 6) (4750,)
Test shape :  (250, 6) (250,)


## **5: Create the Model**

### **5.1: Create the Model**

In [20]:
def get_model():
    model=Sequential([
        Dense(10,input_shape=(6,),activation='relu'),
        Dense(20,activation='relu'),
        Dense(5,activation='relu'),
        Dense(1)
    ])

    model.compile(
        loss='mse',
        optimizer='adam'
    )
    return model

In [21]:
get_model().summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 10)                70        
                                                                 
 dense_1 (Dense)             (None, 20)                220       
                                                                 
 dense_2 (Dense)             (None, 5)                 105       
                                                                 
 dense_3 (Dense)             (None, 1)                 6         
                                                                 
Total params: 401 (1.57 KB)
Trainable params: 401 (1.57 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## **6: Model Training**

### **6.1: Model Training**

In [24]:
es_cb=EarlyStopping(monitor='val_loss',patience=5)

model=get_model()
preds_on_untrained=model.predict(X_test)

history=model.fit(
    X_train,y_train,
    validation_data=(X_test,y_test),
    epochs=100,
    callbacks=[es_cb]
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100


### **6.2: Plot Training and Validation Loss**

In [25]:
plot_loss(history)

NameError: name 'plot_loss' is not defined

## **7: Predictions**

### **7.1: Plot Raw Predictions**

In [26]:
preds_on_trained=model.predict(X_test)
compare_predictions(preds_on_untrained,preds_on_trained,y_test)



NameError: name 'compare_predictions' is not defined

### **7.2: Plot Price Predictions**

In [27]:
price_untrained=[convert_label_values(y) for y in preds_on_untrained]
price_trained=[convert_label_values(y) for y in preds_on_trained]
price_test=[convert_label_values(y) for y in y_test]

compare_predictions(price_untrained,price_trained,price_test)

TypeError: type numpy.ndarray doesn't define __round__ method