# EDA, Linear and Neural Network

In this notebook we will explore the data, run a linear regression and use a neural network to do a more robust regression.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import rcParams

In [None]:
plt.rcParams['figure.figsize'] = (15,16)
sns.set_theme(style="darkgrid")

In [None]:
df = pd.read_csv('../input/house-price-dataset-with-other-information/kc_house_data.csv')
df

Checking if there is any null value, and dropping the date column, because we will not use it.

In [None]:
df.isnull().values.any()

In [None]:
df  = df.drop('date', 1)
df

Let's check how correlated the features are:

In [None]:
sns.heatmap(df.corr(), annot=True, fmt=".2f")
plt.show()

There is definitely some correlation, especially with the 'squarefeet' features.

# Linear Regression

Our first try to predict the house prices will be a simple Linear Regression. For this, we'll use sklearn.

In [None]:
from sklearn import  linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import xgboost

In [None]:
X = df.loc[:, df.columns != 'price']
y = df[['price']]

Split the data, with 30% test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

Now we will normalize the data

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)

In [None]:
scaler.fit(X_test)
X_test = scaler.transform(X_test)

Now we fit the data to the Linear model.

In [None]:
regression = linear_model.LinearRegression()
regression.fit(X_train, y_train)


To measure the accuracy, we'll use r2 score. For more info about how it is calculated, check https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

In [None]:
y_pred = regression.predict(X_test)
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred))

We can try something a little more robust.
# Neural Network

Now we'll use pytorch to create a neural network to predict the housing prices.

In [None]:
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import torch.utils.data as Data

Here, we define our neural net. It has 4 layers, with 32 neurouns in each hidden layer. The activation used was Relu.

In [None]:
class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.fc1 = torch.nn.Linear(n_feature, n_hidden)
        self.fc2 = torch.nn.Linear(n_hidden, n_hidden)
        self.fc3 = torch.nn.Linear(n_hidden, n_hidden)
        self.fc4 = torch.nn.Linear(n_hidden, n_output)   

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)     
        return x

We'll use MSELoss to calculate the loss and Adam as the optimizer

In [None]:
net = Net(n_feature=len(df.columns)-1, n_hidden=32, n_output=1)    
print(net)  # net architecture
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
loss_func = torch.nn.MSELoss()  

Converting the data to tensors, so pytorch can use it.

In [None]:
X_train, X_test = torch.tensor(X_train).float(), torch.tensor(X_test).float()
y_train, y_test = torch.tensor(y_train.values).float(), torch.tensor(y_test.values).float()

Now the training. We'll have 10 epochs. The code below has some comments to indicate what it is doing.

In [None]:
running_loss = 0

for epoch in range(10):
    for i in range(len(X_train)):
        prediction = net(X_train[i])     # input x and predict based on x
        loss = loss_func(prediction, y_train[i])     # must be (1. nn output, 2. target)

        optimizer.zero_grad()   # clear gradients for next train
        loss.backward()         # backpropagation, compute gradients
        optimizer.step()        # apply gradients

        running_loss += loss.item()
        if i % 15000 == 14999:    # print every 15000 items
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 15000))
            running_loss = 0

The results are better, but there is definitely some room for improvements

In [None]:
print('Coefficient of determination: %.2f' % r2_score(y_test, net(X_test).data.numpy()))