# Creating a Linear Regression Model using PyTorch<br>
By Kenneth Lim

Featuring: California House Price

**About The Dataset**

The US Census Bureau has published California Census Data which has 10 types of metrics such as the population, median income, median housing price, and so on for each block group in California. The dataset also serves as an input for project scoping and tries to specify the functional and nonfunctional r requirements for it.

Problem Objective:
The project aims at building a model of housing prices to predict median house values in California using the provided dataset. This model should learn from the data and be able to predict the median housing price in any district, given all the other metrics.

Districts or block groups are the smallest geographical units for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). There are 20,640 districts in the project dataset.

| Feature              | Type                   | Description                                                                                     |
|-----------------------|------------------------|-------------------------------------------------------------------------------------------------|
| longitude             | numeric (float)        | Longitude value for the block in California, USA                                               |
| latitude              | numeric (float)        | Latitude value for the block in California, USA                                                |
| housing_median_age    | numeric (int)          | Median age of the house in the block                                                           |
| total_rooms           | numeric (int)          | Count of the total number of rooms (excluding bedrooms) in all houses in the block             |
| total_bedrooms        | numeric (float)        | Count of the total number of bedrooms in all houses in the block                               |
| population            | numeric (int)          | Count of the total number of population in the block                                           |
| households            | numeric (int)          | Count of the total number of households in the block                                           |
| median_income         | numeric (float)        | Median of the total household income of all the houses in the block                            |
| ocean_proximity       | categorical (string)   | Type of the landscape of the block <br> **Unique Values:** 'NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND' |
| median_house_value    | numeric (int)          | Median of the household prices of all the houses in the block                                  |


**What are we doing for now?**

In this case just for experimentation, we will guess the location of the house based on the number of population, households, income, and the value.

In [None]:
#Importing essential packages
import torch
import numpy as np
import sys
import pandas as pd
from tqdm.notebook import tqdm

In [None]:
#Downloading dataset
import kagglehub

# Download latest version
path = kagglehub.dataset_download("shibumohapatra/house-price")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'house-price' dataset.
Path to dataset files: /kaggle/input/house-price


In [None]:
data = pd.read_csv('/kaggle/input/house-price/1553768847-housing.csv')

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  int64  
 3   total_rooms         20640 non-null  int64  
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  int64  
 6   households          20640 non-null  int64  
 7   median_income       20640 non-null  float64
 8   ocean_proximity     20640 non-null  object 
 9   median_house_value  20640 non-null  int64  
dtypes: float64(4), int64(5), object(1)
memory usage: 1.6+ MB


The only thing we will mind are the longitude, latitude, median income, house value, population and households so we will drop some of them

In [None]:
cols_to_drop = ['housing_median_age', 'total_rooms', 'total_bedrooms', 'ocean_proximity']
data.drop(cols_to_drop, axis=1, inplace=True)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   population          20640 non-null  int64  
 3   households          20640 non-null  int64  
 4   median_income       20640 non-null  float64
 5   median_house_value  20640 non-null  int64  
dtypes: float64(3), int64(3)
memory usage: 967.6 KB


*Now we will proceed with the workflow*

In [None]:
torch.__version__

'2.8.0+cu126'

In [None]:
#We can check whether we have gpu
device = torch.device("cuda:0" if (torch.cuda.is_available()) else "cpu")
print("Device: ", device)

Device:  cuda:0


In [None]:
#Specifying inputs and arrays
x_train = data.drop(['longitude', 'latitude'], axis=1)
y_train = data[['longitude', 'latitude']]

In [None]:
inputs = torch.from_numpy(x_train.values).float()
targets = torch.from_numpy(y_train.values).float()
print(inputs.size())
print(targets.size())

torch.Size([20640, 4])
torch.Size([20640, 2])


*We will be now utilizing the datat primitives of PyTorch, Dataset and Dataloader*

In [None]:
from torch.utils.data import TensorDataset

In [None]:
# Define dataset
train_ds = TensorDataset(inputs, targets)
train_ds[0:3]

(tensor([[3.2200e+02, 1.2600e+02, 8.3252e+00, 4.5260e+05],
         [2.4010e+03, 1.1380e+03, 8.3014e+00, 3.5850e+05],
         [4.9600e+02, 1.7700e+02, 7.2574e+00, 3.5210e+05]]),
 tensor([[-122.2300,   37.8800],
         [-122.2200,   37.8600],
         [-122.2400,   37.8500]]))

In [None]:
from torch.utils.data import DataLoader

*DataLoader splits the dataset into batches. Quite handy expecially dealing with extremely larger datasets*

In [None]:
batch_size = 8
train_dl = DataLoader(train_ds, batch_size, shuffle=True)

In [None]:
for xb, yb in train_dl:
    print(xb)
    print(yb)
    break

tensor([[1.1230e+03, 3.4700e+02, 5.5792e+00, 2.1840e+05],
        [7.1110e+03, 2.4190e+03, 3.3627e+00, 1.9790e+05],
        [1.9390e+03, 4.8400e+02, 4.2875e+00, 1.7660e+05],
        [7.4400e+02, 3.1200e+02, 2.6518e+00, 1.5610e+05],
        [3.7400e+02, 1.8000e+02, 6.2673e+00, 3.5720e+05],
        [1.7600e+03, 5.4200e+02, 4.0227e+00, 1.2650e+05],
        [2.2320e+03, 8.2500e+02, 6.6659e+00, 5.0000e+05],
        [2.2690e+03, 1.2320e+03, 5.7097e+00, 3.1670e+05]])
tensor([[-118.4700,   34.2600],
        [-117.9000,   33.7800],
        [-118.3700,   34.2100],
        [-122.3000,   38.2900],
        [-118.3900,   33.9700],
        [-119.5800,   36.7700],
        [-121.8300,   37.2300],
        [-118.4400,   34.0500]])


*We will now define the model using nn.Linear class from pytorch*

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Fully connected x2
model = nn.Sequential(
    nn.Linear(4, 3),   # first FC: input 4 → hidden 3
    nn.ReLU(),         # non-linearity
    nn.Linear(3, 2)    # second FC: hidden 3 → output 2
)

print(model)

Sequential(
  (0): Linear(in_features=4, out_features=3, bias=True)
  (1): ReLU()
  (2): Linear(in_features=3, out_features=2, bias=True)
)


Given that our prerequisite requires a double FCN, we will introduce Sequential, where we introduce non-linearity and a hidden layer for our model to generalize better from unseen data.

****

In [None]:
list(model.parameters())  #model.param returns a generator

[Parameter containing:
 tensor([[-0.2293,  0.3390,  0.3096, -0.4098],
         [-0.0441,  0.3973,  0.3422,  0.2145],
         [ 0.1209, -0.1670,  0.4479,  0.1457]], requires_grad=True),
 Parameter containing:
 tensor([-0.3035, -0.1168,  0.3317], requires_grad=True),
 Parameter containing:
 tensor([[-0.2023, -0.4467, -0.4492],
         [-0.5334,  0.2858,  0.3055]], requires_grad=True),
 Parameter containing:
 tensor([-0.2968, -0.5370], requires_grad=True)]

In [None]:
#we can print the complexity by the number of parameters
print(sum(p.numel() for p in model.parameters() if p.requires_grad))

23


In [None]:
preds = model(inputs)
preds

tensor([[-73014.5781,  47905.5625],
        [-58015.4414,  38062.9414],
        [-56818.6211,  37279.1055],
        ...,
        [-14964.6865,   9817.5088],
        [-13721.3438,   9001.6895],
        [-14520.3584,   9526.1387]], grad_fn=<AddmmBackward0>)

*We will now define the loss functon from the **nn** module*

In [None]:
criterion_mse = nn.MSELoss()
criterion_softmax_cross_entropy_loss = nn.CrossEntropyLoss()

In [None]:
mse = criterion_mse(preds, targets)
print(mse)
print(mse.item())  ##print out the loss number

tensor(1.0437e+09, grad_fn=<MseLossBackward0>)
1043722560.0


*Next is the optimizer*

In [None]:
# Define optimizer
#momentum update the weight based on past gradients also, which will be useful for getting out of local max/min
#If our momentum parameter was $0.9$, we would get our current grad + the multiplication of the gradient
#from one time step ago by $0.9$, the one from two time steps ago by $0.9^2 = 0.81$, etc.

opt = torch.optim.SGD(model.parameters(), lr=0.0001, momentum=0.9)

*Then we start training*

In [None]:
# Utility function to train the model
def fit(num_epochs, model, loss_fn, opt, train_dl):

    # Repeat for given number of epochs
    for epoch in range(num_epochs):

        # Train with batches of data
        for xb,yb in train_dl:

            xb.to(device) #move them to gpu if possible, if not, it will be cpu
            yb.to(device)

            # 1. Predict
            pred = model(xb)

            # 2. Calculate loss
            loss = loss_fn(pred, yb)

            # 3. Calculate gradient
            opt.zero_grad()  #if not, the gradients will accumulate
            loss.backward()

            # Print out the gradients.
            # print ('dL/dw: ', model.weight.grad)
            # print ('dL/db: ', model.bias.grad)

            # 4. Update parameters using gradients
            opt.step()

        # Print the progress
        if (epoch+1) % 10 == 0:
            sys.stdout.write("\rEpoch [{}/{}], Loss: {:.4f}".format(epoch+1, num_epochs, loss.item()))

In [None]:
#train for 1000 epochs
fit(1000, model, criterion_mse, opt, train_dl)

Epoch [1000/1000], Loss: 5.0568

In [None]:
# Generate predictions
preds = model(inputs)
loss = criterion_mse(preds, targets)
print(loss.item())

4.28816032409668


We may conclude with the fact that the maximum likelihood for error is around 4.29, which is somewhat closer to actual values