# Training a neural network to solve a regression problem 

Consider an e-commerce company that's trying to solve the problem of demand prediction - they'd like to estimate the number of mobile phones that are likely to be purchased in the upcoming week so that they can plan their inventory accordingly. Our goal is to develop a model that can make such a prediction. Let's assume that the demand for a given week is a function of 3 variables - (a) number of mobile phones sold in the previous week, (b) discounts offered, and (c) number of weeks to the next festival. Let's call these variables $prev\_week\_sales$, $discount\_fraction$ and $weeks\_to\_next\_festival$ respectively. This problem can be modelled as a regression problem wherein we predict the number of mobile phones sold in the upcoming week from an input vector of the form [$prev\_week\_sales$, $discount\_fraction$,  $weeks\_to\_next\_festival$]

In [1]:
import torch
torch.manual_seed(100)

<torch._C.Generator at 0x106f73590>

In [2]:
class TwoLayeredNN(torch.nn.Module):
    """
    We define a torch module that represents a simple neural network with 2 hidden layers
    """
    def __init__(self, input_size, hidden1_size, hidden2_size, output_size):
        """
        Args
            input_size(int): Number of inputs
            hidden1_size (int): Number of neurons in the first hidden layer
            hidden2_size(int): Number of neurons in the second hidden layer
            output_size(int): Number of hidden layer neurons
        """
        super(TwoLayeredNN, self).__init__()
        self.model = torch.nn.Sequential(
            torch.nn.Linear(input_size, hidden1_size), # (input x hidden1)
            torch.nn.Sigmoid(),
            torch.nn.Linear(hidden1_size, hidden2_size), # (hidden1 x hidden2)
            torch.nn.Sigmoid(),
            torch.nn.Linear(hidden2_size, output_size) # (hidden2 x output)
        )
        self._initialize_weights()
            
    def forward(self, X):
        return self.model(X)
    
    def _initialize_weights(self):
        for m in self.model.modules():
            if isinstance(m, torch.nn.Linear):
                torch.nn.init.xavier_uniform(m.weight.data)
                torch.nn.init.constant_(m.bias.data, 0)

For the purposes of demonstration, let us construct a toy dataset $X$ having $n$ samples that comprises of the 3 features ($prev\_week\_sales$, $discount\_fraction$ and $weeks\_to\_next\_festival$). We will also generate the ground truth $\vec{y}$ representing the current week sales as a random non linear function of the input variables. Our task will now be to build a neural network that is able to learn this function from the given dataset.

In [3]:
# Let us artificially generate the training data
n = 10000

# Sales of previous week
X_prev_week_sales = torch.randint(low=10000, high=90000, size=(n,)).float().unsqueeze(1)
# Discount offered
X_discount_fraction = torch.randint(low=0, high=80, size=(n,)).float().unsqueeze(1)
# Number of weeks to the next festival
X_weeks_to_next_festival = torch.randint(low=0, high=10, size=(n,)).float().unsqueeze(1)

# X is a Nx3 matrix represtenting our toy dataset
X = torch.cat((X_prev_week_sales, X_discount_fraction, X_weeks_to_next_festival), dim=1)

# y is the ground truth vector which we generate as an arbitrary function of the input variables
y = torch.ceil(1.2 * X_prev_week_sales + \
    (X_discount_fraction / 100).pow(2) * 5000 + \
    (10 - X_weeks_to_next_festival).pow(3) + \
    100 * torch.randn(X_prev_week_sales.shape)) # We also add in some random Gaussian noise

In [4]:
X

tensor([[7.6440e+04, 6.3000e+01, 2.0000e+00],
        [4.1512e+04, 5.0000e+01, 3.0000e+00],
        [7.7395e+04, 7.7000e+01, 9.0000e+00],
        ...,
        [2.1532e+04, 7.0000e+01, 4.0000e+00],
        [3.0035e+04, 5.0000e+00, 2.0000e+00],
        [1.8894e+04, 1.9000e+01, 4.0000e+00]])

In [5]:
y

tensor([[94182.],
        [51531.],
        [95938.],
        ...,
        [28559.],
        [36599.],
        [23056.]])

Note that the range of values for each of the features is completely different. $prev\_week\_sales$ is in the order of thousands of units sold, $discount\_fraction$ is in the order of 1 - 100 and $weeks\_to\_next\_festival$ is the in the order of 1 - 10 weeks.  In machine learning, it is generally a good practice to bring all the values to a common scale because it can help improve the speed of training and also reduce the chance of getting stuck at a local minima.

In [6]:
def min_max_norm(X, y):
    X, y = X.clone(), y.clone()
    X_min, X_max = torch.min(X, dim=0)[0], torch.max(X, dim=0)[0]
    X = (X - X_min) / (X_max - X_min)
    y_min, y_max = torch.min(y, dim=0)[0], torch.max(y, dim=0)[0]
    y = (y - y_min) / (y_max - y_min)
    return X, y
X_norm, y_norm = min_max_norm(X, y)

In [7]:
X_norm.max(dim=0)[0]

tensor([1., 1., 1.])

To solve the regression problem, let's first define a 2-layer Neural Network model that can take in $3d$ input vectors of the form [$prev\_week\_sales$, $discount\_fraction$, $is\_festival\_ongoing$] and generate output predictions. The simple neural network will contain 2 hidden layers with 10 and 5 neurons respectively.

In [8]:
# Let us create the neural network to do regression. 
# input_size = number of input features
# hidden1_size = 10
# hidden2_size = 5
# output_size = 1 because we are regressing to a given value

nn = TwoLayeredNN(input_size=X_norm.shape[-1],
                  hidden1_size=10,
                  hidden2_size=5,
                  output_size=1)

  torch.nn.init.xavier_uniform(m.weight.data)


We want a loss function that compares the demand predicted by the neural network model with the actual demand from the ground truth, and returns larger values when the difference is higher and smaller values when the difference is lower. Mean squared error is one such loss that is readily available in PyTorch through the $torch.nn.MSELoss$ class.

In [9]:
loss = torch.nn.MSELoss() # Mean squared error

During training, we iteratively run the forward pass, compute the loss, calculate gradients and update the weights. The neural network is initialized with random weights and hence makes arbitrary predictions for the demand in the early iterations of the training loop. This translates to a high initial loss value. However, as the training proceeds, the weights are updated so that the loss value is minimized, and the predicted demand comes closer to the actual ground truth. 

To update the weights, we use what is known as an optimizer. Here, we use the Stochastic Gradient Descent based optimizer which can be invoked using $torch.optim.SGD$. PyTorch offers various optimizers  which will be discussed in detail in the next chapter.

We typically run the training loop until the loss reaches a low enough value that is acceptable. Once the training loop completes, we have a model that can readily take in new data points and generate output predictions.

In [10]:
optimizer = torch.optim.SGD(nn.parameters(), lr=0.3, momentum=0.9)
num_iters = 5000
for i in range(num_iters):
    # Forward pass
    y_out = nn(X_norm)
    # Compute mean squared loss
    mse_loss = loss(y_out, y_norm)
    if i % 1000 == 0:
        print(f"Step: {i} Loss: {mse_loss}")
    # Clear gradients
    optimizer.zero_grad()  
    # Backpropogation
    mse_loss.backward()
    optimizer.step()
print(f"Step: {i} Loss: {mse_loss}")

Step: 0 Loss: 0.08074740320444107
Step: 1000 Loss: 2.626368950586766e-05
Step: 2000 Loss: 2.1405438019428402e-05
Step: 3000 Loss: 1.9865865397150628e-05
Step: 4000 Loss: 1.8848664694814943e-05
Step: 4999 Loss: 1.8041795556200668e-05
