# ECON 5150: Neural Network

Zhentao Shi

<!-- code is tested on SCRP -->

## Gradient Descent

* Completely ignore the Hessian. Replace it by the identity matrix.


$$
\theta_{t+1} = \theta_{t} -  \alpha_t \cdot  s(\theta_t)
$$

where $\alpha_t > 0$ is the **learning rate**.

* Linear rate of convergence.
* Less costly in computation. Better for big data.

* **Motivation**: Talyor expansion,

$$
f(\theta_{t+1}) = f(\theta_t + a_t \cdot p_t ) \approx f(\theta_t) + a_t \cdot  p_t' s(\theta_t),
$$

* If in each step we want the value of the criterion function
$f(x)$ to decrease, we need $ p_t' s(\theta_t) \leq 0$.

* A simple choice is $p_t =-s(\theta_t)$, which is called the deepest decent.

* The learning rate is a tuning parameter. 
  * In practice, just choose a small number, say $0.01$ or $0.001$.
  * A small learning rate makes a small step ahead in each iteration



## Gradient Descent in Neural Network



* The number of `epochs` is the maximum number of iterations
* Conventional implementation usually specifies a condition to test convergence. If satisfied, then break out of the loop
* Machine learning monitors the loss function along the epochs

In [10]:
import torch
import torch.optim as optim

# Remove the first column of X and convert to PyTorch tensor
X_tensor = torch.tensor(X.iloc[:, 1].values, dtype=torch.float32).view(-1, 1)
y_tensor = torch.tensor(y.values, dtype=torch.float32)

# Define the model
class PoissonRegressionModel(torch.nn.Module):
    def __init__(self):
        super(PoissonRegressionModel, self).__init__()
        self.linear = torch.nn.Linear(X_tensor.shape[1], 1)

    def forward(self, x):
        return torch.exp(self.linear(x))
    # y_hat = exp(b0 + b1 * x)

# Instantiate the model
model = PoissonRegressionModel()

# Define the loss function (negative log-likelihood)
def poisson_loss(y_hat, y_true):
    return torch.mean(y_hat - y_true * torch.log(y_hat))

# Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
num_epochs = 1000 
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()
    y_hat = model(X_tensor).squeeze()
    loss = poisson_loss(y_hat, y_tensor)
    loss.backward()
    optimizer.step() # Perform a single optimization step to update parameter.

    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')

# Print the optimized parameters
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name, param.data)

Epoch [100/1000], Loss: 6.0970
Epoch [200/1000], Loss: 2.3398
Epoch [300/1000], Loss: 1.3544
Epoch [400/1000], Loss: 0.9979
Epoch [500/1000], Loss: 0.8496
Epoch [600/1000], Loss: 0.7819
Epoch [700/1000], Loss: 0.7476
Epoch [800/1000], Loss: 0.7273
Epoch [900/1000], Loss: 0.7124
Epoch [1000/1000], Loss: 0.6995
linear.weight tensor([[0.1707]])
linear.bias tensor([-0.0580])


In [11]:
len(y_tensor) * loss.item()

460.9989150762558

### Stochastic Gradient Descent

* When the sample size is huge and the number of parameters is large,
Stochastic gradient descent (SGD) uses a small batch of the sample
to evaluate the gradient in each iteration. 

* SGD involves tuning parameters: the batch size and the learning rate. 

In [12]:
from torch.utils.data import DataLoader, TensorDataset

# Create a dataset and data loader
dataset = TensorDataset(X_tensor, y_tensor)
batch_size = 100
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Training loop
num_epochs = 1000
for epoch in range(num_epochs):
    model.train()
    for batch_X, batch_y in data_loader:
        optimizer.zero_grad()
        y_hat = model(batch_X).squeeze()
        loss = poisson_loss(y_hat, batch_y)
        loss.backward()
        optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')

# Print the optimized parameters
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name, param.data)

Epoch [100/1000], Loss: 0.5819
Epoch [200/1000], Loss: 0.4821
Epoch [300/1000], Loss: 0.7783
Epoch [400/1000], Loss: 0.1700
Epoch [500/1000], Loss: 0.4947
Epoch [600/1000], Loss: 0.7954
Epoch [700/1000], Loss: 1.4848
Epoch [800/1000], Loss: 1.4670
Epoch [900/1000], Loss: 0.7220
Epoch [1000/1000], Loss: 0.3634
linear.weight tensor([[-0.1004]])
linear.bias tensor([1.1857])
