# Question 3: How does pretraining (and epochs and batches) affect the performance of a neural network?

The plan intended to answer this question is simple: the neural network from Question 2 will be reused with changes to the number of epochs and the batch size to see whether the accuracy increases or decreases. I will start by reusing all the dependencies for Question 2:

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import functions
from sklearn.preprocessing import LabelEncoder, StandardScaler
%pip install torch
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset, Dataset
import torch.optim as optim
from sklearn.metrics import f1_score 

Note: you may need to restart the kernel to use updated packages.


Next, the same neural network will be used from Question 2. Credit to ChatGPT for showing me how to create it:

I will double the batch size and number of epochs to see if there is any change in the accuracy:

In [2]:
class BooksDataset(Dataset):
    def __init__(self, authors, features, labels):
        self.authors = torch.tensor(authors, dtype=torch.long)       # author IDs for embeddings
        self.features = torch.tensor(features, dtype=torch.float32)  # numeric features
        self.labels = torch.tensor(labels, dtype=torch.float32)      # target

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.authors[idx], self.features[idx], self.labels[idx]
    
class AuthorNet(nn.Module):
    def __init__(self, num_authors, embedding_dim, num_numeric_features):
        super(AuthorNet, self).__init__()
        # Author embedding
        self.embedding = nn.Embedding(num_authors, embedding_dim)
        # Fully connected layers for numeric features
        self.fc_numeric = nn.Sequential(
            nn.Linear(num_numeric_features, 16),
            nn.ReLU()
        )
        # Combine embeddings + numeric features
        self.fc_combined = nn.Sequential(
            nn.Linear(embedding_dim + 16, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
            nn.Sigmoid()  # binary output
        )

    def forward(self, author_ids, numeric_features):
        x_author = self.embedding(author_ids)
        x_numeric = self.fc_numeric(numeric_features)
        x = torch.cat([x_author, x_numeric], dim=1)
        x = self.fc_combined(x)
        return x

X_author_train, X_author_test, X_num_train, X_num_test, y_train, y_test = functions.test_train_split()

# Map author IDs to consecutive integers as there is nearly 13,000 separate authors:

author_to_idx = {author: i for i, author in enumerate(sorted(set(X_author_train)))}
X_author_train = np.array([author_to_idx[a] for a in X_author_train])
X_author_test = np.array([author_to_idx.get(a, 0) for a in X_author_test])  # unknown authors -> 0
num_authors = len(author_to_idx)

# Scaling the numeric features as most of them are over 10,000 and they don't scale linearly:

scaler = StandardScaler()
X_num_train = scaler.fit_transform(X_num_train)
X_num_test = scaler.transform(X_num_test)

num_numeric_features = X_num_train.shape[1]
embedding_dim = 16 

# Defining the two datasets:

train_dataset = BooksDataset(X_author_train, X_num_train, y_train)
test_dataset = BooksDataset(X_author_test, X_num_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AuthorNet(num_authors=num_authors, embedding_dim=embedding_dim, num_numeric_features=num_numeric_features)
model = model.to(device)

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

num_epochs = 20

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for author_ids, numeric_features, labels in train_loader:
        author_ids = author_ids.to(device)
        numeric_features = numeric_features.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(author_ids, numeric_features).view(-1)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader):.4f}")

model.eval()
y_true, y_pred = [], []



with torch.no_grad(): # no.grad() disables gradient calculation as Tensor.backward() will not be called. This reduces memory consumption. Credit: Pytorch Reference API
    for author_ids, numeric_features, labels in test_loader:
        author_ids = author_ids.to(device)
        numeric_features = numeric_features.to(device)
        labels = labels.to(device)

        outputs = model(author_ids, numeric_features).view(-1)
        predicted = (outputs > 0.5).float()

        y_true.extend(labels.cpu().numpy())
        y_pred.extend(predicted.cpu().numpy())

accuracy = np.mean(np.array(y_true) == np.array(y_pred))
f1 = f1_score(y_true, y_pred)

print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test F1 Score: {f1:.3f}")

Epoch 1, Loss: 0.6070
Epoch 2, Loss: 0.4525
Epoch 3, Loss: 0.2939
Epoch 4, Loss: 0.1709
Epoch 5, Loss: 0.1075
Epoch 6, Loss: 0.0762
Epoch 7, Loss: 0.0592
Epoch 8, Loss: 0.0489
Epoch 9, Loss: 0.0423
Epoch 10, Loss: 0.0364
Epoch 11, Loss: 0.0320
Epoch 12, Loss: 0.0287
Epoch 13, Loss: 0.0259
Epoch 14, Loss: 0.0236
Epoch 15, Loss: 0.0216
Epoch 16, Loss: 0.0194
Epoch 17, Loss: 0.0176
Epoch 18, Loss: 0.0160
Epoch 19, Loss: 0.0146
Epoch 20, Loss: 0.0134
Test Accuracy: 0.985
Test F1 Score: 0.976


The accuracy is lower than expected at an impressive 98.5%, just 0.2% below the initial batch size of 64 and epochs of 10. The F1 score is also 0.3% below what was previously found in Question 2. These discrepencies can be placed down to a possible lapse in training data where some of the data may have been miscalculated or the neural network was expecting niche markets to support their niche products. Seeing as the majority of the books in the dataset were highly rated, the accruacy and F1 score would be skewed to a positive light regardless. 

This could have been improved by finding a more balanced dataset or finding the average review from a set number. The only problem presented with that latter idea, is that most books in the dataset had less than 100 reviews total; hence, the data would still be skewed. Thus, more reviews would remove this statistical error within the findings of the neural networks. 
