### Lab 3.1: Batching and Regularization

In this lab you will learn how to set up a dataset to be processed in batches, rather than processing the entire dataset in each training iteration, and explore neural network regularization.

In [1]:
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score



In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
# metadata 
print(adult.metadata) 
  
# variable information 
print(adult.variables)

{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

In [3]:
X.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

In [4]:
y = y['income'].map({'<=50K':0,'<=50K.':0,'>50K':1,'>50K.':1})

In [5]:
X = X[['age','fnlwgt','education-num','capital-gain','capital-loss','hours-per-week']]

In [6]:
y = y.values
X = X.values.astype('float64')

To make the learning algorithm work more smoothly, we we will subtract the mean of each feature.

Here `np.mean` calculates a mean, and `axis=0` tells NumPy to calculate the mean over the rows (calculate the mean of each column).

In [7]:
X -= np.mean(X,axis=0)

Now we will convert our `X` and `y` arrays to torch Tensors.

In [8]:
X = torch.tensor(X).float()
y = torch.tensor(y).long()

### Exercises

1. Divide the data into train and test splits.
2. Create a neural network for this dataset.
3. Use `TensorDataset` and `DataLoader` to batch the dataset during training.  
4. Use `weight_decay` parameter to `optim.SGD` to introduce L2 regularization during training. Evaluate the effect of regularization on test set accuracy.

In [9]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_dataset = TensorDataset(X_train, y_train)

train_loader = DataLoader(dataset=train_dataset, batch_size=1024, shuffle=True)

nn = torch.nn.Sequential(
    torch.nn.Linear(6, 100),
    torch.nn.ReLU(),
    torch.nn.Linear(100, 100),
    torch.nn.ReLU(),
    torch.nn.Linear(100, 2),
)

loss_fn = torch.nn.CrossEntropyLoss()
opt = torch.optim.Adam(nn.parameters(), lr=1e-2, weight_decay=1e-4)

for epoch in range(100):
    nn.train()
    for batch_X, batch_y in train_loader:
        opt.zero_grad()
        y_pred = nn(batch_X)
        loss = loss_fn(y_pred, batch_y)
        loss.backward()
        opt.step()
    nn.eval()
    with torch.no_grad():
        y_test_pred = nn(X_test)
        y_test_pred_classes = torch.argmax(y_test_pred, dim=1)
        acc = accuracy_score(y_test, y_test_pred_classes)
        
    if epoch % 5 == 0:
        print(f'Epoch {epoch}, Loss {loss.item()}, Acc {acc}')
        
    
        
    
    


Epoch 0, Loss 204.33639526367188, Acc 0.5290203705599344
Epoch 5, Loss 0.4641706943511963, Acc 0.7917903572525335
Epoch 10, Loss 0.4981914162635803, Acc 0.7992629747159382
Epoch 15, Loss 0.3992355167865753, Acc 0.7990582454703654
Epoch 20, Loss 0.5495306253433228, Acc 0.7998771624526564
Epoch 25, Loss 0.5180371403694153, Acc 0.8023339133995291
Epoch 30, Loss 0.5125241875648499, Acc 0.8026410072678882
Epoch 35, Loss 0.48972979187965393, Acc 0.8052001228375474
Epoch 40, Loss 0.44324880838394165, Acc 0.8058143105742656
Epoch 45, Loss 0.4695562422275543, Acc 0.8043812058552564
Epoch 50, Loss 0.4926622807979584, Acc 0.8036646534957519
Epoch 55, Loss 0.5559887290000916, Acc 0.8048930289691882
Epoch 60, Loss 0.5394513010978699, Acc 0.7959873067867745
Epoch 65, Loss 0.48220711946487427, Acc 0.8000818916982291
Epoch 70, Loss 0.543071448802948, Acc 0.8074521445388474
Epoch 75, Loss 0.447931706905365, Acc 0.8058143105742656
Epoch 80, Loss 0.4862801730632782, Acc 0.8021291841539564
Epoch 85, Loss 