# Optimal transfer learning

Here, we create two models and traverse a path on the statistical manifold between them.
Given my theory's optimal infinitesimal traversals of continuously changing distributions, 
the theory should port cleanly from reinforcement learning to transfer learning. 
Of course, only each infinitesimal step along the path with be optimal, 
leaving room for data scientists' intuition to usefully bias retention. 
For example, if early game information only becomes useful late in play, 
then a retention spike will need to be added. 
Regardless, infinitesimal optimality is an important step toward a deep and coherent transfer learning theory. 

The experiment will involve fitting a dense net to the MNIST dataset, classifying digits as usual. 
However, after the initial fit, we will embed the dense net into a much larger statistical manifold 
by making it part of a mixture model with a different, far more sparse model. 
We'll then slowly and continuously traverse the mixture probability parameter $q$ from 0 to 1, 
optimally retaining information at each step. 
We'll measure test set accuracy at each step and compare against different traversal strategies. 
A good result should find accuracy is optimally and usefully sustained despite transferring to a sparse model.

If successful, this experiment will illustrate the effectiveness of my general theory of transfer learning. 
Thus, it'd be possible to optimally translate information optimally between arbitrary models, 
provided they can handle the same dataset or a continuously transforming dataset between the models.
This is a powerful result, so I'm eager to run my experiment, but won't get my hopes up too much. 
The truth is best found by the pursuit of ambitious targets and sensitively understanding negative results.

In [2]:
## Dense net code initially authored by Google's search engine GenAI on 20 Oct 2024. 
## I've applied minor modifications for generality, but the code worked great on first draft. 

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# Define the DenseNet model
class DenseNet(nn.Module):
    def __init__(self):
        super(DenseNet, self).__init__()
        self.features = nn.Sequential(
            nn.Linear(784, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = self.features(x)
        return x

# Load MNIST dataset
train_dataset = datasets.MNIST(root='/tmp/data', train=True, transform=transforms.ToTensor(), download=True)
test_dataset = datasets.MNIST(root='/tmp/data', train=False, transform=transforms.ToTensor())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=64, shuffle=False)

# Initialize the model, loss function and optimizer
model = DenseNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model
num_epochs = 5
for epoch in range(num_epochs):
    for i, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        if i % 100 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, i+1, len(train_loader), loss.item()))

# Evaluate the model
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for data, target in test_loader:
        output = model(data)
        _, predicted = torch.max(output.data, 1)
        total += target.size(0)
        correct += (predicted == target).sum().item()

print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)>

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /tmp/data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting /tmp/data/MNIST/raw/train-images-idx3-ubyte.gz to /tmp/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)>

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /tmp/data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting /tmp/data/MNIST/raw/train

100%|██████████| 9912422/9912422 [00:00<00:00, 152101437.92it/s]
100%|██████████| 28881/28881 [00:00<00:00, 16312374.61it/s]
100%|██████████| 1648877/1648877 [00:00<00:00, 56294954.02it/s]
100%|██████████| 4542/4542 [00:00<00:00, 3188373.02it/s]
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


In [24]:
## Transfer learn to sparse nets 

class SparseActivation(nn.Module): 
    '''Softmax that zeros-out small values. 
    Transfer-learning from a prefit model recommended. 
    '''
    def __init__(self, K=1, next_linearity=None):
        'K: determines maximum number of non-zero dimensions.'
        super(SparseActivation, self).__init__()
        self.K = K 
        self.softmax = nn.Softmax() 
        self.next_linearity = next_linearity 
        pass 
    def forward(self, x, next_linearity=None): 
        '''inputs:
        - x: [n, ...]-shaped tensor 
        - next_linearity: instance of nn.Linear is applied efficiently to output of this activation. Autodetects next_nonlinearity if it was provided during init. 
        ouputs: 
        - idx: indicies of softmax(x) that are greater than 1/K. 
        - softmax(x)[idx]: values of softmax(x) that are greater than 1/K. Returns None if idx is empty. 
        '''
        ## calculate 
        x = self.softmax(x) 
        idx = (x > 1/self.K).nonzero() 
        x_idx = None 
        if int(idx.shape[0]) > 0: 
            x_idx = x[idx] 
        if next_linearity is None and self.next_linearity is None: 
            return idx, x_idx 
        if next_linearity is None and self.next_linearity is not None: 
            next_linearity = self.next_linearity 
            pass 
        ## sparsely apply next_linearity 
        if int(idx.shape[0]) > 0: 
            y = x_idx.matmul(next_linearity.weights[idx,:]) 
        else: 
            y = 0. 
            pass 
        y += next_linearity.bias 
        return idx, x_idx, y 
    pass 

class SparseNet(nn.Module):
    def __init__(self, K=10):
        super(SparseNet, self).__init__()
        self.K = K 
        self.linear1 = nn.Linear(784, 512) 
        self.linear2 = nn.Linear(512, 256) 
        self.linear3 = nn.Linear(256, 128) 
        self.linear4 = nn.Linear(128, 10) 
        self.activation1 = SparseActivation(K=self.K, next_linearity=self.linear2) 
        self.activation2 = SparseActivation(K=self.K, next_linearity=self.linear3) 
        self.activation3 = SparseActivation(K=self.K, next_linearity=self.linear4) 
        self.activation4 = nn.Softmax() 
        pass 
    def forward(self, x):
        x = self.activation1(self.linear1(x)) 
        x = self.activation2(self.linear2(x)) 
        x = self.activation3(self.linear3(x)) 
        x = self.activation4(self.linear4(x)) 
        return x

def MixtureModel(nn.Module): 
    'join two models with a Bernoulli random variable'
    def __init__(self, model1, model2, p): 
        'mix models 1 and 2, selecting model 2 with probability p'
        super(MixtureModel, self).__init__() 
        self.model1 = model1 
        self.model2 = model2 
        self.p = p 
        pass 
    def forward(self, x): 
        n = x.shape[0] 
        b = torch.rand(n) >= self.p 
        b_sum = b.sum() 
        y = torch.zeros([n,1]) 
        if b_sum < n: 
            y[b.logical_not()] = self.model1(x[b.logical_not(),:]) 
        if b_sum > 0: 
            y[b] = self.model2(x[b,:]) 
            pass 
        return y 
    pass 

tensor([[ 0.7634,  0.2736,  0.7834, -1.5053],
        [ 0.6353,  1.7708,  0.8720,  0.9240]])
(tensor([], size=(0, 2), dtype=torch.int64), None)


In [41]:
ll = torch.rand(10) > .5
print(ll.logical_not().sum())

tensor(6)
