# An Introduction to Neural Networks

This notebook is part of a tutorial, with accompanying slides.  We will develop a kind of _minimal working example_ of running an experiment with neural networks, inspired by work in natural language semantics.  Instructions for using this notebook may be found at https://github.com/shanest/nn-tutorial.  The slides also contain pointers to more advanced topics and applications.

In particular, we will conduct a very miniature version of one of the experiments in [Steinert-Threlkeld and Szymanik, "Learnability and Semantic Universals"](https://semanticsarchive.net/Archive/mQ2Y2Y2Z/LearnabilitySemanticUniversals.pdf).  In the figure below, they compare a monotone quantifier (_at least 4_) to a non-monotone one (_at least 6 or at most 2_), showing that the former is learned faster than the latter.  We will look at a similar pair of quantifiers here.

![](imgs/exp1a_acc.png)

(There are several important differences between what we will do today and what's done in the paper.  Because of time constraints, we won't do multiple trials or statistical analysis thereof, and we won't be using recurrent networks.)

## Generalized Quantifiers

Generalized quantifier theory provides the meanings for expressions like "most", "all", "between 5 and 10", etc. as they are used in sentences like

(1) Most of the students are happy.

(1) is true just in case the set of students who are happy outnumbers the set of students who are not happy.  Using interpretation bracket notation, where $\mathcal{M}$ is a model:

$$ [[ (1) ]]^{\mathcal{M}} = 1 \text{ iff } \textsf{card}([[\text{students}]] \cap [[\text{happy}]]) > \textsf{card}([[\text{students} \setminus \text{happy}]]) $$

We can view the meaning of an expression like "most" as a function, which takes as input models of the form $\mathcal{M} = \langle M, A, B \rangle$, where $A$ is the denotation of the restrictor (e.g. "students") and $B$ that of the nuclear scope (e.g. "are happy") and outputs a 1 or 0 based on the condition above.  

In other words, a quantifier is a _classifier_ of models, classifying every model as 1 (True) or 0 (False).

## Experiment

We will build a neural network classifier and train it to _learn_ different quantifiers.  Our goal will be to qualitatively compare _monotone_ (e.g. "most") to _non-monotone_ (e.g. "between 5 and 10") quantifiers.  Please refer to the paper linked above for full definitions of these concepts.

### 0. Import Libraries

In [2]:
import itertools
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

### 1. Define Parameters

In [89]:
params = {
    'model_size': 16,  # how big our models will be
    'num_epochs': 2,  # one epoch = one loop through the dataset
    'batch_size': 32,  # size of one batch of training examples
    'eval_every': 20,  # frequency of evaluations, in # of batches
}

### 2. Generating the Data

In [50]:
def get_all_models(length):
    return np.array(list(
        itertools.product([0, 1], repeat=length)
    ))

get_all_models(3)

array([[0, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 1, 1],
       [1, 0, 0],
       [1, 0, 1],
       [1, 1, 0],
       [1, 1, 1]])

In [90]:
def most(model):
    return int(sum(model) > 0.5*len(model))  # integers are useful later

def between_m_n(model, m, n):
    return int(m < sum(model) < n)

def batch_apply(models, quantifier):
    """Applies quantifier function to 2-D array of models,
    where each row corresponds to one model."""
    return np.apply_along_axis(quantifier, 1, models)

batch_apply(get_all_models(3), most)

array([0, 0, 0, 1, 0, 1, 1, 1])

In [26]:
def shuffle_data(models, labels):
    """Shuffles the order of an array of models and of labels."""
    assert len(models) == len(labels), "models and labels must be of same length"
    permutation = np.random.permutation(len(models))
    return models[permutation], labels[permutation]

In [27]:
def get_data(model_size, quantifier, train_split=0.7, shuffle=True):
    """Gets training and test data for quantifier."""
    # get all models and labels
    models = get_all_models(model_size)
    labels = batch_apply(models, quantifier)
    # shuffle them
    if shuffle:
        models, labels = shuffle_data(models, labels)
    # split into train/test
    split_index = int(len(models) * train_split)  # int returns floor / rounds down
    train_models = models[:split_index]  # up to index, not including
    train_labels = labels[:split_index]
    test_models = models[split_index:]  # from index, including
    test_labels = labels[split_index:]
    return train_models, train_labels, test_models, test_labels

In [83]:
np.unique(batch_apply(get_all_models(16), lambda model: int(sum(model) >= 9)), return_counts=True)

(array([0, 1]), array([39203, 26333]))

In [29]:
get_data(3, most)

(array([[1, 1, 0],
        [0, 0, 0],
        [1, 1, 1],
        [0, 0, 1],
        [1, 0, 0]]), array([1, 0, 1, 0, 0]), array([[0, 1, 1],
        [0, 1, 0],
        [1, 0, 1]]), array([1, 0, 1]))

### 3. Build Model

In [30]:
class FFNN(nn.Module):  # all models in PyTorch extend nn.Module
    
    def __init__(self, input_size, output_size):
        super(FFNN, self).__init__()
        
        self.layer1 = nn.Linear(input_size, 32)  # first hidden layer has 32 units
        self.layer2 = nn.Linear(32, 32)  # as does second
        self.output = nn.Linear(32, output_size)
        
    def forward(self, models):  # note: forward can take any number of arguments
        x = torch.as_tensor(models, dtype=torch.float)
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        x = self.output(x)
        return F.softmax(x, dim=1)  # softmax converts to a probability distribution

### 4. Train Model

In [91]:
# get the data
train_models, train_labels, test_models, test_labels = get_data(
    params['model_size'],
    most
)
    
# get the model
model = FFNN(params['model_size'], 2)

In [92]:
model

FFNN(
  (layer1): Linear(in_features=16, out_features=32, bias=True)
  (layer2): Linear(in_features=32, out_features=32, bias=True)
  (output): Linear(in_features=32, out_features=2, bias=True)
)

In [93]:
model(train_models)

tensor([[0.5144, 0.4856],
        [0.5078, 0.4922],
        [0.4985, 0.5015],
        ...,
        [0.5012, 0.4988],
        [0.4858, 0.5142],
        [0.4892, 0.5108]], grad_fn=<SoftmaxBackward>)

In [94]:
# TODO: 
# * install pandas, plotnine
# * record data, plot learning curves

def train(params, quantifier):
    # get the data
    train_models, train_labels, test_models, test_labels = get_data(
        params['model_size'],
        quantifier
    )
    
    # get the model
    model = FFNN(params['model_size'], 2)  # 2 outputs: False/True
    
    # get an optimizer
    opt = torch.optim.Adam(model.parameters())
    num_batches = int(len(train_models) / params['batch_size'])

    for epoch in range(params['num_epochs']):
        # shuffle the training data each epoch
        train_models, train_labels = shuffle_data(train_models, train_labels)
        model.train()  # for our model, this has no effect, but is good practice

        # individual training steps!
        for batch_num in range(num_batches):
            # batch the data
            batch_models = train_models[batch_num*params['batch_size']:(batch_num+1)*params['batch_size']]
            batch_labels = train_labels[batch_num*params['batch_size']:(batch_num+1)*params['batch_size']]

            # get model's output
            model_probs = model(batch_models)  # calls .forward

            # zero the gradients
            opt.zero_grad()
            # calculate loss
            loss = F.cross_entropy(model_probs,
                                   torch.as_tensor(batch_labels))
            loss.backward()  # computes the gradients!
            opt.step()  # updates the parameters

            if (batch_num + 1) % params['eval_every'] == 0:
                with torch.no_grad():  # speeds things up
                    model.eval()  # again, no effect on our model, but good practice
                    model_probs = model(test_models).numpy()
                    model_predictions = model_probs.argmax(axis=1).flatten()
                    # 1 if correct prediction, 0 otherwise
                    correct = (model_predictions == test_labels).astype(int)
                    print('Test set accuracy; after epoch {}, batch {}: {}'.format(
                        epoch, batch_num+1,
                        sum(correct) / len(correct)
                    ))
                model.train()

In [95]:
train(params, most)

Test set accuracy; after epoch 0, batch 20: 0.5966634453995219
Test set accuracy; after epoch 0, batch 40: 0.5962565484970246
Test set accuracy; after epoch 0, batch 60: 0.5962565484970246
Test set accuracy; after epoch 0, batch 80: 0.5965617211738976
Test set accuracy; after epoch 0, batch 100: 0.6533238390722751
Test set accuracy; after epoch 0, batch 120: 0.6863842124001831
Test set accuracy; after epoch 0, batch 140: 0.7784954987030162
Test set accuracy; after epoch 0, batch 160: 0.8446671074716444
Test set accuracy; after epoch 0, batch 180: 0.8711154061339708
Test set accuracy; after epoch 0, batch 200: 0.8579421189156198
Test set accuracy; after epoch 0, batch 220: 0.901480087482834
Test set accuracy; after epoch 0, batch 240: 0.8985809470525405
Test set accuracy; after epoch 0, batch 260: 0.9206042419002085
Test set accuracy; after epoch 0, batch 280: 0.9328111489751284
Test set accuracy; after epoch 0, batch 300: 0.9409999491378872
Test set accuracy; after epoch 0, batch 320: 