
# Tutorial: Using the Delta Method and the Method of Composition for uncertainty propagation

In this tutorial, we illustrate how to use the **Delta Method** and the **Method of Composition** approaches to propagate uncertainty to downstream tasks using GloVe-V, our word-level variance estimates for GloVe. As an example, we compute uncertainty intervals for the cosine similarity of the words `doctor` and `surgeon` using both approaches. This computation was performed using the Method of Composition in Figure 5 of our paper.



## Background

Our GloVe-V framework computes the following Normal distribution for word $i$:

$$ w_i \sim N(\mu_i, \Sigma_i),$$

where $\mu_i$ is the $d$-dimensional GloVe-trained word embedding for word $i$ and $\Sigma_i$ is the $d \times d$ GloVe-V covariance matrix, as given by Equation 6 in the paper.

**Delta Method**

The Delta Method states that if $\sqrt{n}(W - \hat{W})$ converges to $N(0, \Sigma)$, then

$$ \sqrt{n}(\phi(W) - \phi(\hat{W})) \rightarrow N(0, \phi^{\prime}(W)^T\Sigma\phi^{\prime}(W)) ,$$  

where $\phi(\cdot)$ is a differentiable function of $W$ and $\phi^{\prime}(\cdot)$ is its gradient with respect to $W$. 

In our example, $\phi(\cdot)$ is the cosine similarity of the point estimates of the words $j = $ `doctor` and $k=$ `surgeon`:

$$\phi(w_j, w_k) = \frac{w_j^T w_k}{\|w_j\| \|w_k\|} $$

We now compute $\frac{\partial \cos(w_j, w_k)}{\partial w_j}$, the derivative of the cosine similarity with respect to one of the vectors, which is symmetrical for  $w_j$ and $w_k$.

$$d_j := \frac{\partial \cos(w_j, w_k)}{\partial w_j} = \frac{w_k}{\|w_k\| \|w_j\|} - \cos(w_j, w_k) \cdot\frac{w_j}{\|
    w_j\|^2}  $$

Then, the variance of $\phi(W)$ is given by:
$$ \text{var}(\phi(W)) = \phi^{\prime}(W)^T\Sigma\phi^{\prime}(W) = \sum_{i \in \{j, k\}} d_i^T \Sigma_i d_i$$


**Method of Composition (Tanner, 1996)**

The Method of Composition propagates the uncertainty from a set of input variables to an output variable $Y$, generating independent and identically distributed samples of the output variable. In our example, $Y = \cos(w_j, w_k)$.

Let $K$ be the number of iterations. In the $k$th iteration, we draw one sample from each of the input variables $x_j \sim N(\mu_j, \Sigma_j)$ and $x_k \sim N(\mu_k, \Sigma_k)$, and compute $Y^{(k)} = \cos(x_j, x_k)$. Then, ($Y^{(1)}, ..., Y^{(K)}$) are i.i.d. from the marginal distribution of $Y$, and we can compute an estimate of the mean and variance of $Y$ as follows:

$$\hat{Y} = \frac{1}{K} \sum_k Y^{(k)}$$

$$ \text{var}(\hat{Y}) = \frac{1}{K-1} \sum_k (Y^{(k)} - \hat{Y})^2 $$

In [1]:
# Set up environment

import numpy as np
import pandas as pd

import glove_v

  from .autonotebook import tqdm as notebook_tqdm


### Download COHA (1900-1999) vectors and pre-computed variances

We start by downloading the pre-computed variances for the COHA (1900-1999) corpus. In this example, we download only a small subset which includes the vectors and variances for the words `doctor` and `surgeon`, which we make available in the `Toy-Embeddings` folder. 

To obtain the vectors and variances for the full vocabulary of the 1900-1999 COHA corpus, you can use `COHA_1900-1999_300d` as the `embedding_name` argument in the `download_embeddings` function.

In [None]:
glove_v.data.download_embeddings(
    embedding_name="Toy-Embeddings",
    approximation=False,
)

### Load the vocabulary, vectors and variances

In [2]:
# Vocabulary and inverse vocabulary
vocab, ivocab = glove_v.vector.load_vocab(
    embedding_name="Toy-Embeddings",
)
# Vectors and variances
vectors = glove_v.vector.load_vectors(
    embedding_name="Toy-Embeddings", format="dictionary"
)
variances = {}
for word in list(vocab.keys()):
    variances[word] = glove_v.variance.load_variance(
        embedding_name="Toy-Embeddings",
        approximation=False,
        word_idx=vocab[word],
    )

We can see that the dictionaries containing the vectors and pre-computed variances include the keys `doctor` and `surgeon`, as well as other occupations used in the generation of Figure 5 in the paper.

In [3]:
print(f"Keys in vectors dictionary: {vectors.keys()}")
print(f"Keys in variances dictionary: {variances.keys()}")

Keys in vectors dictionary: dict_keys(['doctor', 'surgeon', 'dentist', 'psychiatrist', 'therapist', 'veterinarian', 'obstetrician', 'pediatrician', 'pharmacist', 'neurologist', 'gynecologist'])
Keys in variances dictionary: dict_keys(['doctor', 'surgeon', 'dentist', 'psychiatrist', 'therapist', 'veterinarian', 'obstetrician', 'pediatrician', 'pharmacist', 'neurologist', 'gynecologist'])


## Cosine similarity point estimate
We now compute the point estimate for the cosine similarity between `doctor` and `surgeon` using the GloVe-trained vectors.

In [7]:
cs_pe = np.dot(vectors["doctor"], vectors["surgeon"])
cs_pe /= np.linalg.norm(vectors["doctor"]) * np.linalg.norm(vectors["surgeon"])
print(f'Cosine similarity between "doctor" and "surgeon": {cs_pe}')

Cosine similarity between "doctor" and "surgeon": 0.443281888961792


## Delta Method approach

We'll start by building a dictionary of derivatives for each word. We use the `cosine_derivative` function in `glove_v.propagate`, which implements the following computation for the derivative of the cosine similarity with respect to one of the vectors:

$$d_j := \frac{\partial \cos(w_j, w_k)}{\partial w_j} = \frac{w_k}{\|w_k\| \|w_j\|} - \cos(w_j, w_k) \cdot\frac{w_j}{\|w_j\|^2}  $$

In [8]:
deriv_dict = {}
for w in ["doctor", "surgeon"]:
    w_vec = vectors[w]
    other_w = "doctor" if w == "surgeon" else "surgeon"
    c_vec = vectors[other_w]
    w_der = glove_v.propagate.cosine_derivative(u=w_vec, v=c_vec)
    deriv_dict[w] = w_der.reshape(1, -1)

Next, we compute the variance of the cosine similarity, $\text{var}(\phi(W))$, as follows, using the `delta_method_variance` function in `glove_v.propagate`:

$$ \text{var}(\phi(W)) = \sum_{i \in \{j, k\}} d_i^T \Sigma_i d_i$$

In [9]:
cs_variance = glove_v.propagate.delta_method_variance(
    deriv_dict=deriv_dict,
    variance_dict=variances,
)

In [10]:
DM_dict = {
    "Method": ["Delta Method"],
    "Mean": [cs_pe],
    "Standard Deviation": [np.sqrt(cs_variance)],
}

The **Delta Method** gives us a standard deviation of 0.010045 for the cosine similarity.

In [11]:
print(DM_dict)

{'Method': ['Delta Method'], 'Mean': [0.4432819], 'Standard Deviation': [0.010045475617181882]}


## Method of Composition approach

In this approach, we obtain $K = 100,000$ samples of the cosine similarity of these two words, using random draws from the Normal distributions of each word. We then compute an estimate for the cosine similarity and its standard deviation by looking at the mean and standard deviation over the computed samples.

In [12]:
K = 100_000

sample_matrix_doctor = glove_v.propagate.sample_vector(
    variance=variances["doctor"],
    vector=vectors["doctor"],
    n=K,
)

sample_matrix_surgeon = glove_v.propagate.sample_vector(
    variance=variances["surgeon"],
    vector=vectors["surgeon"],
    n=K,
)

We now compute ($Y^{(1)}, ..., Y^{(K)}$), the i.i.d. samples of the cosine similarity.

In [13]:
moc_cs = np.sum(sample_matrix_doctor * sample_matrix_surgeon, axis=1)
moc_cs = moc_cs / (
    np.linalg.norm(sample_matrix_doctor, axis=1)
    * np.linalg.norm(sample_matrix_surgeon, axis=1)
)

In [14]:
MOC_dict = {
    "Method": ["Method of Composition"],
    "Mean": [np.mean(moc_cs)],
    "Standard Deviation": [np.sqrt(np.var(moc_cs))],
}

## Comparison: Delta Method vs. Method of Composition

We can now compare the results from the **Delta Method** and the **Method of Composition**. We see that both approaches give very similar results, with the **Delta Method** centered around the cosine similarity of the point estimates of the words and the **Method of Composition** centered around the mean of the cosine similarity samples.

In [15]:
df = pd.DataFrame.from_dict(DM_dict)
df = pd.concat([df, pd.DataFrame.from_dict(MOC_dict)])

print(df)

                  Method      Mean  Standard Deviation
0           Delta Method  0.443282            0.010045
0  Method of Composition  0.431803            0.009874


## References
M. A. Tanner, *Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions*, Springer Series in Statistics (Springer New York, 1996).