# About this notebook 

This notebook will demonstrate how to train fairness-aware word2vec based on a triplet contrastive learning. In the triplet contrastive learning, one samples a center word $i$ and a corresponding "context" word $j$ that appears around $i$. Then, one samples a "fake" context $j'$ from a random distribution P. word2vec is trained to discriminate the authentic center-context pair ($i$,$j$) and a fake center-context pair ($i$,$j'$). word2vec learns an embedding based on data features that best discriminate the authentic and fake pairs. 

At the heart of the idea is to prevent word2vec to pick features pertained to social biases by generating the fake context $j'$ using a biased model. By generating the fake pair using a biased model, features pertained to social biases become non-informative in the discrimination task, so that the (new) word2vec model will learn an embedding from unbiased component of the given data. 

To generate a fake context pair $j'$ from a biased word2vec model, remind that the word2vec constructs an embedding by learning a conditional probability 
$$
P(j|i) = \frac{\exp(u_i ^ \top v_j)}{Z},
$$
where $u_i$ and $v_j$ are the embedding vectors representing center word $i$ and context word $j$, and $Z$ is the normalization constant. Using the conditional probability, we can sample a fake context $j'$ from the conditional probability $P(j'|i)$ learned in the biased model. Specifically, given a biased embedding, we generate the fake pair $j'$ from 
$$
P(j'|i) = \frac{\exp(\alpha \cdot u_i ^ \top v_{j'})}{Z},
$$
where $\alpha$ is the concentration parameter. $\alpha=0$ yields a uniform distribution, and $\alpha \rightarrow \infty$ leads to a delta function peaked at the maximum similarity. We set $\alpha=1$. 

# Set up

Import packages

In [None]:
import numpy as np 
import torch 
from scipy import sparse 
from pathlib import Path
import gravlearn # pip install gravlearn
import gensim 
import pickle
from tqdm.auto import  tqdm
import faiss 
from numba import njit

import sys
sys.path.insert(0, "../../")
from utils.dataset import Dataset

Files

In [None]:
# Input
data_dir = Path("../../data")
biased_model_file = data_dir / "derived/simplewiki/models/biased_word2vec.bin"
biased_dataset_id_file = data_dir / "derived/simplewiki/biased-dataset/dataset.pkl"
dataset_file = data_dir / "raw/simplewiki/simplewiki-20171103-pages-articles-multistream.xml.bz2"

# Output
output_file = data_dir / "derived/simplewiki/models/fairness-aware-word2vec/fairness-aware-word2vec_dim~25.pth"
output_kv_file = data_dir / "derived/simplewiki/models/fairness-aware-word2vec/fairness-aware-word2vec-keyedvector_dim~25.pth"

Load biased model and data

In [None]:

biased_model = gensim.models.Word2Vec.load(str(biased_model_file))

with open(biased_dataset_id_file, "rb") as f: 
    dataset = pickle.load(f)

documents = Dataset(dataset_file)

# Preparation

Indexing words

In [None]:
word2index = biased_model.wv.key_to_index.copy()
indexed_documents = [ list(filter(lambda x : x!=-1, map(lambda x : word2index.get(x, -1) , doc ))) for doc in tqdm(documents.lines)]

Get the biased embedding vectors

In [None]:
num_nodes = len(biased_model.wv)
dim = biased_model.vector_size
in_vec = np.zeros((num_nodes, dim))
out_vec = np.zeros((num_nodes, dim))
for i, k in enumerate(biased_model.wv.index_to_key):
    in_vec[i, :] = biased_model.wv[k]
    out_vec[i, :] = biased_model.syn1neg[i]

We will construct a dataset for triplet contrastive learning. Our dataset consists of two samplers, one for anchor-positive example pairs, and the other for anchor-negative example pairs.
First, let us define the sampler for anchor-positive pairs. We will use nGramSampler, which samples the pairs from a given word sequence. 

In [None]:
pos_sampler = gravlearn.nGramSampler(
    window_length=10, context_window_type="double", buffer_size=1000,
)

Next, we will define a sampler for anchor-negative pairs based on a soft-max function 

$$P(j'|i) = \exp(u_i ^\top v_j') / Z$$


where u_i and v_j are the in-vector and out-vector representing center word i and context word j', respectively. 
However, evaluating the probability is computationally expensive due to the normalization constant Z that extends over all nodes in the dataset. 

To reduce the burden, I'll use an alternative based on the two-stage sampling as follows. 
1. First, I find the top $m=500$ words with the largest $\exp(u_i ^\top v_j')$ for each center word $i$. 
2. With probability $\alpha$, we draw context $j'$ from the $m=500$ closest words with probability proportional to $\exp(u_i ^\top v_j')$. 
3. Otherwise, we draw j' from all nodes with probability proportional to the frequency. 

Here, alpha is a hyper-parameter that controls the balance between contextual and non-contextual sampling. I set $\alpha=0.9$ for this experiment.  

In [None]:
class Word2VecSampler(gravlearn.DataSampler):
    def __init__(self, in_vec, out_vec, alpha=0.9, m = 500, gpu_id = None):
        self.alpha = alpha
        self.in_vec = in_vec.astype("float32")
        self.out_vec = out_vec.astype("float32")
        self.center_sampler = gravlearn.FrequencyBasedSampler()
        self.n_elements, self.dim = out_vec.shape[0], self.out_vec.shape[1]
        
        #
        # Find the m words with the largest probability mass
        #

        # Make faiss index
        n_train_sample = np.minimum(100000, self.n_elements)
        nlist = int(np.ceil(np.sqrt(n_train_sample)))
        faiss_index = faiss.IndexIVFFlat(faiss.IndexFlatIP(self.dim), self.dim, nlist, faiss.METRIC_INNER_PRODUCT)

        if gpu_id is not None:
            res = faiss.StandardGpuResources()
            faiss_index = faiss.index_cpu_to_gpu(res, gpu_id, faiss_index)
        faiss_index.train(self.out_vec[np.random.choice(self.n_elements, n_train_sample, replace=False)])

        # Add the embedding vectors to index 
        faiss_index.add(self.out_vec)

        # Construct a graph of words with edges running between a center word $i$ and the m nodes with the largest $\exp(u_i ^\top v_j)$.  
        dist, indices = faiss_index.search(self.in_vec, m)
        rows = np.arange(self.n_elements).reshape((-1, 1)) @ np.ones((1, m))
        rows, indices, dist = rows.ravel(), indices.ravel(), dist.ravel()
        s = indices >= 0
        rows, indices, dist = rows[s], indices[s], dist[s]
        dist = np.exp(dist)
        A = sparse.csr_matrix(
            (dist, (rows, indices)),
            shape=(self.n_elements, self.n_elements),
        )

        # Preprocess the graph for faster sampling 
        data = A.data / A.sum(axis=1).A1.repeat(np.diff(A.indptr))
        A.data = _csr_row_cumsum(A.indptr, data)

        self.A = A

    def fit(self, seqs):
        self.center_sampler.fit(seqs)

    def conditional_sampling(self, conditioned_on=None):
        if np.random.rand() < self.alpha:
            return _sample_one_neighbor(conditioned_on, self.A.indptr, self.A.indices, self.A.data)
        else:
            return self.center_sampler.sampling()[0]

    def sampling(self):
        cent = self.center_sampler.sampling()[0]
        cont = self.conditional_sampling(cent)
        return cent, cont

#
# Helper functions
#
@njit(nogil=True)
def _csr_row_cumsum(indptr, data):
    out = np.empty_like(data)
    for i in range(len(indptr) - 1):
        acc = 0
        for j in range(indptr[i], indptr[i + 1]):
            acc += data[j]
            out[j] = acc
        out[j] = 1.0
    return out

@njit(nogil=True)
def _sample_one_neighbor(node_id, indptr, indices, data):
    neighbors = indices[indptr[node_id]:indptr[node_id + 1]]
    neighbors_weight = data[indptr[node_id]:indptr[node_id + 1]]
    return neighbors[
        np.searchsorted(neighbors_weight, np.random.rand())
    ]

neg_sampler = Word2VecSampler(in_vec=in_vec, out_vec=out_vec, alpha=0.9, m = 500)

Train the samplers and bundle them as a dataset

In [None]:
neg_sampler.fit(indexed_documents)
pos_sampler.fit(indexed_documents)

# Bundle them as a dataset
dataset = gravlearn.TripletDataset(
    epochs=1, pos_sampler=pos_sampler, neg_sampler=neg_sampler
)

# Training


We will train a word2vec model with node similarity being dot similarity. We will take advantage of a GPU to boost the training process.  

In [None]:
device = "cuda:0"
dist_metric = gravlearn.metrics.DistanceMetrics.DOTSIM
batch_size = 20000
checkpoint = 1000 

Define the word2vec model: 

In [None]:
model = gravlearn.Word2Vec(vocab_size=num_nodes, dim=dim)
model.train()
model = model.to(device)
next(model.parameters()).device

Data loader

In [None]:
# Training 
dataloader = gravlearn.DataLoader(
     dataset,
     batch_size=batch_size,
     shuffle=False,
     num_workers=4,
     pin_memory=True,
)

Define the loss function and the optimizer:

In [None]:
# Set up the loss function
loss_func = gravlearn.TripletLoss(embedding=model, dist_metric=dist_metric)

# The optimizer 
focal_params = filter(lambda p: p.requires_grad, model.parameters())
optim = torch.optim.AdamW(focal_params)

Training loop

In [None]:
pbar = tqdm(enumerate(dataloader), miniters=100, total=len(dataloader))
for it, (p1, p2, n1) in pbar:

    # clear out the gradient
    focal_params = filter(lambda p: p.requires_grad, model.parameters())
    for param in focal_params:
        param.grad = None

    # Convert to bags if bags are given
    p1, p2, n1 = p1.to(device), p2.to(device), n1.to(device)

    # compute the loss
    loss = loss_func(p1, p2, n1)

    # backpropagate
    loss.backward()
    torch.nn.utils.clip_grad_norm_(focal_params, 1)

    # update the parameters
    optim.step()

    pbar.set_postfix(loss=loss.item())

    if (it + 1) % checkpoint == 0:
        if output_file is not None:
            torch.save(model.state_dict(), output_file)

Save as the keyed vector

In [None]:
in_vec = model.ivectors.weight.detach().cpu().numpy()
kv = gensim.models.KeyedVectors(in_vec.shape[1])
kv.add_vectors(biased_model.wv.index_to_key, in_vec)
kv.save("fairness-word2vec-keyedvector.kv")