# Word Analogy and Debiasing
---
This notebook contains word analogy, debiasing and equalizing taks. With the help of modern word embbeddings (e.g. GloVe, word2vec), we are able to make use of word vectors and accomplish these tasks.
1. **Word Analogy:** Compute word analogy. For example, 'China' is to 'Mandarin' as 'France' is to 'French'.
2. **Debiasing:** The dataset which was used to train the word embeddings can reflect the some bias of human language. Gender bias is a significant one. 
3. **Equalizing:** Some words are gender-specific. For example, we may assume gender is the only difference between 'girl' and 'boy'. Therefore, they should have the same distance from other dimensions.

### Acknowledgement:
Some ideas come from [Deep Learning Course on Coursera](https://www.deeplearning.ai/deep-learning-specialization/) (e.g., the debiasing and equalizing equations) and the [paper](https://arxiv.org/abs/1607.06520).

## 1. Load Word Embeddings
The pre-trained word vectors is downloaded from [GloVe](https://nlp.stanford.edu/projects/glove/). The file I used contains 400k words and 50 dimensions.

In [1]:
import numpy as np

In [2]:
# Read the GloVe text file and return the words.
def read_glove(name):
    """Given the path/name of the glove file, return the words(set) and word2vec_map(a python dict)
    """
    file = open(name, 'r')
    # Create set for words and a dictionary for words and their corresponding  
    words = set()
    word2vec_map = {}
    
    data = file.readlines()
    for line in data:
        # add word to the words set.
        word = line.split()[0]
        words.add(word)
        
        word2vec_map[word] = np.array(line.split()[1:], dtype = np.float64)
        
    return words, word2vec_map


In [3]:
words, word2vec_map =  read_glove('glove.6B.50d.txt')    

# length of vocab
print('length of vocab:',len(words))

# dimension of word
print('dimension of word:',word2vec_map['hello'].shape)

length of vocab: 400000
dimension of word: (50,)


## 2. Word Analogy
### 2.1 Define similarity
Cosine similarity is used to measure the similarity of two vectors. 
$$\text{Cosine Similarity(a, b)} = \frac {a . b} {||a||_2 ||b||_2} = cos(\theta)$$

In [4]:
def cosine_sim(a,b):
    """Given vector a and b, compute the cosine similarity of these two vectors.
    """
    # Compute the dot product of a,b
    dot = np.dot(a,b)
    # compute the cosine similarity of a,b
    sim = dot/(np.linalg.norm(a)*np.linalg.norm(b))
    
    return sim


In [5]:
print(cosine_sim(word2vec_map['man'], word2vec_map['woman']))


0.8860337718495819


### 2.2 Find word analogy
If word a is to b as c is to d. Then, we have: $e_b - e_a \approx e_d - e_c$. Iterate over the vocabulary to find the best word analogy given three words.

In [6]:
def word_analogy(word_a, word_b, word_c, words, word2vec):
    """word_a is to word_b as word_c is to __.
    Find the word given the words and word vectors.
    """
    # Make sure the inputs are in lower case.
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    a,b,c = word2vec[word_a], word2vec[word_b], word2vec[word_c]
    
    best_sim = -100
    best_word = None
    for word in words:
        
        if word in [word_a, word_b, word_c]:
            continue
        # compute the current similarity
        sim = cosine_sim(a-b, c-word2vec[word])
        if sim > best_sim:
            best_sim = sim
            best_word = word
        
    return best_word

In [7]:
triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'good')]
for triad in triads_to_try:
    print ('{} -> {} :: {} -> {}'.format( *triad, word_analogy(*triad,words, word2vec_map)))

italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: good -> better


## 2. Debiasing
Some words should be neutral to the gender. But pre-trained word vectors are not, which reflects the language bias when we are using the language.

### 2.1 Define the gender vector

In [8]:
g1 = word2vec_map['man'] - word2vec_map['woman']
g2 = word2vec_map['father'] - word2vec_map['mother']
g3 = word2vec_map['boy'] - word2vec_map['girl']
# Average the subtractions.
g = (g1+g2+g3)/3

print(cosine_sim(word2vec_map['technology'], g))
print(cosine_sim(word2vec_map['flower'], g))

0.16192108462558177
-0.0939532553641572


### 2.2 Neutralize the words
Here is the equation to neutralize the words. 

$$e^{bias\_component} = \frac{e \cdot g}{||g||_2^2} * g$$
$$e^{debiased} = e - e^{bias\_component}$$

Where:  
$g$: The gender vector.  
$e$: The original word vector



In [9]:
def neutralize(word, gender, word2vec):
    """Given the word to neutralize, gender vector and the word vectors, neutralize the word.
    """
    e = word2vec[word]
    e_bias = (np.dot(e,gender)/(np.linalg.norm(gender)**2))*gender
    
    e_unbiased = e - e_bias
    return e_unbiased



After neutralizing words:

In [10]:
print(cosine_sim(g,neutralize('technology', g, word2vec_map) ))
print(cosine_sim(g,neutralize('flower', g, word2vec_map) ))

1.8444594232094444e-17
-8.244955165656526e-18


## 3. Equalizing 
Some gender-specific words should be equidistant from non-gender dimensions(axis). 

Major equations:
$$ \mu = \frac{e_{w1} + e_{w2}}{2}$$ 

$$ \mu_{B} = \frac {\mu \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}$$ 

$$\mu_{\perp} = \mu - \mu_{B} $$

$$ e_{w1B} = \frac {e_{w1} \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}
$$ 
$$ e_{w2B} = \frac {e_{w2} \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}
$$


$$e_{w1B}^{corrected} = \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w1B}} - \mu_B} {|(e_{w1} - \mu_{\perp}) - \mu_B)|} $$


$$e_{w2B}^{corrected} = \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w2B}} - \mu_B} {|(e_{w2} - \mu_{\perp}) - \mu_B)|} $$

$$e_1 = e_{w1B}^{corrected} + \mu_{\perp} $$
$$e_2 = e_{w2B}^{corrected} + \mu_{\perp} $$

In [11]:
def equalize(pair, bias_axis, word2vec_map):
    """Given the word pairs, the bias axis and the word vectors, 
       make the word pairs equidistant from unbiased axis.
    """

    w1, w2 = pair
    e_w1, e_w2 = word2vec_map[w1], word2vec_map[w2]
    
    # Compute the mean of e_w1 and e_w2
    mu = (e_w1+e_w2)/2

    # Compute the projections of mu over the bias axis and the orthogonal axis
    mu_B = np.dot(mu,bias_axis)/(np.square(np.linalg.norm(bias_axis)))*bias_axis
    mu_orth = mu - mu_B

    # Compute e_w1B and e_w2B 
    e_w1B = np.dot(e_w1,bias_axis)/(np.square(np.linalg.norm(bias_axis)))*bias_axis
    e_w2B = np.dot(e_w2,bias_axis)/(np.square(np.linalg.norm(bias_axis)))*bias_axis
        
    # Adjust the Bias part of e_w1B and e_w2B
    corrected_e_w1B = np.sqrt(np.abs(1-np.square(np.linalg.norm(mu_orth))))*(e_w1B-mu_B)/np.linalg.norm((e_w1-mu_orth)-mu_B)
    corrected_e_w2B = np.sqrt(np.abs(1-np.square(np.linalg.norm(mu_orth))))*(e_w2B-mu_B)/np.linalg.norm((e_w2-mu_orth)-mu_B)

    # Debias by equalizing e1 and e2 to the sum of their corrected projections
    e1 = corrected_e_w1B + mu_orth
    e2 = corrected_e_w2B + mu_orth

    return e1, e2

In [12]:
print("cosine similarities before equalizing:")
print("cosine_similarity(word_to_vec_map[\"man\"], gender) = ", cosine_sim(word2vec_map["man"], g))
print("cosine_similarity(word_to_vec_map[\"woman\"], gender) = ", cosine_sim(word2vec_map["woman"], g))
print()
e1, e2 = equalize(("man", "woman"), g, word2vec_map)
print("cosine similarities after equalizing:")
print("cosine_similarity(e1, gender) = ", cosine_sim(e1, g))
print("cosine_similarity(e2, gender) = ", cosine_sim(e2, g))

cosine similarities before equalizing:
cosine_similarity(word_to_vec_map["man"], gender) =  0.02435875412347579
cosine_similarity(word_to_vec_map["woman"], gender) =  -0.3979047171251496

cosine similarities after equalizing:
cosine_similarity(e1, gender) =  0.6624273110383183
cosine_similarity(e2, gender) =  -0.6624273110383184
