# Word Bias

Use a pre-trained set of embeddings to detect word biases.

You can train your own word embeddings on your dataset. 

Provided you observe all privacy regulations and you have a self-destructive urge you could use the emails from your organization to check for stereotypes. BAO, training yourself a model is computationally expensive.

Recommended lecture: **Sequence Models Specialization by Deeplearning.ai on Coursera.**

Or at least the video below:

*https://www.coursera.org/lecture/nlp-sequence-models/properties-of-word-embeddings-S2mat*


In [1]:
import numpy as np

Some useful functions

In [2]:
def cosine_similarity(u, v):
    # Cosine similarity (degree of similarity) between the two vectors u and v passed as arguments
    # Both vectors hould have the same shape
    # If u and v are similar, their cosine similarity will be close to 1
    # If they are dissimilar, the result will be less than 1.

    assert u.shape[0] == v.shape[0]
    cosine =  np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
    return cosine


In [3]:
def load_pretrained_word_vector(file = "data/glove.6B/glove.6B.50d.txt"):
    # load pretrained word vectors from various sources such as:
    # https://nlp.stanford.edu/projects/glove/, 
    # https://fasttext.cc/docs/en/crawl-vectors.html, 
    # https://www.kaggle.com/, etc.
    # used by default a small one with 50 dimensions
    
    dictionary_embeddings = {}
    with open(file, 'r', encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word_embedding_vector = np.asarray(values[1:], "float64")
            dictionary_embeddings[values[0]] = word_embedding_vector
    return dictionary_embeddings

In [4]:
def word_analogy_answer(a, b, c, dictionary_embeddings):
    # if "a is to b as c is to x" calculate x and return it
    # word x is calculated such as the cosine similarity between:
    #    - the difference of the word vectors associated to (a, b) and
    #    - the difference of the word vectors associated to (c, x)
    # is maximal
    # word vectors (and their similarity) mirror public data biases

    # word vectors associated to the three words a, b and c 
    va = dictionary_embeddings[a.lower()]
    vb = dictionary_embeddings[b.lower()]
    vc = dictionary_embeddings[c.lower()]
    
    words_to_try = dictionary_embeddings.keys() - {a, b, c} # all except a, b and c
    max_cosine = -1.0
    answer_word = "_None_"
    
    for word in words_to_try:        
        cosine = cosine_similarity(vb-va, dictionary_embeddings[word]-vc)
        if cosine > max_cosine:
            max_cosine = cosine
            answer_word = word
    return answer_word

Let's load a pretrained word vectors model

In [5]:
my_dictionary_embeddings = load_pretrained_word_vector("data/glove.6B/glove.6B.50d.txt")
print("Number of words in pre-trained model: {}".format(len(my_dictionary_embeddings.keys())))
print("Each word has associated a {}-dimension vector".format(my_dictionary_embeddings["a"].shape[0]))

Number of words in pre-trained model: 400000
Each word has associated a 50-dimension vector


And ask some questions

In [6]:
question = ("man","doctor","woman")
print("{} is to {} what {} is to {}".format(*question, word_analogy_answer(*question,my_dictionary_embeddings)))

man is to doctor what woman is to nurse


In [None]:
question = ("white","doctor","black")
print("{} is to {} what {} is to {}".format(*question, word_analogy_answer(*question,my_dictionary_embeddings)))

In [None]:
question = ("christian","doctor","muslim")
print("{} is to {} what {} is to {}".format(*question, word_analogy_answer(*question,my_dictionary_embeddings)))