# Racial in Word Embeddings 
Now, we build up on results from [Bolukbasi, Tolga, et al. "Man is to computer programmer as woman is to homemaker? Debiasing word embeddings.](http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf)" Advances in Neural Information Processing Systems. 2016.

We examine bias across race/religion.

The author's [code](https://github.com/tolga-b/debiaswe) was a great help.

In [1]:
# First, we download a condensed version of word2vec trained on Google News: 
# https://drive.google.com/open?id=1NH6jcrg8SXbnhpIXRIXF_-KUE7wGxGaG

In [14]:
import random
import numpy as np
from matplotlib import pyplot as plt

In [15]:
from debiaswe import debiaswe as dwe
from debiaswe.debiaswe import we
from debiaswe.debiaswe.we import WordEmbedding
from debiaswe.debiaswe.data import load_professions

### Loading the data

In [16]:
# Condensed Word2Vec trained on Google News
em = WordEmbedding('./w2v_gnews_small.txt')

*** Reading data from ./w2v_gnews_small.txt
(26443, 300)


AssertionError: 

### Identifying Bias

In [5]:
pairs = [('Christian', 'Muslim'), ('christian', 'muslim'), 
         ('Christ', 'Allah'), ('Jesus', 'Muhammad'), ('Jesus', 'Mohammed'), 
         ('Christmas', 'Eid'), 
         ('Abraham', 'Ibrahim'), ('Maryam', 'Mary'), 
         ('Bible', 'Quran'), ('church', 'mosque')]
for (c,m) in pairs:
    try:
        religion = em.diff(c,m)
    except KeyError as k:
        #TODO: Add vectors for these
        print("{} not in embedding".format(k))
religion = em.diff('Christian', 'Muslim')    
pair_pca = we.doPCA(pairs, em)
religion = pair_pca.components_[0]

'Christian' not in embedding
'christian' not in embedding
'Christ' not in embedding
'Jesus' not in embedding
'Jesus' not in embedding
'Christmas' not in embedding
'Abraham' not in embedding
'Maryam' not in embedding
'Bible' not in embedding


KeyError: 'Christian'

In [None]:
# from sklearn.decomposition import PCA
# pca = PCA(n_components=2)
# reduced = pca.fit_transform(np.vstack((gender_, gender)))

### Measuring Bias
We will now use occupations and analogies to measure gender bias

#### Professions

In [None]:
professions = load_professions()
profession_words = [p[0] for p in professions]

Let's sort these according to their projection scores (dot product) along the she-he and 10-pair directions.

In [None]:
sorted_he_she = sorted([(em.v(w).dot(gender_), w) for w in profession_words])
sorted_10pair = sorted([(em.v(w).dot(gender), w) for w in profession_words])

In [None]:
print("Top male: ")
for i in range(0,20):
    print("she-he: {} | 10-pair: {}".format(sorted_he_she[i], sorted_10pair[i]))
print("\nTop female: ")
for i in reversed(range(1, 20)):
    print("she-he: {} | 10-pair: {}".format(sorted_he_she[-i], sorted_10pair[-i]))

#### Analogies
she is to x as he is to y

In [None]:
a_gender_ = em.best_analogies_dist_thresh(gender_)
a_gender = em.best_analogies_dist_thresh(gender)
basic_a = {a:b for (a,b,c) in a_gender_}
pca_a = {a:b for (a,b,c) in a_gender}

In [None]:
print("Analogies: she -> he_1 | he_2")
print("word -> she-he analogy | PCA analogy\n")
for w in basic_a.keys():
    try:
        print("{} -> {} | {}".format(w, basic_a[w], pca_a[w]))
    except KeyError:
        pass