<a href="https://colab.research.google.com/github/yasmine-houri/Word_embeddings/blob/main/Code_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center><strong><font color="navy">Introduction to gender bias detection using word embeddings</font></strong></center></h1>


<center>
Yasmine Houri  

École nationale de la statistique et de l'administration économique, France

This is an introduction to word embeddings, as preliminary work for my PhD proposal on the following subject:  

**Gendered representations of power in France: a digital analysis of traces of occupational stratification in online texts**. 

This research is supervied by **Jean-Philippe Cointet** (data scientist at médialab, Sciences Po, Paris, France), and **Achim Edelmann** (sociologist at médialab, Sciences Po, Paris, France).  
Feel free to email me at yasmine.houri@ensae.fr if you have any questions.  

This notebook is freely adapted from https://www.tensorflow.org/text/guide/word_embeddings and https://notebook.community/henchc/Rediscovering-Text-as-Data/11-Word-Embeddings/01-Word-Embeddings. All due credits belong to the authors of these two pages. I will be using data from: https://txtlab.org/2016/01/txtlab450-a-data-set-of-multilingual-novels-for-teaching-and-research/ .

# Conceptual introduction to basic word embeddings

## Representing text as numbers

Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model. In this section, you will look at three strategies for doing so.

### One-hot encodings

As a first idea, you might "one-hot" encode each word in your vocabulary. Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). To represent each word, you will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word. This approach is shown in the following diagram.

<img src="https://github.com/tensorflow/text/blob/master/docs/guide/images/one-hot.png?raw=1" alt="Diagram of one-hot encodings" width="400" />

To create a vector that contains the encoding of the sentence, you could then concatenate the one-hot vectors for each word.

Key point: This approach is inefficient. A one-hot encoded vector is sparse (meaning, most indices are zero). Imagine you have 10,000 words in the vocabulary. To one-hot encode each word, you would create a vector where 99.99% of the elements are zero.

### Encode each word with a unique number

A second approach you might try is to encode each word using a unique number. Continuing the example above, you could assign 1 to "cat", 2 to "mat", and so on. You could then encode the sentence "The cat sat on the mat" as a dense vector like [5, 1, 4, 3, 5, 2]. This approach is efficient. Instead of a sparse vector, you now have a dense one (where all elements are full).

There are two downsides to this approach, however:

* The integer-encoding is arbitrary (it does not capture any relationship between words).

* An integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful.

### Word embeddings

Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

<img src="https://github.com/tensorflow/text/blob/master/docs/guide/images/embedding2.png?raw=1" alt="Diagram of an embedding" width="400"/>

Above is a diagram for a word embedding. Each word is represented as a 4-dimensional vector of floating point values. Another way to think of an embedding is as "lookup table". After these weights have been learned, you can encode each word by looking up the dense vector it corresponds to in the table.

However, one-hot encoding and dense encoding both fail to encode relationships between words. In this analysis, I will be using one of the most common and better performing methods of word embedding named Word2Vec (read Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." (2013) for more information).

# <center> Set up the work environment

## Setup

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
from datascience import *
import numpy as np
from scipy.spatial.distance import cosine
import gensim
import nltk
from string import punctuation
import pandas as pd

### Load the Dataset

In [None]:
url = "https://raw.githubusercontent.com/yasmine-houri/Word_embeddings/main/"

metadata_tb = pd.read_csv(url+'Data/2_txtlab_Novel150.csv').drop('Unnamed: 0', axis=1)

fiction_path = url+'Data/'

novel_list = []

import requests
# Iterate through filenames in metadata table
for filename in metadata_tb['filename']:
    
    # Read in novel text as single string, make lowercase
    url = fiction_path + filename
    novel = requests.get(url)
    novel.encoding = 'utf-8-sig'
    
    # Add novel text as single string to master list
    novel_list.append(novel.text)


## Text preprocessing

In [None]:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')

def fast_tokenize(text):
    
    # Iterate through text removing punctuation characters
    no_punct = "".join([char for char in text if char not in punctuation])
    
    # Split text over whitespace into list of words
    tokens = no_punct.split()
    
    return tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# import random
# randomlist = []
# for i in range(0,50):
#   n = random.randint(0,len(novel_list)-1)
#   randomlist.append(n)
# randomlist= tuple(randomlist)

# short_list = [novel_list[i] for i in randomlist]  

In [None]:
sentences = [sentence for novel in novel_list for sentence in sent_tokenize(novel)]

In [None]:
words_by_sentence = [fast_tokenize(sentence.lower()) for sentence in sentences]

In [None]:
words_by_sentence = [sentence for sentence in words_by_sentence if sentence != []]

# <center> Detection of gender bias in the entire period

## Train the model

Word2Vec is the most prominent word embedding algorithm. Word embedding generally attempts to identify semantic relationships between words by observing them in context.

Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick's first sentence, “me” is paired on either side by “Call” and “Ishmael.” After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts. This chaining of signifiers to one another mirrors some of humanists' most sophisticated interpretative frameworks of language.

The two main flavors of Word2Vec are CBOW (Continuous Bag of Words) and Skip-Gram, which can be distinguished partly by their input and output during training. Skip-Gram takes a word of interest as its input (e.g. "me") and tries to learn how to predict its context words ("Call","Ishmael"). CBOW does the opposite, taking the context words ("Call","Ishmael") as a single input and tries to predict the word of interest ("me").

In general, CBOW is is faster and does well with frequent words, while Skip-Gram potentially represents rare words better.

Word2Vec Features  

* `size`: Number of dimensions for word embedding model
* `window`: Number of context words to observe in each direction
* `min_count`: Minimum frequency for words included in model
* `sg` (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram
* `alpha`: Learning rate (initial); prevents model from over-correcting, enables finer tuning
* `iterations`: Number of passes through dataset
* `batch_words`: Number of words to sample from data during each pass  

Note: cell below uses default value for each argument

In [None]:
model = gensim.models.Word2Vec(words_by_sentence, size=100, window=5, \
                               min_count=5, sg=0, alpha=0.025, iter=15, batch_words=10000)

## Embeddings

In [None]:
print("Diplomate (homme/femme): ", model.similarity('homme','diplomate')/model.similarity('femme','diplomate'))
print("Cadre (homme/femme): ", model.similarity('homme','cadre')/model.similarity('femme','cadre'))
print("Artiste (homme/femme): ", model.similarity('homme','artiste')/model.similarity('femme','artiste'))
print("Officier (homme/femme): ", model.similarity('homme','officier')/model.similarity('femme','officier'))
print("Cuisinier (homme/femme): ", model.similarity('homme','cuisinier')/model.similarity('femme','cuisinière'))
print("Servant (homme/femme): ", model.similarity('homme','servant')/model.similarity('femme','servante'))

Diplomate (homme/femme):  3.39055
Cadre (homme/femme):  -1.27337
Artiste (homme/femme):  2.01439
Officier (homme/femme):  3.80869
Cuisinier (homme/femme):  0.583223
Servant (homme/femme):  -0.107831


  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """
  


## Visualisations

In [None]:
model.most_similar(positive='homme')

  """Entry point for launching an IPython kernel.


[('garçon', 0.7630212306976318),
 ('ouvrier', 0.7295625805854797),
 ('abbé', 0.698989987373352),
 ('officier', 0.6971719264984131),
 ('animal', 0.6933426260948181),
 ('gentilhomme', 0.6898995041847229),
 ('intrigant', 0.68001389503479),
 ('enfant', 0.6695951223373413),
 ('alsacien', 0.6691569089889526),
 ('écrivain', 0.6669896245002747)]

In [None]:
model.most_similar('femme')

  """Entry point for launching an IPython kernel.


[('fille', 0.9250227212905884),
 ('demoiselle', 0.8070782423019409),
 ('maîtresse', 0.776763379573822),
 ('comédienne', 0.7609223127365112),
 ('mère', 0.7494492530822754),
 ('rivale', 0.7487567067146301),
 ('cousine', 0.748172402381897),
 ('actrice', 0.7472723722457886),
 ('paysanne', 0.7403960824012756),
 ('soeur', 0.7387897372245789)]

In [None]:
model.most_similar('femme')

In [None]:
her_tokens = [token for token,weight in model.most_similar(['femme','homme'], topn=30)]
vectors = [model[word] for word in her_tokens]

  """Entry point for launching an IPython kernel.
  


In [None]:
from sklearn.metrics import pairwise
dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')
from sklearn.manifold import MDS
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)

In [None]:
# %matplotlib inline
_, ax = plt.subplots(figsize=(8,8))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
    ax.annotate(her_tokens[i], ((embeddings[i,0], embeddings[i,1])))
# plt.plot([-0.4, 0.6], [-0.4, 0.25], 'k-')

# <center> Detection of gender bias evolution over time

## Split data into time sections

In [None]:
metadata_tb_17 = metadata_tb[(metadata_tb.date >= 1771) & (metadata_tb.date <= 1800)]
metadata_tb_1850 = metadata_tb[(metadata_tb.date >= 1801) & (metadata_tb.date <= 1850)]
metadata_tb_19 = metadata_tb[(metadata_tb.date >= 1851) & (metadata_tb.date <= 1900)]
metadata_tb_1929 = metadata_tb[(metadata_tb.date >= 1901) & (metadata_tb.date <= 1929)]

## Process the data over time sections

### 1771-1800

In [None]:
novel_list_17 = novel_list[:len(metadata_tb_17)]

In [None]:
sentences_17 = [sentence for novel in novel_list_17 for sentence in sent_tokenize(novel)]

In [None]:
words_by_sentence_17 = [fast_tokenize(sentence.lower()) for sentence in sentences_17]

In [None]:
words_by_sentence = [sentence for sentence in words_by_sentence_17 if sentence != []]

### 1801-1850

In [None]:
novel_list_1850 = novel_list[len(metadata_tb_17):len(metadata_tb_17)+len(metadata_tb_1850)]

In [None]:
sentences_1850 = [sentence for novel in novel_list_1850 for sentence in sent_tokenize(novel)]

In [None]:
words_by_sentence_1850 = [fast_tokenize(sentence.lower()) for sentence in sentences_1850]

In [None]:
words_by_sentence_1850 = [sentence for sentence in words_by_sentence_1850 if sentence != []]

### 1850-1900

In [None]:
novel_list_19 = novel_list[len(metadata_tb_1850):len(metadata_tb_1850)+len(metadata_tb_19)]

In [None]:
sentences_19 = [sentence for novel in novel_list_19 for sentence in sent_tokenize(novel)]

In [None]:
words_by_sentence_19 = [fast_tokenize(sentence.lower()) for sentence in sentences_19]

In [None]:
words_by_sentence_19 = [sentence for sentence in words_by_sentence_19 if sentence != []]

### 1901-1929

In [None]:
novel_list_1929 = novel_list[len(metadata_tb_19):len(metadata_tb_19)+len(metadata_tb_1929)]

In [None]:
sentences_1929 = [sentence for novel in novel_list_1929 for sentence in sent_tokenize(novel)]

In [None]:
words_by_sentence_1929 = [fast_tokenize(sentence.lower()) for sentence in sentences_1929]

In [None]:
words_by_sentence_1929 = [sentence for sentence in words_by_sentence_1929 if sentence != []]

## Train a model for each time section

In [None]:
model17 = gensim.models.Word2Vec(words_by_sentence_17, size=100, window=5, \
                               min_count=5, sg=0, alpha=0.025, iter=15, batch_words=10000)

In [None]:
model1850 = gensim.models.Word2Vec(words_by_sentence_1850, size=100, window=5, \
                               min_count=5, sg=0, alpha=0.025, iter=15, batch_words=10000)

In [None]:
model19 = gensim.models.Word2Vec(words_by_sentence_19, size=100, window=5, \
                               min_count=5, sg=0, alpha=0.025, iter=15, batch_words=10000)

In [None]:
model1929 = gensim.models.Word2Vec(words_by_sentence_1929, size=100, window=5, \
                               min_count=2, sg=0, alpha=0.025, iter=15, batch_words=10000)

## Embeddings over time

In [None]:
print("usine, corpus général (femme/homme): ", model.similarity('femme','usine')/model.similarity('homme','usine'))
# print("usine au 18ème (femme/homme): ", model17.similarity('femme','usine')/model17.similarity('homme','usine'))
print("usine au début du 19ème (femme/homme): ", model1850.similarity('femme','usine')/model1850.similarity('homme','usine'))
print("usine à la fin du 19ème (femme/homme): ", model19.similarity('femme','usine')/model19.similarity('homme','usine'))
print("usine au début du 20ème (femme/homme): ", model1929.similarity('femme','usine')/model1929.similarity('homme','usine'))

usine, corpus général (femme/homme):  6.83841
usine au début du 19ème (femme/homme):  -0.184451
usine à la fin du 19ème (femme/homme):  -46.4917
usine au début du 20ème (femme/homme):  1.94608


  """Entry point for launching an IPython kernel.
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """


In [None]:
print("diplomate, corpus général (femme/homme): ", model.similarity('femme','diplomate')/model.similarity('homme','diplomate'))
# print("diplomate au 18ème (femme/homme): ", model17.similarity('femme','diplomate')/model17.similarity('homme','diplomate'))
print("diplomate au début du 19ème (femme/homme): ", model1850.similarity('femme','diplomate')/model1850.similarity('homme','diplomate'))
print("diplomate à la fin du 19ème (femme/homme): ", model19.similarity('femme','diplomate')/model19.similarity('homme','diplomate'))
print("diplomate au début du 20ème (femme/homme): ", model1929.similarity('femme','diplomate')/model1929.similarity('homme','diplomate'))

diplomate, corpus général (femme/homme):  0.435732
diplomate au début du 19ème (femme/homme):  0.327662
diplomate à la fin du 19ème (femme/homme):  0.460108
diplomate au début du 20ème (femme/homme):  1.03136


  """Entry point for launching an IPython kernel.
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """
