In this project, we opted to assess the reliability of the ```WEAT``` by employing ```word2vec``` embeddings. To facilitate the process, we specifically selected the embeddings that were pretrained on Google News, known as ```word2vec-google-news-300```.

Frist, we run WEAT on the pretrained word2vec embeddings for gender bias:

In [1]:
!pip install wefe

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wefe
  Downloading wefe-0.4.1-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting semantic-version (from wefe)
  Downloading semantic_version-2.10.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: semantic-version, wefe
Successfully installed semantic-version-2.10.0 wefe-0.4.1


In [2]:
from wefe.word_embedding_model import WordEmbeddingModel
from wefe.query import Query
from wefe.metrics import WEAT
import gensim.downloader as api

import pandas as pd
import numpy as np

word2vec_model = WordEmbeddingModel(api.load('word2vec-google-news-300'),
                                    'word2vec-google-news-300')



In [3]:
# target lists: lists of 32 most common names in the US, as per 
# https://www.ssa.gov/oact/babynames/decades/century.html
male_names = [
    "James", "Robert", "John", "Michael", "David", "William", "Richard", 
    "Joseph", "Thomas", "Christopher", "Charles", "Daniel", "Matthew",
    "Anthony", "Mark", "Donald", "Steven", "Andrew", "Paul", "Joshua",
    "Kenneth", "Kevin", "Brian", "George", "Timothy", "Ronald", "Jason",
    "Edward", "Jeffrey", "Ryan", "Jacob", "Gary"
]

female_names = [
    "Mary", "Patricia", "Jennifer", "Linda", "Elizabeth", "Barbara", "Susan",
    "Jessica", "Sarah", "Karen", "Lisa", "Nancy", "Betty", "Sandra", "Margaret",
    "Ashley", "Kimberly", "Emily", "Donna", "Michelle", "Carol", "Amanda",
    "Melissa", "Deborah", "Stephanie", "Dorothy", "Rebecca", "Sharon", "Laura",
    "Cynthia", "Amy", "Kathleen"
]

#attribute sets
career_names = [
    "Engineer", "Doctor", "Teacher", "Lawyer", "Nurse", "Programmer", "Artist",
    "Scientist", "Writer", "Chef", "Athlete", "Architect", "Musician",
    "Officer", "Firefighter", "Pilot", "Psychologist", "Entrepreneur",
    "Veterinarian", "Dentist", "Actor", "Designer", "Photographer",
    "Journalist", "Engineer", "Professor", "Economist", "Researcher",
    "Accountant", "Electrician", "Mechanic", "Secretary"
]

family_names = [
    "Family", "Parents", "Children", "Siblings", "Mother", "Father", "Sister",
    "Brother", "Daughter", "Son", "Grandparents", "Grandmother", "Grandfather",
    "Granddaughter", "Grandson", "Aunt", "Uncle", "Cousin", "Niece", "Fiance",
    "Fiancee", "Spouse", "Husband", "Wife", "Stepmother", "Stepfather",
    "Stepsister", "Stepbrother", "Stepdaughter", "Stepson", "Godmother", 
    "Godfather"
]

print(len(male_names), len(female_names), len(career_names), len(family_names))

32 32 32 32


The ```effect_size``` quantifies the magnitude or strength of the association between the target and attribute concepts in the WEAT. A larger effect size suggests a stronger association. In this case, the effect size is 1.9518473221010546, indicating a relatively large effect. However, ```p_value``` is nan indicating p-value could not be calculated or is undefined

In the subsequent steps, we examine both the reliability and validity of the WEAT. To assess reliability, we divide the aforementioned four lists ```male_names```, ```female_names```, ```career```,```family``` into four sublists and perform the WEAT on each of them to determine if the results exhibit consistency across all sublists. Regarding validity, we compare the WEAT outcomes with 2 alternative bias measurements ```Word Analogy Testing``` and ```Word Similarity Comparison```, . This comparative analysis allows us to evaluate the validity of the WEAT by examining the consistency of results across these diverse measurements. Furthermore, we incorporated a downstream task called the ```semantic textual similarity task``` as part of the validity assessment. This additional task serves as further evidence to support the evaluation of the WEAT's validity.

## RELIABILITY

In [12]:
weat = WEAT()

gender_occupation_query = Query([male_names, female_names],
                                [career_names, family_names],
                                ['Male names', 'Female names'],
                                ['Career', 'Family'])

baseline = {'effect_size': weat.run_query(gender_occupation_query, word2vec_model)['effect_size'],
            'weat': weat.run_query(gender_occupation_query, word2vec_model)['weat']}
print(f"Baseline: \n effect_size: {baseline['effect_size']}, WEAT: {baseline['weat']}")

scores = {}

for i in range(4, 25):
    scores[i] = {'effect_size':[], 'weat': []}
    for _ in range(10000):
        male_name_sublist = np.random.choice(male_names, size=i, replace=False)
        female_name_sublist = np.random.choice(female_names, size=i, replace=False)
        career_sublist = np.random.choice(career_names, size=i, replace=False)
        family_sublist = np.random.choice(family_names, size=i, replace=False)

        gender_occupation_query = Query([male_name_sublist, female_name_sublist],
                                    [career_sublist, family_sublist],
                                    ['Male names', 'Female names'],
                                    ['Career', 'Family'])
        
        result = weat.run_query(gender_occupation_query, word2vec_model, lost_vocabulary_threshold=0.25)
        scores[i]['effect_size'].append(result['effect_size'])
        scores[i]['weat'].append(result['weat'])

    print(f"Results for random sample of size {i}\n average effect_size: {np.mean(scores[i]['effect_size'])}, average WEAT: {np.mean(scores[i]['weat'])}")

Baseline: 
 effect_size: 1.2198651231838853, WEAT: 1.2642560498432196
Results for random sample of size 4
 average effect_size: 0.9269837420264228, average WEAT: 0.1678243221731313
Results for random sample of size 5
 average effect_size: 0.9802894599706243, average WEAT: 0.20864114190928176
Results for random sample of size 6
 average effect_size: 1.012864899947951, average WEAT: 0.24928547117730135
Results for random sample of size 7
 average effect_size: 1.0441179468801016, average WEAT: 0.2895405046687927
Results for random sample of size 8
 average effect_size: 1.0790689899797556, average WEAT: 0.331844707325648
Results for random sample of size 9
 average effect_size: 1.0877671865686183, average WEAT: 0.3690776558700199
Results for random sample of size 10
 average effect_size: 1.1125678883620385, average WEAT: 0.4120753589198836
Results for random sample of size 11
 average effect_size: 1.1183082669229727, average WEAT: 0.4474134034733047
Results for random sample of size 12
 av

In [35]:
avg_effect_sizes = [np.mean(scores[i]['effect_size']) for i in range(4, 25)]
avg_weats = [np.mean(scores[i]['weat']) for i in range(4, 25)]

df = pd.DataFrame({'sample size': ['baseline'] + list(range(4, 25)),
                   'average effect size': [baseline['effect_size']] + avg_effect_sizes,
                   'average WEAT': [baseline['weat']] + avg_weats})

df.style.set_caption('Averages after 10000 iterations').hide(axis='index')

sample size,average effect size,average WEAT
baseline,1.219865,1.264256
4,0.926984,0.167824
5,0.980289,0.208641
6,1.012865,0.249285
7,1.044118,0.289541
8,1.079069,0.331845
9,1.087767,0.369078
10,1.112568,0.412075
11,1.118308,0.447413
12,1.138669,0.492353


In [None]:
# df.to_csv('weat_reliability.csv', index=False)

### VALIDITY

#### 1. Word Analogy Testing 

In [None]:
import warnings
warnings.filterwarnings('ignore')
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

## load pretrained word2vec model
model_glove_twitter = api.load('word2vec-google-news-300')

In [None]:
# Define word analogies
analogies = [
    ('man', 'woman', 'king'),
    ('father', 'mother', 'son'),
    ('brother', 'sister', 'uncle'),
    ('boy', 'girl', 'prince'),
    ('he', 'she', 'king'),
    ('he', 'she', 'pilot'),
    ('man', 'woman', 'director'),
    ('father', 'mother', 'leader'),
    ('husband', 'wife', 'king')
]
# Perform word analogy testing
for analogy in analogies:
    a, b, c = analogy
    predicted_word = model_glove_twitter.most_similar(positive=[b, c], negative=[a])[0][0]

    print(f"{a}->{b}  || {c}->{predicted_word}")

man->woman  || king->queen
father->mother  || son->daughter
brother->sister  || uncle->aunt
boy->girl  || prince->princess
he->she  || king->queen
he->she  || pilot->flight_attendant
man->woman  || director->chairwoman
father->mother  || leader->Leader
husband->wife  || king->kings


In [None]:
rater1 = [1,2,3,4,5,6,7,0,0]

#### 2. Word Similarity Comparison

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import ttest_ind

In [None]:
# Define gendered word pairs
gender_pairs = [
    ('man', 'woman'),
    ('father', 'mother'),
    ('son', 'daughter'),
    ('brother', 'sister'),
    ('uncle', 'aunt'),
    ('nephew', 'niece'),
    ('king', 'queen'),
    ('prince', 'princess'),
    ('emperor', 'empress'),
    ('god', 'goddess'),
    ('male', 'female'),
    ('boy', 'girl'),
    ('groom', 'bride'),
    ('husband', 'wife'),
    ('grandfather', 'grandmother'),
    ('grandson', 'granddaughter'),
    ('widower', 'widow'),
    ('master', 'mistress'),
    ('host', 'hostess'),
    ('actor', 'actress'),
    ('waiter', 'waitress'),
    ('steward', 'stewardess'),
    ('nephew', 'niece'),
    ('wizard', 'witch'),
    ('hero', 'heroine'),
    ('bachelor', 'spinster'),
    ('lad', 'lass'),
    ('monk', 'nun'),
    ('policeman', 'policewoman'),
    ('fireman', 'firewoman'),
    ('salesman', 'saleswoman'),
    ('mailman', 'mailwoman'),
    ('businessman', 'businesswoman'),
    ('chairman', 'chairwoman'),
    ('priest', 'priestess'),
    ('sir', 'madam'),
    ('lord', 'lady'),
    ('gentleman', 'lady'),
    ('sultan', 'sultana'),
    ('bull', 'cow'),
    ('ram', 'ewe'),
    ('boar', 'sow'),
    ('cock', 'hen'),
    ('drake', 'duck'),
    ('stallion', 'mare'),
    ('gander', 'goose'),
    ('rooster', 'hen'),
    ('lion', 'lioness'),
    ('tiger', 'tigress'),
    ('leopard', 'leopardess'),
    ('goat', 'doe'),
    ('buck', 'doe'),
    ('billy', 'nanny'),
    ('stag', 'hind'),
    ('wizard', 'sorceress'),
    ('sorcerer', 'witch'),
    ('drummer', 'drummeress'),
    ('barman', 'barmaid'),
    ('farmhand', 'farmgirl'),
    ('businessman', 'businesswoman'),
    ('gigolo', 'prostitute'),
    ('shepherd', 'shepherdess'),
    ('governor', 'governess'),
    ('prince', 'princess'),
    ('waiter', 'waitress'),
    ('captain', 'captainess'),
    ('lord', 'lady'),
    ('male', 'female'),
    ('man', 'woman'),
    ('boy', 'girl'),
    ('gentleman', 'lady'),
    ('sir', 'madam'),
    ('king', 'queen'),
    ('god', 'goddess'),
    ('father', 'mother'),
    ('son', 'daughter'),
    ('brother', 'sister'),
    ('uncle', 'aunt')]

In [None]:
# Define ungendered word pairs
ungendered_pairs = [
    ('child', 'adult'),
    ('person', 'individual'),
    ('human', 'being'),
    ('student', 'learner'),
    ('employee', 'worker'),
    ('friend', 'companion'),
    ('citizen', 'resident'),
    ('actor', 'performer'),
    ('artist', 'creator'),
    ('writer', 'author'),
    ('musician', 'performer'),
    ('chef', 'cook'),
    ('doctor', 'physician'),
    ('engineer', 'developer'),
    ('scientist', 'researcher'),
    ('teacher', 'educator'),
    ('lawyer', 'attorney'),
    ('manager', 'supervisor'),
    ('leader', 'director'),
    ('customer', 'client'),
    ('patient', 'recipient'),
    ('guest', 'visitor'),
    ('driver', 'operator'),
    ('athlete', 'player'),
    ('participant', 'member'),
    ('speaker', 'presenter'),
    ('listener', 'receiver'),
    ('reader', 'consumer'),
    ('viewer', 'audience'),
    ('buyer', 'shopper'),
    ('seller', 'vendor'),
    ('traveler', 'explorer'),
    ('volunteer', 'helper'),
    ('worker', 'laborer'),
    ('resident', 'inhabitant'),
    ('consumer', 'user'),
    ('passenger', 'rider'),
    ('patient', 'client'),
    ('passerby', 'onlooker'),
    ('user', 'customer'),
    ('viewer', 'spectator'),
    ('student', 'pupil'),
    ('manager', 'administrator'),
    ('participant', 'attendee'),
    ('recipient', 'beneficiary'),
    ('reader', 'viewer'),
    ('writer', 'editor'),
    ('artist', 'performer'),
    ('player', 'competitor'),
    ('colleague', 'coworker'),
    ('neighbor', 'resident'),
    ('traveler', 'tourist'),
    ('customer', 'consumer'),
    ('employee', 'staff'),
    ('speaker', 'lecturer'),
    ('listener', 'observer'),
    ('learner', 'student'),
    ('worker', 'employee'),
    ('author', 'writer'),
    ('creator', 'designer'),
    ('chef', 'sous chef'),
    ('doctor', 'surgeon'),
    ('developer', 'programmer'),
    ('researcher', 'scholar'),
    ('teacher', 'professor'),
    ('director', 'executive'),
    ('client', 'customer'),
    ('recipient', 'holder'),
    ('visitor', 'guest'),
    ('operator', 'technician'),
    ('player', 'athlete'),
    ('member', 'participant'),
    ('presenter', 'host'),
    ('receiver', 'listener'),
    ('consumer', 'shopper'),
    ('vendor', 'merchant'),
    ('explorer', 'adventurer'),
    ('helper', 'assistant')
]

In [None]:
len(ungendered_pairs)

78

In [None]:
def detect_bias(model, gendered_pairs, ungendered_pairs, idx):
    # Calculate cosine similarity for gendered word pairs
    gendered_similarities = [cosine_similarity([model[word1]], [model[word2]])[0][0] for word1, word2 in gendered_pairs]

    # Calculate cosine similarity for non-gendered word pairs
    non_gendered_similarities = [cosine_similarity([model[word1]], [model[word2]])[0][0] for word1, word2 in ungendered_pairs]

    # Perform t-test to compare the similarity scores
    t_statistic, p_value = ttest_ind(gendered_similarities, non_gendered_similarities)

    # Compare the p-value to determine significance
    if p_value < 0.05:
        print("Significant difference detected. Gender bias may be present.")
        return idx
    else:
        print("No significant difference detected. Gender bias may not be present.")
        return 0

In [None]:
interval = 8
rater2 = []
for i in range(int(len(gender_pairs)/interval)):
    p1 = gender_pairs[i:i+interval]
    p2 = ungendered_pairs[i:i+interval]
    rater2.append(detect_bias(model_glove_twitter, p1, p2, i+1))

Significant difference detected. Gender bias may be present.
Significant difference detected. Gender bias may be present.
Significant difference detected. Gender bias may be present.
Significant difference detected. Gender bias may be present.
Significant difference detected. Gender bias may be present.
Significant difference detected. Gender bias may be present.
Significant difference detected. Gender bias may be present.
Significant difference detected. Gender bias may be present.
Significant difference detected. Gender bias may be present.


In [None]:
rater2

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [None]:
from sklearn.metrics import cohen_kappa_score

In [None]:
## kappa score ranges from -1 to 1
## -1 means no agreement
## 1 means complete agreement
## 0 means chance agreement
kappa = cohen_kappa_score(rater1, rater2)

In [None]:
print(f"Kappa score is {kappa}")

Kappa score is 0.7567567567567568


#### 3. Semantic textual similarity

In this section, we use validate gender bias of word2vec using a downstream task: semantic textual similarity task (as in "Gender Bias in Downstream Task" in "2023_NLP2_Bias_Measures_Student_version" notebook).  

In [27]:
sts_df = pd.read_csv('sts-b.tsv', delimiter='\t')

# Obtains STS-B sentence pairs
pairs = []
for i in (range(sts_df.shape[0])):
    row = dict(sts_df.iloc[i])

    pair = {}
    pair['sentence1'] = row['pair1-2']
    pair['sentence2'] = row['pair1-1']
    pair['sentence3'] = row['pair2-1']
    pairs.append(pair)

In [28]:
## Create dataset
import json

data = []

# Construct JSON data
for pair in pairs:
    data.append({
        "sentence1": pair["sentence1"],
        "sentence2": pair["sentence2"],
        "sentence3": pair["sentence3"]
    })

# Write JSON data to a file
with open("sentence_pairs.json", "w") as outfile:
    json.dump(data, outfile, indent=4)

In [29]:
from gensim import matutils
from sklearn.metrics.pairwise import cosine_similarity

In [30]:
## load predefined json file
with open("sentence_pairs.json", "r") as file:
    data = json.load(file)

In [31]:
def compute_sim_diff(sent1, sent2, sent3):
    # Preprocess sentences
    preprocessed_sentence1 = sent1.lower().split()
    preprocessed_sentence2 = sent2.lower().split()
    preprocessed_sentence3 = sent3.lower().split()

    # Compute sentence vectors
    sentence_vector1 = np.mean([word2vec_model[word] for word in preprocessed_sentence1 if word in word2vec_model], axis=0)
    sentence_vector2 = np.mean([word2vec_model[word] for word in preprocessed_sentence2 if word in word2vec_model], axis=0)
    sentence_vector3 = np.mean([word2vec_model[word] for word in preprocessed_sentence3 if word in word2vec_model], axis=0)

    # Handle out-of-vocabulary words
    sentence_vector1 = np.nan_to_num(sentence_vector1, nan=0.0)
    sentence_vector2 = np.nan_to_num(sentence_vector2, nan=0.0)
    sentence_vector3 = np.nan_to_num(sentence_vector3, nan=0.0)
    
    # Calculate cosine similarity
    similarity_score1 = cosine_similarity([sentence_vector1], [sentence_vector2])[0][0]
    similarity_score2 = cosine_similarity([sentence_vector1], [sentence_vector3])[0][0]
    
#     print("Similarity score:", similarity_score1-similarity_score2)
    return similarity_score1-similarity_score2


In [32]:
diff = [compute_sim_diff(sample['sentence1'], sample['sentence2'], sample['sentence3']) for sample in data] 
ratio = len([i for i in diff if i>0])*100/len(diff)
print(f"{ratio:.1f}% of all samples demonstrate male orientation")

69.9% of all samples demonstrate male orientation
