# Seed Set Pairing

This notebook contains the code for how we generate the `seed_set_pairings.csv` file.

## Set up

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os

if os.path.isdir("../notebooks/"):
    os.chdir("../badseeds/")

In [3]:
import json
import random
import itertools

import pandas as pd
import numpy as np

from badseeds import seedbank

In [4]:
# path to config json file containing paths to datasets. change if necessary
CONFIG_PATH = "../config.json"

In [5]:
with open(CONFIG_PATH, "r") as f:
    config = json.load(f)

In [6]:
# for replicability
np.random.seed(42)
random.seed(42)

## Code

In [7]:
# loading seeds df
seeds = seedbank.seedbanking(config["seeds"]["dir_path"] + "seeds.json", index=False)

In [8]:
gathered_seeds = seeds["Seeds"]

We now manually go through each index in the dataframe. We group gathered seed sets, based on whether they originate from the same paper/dataset in and whether the pairing between the seed sets makes sense (human judgement).

In [9]:
seeds.iloc[158]

Category                                            female (CNN Daily Mail)
Seeds                     [actress, girl, girlfriend, girls, mother, mot...
Source / Justification                       curated for the target dataset
Source Categories                                                   curated
Used in Paper             Identifying and Reducing Gender Bias in Word-L...
Link                                                                   None
Seeds ID                  female_words_CNN_DailyMail-Bordia_and_Bowman_2019
Name: 158, dtype: object

In [10]:
# manually finding indices, by using the seeds.iloc[num] cell above, for each num
group_indices = [
    [0, 1],
    [2, 3],
    [4, 5],
    [6, 7],
    [8, 9],
    [10, 11],
    [12, 13],
    [14, 15],
    [18, 19],
    [20, 21],
    [22, 23],
    # need to skip a few here
    [28, 29],
    [30, 31],
    [32, 33],
    [34, 35],
    # skip "equalize", "gender specific"
    [40, 41],
    [42, 43],
    # here we have black, white and asian triplets
    [45, 46, 47],
    [48, 49, 50],
    # and jew, muslim christian triplets
    [52, 53, 54],
    # and jewish muslim christian troplets
    [55, 56, 57],
    # skip some tests, now back to pairs
    [60, 61],
    # skip some weird ones
    [66, 67],
    # black, asian, white, hispanic, russian, chineze names
    [68, 69, 70, 71, 72, 73],
    # back to pairs
    [78, 79],
    [80, 81],
    # skip a few unpaired ones
    [88, 89],
    [90, 91],
    [92, 93],
    [94, 95],
    [96, 97],
    [98, 99],
    [100, 101],
    [102, 103],
    [104, 105],
    [106, 107],
    [108, 109],
    # career vs violence doesn't seem appropriate pair
    [112, 113],
    # white collar job, blue collar jon, domestic work, occupation quadruplet
    [115, 116, 117, 118],
    [119, 120],
    [121, 122],
    [123, 124],
    # male/female singular/plural
    [126, 127],
    [126, 128],
    [128, 129],
    [127, 129],
    # back to normal pairs
    [131, 132],
    [133, 134],
    [135, 136],
    [137, 138],
    # christianity, islam, atheism
    [139, 140, 141],
    [142, 143],
    [144, 145],
    [146, 147],
    [148, 149],
    [150, 151, 152],
    [153, 154],
    [155, 156],
    [157, 158],
]

In [11]:
# For each tuple, find paired combinations with itertools. Need 2 for-loops to flatten
pair_indices = [
    list(pair) for group in group_indices for pair in itertools.combinations(group, 2)
]

In [12]:
# convert these to their respective IDs in the seeds dataframe
pair_ids = [seeds.iloc[pair]["Seeds ID"].tolist() for pair in pair_indices]

In [13]:
# convert this to a dataframe of its own
pair_df = pd.DataFrame.from_records(pair_ids, columns=["ID_A", "ID_B"])

In [14]:
# here's what it looks like
pair_df

Unnamed: 0,ID_A,ID_B
0,pleasant-Caliskan_et_al_2017,unpleasant-Caliskan_et_al_2017
1,flowers-Caliskan_et_al_2017,insects-Caliskan_et_al_2017
2,instruments-Caliskan_et_al_2017,weapons-Caliskan_et_al_2017
3,european_american_names-Caliskan_et_al_2017,african_american_names-Caliskan_et_al_2017
4,european_american_names_market_discrimination-...,african_american_names_market_discrimination-C...
...,...,...
85,high_morality_and_low/neutral_warmth-Bhatia_et...,high_competence-Bhatia_et_al_2018
86,low/neutral_and_morality_high_warmth-Bhatia_et...,high_competence-Bhatia_et_al_2018
87,male_words_Penn_Treebank-Bordia_and_Bowman_2019,female_words_Penn_Treebank-Bordia_and_Bowman_2019
88,male_words_WikiText_2-Bordia_and_Bowman_2019,female_words_WikiText_2-Bordia_and_Bowman_2019...


In [15]:
# save to disk
pair_df.to_csv(config["pairs"]["dir_path"] + "seed_set_pairings.csv", index=False)

In [16]:
# you can read it back from disk like this
pair_df_new = pd.read_csv(config["pairs"]["dir_path"] + "seed_set_pairings.csv")

In [17]:
# looks like you'd expect
pair_df_new

Unnamed: 0,ID_A,ID_B
0,pleasant-Caliskan_et_al_2017,unpleasant-Caliskan_et_al_2017
1,flowers-Caliskan_et_al_2017,insects-Caliskan_et_al_2017
2,instruments-Caliskan_et_al_2017,weapons-Caliskan_et_al_2017
3,european_american_names-Caliskan_et_al_2017,african_american_names-Caliskan_et_al_2017
4,european_american_names_market_discrimination-...,african_american_names_market_discrimination-C...
...,...,...
85,high_morality_and_low/neutral_warmth-Bhatia_et...,high_competence-Bhatia_et_al_2018
86,low/neutral_and_morality_high_warmth-Bhatia_et...,high_competence-Bhatia_et_al_2018
87,male_words_Penn_Treebank-Bordia_and_Bowman_2019,female_words_Penn_Treebank-Bordia_and_Bowman_2019
88,male_words_WikiText_2-Bordia_and_Bowman_2019,female_words_WikiText_2-Bordia_and_Bowman_2019...


In [18]:
# can convert it to a list of tuples for example
[tuple(x) for x in pair_df_new.to_records(index=False)]

[('pleasant-Caliskan_et_al_2017', 'unpleasant-Caliskan_et_al_2017'),
 ('flowers-Caliskan_et_al_2017', 'insects-Caliskan_et_al_2017'),
 ('instruments-Caliskan_et_al_2017', 'weapons-Caliskan_et_al_2017'),
 ('european_american_names-Caliskan_et_al_2017',
  'african_american_names-Caliskan_et_al_2017'),
 ('european_american_names_market_discrimination-Caliskan_et_al_2017',
  'african_american_names_market_discrimination-Caliskan_et_al_2017'),
 ('pleasantness-Caliskan_et_al_2017', 'unpleasantness-Caliskan_et_al_2017'),
 ('male_names_1-Caliskan_et_al_2017', 'female_names_1-Caliskan_et_al_2017'),
 ('career-Caliskan_et_al_2017', 'family-Caliskan_et_al_2017'),
 ('male_1-Caliskan_et_al_2017', 'female_1-Caliskan_et_al_2017'),
 ('science_1-Caliskan_et_al_2017', 'arts_2-Caliskan_et_al_2017'),
 ('male_2-Caliskan_et_al_2017', 'female_2-Caliskan_et_al_2017'),
 ('temporary-Caliskan_et_al_2017', 'permanent-Caliskan_et_al_2017'),
 ('young_names-Caliskan_et_al_2017', 'old_names-Caliskan_et_al_2017'),
 ('p