# Data Preparation for Repairing the Model Inference to Handle Fairness Issue

* Preparing selected template from IMDB review data
* Taking a random sample from names, https://www.surveysystem.com/sscalc.htm (for calculating the number of names with representative sample size)
* Create mutant texts from the names
* Return the majority result from the prediction (Note: No need to have a true label from the majority. The majority will represent a fairness inference from the prediction) -> **presented in other notebook**

In [1]:
import pandas as pd
import numpy as np
import math

#### Preparing Mutant Template

Please refer to `codes/mutant-generation.ipynb`to know the detail in getting the template.

In [2]:
dfm = pd.read_csv("../data/imdb_mutant/male/test.csv", header=None, sep="\t", names=["label", "mutant", "template"])
dff = pd.read_csv("../data/imdb_mutant/female/test.csv", header=None, sep="\t", names=["label", "mutant", "template"])

In [3]:
dfm

Unnamed: 0,label,mutant,template
0,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
1,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
2,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
3,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
4,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
...,...,...,...
138995,1,"First, I'm a huge Justin fan. I grew up knowin...","First, I'm a huge Buddy Holly fan. I grew up k..."
138996,1,"First, I'm a huge Terrence fan. I grew up know...","First, I'm a huge Buddy Holly fan. I grew up k..."
138997,1,"First, I'm a huge Roger fan. I grew up knowing...","First, I'm a huge Buddy Holly fan. I grew up k..."
138998,1,"First, I'm a huge Torrance fan. I grew up know...","First, I'm a huge Buddy Holly fan. I grew up k..."


In [4]:
df = pd.concat([dfm, dff])

In [5]:
df

Unnamed: 0,label,mutant,template
0,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
1,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
2,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
3,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
4,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
...,...,...,...
138995,1,"First, I'm a huge Melanie fan. I grew up knowi...","First, I'm a huge Buddy Holly fan. I grew up k..."
138996,1,"First, I'm a huge Tanisha fan. I grew up knowi...","First, I'm a huge Buddy Holly fan. I grew up k..."
138997,1,"First, I'm a huge Nancy fan. I grew up knowing...","First, I'm a huge Buddy Holly fan. I grew up k..."
138998,1,"First, I'm a huge Tia fan. I grew up knowing w...","First, I'm a huge Buddy Holly fan. I grew up k..."


In [6]:
df["template"] = df["template"].astype("category")
df["template_id"] = df["template"].cat.codes

In [7]:
gb = df.groupby("template_id")

In [8]:
gb.count()

Unnamed: 0_level_0,label,mutant,template
template_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,40,40,40
1,40,40,40
2,40,40,40
3,40,40,40
4,40,40,40
...,...,...,...
6893,40,40,40
6894,40,40,40
6895,40,40,40
6896,40,40,40


we have 6898 templates with 40 mutants for each template

#### Preparing Name from Gender Computer

In [9]:
gc = pd.read_csv("../data/gc_name/data.csv")
gc

Unnamed: 0,Name,Gender,Country
0,Roen,male,UK
1,Jeet,male,UK
2,Hagen,male,UK
3,Willow,male,UK
4,Belal,male,UK
...,...,...,...
305,Dódi,female,Hungary
306,Iboyka,female,Hungary
307,Zsófia,female,Hungary
308,Ágota,female,Hungary


In [10]:
gcm = gc[gc["Gender"] == "male"]
gcf = gc[gc["Gender"] == "female"]

#### Mutant Generation

In [11]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

import spacy
import en_core_web_lg
import neuralcoref
nlp = en_core_web_lg.load()
coref = neuralcoref.NeuralCoref(nlp.vocab)
nlp.add_pipe(coref, name='neuralcoref')

In [87]:
# names from GC
mnames = gcm["Name"].values
fnames = gcf["Name"].values

# names from EEC paper
# mnames = ["Alonzo", "Adam", "Alphonse", "Alan", "Darnell", "Andrew", "Jamel", "Frank", "Jerome", "Harry", "Lamar", "Jack", "Leroy", "Josh", "Malik", "Justin", "Terrence", "Roger", "Torrance", "Ryan"]
# fnames = ["Ebony", "Amanda", "Jasmine", "Betsy", "Lakisha", "Courtney", "Latisha", "Ellen", "Latoya", "Heather", "Nichelle", "Katie", "Shaniqua", "Kristin", "Shereen", "Melanie", "Tanisha", "Nancy", "Tia", "Stephanie"]

# small name for debugging
# mnames = ["Alonzo", "Adam"] 
# fnames = ["Ebony", "Amanda"]

# masculine pronoun
mp = ["He", "he", "him", "his", "His", "himself"]

# feminine prononun
fp = ["She", "she","her", "her", "Her", "herself"]

# masculine contra -> flipper
mc = {}
fc = {}

for _m, _f in zip(mp, fp) :
    fc[_m] = _f
    mc[_f] = _m

In [120]:
# contain a word and its location inside the sentence
# The location is indicated by start char and end char
class Token: 
    word = ""
    start = -1
    end = -1
    
    def __init__(self, word, start, end) :
        self.word = word
        self.start = start
        self.end = end
        
    def __str__(self) :
        return self.word
    
    def __repr__(self) :
        return self.word
        
    def get_word(self):
        return self.word
    
    def get_start(self):
        return self.start
    
    def get_end(self):
        return self.end


# Reference is a class to save Reference data
# e.g. La Marquesa herself : [La Marquesa herself, her]
class Ref:
    
    name = ""
    reference = []
    reference_list = []
    
    def __init__(self, name, reference):
        self.name = str(name)
        self.reference = []
        self.reference_list = []
        for word in reference :
            self.reference_list.append(word.text)
            self.reference.append(Token(word.text, word.start_char, word.end_char))
            
    def __str__(self) :
        return self.name + ": " + str(self.reference_list)
    
    def __repr__(self) :
        return self.name + ": " + str(self.reference_list)
    
    def get_name(self):
        return self.name
    
    def get_reference(self):
        return self.reference
    
    # is having male subject
    def is_having_male_subject(self):
        if "He" in self.reference_list :
            return True
        elif "he" in self.reference_list :
            return True
        else :
            return False

    # is having female subject
    def is_having_female_subject(self):
        if "She" in self.reference_list :
            return True
        elif "she" in self.reference_list :
            return True
        else :
            return False

class Coref:
    original = ""
    resolved = ""
    refs = []
    one_subject = False
    is_male = False
    subject_reference = None
    chunk = []
    
    def __init__(self, text):
        
        self.original = str(text)
        doc = nlp(text)
        refs = doc._.coref_clusters
        self.resolved = str(doc._.coref_resolved)
        self.refs = []
        for r in refs :
            self.refs.append(Ref(r.main, r.mentions))
            
        self.one_subject, self.subject_reference, self.is_male = self.check_one_subject()
        
        if self.one_subject :
            self.chunk = self.generate_chunk_from_coref()
            
    def get_original(self):
        return self.original
    
    def get_resolved(self):
        return self.resolved
    
    def get_refs(self):
        return self.refs
    
    def get_gender(self):
        if self.is_male :
            return "male"
        return "female"
    
    def is_one_subject(self) :
        return self.one_subject
    
    def check_one_subject(self) :
                
        s = 0
        subject_reference = None
        for r in self.refs :
            if r.is_having_male_subject() :
                s += 1
                subject_reference = r
                is_male = True
            
            if r.is_having_female_subject() :
                s += 1
                subject_reference = r
                is_male = False
                
        if s == 1 :
            # check if it's only prononun there
            is_only_pronoun = True
            for r in subject_reference.get_reference() :
                if r.word not in mp and r.word not in fp :
                    is_only_pronoun = False

            if is_only_pronoun :
                return False, None, None 
            
            return True, subject_reference, is_male
        else :
            return False, None, None
    
    def get_subject_reference(self):
        return self.subject_reference
    
    def generate_chunk_from_coref(self) :
        chunk = []
        refs = self.subject_reference.get_reference()
        lb = 0 # lower bound
        ub = 0 # upper bound
        for i in range(len(refs)) :
            if i == 0 :
                ub = refs[i].start
                _chunk = text[lb:ub]
                if _chunk == "" :
                    chunk.append(" ")
                else :
                    chunk.append(_chunk)
            else :
                lb = refs[i-1].end
                ub = refs[i].start
                _chunk = text[lb:ub]
                if _chunk == "" :
                    chunk.append(" ")
                else :
                    chunk.append(_chunk)
                
            if i == len(refs)-1 :
                lb = refs[-1].end
                chunk.append(self.original[lb:])
        
        return chunk
    
    def generate_male_mutant_text(self):
        refs = self.subject_reference.get_reference()
        chunk = self.chunk
        mutant = []
        if self.is_male :
            for name in mnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in mp :
                        t.append(r.word)
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "male", "text": "".join(t)})
        else :
            gender = "female"
            for name in mnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in fp :
                        t.append(mc[r.word])
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "male", "text": "".join(t)})
        return mutant
            
    def generate_female_mutant_text(self):
        refs = self.subject_reference.get_reference()
        chunk = self.chunk
        mutant = []
        if self.is_male :
            for name in fnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in mp :
                        t.append(fc[r.word])
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "female", "text": "".join(t)})
        else :
            for name in fnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in fp :
                        t.append(r.word)
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "female", "text": "".join(t)})
        return mutant
    
    
    def generate_mutant_text(self):
        chunk = self.chunk
        mutants = []
        male_mutant = self.generate_male_mutant_text()
        for _mutant in male_mutant :
            mutants.append(_mutant)
        female_mutant = self.generate_female_mutant_text()
        for _mutant in female_mutant :
            mutants.append(_mutant)
        return mutants

In [178]:
text = "About your terrible movie copying Beethoven. As a professional musician it\'s my duty to watch every movie made about any composer and Beethoven is one of my favorites. When Hungarians and Americans meet, it\'s a terrible combination of empty over the top emotions combined with the worst taste possible. You proved it in your terrible b-movie. The only thing that carries the movie is the music. Of course you didn\'t bother to look further than the good but in my taste contrived performances of the Tackacs quartet, but OK I have to admit that the performances at least have quality as contrast to the movie you\'ve made. It starts of with the dying DEAF Beethoven who perfectly understands Anna who is merely whispering. Beethoven\'s hearing during the movie get\'s better by the minute, but that must be because of some vague divine thing. Then there is the quite impossible semi-pornographic \"eyes wide shut\" double-conducting scene which is totally over the top with the luscious Anna and the crying nephew in the end (who also cries in the deleted scenes with constant red eyes, my GOD what a performance). And as culmination the rip-off from Amadeus, with Beethoven dictating music to Anna not in notes but in total nonsense, which she understands perfectly but no-one else in your audience even trained professional musicians will understand. Of course your reaction will be that negative response is a response at least, but I can assure you that Beethoven himself is turning in his grave because of your worthless creation and with reason. This so called homage is blasphemy and I am so sorry to have rented one of the worst movies ever made even though it\'s about my favorite subject. Ed Harris and others, you cannot comprehend the greatness of Beethoven in your wildest dreams and certainly not after a couple of lessons in conducting and violin playing. That's the trouble with you Americans: you think you can grasp everything even when it takes a lifetime of hard work. Yeah we can do it anyway! Remember that a good product comes with hard labor, talent, devotion and professionalism. All these you creators of Copying Beethoven lack. See you in kindergarten."

In [179]:
c = Coref(text)

In [180]:
c.get_resolved()

'About your terrible movie copying Beethoven. As a professional musician it\'s my duty to watch every movie made about any composer and Beethoven is one of my favorites. When Hungarians and Americans meet, it\'s a terrible combination of empty over the top emotions combined with the worst taste possible. You proved it in your terrible b-movie. The only thing that carries your terrible b-movie is the music. Of course you didn\'t bother to look further than the good but in my taste contrived performances of the Tackacs quartet, but OK I have to admit that the performances at least have quality as contrast to your terrible b-movie. It starts of with the dying DEAF Beethoven who perfectly understands Anna who is merely whispering. Beethoven\'s hearing during your terrible b-movie get\'s better by the minute, but that must be because of some vague divine thing. Then there is the quite impossible semi-pornographic "eyes wide shut" double-conducting scene which is totally over the top with th

In [181]:
for r in c.get_refs() :
    print(r.get_name())
    print(r.get_reference())

Beethoven
[Beethoven, Beethoven, Beethoven, Beethoven]
it
[it, it]
your terrible b-movie
[your terrible b-movie, the movie, the movie you've made, the movie]
Anna
[Anna, Anna, Anna, she]
Beethoven himself
[Beethoven himself, his, Beethoven, Beethoven]
This
[This, it]
Ed Harris and others
[Ed Harris and others, you Americans]


In [182]:
c.is_one_subject()

True

In [183]:
if c.is_one_subject() :
    print(c.get_subject_reference())
#     print(c.generate_mutant_text())

Anna: ['Anna', 'Anna', 'Anna', 'she']


In [127]:
dt = df[["label", "template"]]

In [128]:
dt = dt.drop_duplicates().reset_index(drop=True)
dt

Unnamed: 0,label,template
0,1,"I have only see three episodes of Hack, starri..."
1,1,In the groovy mid 70's a scruffy bunch of bras...
2,1,This must have been one of Chaplin's most ambi...
3,1,The debut that plucked from obscurity one of t...
4,1,There is really no way to compare this motion ...
...,...,...
6893,1,I heard they were going to remake this French ...
6894,1,"Well, the movie did turn out a lot better than..."
6895,0,"In this film, there is a loose plot of a man (..."
6896,1,The French Babbette appears at the modest hous...


In [203]:
import time
start = time.time()

original_arr = []
mutant_arr = []
gender_arr = []
label_arr = []

for index, row in dt.iterrows():
    label = row['label']
    text = row['template']
    c = Coref(text)
    if c.is_one_subject() :

        # append original text
        original_arr.append(text)
        mutant_arr.append(text)
        gender_arr.append("template")
        label_arr.append(label)
        
        mtext = c.generate_mutant_text()
        for m in mtext :
            original_arr.append(text)
            mutant_arr.append(m["text"])
            gender_arr.append(m["gender"])
            label_arr.append(label)
        

end = time.time()
print("Execution Time: ", end-start)

Execution Time:  1036.0771293640137


In [204]:
dmutant = pd.DataFrame(data={"label": label_arr, "mutant": mutant_arr, "gender": gender_arr, "original": original_arr})
dmutant

Unnamed: 0,label,mutant,gender,original
0,1,"I have only see three episodes of Hack, starri...",template,"I have only see three episodes of Hack, starri..."
1,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri..."
2,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri..."
3,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri..."
4,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri..."
...,...,...,...,...
1891808,1,"First, I'm a huge Dódi fan. I grew up knowing ...",female,"First, I'm a huge Buddy Holly fan. I grew up k..."
1891809,1,"First, I'm a huge Iboyka fan. I grew up knowin...",female,"First, I'm a huge Buddy Holly fan. I grew up k..."
1891810,1,"First, I'm a huge Zsófia fan. I grew up knowin...",female,"First, I'm a huge Buddy Holly fan. I grew up k..."
1891811,1,"First, I'm a huge Ágota fan. I grew up knowing...",female,"First, I'm a huge Buddy Holly fan. I grew up k..."


In [205]:
dmutant["original"] = dmutant["original"].astype("category")
dmutant["template_id"] = dmutant["original"].cat.codes

In [206]:
dmutant.groupby("template_id").count()

Unnamed: 0_level_0,label,mutant,gender,original
template_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,311,311,311,311
1,311,311,311,311
2,311,311,311,311
3,311,311,311,311
4,311,311,311,311
...,...,...,...,...
6078,311,311,311,311
6079,311,311,311,311
6080,311,311,311,311
6081,311,311,311,311


In [207]:
dms = dmutant[dmutant["gender"] == "template"]
dmd = dmutant[dmutant["gender"] != "template"]

In [208]:
dmd = dmd.reset_index(drop=True)
dmd

Unnamed: 0,label,mutant,gender,original,template_id
0,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri...",1846
1,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri...",1846
2,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri...",1846
3,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri...",1846
4,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri...",1846
...,...,...,...,...,...
1885725,1,"First, I'm a huge Dódi fan. I grew up knowing ...",female,"First, I'm a huge Buddy Holly fan. I grew up k...",1128
1885726,1,"First, I'm a huge Iboyka fan. I grew up knowin...",female,"First, I'm a huge Buddy Holly fan. I grew up k...",1128
1885727,1,"First, I'm a huge Zsófia fan. I grew up knowin...",female,"First, I'm a huge Buddy Holly fan. I grew up k...",1128
1885728,1,"First, I'm a huge Ágota fan. I grew up knowing...",female,"First, I'm a huge Buddy Holly fan. I grew up k...",1128


In [209]:
dmd.groupby("template_id").count()

Unnamed: 0_level_0,label,mutant,gender,original
template_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,310,310,310,310
1,310,310,310,310
2,310,310,310,310
3,310,310,310,310
4,310,310,310,310
...,...,...,...,...
6078,310,310,310,310
6079,310,310,310,310
6080,310,310,310,310
6081,310,310,310,310


In [210]:
dms = dms.reset_index(drop=True)
dms

Unnamed: 0,label,mutant,gender,original,template_id
0,1,"I have only see three episodes of Hack, starri...",template,"I have only see three episodes of Hack, starri...",1846
1,1,In the groovy mid 70's a scruffy bunch of bras...,template,In the groovy mid 70's a scruffy bunch of bras...,2831
2,1,This must have been one of Chaplin's most ambi...,template,This must have been one of Chaplin's most ambi...,5200
3,1,The debut that plucked from obscurity one of t...,template,The debut that plucked from obscurity one of t...,4244
4,1,There is really no way to compare this motion ...,template,There is really no way to compare this motion ...,4491
...,...,...,...,...,...
6078,0,My Take: Yet another lame PG-13 horror movie w...,template,My Take: Yet another lame PG-13 horror movie w...,3326
6079,1,I heard they were going to remake this French ...,template,I heard they were going to remake this French ...,1899
6080,1,"Well, the movie did turn out a lot better than...",template,"Well, the movie did turn out a lot better than...",5542
6081,1,The French Babbette appears at the modest hous...,template,The French Babbette appears at the modest hous...,4152


In [211]:
c = dms.groupby("template_id").count()
c

Unnamed: 0_level_0,label,mutant,gender,original
template_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,1,1,1
1,1,1,1,1
2,1,1,1,1
3,1,1,1,1
4,1,1,1,1
...,...,...,...,...
6078,1,1,1,1
6079,1,1,1,1
6080,1,1,1,1
6081,1,1,1,1


In [212]:
# dms.groupby("template_id").get_group(302)["mutant"].values[1]

In [213]:
import os

dirname = "../data/gc_imdb/"

if not os.path.exists(dirname) :
    os.makedirs(dirname)

dmutant.to_csv(dirname + "test.csv", index=None, header=None, sep="\t")