# Data Preparation for Repairing the Model Inference to Handle Fairness Issue

* Preparing selected template from IMDB review data
* Taking a random sample from names, https://www.surveysystem.com/sscalc.htm (for calculating the number of names with representative sample size)
* Create mutant texts from the names
* Return the majority result from the prediction (Note: No need to have a true label from the majority. The majority will represent a fairness inference from the prediction) -> **presented in other notebook**

In [10]:
import pandas as pd
import numpy as np
import math

#### Preparing Mutant Template

Please refer to `codes/mutant-generation.ipynb`to know the detail in getting the template.

In [11]:
dfm = pd.read_csv("../data/imdb_mutant/male/test.csv", header=None, sep="\t", names=["label", "mutant", "template"])
dff = pd.read_csv("../data/imdb_mutant/female/test.csv", header=None, sep="\t", names=["label", "mutant", "template"])

In [12]:
dfm

Unnamed: 0,label,mutant,template
0,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
1,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
2,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
3,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
4,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
...,...,...,...
138995,1,"First, I'm a huge Justin fan. I grew up knowin...","First, I'm a huge Buddy Holly fan. I grew up k..."
138996,1,"First, I'm a huge Terrence fan. I grew up know...","First, I'm a huge Buddy Holly fan. I grew up k..."
138997,1,"First, I'm a huge Roger fan. I grew up knowing...","First, I'm a huge Buddy Holly fan. I grew up k..."
138998,1,"First, I'm a huge Torrance fan. I grew up know...","First, I'm a huge Buddy Holly fan. I grew up k..."


In [13]:
df = pd.concat([dfm, dff])

In [14]:
df

Unnamed: 0,label,mutant,template
0,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
1,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
2,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
3,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
4,1,"I have only see three episodes of Hack, starri...","I have only see three episodes of Hack, starri..."
...,...,...,...
138995,1,"First, I'm a huge Melanie fan. I grew up knowi...","First, I'm a huge Buddy Holly fan. I grew up k..."
138996,1,"First, I'm a huge Tanisha fan. I grew up knowi...","First, I'm a huge Buddy Holly fan. I grew up k..."
138997,1,"First, I'm a huge Nancy fan. I grew up knowing...","First, I'm a huge Buddy Holly fan. I grew up k..."
138998,1,"First, I'm a huge Tia fan. I grew up knowing w...","First, I'm a huge Buddy Holly fan. I grew up k..."


In [15]:
df["template"] = df["template"].astype("category")
df["template_id"] = df["template"].cat.codes

In [16]:
gb = df.groupby("template_id")

In [17]:
gb.count()

Unnamed: 0_level_0,label,mutant,template
template_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,40,40,40
1,40,40,40
2,40,40,40
3,40,40,40
4,40,40,40
...,...,...,...
6893,40,40,40
6894,40,40,40
6895,40,40,40
6896,40,40,40


we have 6898 templates with 40 mutants for each template

#### Mutant Generation

In [26]:
import spacy
import en_core_web_lg
import neuralcoref
nlp = en_core_web_lg.load()
coref = neuralcoref.NeuralCoref(nlp.vocab)
nlp.add_pipe(coref, name='neuralcoref')

In [27]:
class Entity:
    word = ""
    start = 0
    end = 0
    ent_type = ""
    def __init__(self, word, start, end, ent_type) :
        self.word = word
        self.start = start
        self.end = end
        self.ent_type = ent_type
    
    def __str__(self) :
        return self.word
    
    def __repr__(self) :
        return self.word
        
    def get_word(self):
        return self.word
    
    def get_start(self):
        return self.start
    
    def get_end(self):
        return self.end
    
    def get_entity_type(self):
        return self.ent_type
    
    def is_person(self):
        return self.ent_type == "PERSON"

In [28]:
# contain a word and its location inside the sentence
# The location is indicated by start char and end char
class Token: 
    word = ""
    start = 0
    end = 0
    
    def __init__(self, word, start, end) :
        self.word = word
        self.start = start
        self.end = end
        
    def __str__(self) :
        return self.word
    
    def __repr__(self) :
        return self.word
        
    def get_word(self):
        return self.word
    
    def get_start(self):
        return self.start
    
    def get_end(self):
        return self.end


# Reference is a class to save Reference data
# e.g. La Marquesa : [La Marquesa, her]
class Ref:
    
    name = ""
    reference = []
    reference_list = []
        
    def __init__(self, name, reference):
        self.name = str(name)
        self.reference = []
        self.reference_list = []
        for word in reference :
            self.reference_list.append(word.text)
            self.reference.append(Token(word.text, word.start_char, word.end_char))
            
    def __str__(self) :
        return self.name + ": " + str(self.reference_list)
    
    def __repr__(self) :
        return self.name + ": " + str(self.reference_list)
    
    def get_name(self):
        return self.name
    
    def get_reference(self):
        return self.reference
    
    def get_reference_list(self):
        return self.reference_list
    
    # is having male subject
    def is_having_male_subject(self):
        if "He" in self.reference_list :
            return True
        elif "he" in self.reference_list :
            return True
        else :
            return False

    # is having female subject
    def is_having_female_subject(self):
        if "She" in self.reference_list :
            return True
        elif "she" in self.reference_list :
            return True
        else :
            return False

#### Load gender associated word

In [29]:
# gender associated word
gaw = pd.read_csv("../data/gender_associated_word/masculine-feminine-person.txt")
gaw.head()

Unnamed: 0,masculine,feminine
0,boy,girl
1,brother,sister
2,daddy,mummy
3,man,woman
4,father,mother


#### Load Name from Gender Computer

In [30]:
gc = pd.read_csv("../data/gc_name/data.csv")
gc

Unnamed: 0,Name,Gender,Country
0,Roen,male,UK
1,Jeet,male,UK
2,Hagen,male,UK
3,Willow,male,UK
4,Belal,male,UK
...,...,...,...
305,Dódi,female,Hungary
306,Iboyka,female,Hungary
307,Zsófia,female,Hungary
308,Ágota,female,Hungary


In [31]:
gcm = gc[gc["Gender"] == "male"]
gcf = gc[gc["Gender"] == "female"]

In [32]:
# names from GC
mnames = gcm["Name"].values
fnames = gcf["Name"].values

# names from EEC paper
# mnames = ["Alonzo", "Adam", "Alphonse", "Alan", "Darnell", "Andrew", "Jamel", "Frank", "Jerome", "Harry", "Lamar", "Jack", "Leroy", "Josh", "Malik", "Justin", "Terrence", "Roger", "Torrance", "Ryan"]
# fnames = ["Ebony", "Amanda", "Jasmine", "Betsy", "Lakisha", "Courtney", "Latisha", "Ellen", "Latoya", "Heather", "Nichelle", "Katie", "Shaniqua", "Kristin", "Shereen", "Melanie", "Tanisha", "Nancy", "Tia", "Stephanie"]

# small name for debugging
mnames = ["Alonzo"] 
fnames = ["Ebony"]

# masculine pronoun
masculine_pronoun = ["he", "him", "his", "himself", "He", "Him", "His", "Himself"]

# feminine prononun
feminine_pronoun = ["she","her", "her", "herself", "She","Her", "Her", "Herself"]

# gender flipper
masculine_flipper = {}
feminine_flipper = {}

for _m, _f in zip(masculine_pronoun, feminine_pronoun) :
    feminine_flipper[_m] = _f
    masculine_flipper[_f] = _m

In [58]:
class Coref:
    original = ""
    resolved = ""
    refs = []
    person_entities = []
    is_one_subject = False
    is_male = False
    chunk = []
    person_reference = None
    person_name = None
    person_substitution = None
    
    def __init__(self, text):
        
        self.original = str(text)
        doc = nlp(text)
        
        self.person_entities = self.get_person_entities(doc.ents)
        
        self.resolved = str(doc._.coref_resolved)
        self.refs = []
        refs = doc._.coref_clusters
        for r in refs :
            self.refs.append(Ref(r.main, r.mentions))
        
        self.is_one_subject, self.person_reference, self.is_male = self.check_one_subject()
        
        if self.is_one_subject :
            self.chunk = self.generate_chunk_from_coref()
            
    def get_original(self):
        return self.original
    
    def get_resolved(self):
        return self.resolved
    
    def get_refs(self):
        return self.refs
    
    def get_gender(self):
        if self.is_male :
            return "male"
        return "female"
    
    def get_person_entities(self, ents) :
        entities = set()
        for ent in ents :
            e = Entity(ent.text, ent.start_char, ent.end_char, ent.label_)
            if e.is_person() :
                entities.add(e.get_word())
        return list(entities)
    
    def is_having_one_subject(self) :
        return self.is_one_subject
    
    def is_the_person_reference_is_a_person_name(self) :
        if self.person_reference.get_name() in self.person_entities :
            return True
        return False
    
    def is_the_person_reference_is_followed_by_an_aposthrope(self) :
        if self.person_reference.get_name()[-2:] == "'s" :
            return True
        return False
    
    def is_the_person_name_only_occupies_several_words_in_the_person_reference(self) :
        doc_person_reference = nlp(self.person_reference.get_name())

        for token in doc_person_reference:
#             print(token.text, token.pos_, token.dep_)
            if token.text in self.person_entities and token.dep_ == "ROOT":        
                return True
        return False
        
    def is_the_main_person_reference_is_a_pronoun(self):
        return self.person_reference.get_name() in (masculine_pronoun + feminine_pronoun)
    
    def is_there_is_a_person_name_inside_the_references(self) :
        for token in self.person_reference.get_reference_list() :
            if token in self.person_entities :
                return True
        return False
    
    def is_the_person_reference_is_a_gender_associated_word(self) :
        
        doc_person_reference = nlp(self.person_reference.get_name())

        main_token = None
        for token in doc_person_reference:
#             print(token.text, token.pos_, token.dep_)
            if token.pos_ == "NOUN" and token.dep_ == "ROOT" :
                main_token = token.text
        
        if main_token != None :
            if self.is_male :
                if main_token in gaw["masculine"].values :
                    return True
            else :
                if main_token in gaw["feminine"].values :
                    return True
            
        return False
        
        
    def check_one_subject(self) :
                
        s = 0
        subject_reference = None
        for r in self.refs :
            if r.is_having_male_subject() :
                s += 1
                subject_reference = r
                is_male = True
            
            if r.is_having_female_subject() :
                s += 1
                subject_reference = r
                is_male = False
                
        if s == 1 :
            # check if it's only prononun there
            is_only_pronoun = True
            for r in subject_reference.get_reference() :
                if r.word not in masculine_pronoun and r.word not in feminine_pronoun :
                    is_only_pronoun = False

            if is_only_pronoun :
                return False, None, None 
            
            return True, subject_reference, is_male
        else :
            return False, None, None
    
    def get_person_reference(self):
        return self.person_reference
    
    def generate_chunk_from_coref(self) :
        chunk = []
        refs = self.person_reference.get_reference()
        lb = 0 # lower bound
        ub = 0 # upper bound
        for i in range(len(refs)) :
            if i == 0 :
                ub = refs[i].start
                _chunk = text[lb:ub]
                if _chunk == "" :
                    chunk.append(" ")
                else :
                    chunk.append(_chunk)
            else :
                lb = refs[i-1].end
                ub = refs[i].start
                _chunk = text[lb:ub]
                if _chunk == "" :
                    chunk.append(" ")
                else :
                    chunk.append(_chunk)
                
            if i == len(refs)-1 :
                lb = refs[-1].end
                chunk.append(self.original[lb:])
        
        return chunk
    
    def generate_normal_male_mutant_text(self):
        refs = self.person_reference.get_reference()
        chunk = self.chunk
        mutant = []
        if self.is_male :
            for name in mnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in masculine_pronoun :
                        t.append(r.word)
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "male", "text": "".join(t), "mutant_generation": "normal", "person_reference": name})
        else :
            gender = "female"
            for name in mnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in feminine_pronoun :
                        t.append(masculine_flipper[r.word])
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "male", "text": "".join(t), "mutant_generation": "normal", "person_reference": name})
        return mutant
            
    def generate_normal_female_mutant_text(self):
        refs = self.person_reference.get_reference()
        chunk = self.chunk
        mutant = []
        if self.is_male :
            for name in fnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in masculine_pronoun :
                        t.append(feminine_flipper[r.word])
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "female", "text": "".join(t), "mutant_generation": "normal", "person_reference": name})
        else :
            for name in fnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in feminine_pronoun :
                        t.append(r.word)
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "female", "text": "".join(t), "mutant_generation": "normal", "person_reference": name})
        return mutant
    
    def generate_apostrophe_male_mutant_text(self):
        refs = self.person_reference.get_reference()
        chunk = self.chunk
        mutant = []
        if self.is_male :
            for name in mnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in masculine_pronoun :
                        t.append(r.word)
                    elif r.word[-2:] == "'s" :
                        t.append(name + "'s")
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "male", "text": "".join(t), "mutant_generation": "apostrophe", "person_reference": name + "'s"})
        else :
            gender = "female"
            for name in mnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in feminine_pronoun :
                        t.append(masculine_flipper[r.word])
                    elif r.word[-2:] == "'s" :
                        t.append(name + "'s")
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "male", "text": "".join(t), "mutant_generation": "apostrophe", "person_reference": name + "'s"})
        return mutant
    
                
    def generate_apostrophe_female_mutant_text(self):
        refs = self.person_reference.get_reference()
        chunk = self.chunk
        mutant = []
        if self.is_male :
            for name in fnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in masculine_pronoun :
                        t.append(feminine_flipper[r.word])
                    elif r.word[-2:] == "'s" :
                        t.append(name + "'s")
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "female", "text": "".join(t), "mutant_generation": "apostrophe", "person_reference": name + "'s"})
        else :
            for name in fnames :
                t = []
                t.append(chunk[0])
                i = 1
                for r in refs :
                    if r.word in feminine_pronoun :
                        t.append(r.word)
                    elif r.word[-2:] == "'s" :
                        t.append(name + "'s")
                    else :
                        t.append(name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "female", "text": "".join(t), "mutant_generation": "apostrophe", "person_reference": name + "'s"})
        return mutant
    
    def get_person_substitution(self) :
                
        doc_person_reference = nlp(self.person_reference.get_name())

        person_substitution = ""
        
        start_marker = False
        
        for token in doc_person_reference:
            
            if start_marker == True :
                person_substitution += (token.text + " ")
            else :
                if token.text in self.person_entities and token.dep_ == "ROOT":
                    start_marker = True
                    person_substitution += "@ROOT "
            
        person_substitution = person_substitution[:-1] 
        
        return person_substitution
    
    def generate_male_mutant_text_by_specific_person_reference_template(self) :
        person_substitution = self.get_person_substitution()
        
        refs = self.person_reference.get_reference()
        chunk = self.chunk
        mutant = []
        if self.is_male :
            for name in mnames :
                t = []
                t.append(chunk[0])
                i = 1
                substituted_name = person_substitution.replace("@ROOT", name)
                for r in refs :
                    if r.word in masculine_pronoun :
                        t.append(r.word)
                    else :
                        t.append(substituted_name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "male", "text": "".join(t), "mutant_generation": "part of person reference", "person_reference": substituted_name })
        else :
            gender = "female"
            for name in mnames :
                t = []
                t.append(chunk[0])
                i = 1
                substituted_name = person_substitution.replace("@ROOT", name)
                for r in refs :
                    if r.word in feminine_pronoun :
                        t.append(masculine_flipper[r.word])
                    else :
                        t.append(substituted_name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "male", "text": "".join(t), "mutant_generation": "part of person reference", "person_reference": substituted_name})
        return mutant
    
        
    def generate_female_mutant_text_by_specific_person_reference_template(self) :
        person_substitution = self.get_person_substitution()
        
        refs = self.person_reference.get_reference()
        chunk = self.chunk
        mutant = []
        
        if self.is_male :
            for name in fnames :
                t = []
                t.append(chunk[0])
                i = 1
                substituted_name = person_substitution.replace("@ROOT", name)
                for r in refs :
                    if r.word in masculine_pronoun :
                        t.append(feminine_flipper[r.word])
                    else :
                        t.append(substituted_name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "female", "text": "".join(t), "mutant_generation": "part of person reference", "person_reference": substituted_name})
        else :
            for name in fnames :
                t = []
                t.append(chunk[0])
                i = 1
                substituted_name = person_substitution.replace("@ROOT", name)
                for r in refs :
                    if r.word in feminine_pronoun :
                        t.append(r.word)
                    else :
                        t.append(substituted_name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "female", "text": "".join(t), "mutant_generation": "part of person reference", "person_reference": substituted_name})
        return mutant
    
    def get_gender_associated_word_substitution(self) :
                
        doc_person_reference = nlp(self.person_reference.get_name())

        word_substitution = ""
        
        for token in doc_person_reference:
            
            if token.dep_ == "ROOT":
                start_marker = True
                word_substitution += "@ROOT "
            else :
                word_substitution += (token.text + " ")

        word_substitution = word_substitution[:-1] 
        
        return word_substitution
    
    def generate_male_mutant_using_gender_associated_word(self) :
        substituted_word = self.get_gender_associated_word_substitution()
        
        refs = self.person_reference.get_reference()
        chunk = self.chunk
        mutant = []
        if self.is_male :
            for name in gaw["masculine"].values :
                t = []
                t.append(chunk[0])
                i = 1
                substituted_name = substituted_word.replace("@ROOT", name)
                for r in refs :
                    if r.word in masculine_pronoun :
                        t.append(r.word)
                    else :
                        t.append(substituted_name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "male", "text": "".join(t), "mutant_generation": "gender associated word", "person_reference": substituted_name})
        else :
            gender = "female"
            for name in gaw["masculine"].values :
                t = []
                t.append(chunk[0])
                i = 1
                substituted_name = substituted_word.replace("@ROOT", name)
                for r in refs :
                    if r.word in feminine_pronoun :
                        t.append(masculine_flipper[r.word])
                    else :
                        t.append(substituted_name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "male", "text": "".join(t), "mutant_generation": "gender associated word", "person_reference": substituted_name})
        return mutant
        
    def generate_female_mutant_using_gender_associated_word(self) :
        substituted_word = self.get_gender_associated_word_substitution()
        
        refs = self.person_reference.get_reference()
        chunk = self.chunk
        mutant = []
    
        if self.is_male :
            for name in gaw["feminine"].values :
                t = []
                t.append(chunk[0])
                i = 1
                substituted_name = substituted_word.replace("@ROOT", name)
                for r in refs :
                    if r.word in masculine_pronoun :
                        t.append(feminine_flipper[r.word])
                    else :
                        t.append(substituted_name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "female", "text": "".join(t), "mutant_generation": "gender associated word", "person_reference": substituted_name})
        else :
            for name in gaw["feminine"].values :
                t = []
                t.append(chunk[0])
                i = 1
                substituted_name = substituted_word.replace("@ROOT", name)
                for r in refs :
                    if r.word in feminine_pronoun :
                        t.append(r.word)
                    else :
                        t.append(substituted_name)
                        
                    t.append(chunk[i])
                    i += 1
                mutant.append({"gender": "female", "text": "".join(t), "mutant_generation": "gender associated word", "person_reference": substituted_name})
        return mutant

        
    def generate_normal_mutant_text(self):
        chunk = self.chunk
        mutants = []
        male_mutant = self.generate_normal_male_mutant_text()
        for _mutant in male_mutant :
            mutants.append(_mutant)
        female_mutant = self.generate_normal_female_mutant_text()
        for _mutant in female_mutant :
            mutants.append(_mutant)
        return mutants
        
    def generate_apostrophe_mutant_text(self):
        chunk = self.chunk
        mutants = []
        male_mutant = self.generate_apostrophe_male_mutant_text()
        for _mutant in male_mutant :
            mutants.append(_mutant)
        female_mutant = self.generate_apostrophe_female_mutant_text()
        for _mutant in female_mutant :
            mutants.append(_mutant)
        return mutants

    def generate_mutant_text_from_the_part_of_person_reference(self) :
        chunk = self.chunk
        mutants = []
        male_mutant = self.generate_male_mutant_text_by_specific_person_reference_template()
        for _mutant in male_mutant :
            mutants.append(_mutant)
        female_mutant = self.generate_female_mutant_text_by_specific_person_reference_template()
        for _mutant in female_mutant :
            mutants.append(_mutant)
        return mutants
    
    def generate_mutant_using_gender_associated_word(self) :
        chunk = self.chunk
        mutants = []
        male_mutant = self.generate_male_mutant_using_gender_associated_word()
        for _mutant in male_mutant :
            mutants.append(_mutant)
        female_mutant = self.generate_female_mutant_using_gender_associated_word()
        for _mutant in female_mutant :
            mutants.append(_mutant)
        return mutants
    
    def generate_mutants(self) :
        mutants = []
        if c.is_the_person_reference_is_a_person_name() :
#             print("The person reference is a person name")
            if (c.is_the_person_reference_is_followed_by_an_aposthrope()) :
#                 print("it's followed by an aposthrope")
                mutants = c.generate_apostrophe_mutant_text()
            else :
#                 print("it's not followed by an aposthrope")
                mutants = c.generate_normal_mutant_text()
        elif c.is_the_person_name_only_occupies_several_words_in_the_person_reference() :
#             print("The person name only occupies several words in the person reference")
            mutants = c.generate_mutant_text_from_the_part_of_person_reference()
        elif c.is_the_main_person_reference_is_a_pronoun() :
#             print("The main person reference is a pronoun")
            if c.is_there_is_a_person_name_inside_the_references() :
#                 print("There is a person name inside the references")
                mutants = c.generate_normal_mutant_text()
            else :
                mutants = []
#                 print("There isn't any person name inside the references")
        elif c.is_the_person_reference_is_a_gender_associated_word() :
#             print("The person reference is a gender associated word")
            mutants = c.generate_mutant_using_gender_associated_word()
        else :
            mutants = []
#             print("Skip")
        
        return mutants

In [59]:
text = "About your terrible movie copying Beethoven. As a professional musician it\'s my duty to watch every movie made about any composer and Beethoven is one of my favorites. When Hungarians and Americans meet, it\'s a terrible combination of empty over the top emotions combined with the worst taste possible. You proved it in your terrible b-movie. The only thing that carries the movie is the music. Of course you didn\'t bother to look further than the good but in my taste contrived performances of the Tackacs quartet, but OK I have to admit that the performances at least have quality as contrast to the movie you\'ve made. It starts of with the dying DEAF Beethoven who perfectly understands Anna who is merely whispering. Beethoven\'s hearing during the movie get\'s better by the minute, but that must be because of some vague divine thing. Then there is the quite impossible semi-pornographic \"eyes wide shut\" double-conducting scene which is totally over the top with the luscious Anna and the crying nephew in the end (who also cries in the deleted scenes with constant red eyes, my GOD what a performance). And as culmination the rip-off from Amadeus, with Beethoven dictating music to Anna not in notes but in total nonsense, which she understands perfectly but no-one else in your audience even trained professional musicians will understand. Of course your reaction will be that negative response is a response at least, but I can assure you that Beethoven himself is turning in his grave because of your worthless creation and with reason. This so called homage is blasphemy and I am so sorry to have rented one of the worst movies ever made even though it\'s about my favorite subject. Ed Harris and others, you cannot comprehend the greatness of Beethoven in your wildest dreams and certainly not after a couple of lessons in conducting and violin playing. That's the trouble with you Americans: you think you can grasp everything even when it takes a lifetime of hard work. Yeah we can do it anyway! Remember that a good product comes with hard labor, talent, devotion and professionalism. All these you creators of Copying Beethoven lack. See you in kindergarten."

In [60]:
c = Coref(text)

In [61]:
c.get_resolved()

'About your terrible movie copying Beethoven. As a professional musician it\'s my duty to watch every movie made about any composer and Beethoven is one of my favorites. When Hungarians and Americans meet, it\'s a terrible combination of empty over the top emotions combined with the worst taste possible. You proved it in your terrible b-movie. The only thing that carries your terrible b-movie is the music. Of course you didn\'t bother to look further than the good but in my taste contrived performances of the Tackacs quartet, but OK I have to admit that the performances at least have quality as contrast to your terrible b-movie. It starts of with the dying DEAF Beethoven who perfectly understands Anna who is merely whispering. Beethoven\'s hearing during your terrible b-movie get\'s better by the minute, but that must be because of some vague divine thing. Then there is the quite impossible semi-pornographic "eyes wide shut" double-conducting scene which is totally over the top with th

In [62]:
for r in c.get_refs() :
    print(r.get_name())
    print(r.get_reference())

Beethoven
[Beethoven, Beethoven, Beethoven, Beethoven]
it
[it, it]
your terrible b-movie
[your terrible b-movie, the movie, the movie you've made, the movie]
Anna
[Anna, Anna, Anna, she]
Beethoven himself
[Beethoven himself, his, Beethoven, Beethoven]
This
[This, it]
Ed Harris and others
[Ed Harris and others, you Americans]


In [63]:
c.is_having_one_subject()

True

In [64]:
if c.is_having_one_subject() :
    print(c.get_person_reference())
#     print(c.generate_mutant_text())

Anna: ['Anna', 'Anna', 'Anna', 'she']


In [65]:
dt = df[["label", "template"]]

In [66]:
dt = dt.drop_duplicates().reset_index(drop=True)
dt

Unnamed: 0,label,template
0,1,"I have only see three episodes of Hack, starri..."
1,1,In the groovy mid 70's a scruffy bunch of bras...
2,1,This must have been one of Chaplin's most ambi...
3,1,The debut that plucked from obscurity one of t...
4,1,There is really no way to compare this motion ...
...,...,...
6893,1,I heard they were going to remake this French ...
6894,1,"Well, the movie did turn out a lot better than..."
6895,0,"In this film, there is a loose plot of a man (..."
6896,1,The French Babbette appears at the modest hous...


In [69]:
import time
start = time.time()

original_arr = []
mutant_arr = []
gender_arr = []
label_arr = []
generation_approach_arr = []
person_reference_arr = []

for index, row in dt.iterrows():
    label = row['label']
    text = row['template']
    c = Coref(text)
    if c.is_having_one_subject() :

        # append original text
        original_arr.append(text)
        mutant_arr.append(text)
        gender_arr.append("template")
        person_reference_arr.append("template")
        generation_approach_arr.append("template")
        label_arr.append(label)
        
        mtext = c.generate_mutants()
        for m in mtext :
            original_arr.append(text)
            mutant_arr.append(m["text"])
            gender_arr.append(m["gender"])
            person_reference_arr.append(m["person_reference"])
            generation_approach_arr.append(m["mutant_generation"])
            label_arr.append(label)
        

end = time.time()
print("Execution Time: ", end-start)

Execution Time:  1041.3692698478699


In [70]:
dmutant = pd.DataFrame(data={"label": label_arr, "mutant": mutant_arr, "gender": gender_arr, "original": original_arr, "person_reference": person_reference_arr, "generation_approach": generation_approach_arr})
dmutant

Unnamed: 0,label,mutant,gender,original,person_reference,generation_approach
0,1,"I have only see three episodes of Hack, starri...",template,"I have only see three episodes of Hack, starri...",template,template
1,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri...",Alonzo,normal
2,1,"I have only see three episodes of Hack, starri...",female,"I have only see three episodes of Hack, starri...",Ebony,normal
3,1,In the groovy mid 70's a scruffy bunch of bras...,template,In the groovy mid 70's a scruffy bunch of bras...,template,template
4,1,In the groovy mid 70's a scruffy bunch of bras...,male,In the groovy mid 70's a scruffy bunch of bras...,Alonzo,normal
...,...,...,...,...,...,...
29067,1,"Well, the movie did turn out a lot better than...",template,"Well, the movie did turn out a lot better than...",template,template
29068,1,"Well, the movie did turn out a lot better than...",male,"Well, the movie did turn out a lot better than...",Alonzo,normal
29069,1,"Well, the movie did turn out a lot better than...",female,"Well, the movie did turn out a lot better than...",Ebony,normal
29070,1,The French Babbette appears at the modest hous...,template,The French Babbette appears at the modest hous...,template,template


In [71]:
dmutant["original"] = dmutant["original"].astype("category")
dmutant["template_id"] = dmutant["original"].cat.codes

In [72]:
dmutant.groupby("template_id").count()

Unnamed: 0_level_0,label,mutant,gender,original,person_reference,generation_approach
template_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3,3,3,3,3,3
1,3,3,3,3,3,3
2,1,1,1,1,1,1
3,27,27,27,27,27,27
4,3,3,3,3,3,3
...,...,...,...,...,...,...
6077,27,27,27,27,27,27
6078,1,1,1,1,1,1
6079,1,1,1,1,1,1
6080,3,3,3,3,3,3


In [73]:
dms = dmutant[dmutant["gender"] == "template"]
dmd = dmutant[dmutant["gender"] != "template"]

In [74]:
dmd = dmd.reset_index(drop=True)
dmd

Unnamed: 0,label,mutant,gender,original,person_reference,generation_approach,template_id
0,1,"I have only see three episodes of Hack, starri...",male,"I have only see three episodes of Hack, starri...",Alonzo,normal,1845
1,1,"I have only see three episodes of Hack, starri...",female,"I have only see three episodes of Hack, starri...",Ebony,normal,1845
2,1,In the groovy mid 70's a scruffy bunch of bras...,male,In the groovy mid 70's a scruffy bunch of bras...,Alonzo,normal,2830
3,1,In the groovy mid 70's a scruffy bunch of bras...,female,In the groovy mid 70's a scruffy bunch of bras...,Ebony,normal,2830
4,1,This must have been one of Alonzo's most ambit...,male,This must have been one of Chaplin's most ambi...,Alonzo,normal,5199
...,...,...,...,...,...,...,...
22985,0,My Take: Yet another lame PG-13 horror movie w...,female,My Take: Yet another lame PG-13 horror movie w...,Ebony,normal,3325
22986,1,I heard they were going to remake this French ...,male,I heard they were going to remake this French ...,Alonzo ( Jean Servais,part of person reference,1898
22987,1,I heard they were going to remake this French ...,female,I heard they were going to remake this French ...,Ebony ( Jean Servais,part of person reference,1898
22988,1,"Well, the movie did turn out a lot better than...",male,"Well, the movie did turn out a lot better than...",Alonzo,normal,5541


In [47]:
dmd.groupby("template_id").count()

Unnamed: 0_level_0,label,mutant,gender,original,name
template_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,310,310,310,310,310
1,310,310,310,310,310
2,310,310,310,310,310
3,310,310,310,310,310
4,310,310,310,310,310
...,...,...,...,...,...
6078,310,310,310,310,310
6079,310,310,310,310,310
6080,310,310,310,310,310
6081,310,310,310,310,310


In [48]:
dms = dms.reset_index(drop=True)
dms

Unnamed: 0,label,mutant,gender,original,name,template_id
0,1,"I have only see three episodes of Hack, starri...",template,"I have only see three episodes of Hack, starri...",template,1846
1,1,In the groovy mid 70's a scruffy bunch of bras...,template,In the groovy mid 70's a scruffy bunch of bras...,template,2831
2,1,This must have been one of Chaplin's most ambi...,template,This must have been one of Chaplin's most ambi...,template,5200
3,1,The debut that plucked from obscurity one of t...,template,The debut that plucked from obscurity one of t...,template,4244
4,1,There is really no way to compare this motion ...,template,There is really no way to compare this motion ...,template,4491
...,...,...,...,...,...,...
6078,0,My Take: Yet another lame PG-13 horror movie w...,template,My Take: Yet another lame PG-13 horror movie w...,template,3326
6079,1,I heard they were going to remake this French ...,template,I heard they were going to remake this French ...,template,1899
6080,1,"Well, the movie did turn out a lot better than...",template,"Well, the movie did turn out a lot better than...",template,5542
6081,1,The French Babbette appears at the modest hous...,template,The French Babbette appears at the modest hous...,template,4152


In [49]:
c = dms.groupby("template_id").count()
c

Unnamed: 0_level_0,label,mutant,gender,original,name
template_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,1,1,1,1
1,1,1,1,1,1
2,1,1,1,1,1
3,1,1,1,1,1
4,1,1,1,1,1
...,...,...,...,...,...
6078,1,1,1,1,1
6079,1,1,1,1,1
6080,1,1,1,1,1
6081,1,1,1,1,1


In [50]:
# dms.groupby("template_id").get_group(302)["mutant"].values[1]

In [51]:
import os

dirname = "../data/gc_imdb/"

if not os.path.exists(dirname) :
    os.makedirs(dirname)

dmutant.to_csv(dirname + "test.csv", index=None, header=None, sep="\t")