# Data Analysis (Part 1)

This notebook takes the formats of the results that I gave for the WEAT tests and for the CNN and transforms to the format that we will be using for the analysis of the results for the paper.

I'm assumming that you have an established naming convention given an embedding file ``[EMB]_[NAME]`` (where ``[EMB]`` is the type of embedding) and the WEAT test that was run. The convention I assumed was:
- The embedding file has ``[EMB]`` somewhere in its name. I would also recommend to remove the extension, as this script adds that, but that's up to your naming conventions.
- You gave each experiment a name ``[EXP]``.
- The results from WEAT are in ``xweat/[EMB]_[NAME]_[EXP].res``
- The results from the CNN are in:
    - ``cnn/task1_[EXP].txt`` for general results
    - ``cnn/task1_[EXP]\_g1.txt`` for group 1
    - ``cnn/task1_[EXP]\_g2.txt`` for group 2
    
This is according to the naming convention I used in my scripts, it can be changed in the ``get_paths`` function.

In [1]:
# We import the packages we will be using

import numpy as np
import pandas as pd
import warnings

These are the variables to the location of the results files.

In [2]:
# Set here the path to the results
path = "./results/"

# I'm assuming that you have a folder for each of these, but that can be changed here
cnn_path = path + "cnn/"
weat_path = path + "xweat/"

# The name of the file where we will save the results
df_name = "results_es.csv"

Here we have to give a tuple for each experiment. It must be of the form:

``( embedding_file , experiment_name , weat_test )``

If you add any extra information, try to add it after the ``weat_test`` value to avoid having to modify more than is necessary.

In [3]:
# If you are adding experiments manually, use this
experiments = [
    ("w2v_all_tweets_processed_es.tsv.vectors", "w2v_alltweets_t7", "7")
]

In [4]:
# If you are adding the experiments straight from the experiment.txt file, use this
experiments = []

with open("experiment_1.txt",mode="r",encoding="utf-8") as file:
    for L in file:
        line = L.split()
        embeddings  = line[4]
        exp_name    = line[2][1:-1]
        weat_number = line[3]
        # If we use normal weat, gives the number to it, otherwise states gender for es1 and migrant for es2
        if weat_number == "es1":
            weat_number = "gender"
        elif weat_number == "es2":
            weat_number = "migrant"
        experiments.append((embeddings, exp_name, weat_number))

# Make sure you have the right amount of experiments!
print(len(experiments))

38


This returns the paths to the WEAT results file that we used and to the CNN results. This is the function to modify if you used a different naming convention to those files.

In [5]:
def get_paths(experiment, weat_path, cnn_path, cnn_files=["_g1.txt","_g2.txt"]):
    """
    This is where the function builds the paths to things.
    The weat_path and the cnn_path tell the program where to find those results
    Assuming that the CNN results we want to compare only differ in a suffix of a filename, we use cnn_files
    to be able to determine their locations. For example, if the files are "RESULTS_g1.txt" and "RESULTS_g2.txt",
    we can give RESULTS as an argument and have cnn_files=["_g1.txt","_g2.txt"]
    """
    
    # Read the info we passed for the experiment
    emb = experiment[0]
    exp = experiment[1]
    tst = experiment[2]
    
    # Determine the paths to the files
    weat_file = weat_path + emb + "_" + exp + ".res"
    cnn_file  = [cnn_path + "task1_" + exp + file for file in cnn_files]
    
    return weat_file, cnn_file

This reads the WEAT test results given the path to the results file. It currently only gives back the EffectSize value, but it can be modified to also fetch the other values.

In [6]:
def read_weat_results(weat_file):
    """
    Return the WEAT results.
    
    Input:
        weat_file   A string that contains the path to the file that has the results
        
    Output:
        EffectSize  The effect of the bias on the dataset. It is a float in the interval [-2, 2]
    """
    
    # Read the file
    data = list(map(
                        lambda x: x.strip().lower().split(),
                        open(weat_file,"r", encoding="utf8").readlines()
                       ))
    
    # Extract the results from the file
    WeatStatistic = float(data[1][1][1:-1])
    EffectSize    = float(data[1][2][:-1])
    pValue        = float(data[1][3][:-1])
    
    return EffectSize

Here we read the results from two runs of the CNN and return the values corresponding to the performance gaps of the metrics that we passed.

In [25]:
def read_cnn_results(cnn_files, metrics=["precision","recall"]):
    """
    A function that reads and returns the gaps between the preformance of two runs of the CNN model. A list of the
    performance metrics to use can be passed.
    
    Input:

        cnn_files   A list or list-like object where the first two elements are the paths to the two CNN results
                    that we want to compare.
        
        metrics     A list or list-like object with the metrics that we want to compare. If using the CNN from the
                    github user rimusa, the accepted values are: "accuracy", "precision", "recall", and "f1-score".
                    The default value is metrics=["precision","recall"]
                    
    Output:
        
        A list containing the values for each metric in the form ( gap , metric )
    
    """
    
    # If we do not have at least two files to compare, we cannot find the gap between them.
    assert len(cnn_files) >= 2
    
    # Initialize the results list and the list that contains the gaps
    results = ["",""]
    performance_gaps = []
    
    # We only read the first two files in the cnn_files list
    for i in range(2):
        
        # Set the path to the current file
        file = cnn_files[i]
        
        # Read the file
        data = list(map(
                    lambda x: x.strip().lower().split(),
                    open(file,"r", encoding="utf8").readlines()
                   ))
        
        # Transforms the results to a dictionary
        group = {}
        for item in data:
            key = item[0][:-1]
            group[key] = item[1]
            
        # Stores the results
        results[i] = group

    # For each of our metrics, obtain the corresponding performance gap and save it
    for metric in metrics:
        gap = ( float(results[0][metric]) , float(results[1][metric] ) )
        performance_gaps.append((gap, metric))
    
    # Returns the list of the performance gaps
    return performance_gaps

Here we obtain the relevant rows of the results.

In [29]:
def fetch_datapoint(weat_file, cnn_files, experiment):
    """
    This function obtains the relevant data for each experiment.
    
    Input:
        
        weat_file   A string that contains the path to the file that has the results.
        
        cnn_files   A list or list-like object where the first two elements are the paths to the two CNN results
                    that we want to compare.
                    
        experiment  A list or list-like object for the form
                    [ embedding_file , experiment_name , weat_test ]
                    
    Output:
    
        
    """
    
    test = experiment[2]
    emb_name = experiment[0]
    
    data1 = {}
    data2 = {}
    
    data1["WEAT"] = data2["WEAT"] = read_weat_results(weat_file)
    
    results = read_cnn_results(cnn_files, metrics=["precision","recall"])
    #print(results)
    
    data1["Performance Gender"] = results[0][0][0]
    data1["Performance Migrant"] = results[0][0][1]
    data1["Performance Gap"] = results[0][0][0] - results[0][0][1]
    data1["Metric"] = results[0][1]
    
    data2["Performance Gender"] = results[1][0][0]
    data2["Performance G2"] = results[1][0][1]
    data2["Performance Gap"] = results[1][0][0] - results[1][0][1]
    data2["Metric"] = results[1][1]
    
    data1["Test"] = data2["Test"] = test
    
    info = emb_name.lower().split("_")
    
    if ("ft" in info) or ("fasttext" in info):
        data1["Embedding"] = "fastText"
    elif ("w2v" in info) or ("word2vec" in info):
        data1["Embedding"] = "word2vec"
    else:
        warnings.warn("Embedding kind not recognized\nCheck that the string before the first underscore are recognized"+
                      " by the 'fetch_datapoint' method.")
    data2["Embedding"] = data1["Embedding"]
    
    data1["Name"] = data2["Name"] = emb_name + ".vectors"
        
    return [data1, data2]

In [30]:
results = []
errors = 0

for experiment in experiments:
    try:
        weat_file, cnn_files = get_paths(experiment, weat_path, cnn_path)
        results += fetch_datapoint(weat_file, cnn_files, experiment)
    except FileNotFoundError:
        print("Error with test:",experiment[1])
        errors += 1
    
print(errors,"errors found!")
#print(results)

0 errors found!


In [31]:
df = pd.DataFrame(results).to_csv(df_name)