# Lab 1: ConceptNet Ethics Testing

#### Henry Lambson, Alex Gregory, Mike Wisniewski

### Part 1: Overview of Bias

For our analysis of ConceptNet, our group decided to look into the potential bias of job titles pertaining to gender. An example of this being "waitor" vs "waitress". We believe that this investigation is relevant because we think that ConceptNet trained on the large, unfiltered GloVe data will be inherently biased towards male positions. We believe that the model will be biased towards male terms in general, and we want to use job titles as a way to expose this bias. If the model is distriminatory when it is not meant to be, it can affect the business using the model, as well as the the consumers who use the business. From the business perspective, businesses risk their bottom line and credibility. From a consumer perspective, consumers are at risk of being alienated due to their identity. Due to this bias, we believe that low level NPL models can be biased in a way that will hurt businesses trying to reach consumers equally.    

### Part 2: Research Questions

Our main research question for this project is: "Is ConceptNet trained on the large GloVe data biased towards male job titles and biased against female job titles?" 

This first large section of code is grabbed from the in class example. We are using the same pre-processing that was shown and used in class. Once we have a sentiment analyzer, we will begin testing it for different biases. 

In [2]:
import numpy as np
import pandas as pd
import matplotlib
import seaborn
import re

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [3]:
# Configure how graphs will show up in this notebook
%matplotlib inline
seaborn.set_context('notebook', rc={'figure.figsize': (10, 6)}, font_scale=1.5)

In [6]:
%%time
def load_embeddings(filename):
    """
    Load a DataFrame from the generalized text format used by word2vec, GloVe,
    fastText, and ConceptNet Numberbatch. The main point where they differ is
    whether there is an initial line with the dimensions of the matrix.
    """
    labels = []
    rows = []
    with open(filename, encoding='utf-8') as infile:
        for i, line in enumerate(infile):
            items = line.rstrip().split(' ')
            if len(items) == 2:
                # This is a header row giving the shape of the matrix
                continue
            labels.append(items[0])
            values = np.array([float(x) for x in items[1:]], 'f')
            rows.append(values)
    
    arr = np.vstack(rows)
    return pd.DataFrame(arr, index=labels, dtype='f')

embeddings = load_embeddings('glove.840B.300d/glove.840B.300d.txt')
embeddings.shape

Wall time: 5min


(2196018, 300)

In [8]:
def load_lexicon(filename):
    """
    Load a file from Bing Liu's sentiment lexicon
    (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), containing
    English words in Latin-1 encoding.
    
    One file contains a list of positive words, and the other contains
    a list of negative words. The files contain comment lines starting
    with ';' and blank lines, which should be skipped.
    """
    lexicon = []
    with open(filename, encoding='latin-1') as infile:
        for line in infile:
            line = line.rstrip()
            if line and not line.startswith(';'):
                lexicon.append(line)
    return lexicon

pos_words = load_lexicon('glove.840B.300d/positive-words.txt')
neg_words = load_lexicon('glove.840B.300d/negative-words.txt')

print(len(pos_words), len(neg_words))

FileNotFoundError: [Errno 2] No such file or directory: 'glove.840B.300d/positive-words.txt'

In [None]:
pos_words_common = list(set(pos_words) & set(embeddings.index)) 
neg_words_common = list(set(neg_words) & set(embeddings.index)) 

pos_vectors = embeddings.loc[pos_words_common]
neg_vectors = embeddings.loc[neg_words_common]
print(pos_vectors.shape,neg_vectors.shape)

In [None]:
vectors = pd.concat([pos_vectors, neg_vectors])
targets = np.array([1 for entry in pos_vectors.index] + [-1 for entry in neg_vectors.index])
labels = list(pos_vectors.index) + list(neg_vectors.index)

### References
[1] https://careersmart.org.uk/occupations/equality/which-jobs-do-men-and-women-do-occupational-breakdown-gender