# Problem Statement:

We will be investigating sexual orientation bias in the domain of sentiment anlaysis between different pretrained word embeddings, Stanford's Glove (we will be using the 200 dimensional version trained on Twitter), and the latest version of ConceptNet Numberbatch, word embeddings that take other pretrained embeddings, such as Glove, word2vec, etc. and preturb them using the distance to neighbors in the ConceptNet Knowledge graph. We feel this investigation is pertienet given the ongoing tocxicity of the Internet, twitter in particular, and may show how the choice of embeddings can affect the bias of the model, even if it is trained on the same (probably biased) data. Furthermore, with the rise of complex deep learning based language models in NLP and a shift away from interpreability and methodology, this invesdtigtion may elucidate how biases can also affect the results of relatively sophisticated models. 

Our research question is: How does the usage of different pretrained word embeddings affect sexual orientation bias in setiment analysis? 
We hypothesize that the usage of different embeddings will affect the sexual orientation bias of the resulting model to a significant extent. In particular, we believe that the Glove embeddings trained on Twitter will exacerbate the presumable existing biases in the dataset (c'mon, it's twitter) and that the ConceptNet Numberbatch embeddings will hopefully mitigate this bias.

In [98]:
import pandas as pd
import numpy as np
from tensorflow import keras
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Dense, LSTM, Bidirectional, Dropout
from tensorflow.keras.models import Sequential
import tensorflow as tf

In [2]:
data = pd.read_csv("training.1600000.processed.noemoticon.csv", encoding="ISO-8859-1", header=None)
data = data[[0,5]]
data.columns = ["sentiment", "text"]

data["sentiment"] = data["sentiment"].replace(4, 1)

In [3]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer

X_train, X_test, y_train, y_test = train_test_split(data["text"], data["sentiment"], test_size=0.2, random_state=42)

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

word_index = tokenizer.word_index

print('Found %s unique tokens.' % (len(word_index)))

Found 594848 unique tokens.


In [4]:
# all of this code was taken from the last lab of the previous course

def load_embeddings(filename, embed_size):
    # the embed size should match the file you load glove from
    embeddings_index = {}
    f = open(filename)
    # save key/array pairs of the embeddings
    #  the key of the dictionary is the word, the array is the embedding
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

    print('Found %s word vectors.' % len(embeddings_index))

    # now fill in the matrix, using the ordering from the
    #  keras word tokenizer from before
    found_words = 0
    embedding_matrix = np.zeros((len(word_index) + 1, embed_size))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be ALL-ZEROS
            embedding_matrix[i] = embedding_vector
            found_words = found_words+1

    print("Embedding Shape:",embedding_matrix.shape, "\n",
        "Total words found:",found_words, "\n",
        "Percentage:",100*found_words/embedding_matrix.shape[0])
    return embedding_matrix

In [5]:
glove_embeddings = load_embeddings("glove.twitter.27B.200d.txt", 200)

Found 1193514 word vectors.
Embedding Shape: (594849, 200) 
 Total words found: 121078 
 Percentage: 20.35440927025178


In [6]:
numberbatch_embeddings = load_embeddings("numberbatch-en.txt", 300)

Found 516783 word vectors.
Embedding Shape: (594849, 300) 
 Total words found: 75671 
 Percentage: 12.721043491709661


In [7]:
X_train = pad_sequences(X_train_sequences, maxlen=30)
X_test = pad_sequences(X_test_sequences, maxlen=30)

In [15]:
MAX_ART_LEN = 30

# save this embedding now
glove_embedding_layer = Embedding(len(word_index) + 1,
                            200,
                            weights=[glove_embeddings],# here is the embedding getting saved
                            input_length=MAX_ART_LEN,
                            trainable=False)
                            
numberbatch_embedding_layer = Embedding(len(word_index) + 1,
                            300,
                            weights=[numberbatch_embeddings],# here is the embedding getting saved
                            input_length=MAX_ART_LEN,
                            trainable=False)


In [87]:
glove_rnn = Sequential()
glove_rnn.add(glove_embedding_layer)
glove_rnn.add(Bidirectional(LSTM(64, return_sequences=True, dropout=0.2)))
glove_rnn.add(Bidirectional(LSTM(32, dropout=0.2)))
glove_rnn.add(Dense(1, activation='sigmoid'))
glove_rnn.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
glove_rnn.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 30, 200)           118969800 
                                                                 
 bidirectional_11 (Bidirecti  (None, 30, 128)          135680    
 onal)                                                           
                                                                 
 bidirectional_12 (Bidirecti  (None, 64)               41216     
 onal)                                                           
                                                                 
 dense_6 (Dense)             (None, 1)                 65        
                                                                 
Total params: 119,146,761
Trainable params: 176,961
Non-trainable params: 118,969,800
_________________________________________________________________


In [16]:
numberbatch_rnn = Sequential()
numberbatch_rnn.add(numberbatch_embedding_layer)
numberbatch_rnn.add(Bidirectional(LSTM(64, return_sequences=True, dropout=0.2)))
numberbatch_rnn.add(Bidirectional(LSTM(32, dropout=0.2)))
numberbatch_rnn.add(Dense(1, activation='sigmoid'))
numberbatch_rnn.compile(loss='binary_crossentropy',
                optimizer='adam',       
                metrics=['accuracy'])
numberbatch_rnn.summary()

2023-02-08 02:56:18.341296: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-08 02:56:18.386053: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-08 02:56:18.386275: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-08 02:56:18.386938: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorF

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 30, 300)           178454700 
                                                                 
 bidirectional (Bidirectiona  (None, 30, 128)          186880    
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 64)               41216     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 178,682,861
Trainable params: 228,161
Non-trainable params: 178,454,700
_________________________________________________________________


In [18]:
glove_callable = keras.callbacks.ModelCheckpoint(
    filepath="glove_model.h5",
    monitor='val_accuracy',
    save_best_only=True)

numberbatch_callable = keras.callbacks.ModelCheckpoint(
    filepath="numberbatch_model.h5",
    monitor='val_accuracy',
    save_best_only=True)

In [90]:
glove_rnn.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[glove_callable])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f3f647bb1f0>

In [19]:
numberbatch_rnn.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[numberbatch_callable])

Epoch 1/50


2023-02-08 02:56:45.245801: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8100
2023-02-08 02:56:46.041933: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:630] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-02-08 02:56:46.043317: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7f571c7c09a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-08 02:56:46.043336: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): NVIDIA GeForce RTX 3050, Compute Capability 8.6
2023-02-08 02:56:46.058335: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.
2023-02-08 02:56:46.160699: W tensorflow/compiler/xla/stream_executor/g

   19/40000 [..............................] - ETA: 4:06 - loss: 0.6857 - accuracy: 0.5543  


You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f5b1b74eaf0>

### Loading the trained models from disk: 

In [9]:
glove_rnn = keras.models.load_model("glove_model.h5")
numberbatch_rnn = keras.models.load_model("numberbatch_model.h5")

In [20]:
# uReturns sentiment of a sentence derived from the model
def get_sentiment(sentence, model):
    sentence = tokenizer.texts_to_sequences(sentence)
    sentence = pad_sequences(sentence, maxlen=30)
    # print model weights:
    return model.predict(sentence)

### Generating a test dataset

In [56]:
# Read in CommonNames.csv
names = pd.read_csv("CommonNames.csv")
male_names = names["Male"][:20] # Top 20 male names of the 2000s
female_names = names["Female"][:20] # Top 20 female names of the 2000s

In [212]:
# Create a list of sentences with formatting
lgbtq_sentence_templates = [
    "{} is a .",
]

straight_sentence_templates = [
    "{} is straight.",
]

In [213]:
lgbtq_sentences = []
straight_sentences = []

for sentence in lgbtq_sentence_templates:
    for name in male_names:
        curr_sentence = sentence.format(name)
        lgbtq_sentences.append(sentence.format(name))
    for name in female_names:
        lgbtq_sentences.append(sentence.format(name))

for sentence in straight_sentence_templates:
    for name in male_names:
        straight_sentences.append(sentence.format(name))
    for name in female_names:
        straight_sentences.append(sentence.format(name))

In [214]:
get_sentiment(lgbtq_sentences, glove_rnn).mean()



0.8558687

In [200]:
get_sentiment(straight_sentences, glove_rnn).mean()



0.7447921

In [201]:
get_sentiment(lgbtq_sentences, numberbatch_rnn).mean()



0.95314467

In [196]:
get_sentiment(straight_sentences, numberbatch_rnn).mean()



0.6283485