# Problem Statement:

We will be investigating gender bias in the domain of sentiment anlaysis between different pretrained word embeddings, Stanford's Glove (we will be using the 200 dimensional version trained on Twitter), and the latest version of ConceptNet Numberbatch, word embeddings that take other pretrained embeddings, such as Glove, word2vec, etc. and preturb them using the distance to neighbors in the ConceptNet Knowledge graph. We feel this investigation is pertienet given the ongoing tocxicity of the Internet, twitter in particular, and may show how the choice of embeddings can affect the bias of the model, even if it is trained on the same (probably biased) data. Furthermore, with the rise of complex deep learning based language models in NLP and a shift away from interpreability and methodology, this invesdtigtion may elucidate how biases can also affect the results of relatively sophisticated models. 

Our research question is: How does the usage of different pretrained word embeddings affect gender bias in setiment analysis? 
We hypothesize that the usage of different embeddings will affect the gender bias of the resulting model to a significant extent. In particular, we believe that the Glove embeddings trained on Twitter will exacerbate the presumable existing biases in the dataset (c'mon, it's twitter) and that the ConceptNet Numberbatch embeddings will hopefully mitigate this bias.

In [24]:
import pandas as pd
import numpy as np
from tensorflow import keras
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Dense, LSTM, Bidirectional, Dropout
from tensorflow.keras.models import Sequential

In [4]:
data = pd.read_csv("training.1600000.processed.noemoticon.csv", encoding="ISO-8859-1", header=None)
data = data[[0,5]]
data.columns = ["sentiment", "text"]

data["sentiment"] = data["sentiment"].replace(4, 1)

In [18]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer

X_train, X_test, y_train, y_test = train_test_split(data["text"], data["sentiment"], test_size=0.2, random_state=42)

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

word_index = tokenizer.word_index

print('Found %s unique tokens.' % (len(word_index)))

2023-02-07 21:49:57.564472: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-07 21:49:59.301896: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.0/include:/usr/local/cuda-11.0/lib64:
2023-02-07 21:49:59.301973: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.0/include:/usr/local/cuda-11.0/lib64:


Found 594848 unique tokens.


In [20]:
# all of this code was taken from the last lab of the previous course

%%time
def load_embeddings(filename, embed_size):
    # the embed size should match the file you load glove from
    embeddings_index = {}
    f = open(filename)
    # save key/array pairs of the embeddings
    #  the key of the dictionary is the word, the array is the embedding
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

    print('Found %s word vectors.' % len(embeddings_index))

    # now fill in the matrix, using the ordering from the
    #  keras word tokenizer from before
    found_words = 0
    embedding_matrix = np.zeros((len(word_index) + 1, embed_size))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be ALL-ZEROS
            embedding_matrix[i] = embedding_vector
            found_words = found_words+1

    print("Embedding Shape:",embedding_matrix.shape, "\n",
        "Total words found:",found_words, "\n",
        "Percentage:",100*found_words/embedding_matrix.shape[0])
    return embedding_matrix

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 4.53 µs


In [21]:
glove_embeddings = load_embeddings("glove.twitter.27B.200d.txt", 200)

Found 1193514 word vectors.
Embedding Shape: (594849, 200) 
 Total words found: 121078 
 Percentage: 20.35440927025178


In [22]:
numberbatch_embeddings = load_embeddings("numberbatch-en.txt", 300)

Found 516783 word vectors.
Embedding Shape: (594849, 300) 
 Total words found: 75671 
 Percentage: 12.721043491709661


In [None]:
X_train = pad_sequences(X_train_sequences, maxlen=30)
X_test = pad_sequences(X_test_sequences, maxlen=30)

In [25]:
MAX_ART_LEN = 30

# save this embedding now
glove_embedding_layer = Embedding(len(word_index) + 1,
                            200,
                            weights=[glove_embeddings],# here is the embedding getting saved
                            input_length=MAX_ART_LEN,
                            trainable=False)
                            
numberbatch_embedding_layer = Embedding(len(word_index) + 1,
                            300,
                            weights=[numberbatch_embeddings],# here is the embedding getting saved
                            input_length=MAX_ART_LEN,
                            trainable=False)


In [33]:
glove_rnn = Sequential()
glove_rnn.add(glove_embedding_layer)
glove_rnn.add(Bidirectional(LSTM(64, return_sequences=True)))
glove_rnn.add(Bidirectional(LSTM(32)))
glove_rnn.add(Dense(1, activation='sigmoid'))
glove_rnn.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
glove_rnn.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 30, 200)           118969800 
                                                                 
 bidirectional_3 (Bidirectio  (None, 30, 128)          135680    
 nal)                                                            
                                                                 
 bidirectional_4 (Bidirectio  (None, 64)               41216     
 nal)                                                            
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 119,146,761
Trainable params: 176,961
Non-trainable params: 118,969,800
_________________________________________________________________


In [35]:
numberbatch_rnn = Sequential()
numberbatch_rnn.add(numberbatch_embedding_layer)
numberbatch_rnn.add(Bidirectional(LSTM(64, return_sequences=True)))
numberbatch_rnn.add(Bidirectional(LSTM(32)))
numberbatch_rnn.add(Dense(1, activation='sigmoid'))
numberbatch_rnn.compile(loss='binary_crossentropy',
                optimizer='adam',       
                metrics=['accuracy'])
numberbatch_rnn.summary()

2023-02-07 22:05:48.696496: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 713818800 exceeds 10% of free system memory.


Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 30, 300)           178454700 
                                                                 
 bidirectional_5 (Bidirectio  (None, 30, 128)          186880    
 nal)                                                            
                                                                 
 bidirectional_6 (Bidirectio  (None, 64)               41216     
 nal)                                                            
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 178,682,861
Trainable params: 228,161
Non-trainable params: 178,454,700
_________________________________________________________________


In [34]:
glove_rnn.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

Epoch 1/50
Epoch 2/50

KeyboardInterrupt: 

In [None]:
numberbatch_rnn.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))