In [1]:
%matplotlib inline

import numpy as np
import torch
import matplotlib.pyplot as plt
import nltk
import collections
import pandas as pd
import sklearn
import random
import time

# Gender classification assignment

You are to follow the instructions below and fill each cell as instructed.
Once ready, submit this notebook on VLE with all the outputs included (run all your code and don't clear any output cells).
Do not submit anything else apart from the notebook and do not use any extra data apart from what is provided.

You will be working on classifying the genders of people from their blog posts using a data set called the [Blog Authorship Corpus](https://www.kaggle.com/rtatman/blog-authorship-corpus).
This has been pre-split and reduced for you to use in this assignment.

10% of the marks from this assignment are based on neatness.

This assignment will carry 40% of the final mark.

## Data processing (10%)

You have a train/dev/test split data set consisting of CSV files with two fields: gender and text.
The gender field contains either 'male' or 'female' whilst the text is a string containing text from blog posts.

Do the following tasks:

Load these three CSV files and tokenise each text.

In [None]:
train_set = pd.read_csv("train.csv")
test_set = pd.read_csv("test.csv")
dev_set = pd.read_csv("dev.csv")

train_set_x = train_set['text']
train_set_y = train_set['gender']

test_set_x = test_set['text']
test_set_y = test_set['gender']

dev_set_x = dev_set['text']
dev_set_y = dev_set['gender']

def preprocess(x):
    return x.lower()

train_set_x = train_set_x.apply(preprocess)
train_set_x = train_set_x.apply(preprocess)
dev_set_x = dev_set_x.apply(preprocess)

train_token_x = train_set_x.apply(nltk.tokenize.word_tokenize)
test_token_x = test_set_x.apply(nltk.tokenize.word_tokenize)
dev_token_x = dev_set_x.apply(nltk.tokenize.word_tokenize)

Write code that counts the number of lines in each data set as well as the maximum number of tokens in each data set.

In [None]:
train_len = len(train_set.index)
test_len = len(test_set.index)
dev_len = len(dev_set.index)

print("Train Set no. of lines:", str(train_len))
print("Test Set no. of lines:", str(test_len))
print("Dev Set no. of lines:", str(dev_len))
print()

train_lens = [len(x) for x in train_token_x]
train_max = max(train_lens)
test_lens = [len(x) for x in test_token_x]
test_max = max(test_lens)
dev_lens = [len(x) for x in dev_token_x]
dev_max = max(dev_lens)

print("Train Set max tokens:", str(train_max))
print("Test Set max tokens:", str(test_max))
print("Dev Set max tokens:", str(dev_max))

Convert each data set's labels (gender) into numeric form.

In [None]:
categories = sorted(set(train_set_y))

train_y_indexes = [categories.index(y) for y in train_set_y]
test_y_indexes = [categories.index(y) for y in test_set_y]
dev_y_indexes = [categories.index(y) for y in dev_set_y]

Extract a vocabulary consisting of the tokens that occur at least 5 times in the train set and output the size of your vocabulary.
Include the unknown token and pad token in the vocabulary.

In [None]:
min_freq = 5

frequencies = collections.Counter(word for text in train_token_x for word in text)
vocab = sorted(frequencies.keys(), key=frequencies.get, reverse=True)
while frequencies[vocab[-1]] < min_freq:
    vocab.pop()
vocab = ['<PAD>', '<UNK>'] + sorted(vocab)

print("Length of vocabulary:", len(vocab))

Create binary bag of words feature vectors for all data set texts using the vocabulary created above (include stop words).

In [None]:
encoder = sklearn.feature_extraction.text.CountVectorizer(vocabulary=vocab, binary=True, analyzer=lambda text: text, dtype=np.float32)
encoder.fit(train_set_x)

train_x_vecs = encoder.transform(train_token_x).toarray()
test_x_vecs = encoder.transform(test_token_x).toarray()
dev_x_vecs = encoder.transform(dev_token_x).toarray()

Create a data set of indexified token sequences for all texts using the vocabulary created above, making use of unknown tokens and pad tokens.

In [None]:
word2index = {word: i for (i, word) in enumerate(vocab)}

for i in range(len(train_token_x)):
    for j in range(len(train_token_x[i])):
        if train_token_x[i][j] not in word2index:
            train_token_x[i][j] = '<UNK>'
    train_token_x[i].extend(['<PAD>']*(train_max - len(train_token_x[i])))
    
for i in range(len(test_token_x)):
    for j in range(len(test_token_x[i])):
        if test_token_x[i][j] not in word2index:
            test_token_x[i][j] = '<UNK>'
    test_token_x[i].extend(['<PAD>']*(test_max - len(test_token_x[i])))
    
for i in range(len(dev_token_x)):
    for j in range(len(dev_token_x[i])):
        if dev_token_x[i][j] not in word2index:
            dev_token_x[i][j] = '<UNK>'
    dev_token_x[i].extend(['<PAD>']*(dev_max - len(dev_token_x[i])))

indexed_train_x = torch.tensor([[word2index[word] for word in text] for text in train_token_x], dtype=torch.int64)
indexed_test_x = torch.tensor([[word2index[word] for word in text] for text in test_token_x], dtype=torch.int64)
indexed_dev_x = torch.tensor([[word2index[word] for word in text] for text in dev_token_x], dtype=torch.int64)

Write code that counts the percentage of tokens in each data set that are unknown tokens (not including pad tokens).

In [None]:
def unk_percent(tokens):
    total_tokens = sum([len(x) for x in tokens])    
    unk_tokens = sum([1 if word == "<UNK>" else 0 for text in tokens for word in text])
    return (unk_tokens/total_tokens)

train_unkper = unk_percent(train_token_x)
test_unkper = unk_percent(test_token_x)
dev_unkper = unk_percent(dev_token_x)

print("Train Set Unknown %: {:.2%}".format(train_unkper))
print("Test Set Unknown %: {:.2%}".format(test_unkper))
print("Dev Set Unknown %: {:.2%}".format(dev_unkper))

## Linear regression classification (20%)

Write a linear regression classifier (single layer neural net) that is trained to classify the author gender from the bag of words vector of the text.
You do not need to perform any hyperparameter tuning.
Use L1 weight decay regularisation.

In [None]:
class Linear_Model(torch.nn.Module):
    
    def __init__(self, vocab_size, num_categories):
        super().__init__()
        self.w = torch.nn.Parameter(torch.zeros((vocab_size, num_categories), dtype=torch.float32, requires_grad=True))
        self.b = torch.nn.Parameter(torch.zeros((num_categories,), dtype=torch.float32, requires_grad=True))

    def forward(self, x):
        return x@self.w + self.b

In [None]:
lin_model = Linear_Model(len(vocab), 2)
lin_model.to('cpu')

optimiser = torch.optim.Adam(lin_model.parameters())

tensor_train_x_vecs = torch.tensor(train_x_vecs, dtype=torch.float32)
tensor_train_y = torch.tensor(train_y_indexes, dtype=torch.int64)

print('step', 'error')
for step in range(1, 200+1):
    optimiser.zero_grad()
    output = lin_model(tensor_train_x_vecs)
    error = torch.nn.functional.cross_entropy(output, tensor_train_y) + lin_model.w.abs().mean()
    error.backward()
    optimiser.step()

    if step%100 == 0:
        print(step, error.detach().tolist())

Measure the accuracy, precision, recall, and F1-score of this classifier on the test set.

In [None]:
tensor_test_x_vecs = torch.tensor(test_x_vecs, dtype=torch.float32)

targets = np.array(test_y_indexes, np.int64)

with torch.no_grad():
    output_probs = torch.sigmoid(lin_model(tensor_test_x_vecs))
    outputs = output_probs.detach().numpy().argmax(axis=1)

accuracy = (targets == outputs).sum()/len(targets)
print('accuracy: {:.2%}'.format(accuracy))

Write code that shows the top 10 tokens that are the most important for determining the author gender according to the classifier.

In [None]:
w = np.abs(lin_model.w.detach().numpy())

category_index = 5
weighted_words = sorted(zip(w[:, :].tolist(), vocab), reverse=True)
top_ten = []

print('Top 10 words')
for i, weighted_word in enumerate(weighted_words[:10]):
    m = (weighted_word[0][0] + weighted_word[0][1]) / 2
    mean = "{:.2%}".format(m)
    print(i+1,") ",weighted_word[1]," (",mean,")",sep="")
    top_ten.append(weighted_word[1])

Write code that, for each data split and gender, shows the percentage of rows that include at least one of these important words (so 6 percentages in all).

In [None]:
def percentageofword(word, dataset, gender):
    count = 0
    total = len(dataset.index)
    for i,text in enumerate(dataset["text"]):
        text = text.lower()
        if(dataset["gender"][i] == gender):
            if(text.find(word) != -1):
                count += 1
    return count/total

top_word = random.randint(0,len(top_ten)-1)

train_percent_male = percentageofword(top_ten[top_word], train_set, "male")
train_percent_female = percentageofword(top_ten[top_word], train_set, "female")

test_percent_male = percentageofword(top_ten[top_word], test_set, "male")
test_percent_female = percentageofword(top_ten[top_word], test_set, "female")

dev_percent_male = percentageofword(top_ten[top_word], dev_set, "male")
dev_percent_female = percentageofword(top_ten[top_word], dev_set, "female")

print("Word:", top_ten[top_word], sep="\t")
print()

print("Train Set Male: {:.2%}".format(train_percent_male))
print("Train Set Female: {:.2%}".format(train_percent_female))
print("Test Set Male: {:.2%}".format(test_percent_male))
print("Test Set Female: {:.2%}".format(test_percent_female))
print("Dev Set Male: {:.2%}".format(dev_percent_male))
print("Dev Set Female: {:.2%}".format(dev_percent_female))

## Deep learning classifier (50%)

Perform hyperparameter tuning on a deep learning classifier (with a convolutional neural network or a recurrent neural network) that is trained to classify the author gender from the indexified sequences of the text.
Using the dev set for evaluation.
Output the best hyperparameters found and do not store the best trained model as you will be training it again in the next bit.

In [None]:
class Conv_Model(torch.nn.Module):
    
    def __init__(self, vocab_size, categ_size, embedding_size, window_size, hidden_size, init_dev):
        super().__init__()
        self.window_size = window_size
        self.embedding_matrix = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, init_dev, (vocab_size, embedding_size)), dtype=torch.float32))
        
        self.w1 = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, init_dev, (hidden_size, embedding_size, window_size)), dtype=torch.float32))
        self.b1 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.w2 = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, init_dev, (hidden_size, categ_size)), dtype=torch.float32))
        self.b2 = torch.nn.Parameter(torch.zeros((categ_size,), dtype=torch.float32))

    def forward(self, x, text_lens):
        embedded = self.embedding_matrix[x]
        embedded_t = embedded.transpose(1, 2)
        hidden_t = torch.nn.functional.leaky_relu(torch.nn.functional.conv1d(embedded_t, self.w1, self.b1))
        hidden = hidden_t.transpose(1, 2)

        pad_mask = torch.zeros(hidden.shape, dtype=torch.bool)
        for i in range(hidden.shape[0]):
            for j in range(hidden.shape[1]):
                pad_mask[i, j, :] = j >= (text_lens[i] - self.window_size + 1)
        
        masked = torch.masked_fill(hidden, pad_mask, torch.tensor(np.inf))
        pooled = torch.min(masked, dim=1)[0]
        
        return pooled@self.w2 + self.b2

In [None]:
embedding_size_set = [2, 3]
window_size_set = [2, 3]
hidden_size_set = [1, 2, 4, 8, 16]
init_dev_set = [10.0, 1.0, 0.1, 0.01, 0.001]

tensor_dev_y = torch.tensor(dev_y_indexes, dtype=torch.int64)

already_generated = set()
best_dev_acc = 0.0
best_hyperparams = None
for i in range(1, 20+1):
    while True:
        embedd_size = random.choice(embedding_size_set)
        wind_size = random.choice(window_size_set)
        hidden_size = random.choice(hidden_size_set)
        init_dev = random.choice(init_dev_set)
        hyperparams = (embedd_size, wind_size,hidden_size, init_dev)
        if hyperparams not in already_generated:
            already_generated.add(hyperparams)
            break
    if i%1 == 0:
        print('Hyperparameter search attempt:', i)
        print('embedding_size:', embedd_size)
        print('window_size:', wind_size)
        print('hidden_layer_size:', hidden_size)
        print('init_stddev:', init_dev)

    hyp_model = Conv_Model(len(vocab), len(categories), embedding_size=embedd_size, window_size=wind_size, hidden_size=hidden_size, init_dev=init_dev)
    hyp_model.to('cpu')
    optimiser = torch.optim.SGD(hyp_model.parameters(), lr=1.0, momentum=0.9)
    for step in range(1, 10+1):
        optimiser.zero_grad()
        output = hyp_model(indexed_train_x, train_lens)
        error = torch.nn.functional.cross_entropy(output, tensor_train_y)
        error.backward()
        optimiser.step()
        print(str(step)+", ", end="")
    print()
    with torch.no_grad():
        dev_acc = np.mean(np.abs(torch.sigmoid(hyp_model(indexed_dev_x, dev_lens)).detach().numpy().argmax(axis=1) - np.array(dev_y_indexes, np.int64)) < 0.1)
    print('Dev set accuracy:', dev_acc)
    if dev_acc > best_dev_acc:
        best_hyperparams = hyperparams
        best_dev_acc = dev_acc
        print('new best!')
    print()

(embedd_size, wind_size, hidden_size, init_dev) = best_hyperparams
print('Best found:')
print('embedding_size:', embedd_size)
print('window_size:', wind_size)
print('hidden_layer_size:', hidden_size)
print('init_stddev:', init_dev)
print('Dev set accuracy:', best_dev_acc)

Use the hyperparameters found in the previous bit to train the classifier, this time outputting a graph showing the dev set accuracy after every epoch.

In [None]:
def live_plot(x, y, max_epochs):
    clear_output(wait=True)
    plt.figure(figsize=(8,8))
    plt.plot(x, y, color='blue')
    plt.title('Localisation')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.xlim([0,max_epochs])
    plt.ylim([0,1])
    plt.show()

Measure the accuracy, precision, recall, and F1-score of this classifier on the test set.

Output a confusion matrix of the trained model on the test set.

Output 5 examples of correctly classified text for each gender and 5 examples of incorrectly classified text for each gender (so 20 text examples in total), all of which must be from the test set.
This is assuming that you have at least 5 instances of each group.
If you have less, then show whatever is available.

Remember the list of important tokens determined previously (from the logistic regression classifier)?
Write code that takes all the texts in the test set that have at least one of the important tokens and shows the percentage of these texts that were correctly classified.
Similarly, take all the texts that don't have any of the important tokens and show the percentage of these texts that were correctly classified (so 2 percentages in total).

## Conclusion (10%)

Write, in less than 300 words, your interpretation of the results and how you think the model could perform better.
You should talk about things like overfitting/underfitting and whether the model is learning anything deep about how the different genders write or if it's just basing everything on the words used.