#  **Author profiling** - *The writing creates the writer*

**Course:** Language speech and dialogue processing 2021

**Course Code:** 508

**Authors**: Emma Quist - 
Thijs Rood - 
David Borensztajn

**Teaching Assistant:** Silvan Murre

# Table of contents
1. [Introduction](#introduction)
2. [Data description](#Datades)
    1. [Blog autorship corpus](#Blog)
    2. [PAN dataset series](#PAN)
3. [Data pre-processing](#prep)
4. [Approach](#Approach)
    1. [NN](#NN)
    2. [RNN](#RNN)
    3. [BERT](#BERT)
5. [RQ1: To what extend can we predict an author's gender based on:](#RQ1)
    1. [RQ1.I Blogs](#RQ1B)

    2. [RQ1.II Chats](#RQ1C)

6. [RQ2: To what extend can we predict an author's age-group based on:](#RQ1)
    1. [RQ2.I Blogs](#RQ1B)

    2. [RQ2.II Chats](#RQ2C)

7. [Conclusions](#Conclusions)
8. [Discussion](#Discussion)
8. [References](#References)



## Introduction <a name="introduction"></a>
The rise of social media platforms such as Facebook and Twitter over the past two decades, has given enormous amounts of people the power to interact with others on the internet. These web conversations bring many opportunities, as it was never this easy to exchange ideas with people from all over the world. However, there are some issues that emerged from online communication in the digital age.
 
One of the problems is the anonymity of the internet. Setting up an account on one of the major social media platforms is done in just a few steps, no verification is needed. This brings much uncertainty to the identity of a person running an account. People use different names, write under pseudonyms and in some cases pretend to be someone else. This anonymity can be beneficial in terms of user privacy, but it can be dangerous if used for harmful purposes. For example, when being used by online sexual predators.
The Internet is not just for adults: young children grow up on iPads and almost every teenager has a smartphone. But people from these age groups are also vulnerable for sexual predators, operating on the internet. Specifically, 19% of children has been sexually approached online (Mitchell, 2001). Predators are able to falsify their age and come in contact with youth, this shows the importance of online identity monitoring.
Another issue is that non-human users (bots) infiltrate human-to-human communication systems. This forms a problem because the number of bots can be largely scaled and therefore their posts can have great impact on the users of the platform. Scientists found that bots are currently actively influencing popular opinion on major social media platforms (Mesnards et al., 2020). Other research showed that in highly polarised settings, a bot participation of 2-4% of an online networks can be sufficient to tip over the general consensus on the platform (Ross et al., 2019).
These examples clearly show that there is a strong need for user identification. Author profiling could form the solution.

Author profiling aims to correlate writing style with author demographics using natural language processing techniques (Wiegmann et al., 2019). 

Over the last years, texts available on social media platforms have become a primary source of data for computational author profiling (Hsieh et al., 2018). In 2018, researchers showed that age and gender can be predicted using SVM classifiers on TF-IDF features retreived from tweets (Dichiu & Rancea, 2018). Other researchers suggested an approach using three types of features: content based, style based and topic based. They were able to achieve an accuracy of 64.08% on age-group estimation and 56.53% on gender prediction using SVM models and decisions trees (Santosh et al., 2013). However, these papers did not consider the use of neural networks for age and gender prediction.

In this report, we suggest a neural network-based approach for author profiling on online blogs and chat platforms. The annotated datasets PAN13 and The Blog Authorship Corpus, have been used to model linguistic and semantic features in text to predict authors’ agegroup and gender. The article is composed to answer a two-fold of research questions devided in four sub-question:


* **RQ1: To what extend can we predict an author's gender based on:**
  * **RQ1.I Blogs**
  * **RQ1.II Chats**
*  **RQ2: To what extend can we predict an author's age-group based on:**
  * **RQ2.I Blogs**
  * **RQ2.II Chats**

NOTE TO GRADER: Instead of having 3 research questions this report has 2 with corresponding subquestions. These have been researched thoroughly with 3 different models that each could be considered a method of research by themselves. Applying 3 models at 2 datasets to research 2 questions gives 12 separate results which should be well-lined for a research of this degree.

In the first sections, the used datasets, pre-processing steps and the three different approaches will be explained. After that, each research (sub-)question will be examined and evaluated, whereby the performances of the different models will be compared and discussed. Finally, we will discuss some discussion points and draw final conclusions about our project.
 


## Data description <a name="Datades"></a>

As we needed a dataset which both included an authors texts and its age or gender we stumbled upon the following datasets:


### The Blog Authorship Corpus <a name="Blog"></a>

The [Blog Autorship Corpus](http://www.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf) is a dataset consisting of 19320 collected posts from bloggers on blogger.com in August 2004. The corpus is free to use for non-commercial purposes. In the corpus are 681.288 posts in total, together forming over 140 million words. This makes up for an average of 35 posts and 7250 words per person.

Each file is filled with a different blogpost, the authors (self-provided) age, gender, and astrological sign. The age of the authors is normally distributed around the age 16-24 with a skew and thin distribution towards the higher ages. The amount of male and females is equally represented.

### The PAN dataset series <a name="PAN"></a>

The [PAN dataset series](https://pan.webis.de/data.html) consists of more than 10 datasets with text documents usable for various tasks such as: Author Identification, Plagiarism Detection, Quality Flaw Prediction and Author Profiling. As the research question concerns the latter task it is apparent one of these datasets was picked. After permission to the PAN 13 dataset was given it was decided it would be used in our research. 

The PAN 13 (from 2013) dataset contains anonymous texts collected from social media platforms. The files are ordered in a XML structure and contain the language in which the texts were written, the amount of texts per user, the users age and its gender. In this research we decided to only process English texts to keep it accessible and still keep the most files. The pan age ranges are: 10s: 13-17 yrs (17.200 conversations), 20s: 23-27 yrs (85.800 conversations) and 30s: 33-47yrs (133.600 conversations)


Together with our TA it was decided that using these two datasets would satisfy the requirements and should suffice to complete our research.



## Data pre-processing <a name="prep"></a>

NOTE: In this section of the report, only the pre-processing of the Blog-dataset for the age-group classification will be presented as an illustration. The pre-processing of the other datasets can be found in the other notebooks.

### Imbalanced data 

As we have stated before, both datasets are not uniformly distributed on age. This means that the data should be balanced before training the model. In this research we used a resampling approach to solve this problem. Because the size of both datasets was sufficient, under-sampling could be used. Under-sampling balances the dataset by reducing the size of the abundant class (‘7 Techniques to Handle Imbalanced Data’, 2021). This has been done by keeping all samples of the smallest age-group and randomly selecting an equal number of samples in the larger age-groups. The outcome is a balanced dataset, that is suitable for training a neural network with less bias.




### Text pre-processing 

Before applying an analysis on the data it is important to load it in carefully and at think about other pre-processing steps. In this research the non-alphabetic characters were removed, text was lowered and tokenised. To predict an authors age and gender, the style of the text had to be retained as much as possible. Therefore it was decided not to remove stopwords and to not lemmatize the words. It might be argued that different groups of people tend to choose different words in their sentences. In this research unigrams were used for simplicity, bigrams might improve later performance.



### Importing dependencies

In [None]:
# Loading files from drive
from google.colab import drive
drive.mount('/content/drive')

# Uploading files directly
# from google.colab import files
# uploaded = files.upload()
# dataset_blog = pd.read_excel(uploaded)
# df = pd.DataFrame(dataset_blog)

%load_ext tensorboard
!pip install torchtext==0.4.

import pandas as pd
import re
import os
import time

import torch
import torchtext
from torchtext.datasets import text_classification
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
import nltk
from torch.utils.tensorboard import SummaryWriter
import tensorflow as tf
import tensorboard as tb
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

from collections import defaultdict, OrderedDict, Counter
import operator
from IPython.display import Image
import plotly.graph_objs as go
import ipywidgets as widgets
import IPython.display as display

path = '/content/drive/MyDrive/Language, speech and dialogue processing/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:


dataset_blog = pd.read_excel(path+'blogtext_full.xlsx')
df = pd.DataFrame(dataset_blog)
df = df.drop(['id', 'topic', 'sign', 'date'], axis=1)


def tokenize(text):
    tokens = [token for token in text.split(" ") if token != ""]
    return tokens

def lowered(s):
    return s.lower()

# removing non alphabetic strings function
def remove_nonalph(s):
      s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
      return s

# applying preprocessing steps
df['text'] = df['text'].apply(lowered).apply(remove_nonalph).apply(tokenize)

In [None]:
df.head()

Unnamed: 0,gender,age,text
0,male,17,"[enjoy, never, ask, what, ask, why, bush, why,..."
1,female,13,"[richmond, fernandez, okay, those, who, hate, ..."
2,female,13,"[hey, everyone, if, you, got, to, this, blog, ..."
3,male,17,"[well, tonight, things, finally, came, to, a, ..."
4,female,25,"[not, really, to, state, that, i, am, an, expe..."


### Word and classes mapping 

After these steps the words had to be represented into numbers in order to make use of them in a neural network. The word mapping chosen was integer encoding. Each word in the vocabulary gained a number from 0 to N (amount of unique words) in order of frequency. After this the text documents could be translated into tensors, which are large arrays designed for neural networks. The same mapping was applied on the classes (either age groups or gender groups).

In [None]:
class_choice = 'age'
classdict = defaultdict(int)

for row in df[class_choice]:
  classdict[row] += 1

classdict = dict(sorted(classdict.items(), key=operator.itemgetter(1), reverse=True))

classlist = list(classdict.keys())
classmap = dict([(y,x) for x,y in enumerate(classlist)])
print('Original age distribution of dataset:')

x = tuple(classdict.keys())
y = list(classdict.values())
fig = go.Figure([go.Bar(x=x, y=y)])
fig.show()

Original age distribution of dataset:


In [None]:


ages = df["age"].unique().tolist()
print(ages)

ages.sort()
age_dict = {}
for age in ages:
  age_dict[age] = df.loc[df['age'] == age]

for key, value in age_dict.items():
  age_dict[key] = value

# Same range as in pan13 dataset, with an extra 40's group

dataframes_10s = []
dataframes_20s = []
dataframes_30s = []
dataframes_40s = []

for key, value in age_dict.items():
  if 13 <= int(key) <= 17:
    dataframes_10s.append(value)
  elif 23 <= int(key) <= 27:
    dataframes_20s.append(value)
  elif 33 <= int(key) <= 37:
    dataframes_30s.append(value)
  elif 43 <= int(key) <= 47:
    dataframes_40s.append(value)

df_10s = pd.concat(dataframes_10s)
df_20s = pd.concat(dataframes_20s)
df_30s = pd.concat(dataframes_30s)
df_40s = pd.concat(dataframes_40s)

all_dataframes = [df_10s, df_20s, df_30s, df_40s]

min_len = len(all_dataframes[0])

for df in all_dataframes:
  if len(df) < min_len:
    min_len = len(df)

df_10s["age"] = 0
df_20s["age"] = 1
df_30s["age"] = 2
df_40s['age'] = 3

all_dataframes = [df_10s.sample(min_len), df_20s.sample(min_len), df_30s.sample(min_len), df_40s.sample(min_len)]

df = pd.concat(all_dataframes)
df.index = range(len(df))

[17, 13, 25, 23, 27, 38, 24, 15, 14, 26, 37, 48, 16, 33, 39, 35, 34, 41, 36, 45, 47, 44, 46, 43, 42, 40]


### Visualizing balanced groups

In [None]:
classdict = defaultdict(int)

for row in df[class_choice]:
  classdict[row] += 1

classdict = dict(sorted(classdict.items(), key=operator.itemgetter(1), reverse=True))

classlist = list(classdict.keys())
classmap = dict([(y,x) for x,y in enumerate(classlist)])
print('Balanced dataset with age groups:')

x = ["10s", "20s", "30s", "40s"]
y = list(classdict.values())
fig = go.Figure([go.Bar(x=x, y=y)])
fig.show()

Balanced dataset with age groups:


###Creating a vocabulary

In [None]:
# creating vocabulary of words
vocabdict = defaultdict(int)   
for row in df['text']:
  for n_gram in row:
    n_gram = n_gram.lower()
    vocabdict[n_gram] += 1

vocabdict = dict(sorted(vocabdict.items(), key=operator.itemgetter(1), reverse=True))

# mapping words to integer
l = list(vocabdict.keys())
wordmap = dict([(y,x) for x,y in enumerate(l)])

### Integer encoding and creating tensors

In [None]:
df_copy = df.copy()

for i, text in enumerate(df['text']):
  newtext = []
  for word in text:
    word = word.lower()
    newtext.append(int(wordmap[word]))
  df['text'][i] = torch.tensor(newtext).to(torch.int64)

train_dataset = []

for index, row in df.iterrows():
  clasn = classmap[row['age']]
  train_dataset.append(tuple((clasn, row['text'])))



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



##Approach  <a name="Approach"></a>

There are different aproaches for a text classification task like age and gender prediciton. For the scope of this project the following models will be implemented and evaluated: a neural network (NN), a recurrent neural network (RRN) and Bidirectional Encoder Representations from Transformers (BERT).


NOTE: Only the implementation of the NN on the blog-dataset will be presented in this report, the implementation of the other models and datasets can found in the other notebooks. 

### Neural Network (NN) <a name="NN"></a>


The neural network model, mainly consists of an embedding layer with fully connected units to each of the classes. An embedding layer works by taking the words of a tensor and tries to build a lower dimensional embedding. Similar words gain a similar vectors which makes classification easier. The weights from the layers are uniformly distributed from $-\frac{1}{2}$ to $+\frac{1}{2}$ and the biases were set to 0.

As input for the embedding layer the vocabulary size and embedding dimension should be given. The words in the vocabulary size will first form a vector filled with 0's only representing the given word with a 1. The embedding dimension gives the space size to which all words should be mapped. The number of classes should be given as well as the fully connected layer should know to what amount of dimensions to map.

The Neural Network was trained using an embeddings of 32 dimensions. The texts were split and put in batches of 16. The stochastic gradient descent optimizer with a starting learning rate of 0.4 was used, together with a scheduler with a gamma of 0.9 which should decay this learning rate. As a loss function, the cross entropy loss was used, since this is a recommended loss function for binary predictions.

The usage of an embedding layer was chosen since people of different groups tend to use similar words. The vectors in the embedding layer for these groups should therefore be more alike. Younger groups, as an example, write more about things related to their age than older groups. The usage of these embedding vectors could therefore be used to classify texts. The last fully connected hidden layer maps these vectors to a corresponding age.

The final part of the implementation of the embedded neural network is demonstrated below, here we generate the batch, define the functions for the training and test data and run the network with 30 epochs (on the sub-dataset of 999 blogposts). In order to evaluate our neural network, we started by calculating the loss and accuracy. The loss measures the cross-entropy-loss, this combines the log softmax function (LogSoftmax) and negative log likelihood loss (NLLLoss) in one single class to calculate the loss per epoch. This criterion has been chosen in our case since it is especially useful when training a classification problem with C classes.  The accuracy measures the percentage of blogposts that was correctly labeled and, for both the training and validation data. For example, if the model predicts the correct age group 60 times in a test set containing 100 authors’ text, this will result in an accuracy of 60%.

In [None]:
# set GPU as device for better performance
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class TextSentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

vocab = len(vocabdict)
embed_dim = 32
n_classes = len(classdict)
BATCH_SIZE = 16
model = TextSentiment(vocab, embed_dim, n_classes).to(device)
writer = SummaryWriter('runs/lsd')

In [None]:
def generate_batch(batch):
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text]
    # torch.Tensor.cumsum returns the cumulative sum
    # of elements in the dimension dim.
    # torch.Tensor([1.0, 2.0, 3.0]).cumsum(dim=0)

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    return text, offsets, label

def train_func(sub_train_):

    # Train the model
    train_loss = 0
    train_acc = 0
    data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,
                      collate_fn=generate_batch)
    for i, (text, offsets, cls) in enumerate(data):
        optimizer.zero_grad()
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        output = model(text, offsets)
        loss = criterion(output, cls)
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
        train_acc += (output.argmax(1) == cls).sum().item()

    # Adjust the learning rate
    scheduler.step()

    return train_loss / len(sub_train_), train_acc / len(sub_train_)

def test(data_):
    loss = 0
    acc = 0
    data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)
    for text, offsets, cls in data:
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        with torch.no_grad():
            output = model(text, offsets)
            loss = criterion(output, cls)
            loss += loss.item()
            acc += (output.argmax(1) == cls).sum().item()

    return loss / len(data_), acc / len(data_)

N_EPOCHS = 10
min_valid_loss = float('inf')

criterion = torch.nn.CrossEntropyLoss().to(device)
# Adam of SGD
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_len = int(len(train_dataset) * 0.95)
sub_train_, sub_valid_ = \
    random_split(train_dataset, [train_len, len(train_dataset) - train_len])

start_time = time.time()
for epoch in range(N_EPOCHS):

    train_loss, train_acc = train_func(sub_train_)
    valid_loss, valid_acc = test(sub_valid_)

    print('Epoch: %d' %(epoch + 1))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    writer.add_scalar('loss', train_loss, epoch)
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')
    writer.add_scalar('test acc', valid_acc, epoch)

total_time = int(time.time() - start_time)
print('Total time elapsed: %d seconds.' %(total_time))

Epoch: 1
	Loss: 0.0852(train)	|	Acc: 36.2%(train)
	Loss: 0.0011(valid)	|	Acc: 38.9%(valid)
Epoch: 2
	Loss: 0.0791(train)	|	Acc: 42.0%(train)
	Loss: 0.0014(valid)	|	Acc: 32.6%(valid)
Epoch: 3
	Loss: 0.0750(train)	|	Acc: 46.1%(train)
	Loss: 0.0014(valid)	|	Acc: 42.9%(valid)
Epoch: 4
	Loss: 0.0719(train)	|	Acc: 48.8%(train)
	Loss: 0.0013(valid)	|	Acc: 45.1%(valid)
Epoch: 5
	Loss: 0.0689(train)	|	Acc: 51.8%(train)
	Loss: 0.0010(valid)	|	Acc: 50.4%(valid)
Epoch: 6
	Loss: 0.0668(train)	|	Acc: 54.2%(train)
	Loss: 0.0011(valid)	|	Acc: 45.7%(valid)
Epoch: 7
	Loss: 0.0644(train)	|	Acc: 56.1%(train)
	Loss: 0.0012(valid)	|	Acc: 45.1%(valid)
Epoch: 8
	Loss: 0.0624(train)	|	Acc: 58.1%(train)
	Loss: 0.0013(valid)	|	Acc: 46.1%(valid)
Epoch: 9
	Loss: 0.0607(train)	|	Acc: 59.9%(train)
	Loss: 0.0011(valid)	|	Acc: 48.3%(valid)
Epoch: 10
	Loss: 0.0589(train)	|	Acc: 61.5%(train)
	Loss: 0.0011(valid)	|	Acc: 48.5%(valid)
Total time elapsed: 158 seconds.


In [None]:
def predict(text, model, wordmap):
    sentence = []
    for word in text:
      word = word.lower()
      sentence.append(int(wordmap[word]))
    text = torch.tensor(sentence).to(torch.int64)

    with torch.no_grad():
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1



test_text = df_copy['text'][1]

model = model.to("cpu")

inv_map = {v: k for v, k in [(0,"10's"), (1,"20's"), (2,"30's"), (3,"40's")]}

print(f"I predict the author of this article is in his {inv_map[predict(test_text, model, wordmap)]}")

I predict the author of this article is in his 30's


###Recurrent Neural Network (RNN)  <a name="RNN"></a>


As an alternative to the neural network, a recurrent neural network can be implemented. The idea behind a RNN is that it can make use of sequential information. This can be useful for texts because language has a clear sequential character. The model we used is called a Long Short-Term Memory Network (LTSM). As the name says, it remembers previous word-inputs from shorter and longer periods back. The meaning of a word depends on preceding words, therefore LTSM's can be used to extract this sequential information for a better classification task. 

### BERT <a name="BERT"></a>


Finally, BERT (Bidirectional Encoder Representations from Transformers) can be used for text classification. This is a state-of-the-art method of pre-trained language representations that makes use of a transformer. A transformer model applies an attention mechanism that learns contextual relations between words in a text, by determining which parts of the input sequence is important. In contrast to the other (directional) models, which means that it reads the text input sequentially, the Transformer encoder reads the entire sequence of words all at once. This enables the model to learn the context of a word based on all the surrounded words, which could be very beneficial for a text classification task like we are examining here.



## RQ1: To what extend can we predict an author's gender based on: <a name="RQ1"></a>

*   ## I Blogs<a name="RQ1B"></a>



In [None]:
df_bg = pd.read_csv(path+'ACC/BG.csv', sep=';', index_col=0)
df_bg

Unnamed: 0_level_0,NN,RNN,BERT
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Acc. valid,69.5%,67.5%,66.0%


In [None]:
img1 = open(path+'gender_Blog_nn_confusion_matrix.png', 'rb').read()
img2 = open(path+'gender_Blog_rnn_confusion_matrix.png', 'rb').read()
img3 = open(path+'GENDER_BLOG_BERT_confusion_matrix.png', 'rb').read()

wi1 = widgets.Image(value=img1, format='png', width=390, height=400)
wi2 = widgets.Image(value=img2, format='png', width=390, height=400)
wi3 = widgets.Image(value=img3, format='png', width=390, height=400)

sidebyside1 = widgets.HBox([wi1, wi2, wi3])

display.display(sidebyside1)

HBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xb0\x00\x00\x01 \x08\x06\x00\x00\…



*   ## II Chats<a name="RQ1C"></a>


In [None]:
df_pg = pd.read_csv(path+'ACC/PG.csv', sep=';', index_col=0)
df_pg

Unnamed: 0_level_0,NN,RNN,BERT
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Acc. valid,60.4%,62.3%,57.6%


In [None]:
img4 = open(path+'gender_PAN_nn_confusion_matrix.png', 'rb').read()
img5 = open(path+'gender_PAN_rnn_confusion_matrix.png', 'rb').read()
img6 = open(path+'GENDER_PAN_BERT_confusion_matrix.png', 'rb').read()

wi4 = widgets.Image(value=img4, format='png', width=390, height=400)
wi5 = widgets.Image(value=img5, format='png', width=390, height=400)
wi6 = widgets.Image(value=img6, format='png', width=390, height=400)

sidebyside2 = widgets.HBox([wi4, wi5, wi6])

display.display(sidebyside2)

HBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xb0\x00\x00\x01 \x08\x06\x00\x00\…

## RQ2: To what extend can we predict an author's age-group based on: <a name="RQ2"></a>

*   ## I Blogs<a name="RQ2B"></a>



In [None]:
df_ba = pd.read_csv(path+'ACC/BA.csv', sep=';', index_col=0)
df_ba

Unnamed: 0_level_0,NN,RNN,BERT
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Acc. valid,56.2%,44.7%,60.0%


In [None]:
img7 = open(path+'Blog_nn_confusion_matrix.png', 'rb').read()
img8 = open(path+'Blog_rnn_confusion_matrix.png', 'rb').read()
#img9 = open(path+'Blog_BERT_confusion_matrix.png', 'rb').read()

wi7 = widgets.Image(value=img7, format='png', width=390, height=400)
wi8 = widgets.Image(value=img8, format='png', width=390, height=400)
#wi9 = widgets.Image(value=img9, format='png', width=390, height=400)

sidebyside3 = widgets.HBox([wi7, wi8])

display.display(sidebyside3)

HBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xb0\x00\x00\x01 \x08\x06\x00\x00\…

 * ## II Chats<a name="RQ2C"></a>

In [None]:
df_pa = pd.read_csv(path+'ACC/PA.csv', sep=';', index_col=0)
df_pa

Unnamed: 0_level_0,NN,RNN,BERT
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Acc. valid,48.6%,50.6%,44.2%


In [None]:
img10 = open(path+'PAN_nn_confusion_matrix.png', 'rb').read()
img11 = open(path+'PAN_rnn_confusion_matrix.png', 'rb').read()
img12 = open(path+'AGE_PAN_BERT_confusion_matrix.png', 'rb').read()

wi10 = widgets.Image(value=img10, format='png', width=390, height=400)
wi11 = widgets.Image(value=img11, format='png', width=390, height=400)
wi12 = widgets.Image(value=img12, format='png', width=390, height=400)

sidebyside4 = widgets.HBox([wi10, wi11, wi12])

display.display(sidebyside4)

HBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xb0\x00\x00\x01 \x08\x06\x00\x00\…

## Conclusions <a name="Conclusions"></a>

Above results give us some insights on the predictive performance of the different networks. 

First of all, the models can predict an author's gender pretty well based on blogs, they tend to be slightly better at predicting female authors versus male authors. For this type of data our neural network showed the best performance. 

Secondly, the results showed that it was more difficult for to predict an authors' gender based on chats, especcially BERT. The other two models were slightly beter at predicting male writers than female writers. In this case the recurrent neural network performed slightly better than the others.

Thirdly the confustion matrices of the neural network and the recurrent neural network, from the age-group predicition task on the blog dataset show that most true positives are in the 10's age group. As well as the 40's age-group, this shows that those two models perform better at the edge age-groups than the middle two age groups.The BERT model showed the highest accuarcy for this dataset on this classification task.

Lastly, all models performed fairly similair on the age-group classification task on the chat dataset. These confusion matrices show (similair as with the blog dataset) that the models tend to be slightly better at predicting the 10s age group, rather than the other two. Overall, the recurrent neural network showed the highest accuracy in this case.

## Discussion <a name="Discussion"></a>

This research has clearly shown that age and gender can be predicted with a neural network approach. However, there is still much room for future research.

First of all, in this research we used integer encodings to map words to numbers for our NN and RNN. However there are more ways to do word embeddings that could lead to better predictive performance. One of these methods is would be one-hot encodings. This would give unique word dimensions, and it would assume fully independent words. One more way of embedding words would be to use a pre-trained embedding model like word2vec. The advantage of pre-trained word vectors would be that similar words get similar vectors. The exact effects of these embeddings are not clear but these options could possibly lead to better predictive perfomance.

Secondly, to avoid bias in the neural networks, the choice has been made to use under-sampling on the age-groups. This results in a much smaller dataset, and that is not a desired result. So in future work, other methods of balancing could be considered. For example, weighted classes or over-sampling are other possible options.

Lastly, these test results show that there is predictive power in all three models. However, all models have been trained and tested seperately on blog and chat texts. Another interesting research would be to cross-validate the models on both datasets. This means that we would train a model on blog texts and test the model on chat texts and vice-versa.
This gives more insight on the "real" predictive power of the models because it would be a test on another distribution.

### Monologue versus dialogue

The main difference between the datasets is the perspective and audiance the texts were written for. In the Blog Authorship dataset the data are blogposts which are obviousely monologues aimed for no direct audiance. The posts can be written for the author self or for any group interested in the blogposts. On the other hand, the PAN13 dataset contains of texts derived from social media. These are one part of a conversation that is aimed at the other person receiving the certain texts. This data is thus an isolated part of a dialogue. Interstingly 

## References <a name="References"></a>
7 Techniques to Handle Imbalanced Data. (2021). KDnuggets. Geraadpleegd 11 maart 2021, van https://www.kdnuggets.com/7-techniques-to-handle-imbalanced-data.html/


Dichiu, D., & Rancea, I. (z.d.). Using Machine Learning Algorithms for Author Profiling In Social Media. 6.

Hsieh, F., Dias, R., & Paraboni, I. (2018, May). Author Profiling from Facebook Corpora. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). LREC 2018, Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1407

Mesnards, N. G. des, Hunter, D. S., Hjouji, Z. el, & Zaman, T. (2020). Detecting Bots and Assessing Their Impact in Social Networks. ArXiv:1810.12398 [Physics, Stat]. http://arxiv.org/abs/1810.12398

Mitchell, K. J. (2001). Risk Factors for and Impact of Online Sexual Solicitation of Youth. JAMA, 285(23), 3011. https://doi.org/10.1001/jama.285.23.3011

Ross, B., Pilz, L., Cabrera, B., Brachten, F., Neubaum, G., & Stieglitz, S. (2019). Are social bots a real threat? An agent-based model of the spiral of silence to analyse the impact of manipulative actors in social networks. European Journal of Information Systems, 28(4), 394–412. https://doi.org/10.1080/0960085X.2018.1560920

Santosh, K., Bansal, R., Shekhar, M., & Varma, V. (2013, januari 1). Author Profiling: Predicting Age and Gender from Blogs Notebook for PAN at CLEF 2013.

Wiegmann, M., Stein, B., & Potthast, M. (2019). Overview of the Celebrity Proﬁling Task at PAN 2019. 19.

