### Data cleaning and preprocessing 

In [69]:
# load data
import pandas as pd
from sklearn.utils import shuffle
df = pd.read_table('coursework2_train.tsv')
df = shuffle(df) # randomly shuffle data entries 
df

Unnamed: 0,article_id,article_title,label,sentence_text
4634,730559808,Austrian bishop forcefully rejects German Bish...,non-propaganda,But parents can bless their children.
6749,789615291,Stop Appeasing the Democrats,non-propaganda,She told no one about the assault until 30 yea...
4998,728972961,FOR THE FIRST TIME ONLINE: Archbishop Lefebvre...,non-propaganda,I do not destroy the Church.
10562,758386255,Pope Francis vs Contemplative Orders,non-propaganda,"Unfortunately, it was about a subject that the..."
11246,780414700,Bishop Morlino Targets ‘Homosexual Subculture’...,non-propaganda,To our seminarians: If you are unchastely prop...
7974,698018235,Dan Fishback: It's Okay to Boycott Israeli Pla...,propaganda,Now he's whining that boycotting his play is a...
11343,777488669,"The Death Penalty, Instituted by God Himself (...",non-propaganda,If it is ever necessary to hold back the revol...
3203,782149225,"Muslim Leader Calls for Conquest of “America, ...",non-propaganda,Will Brett Kavanaugh be confirmed to the Supre...
2621,111111132,A popular public school Bible class in West Vi...,propaganda,"According to Elliott, the Mercer program is “e..."
7977,698018235,Dan Fishback: It's Okay to Boycott Israeli Pla...,non-propaganda,"""it’s not that BDS is “censoring” work — it’s ..."


In [70]:
# take a look at data size
raw_labels = df.label.values.tolist()
docs = df.sentence_text.tolist()
titles = df.article_title.values.tolist()

label_dic = {'non-propaganda':0, 'propaganda':1}

assert len(docs) == len(raw_labels) == len(titles)
labels = [label_dic[rl] for rl in raw_labels] # transfer raw labels (strings) to integer numbers
print('total data size: {}, label type num: {}'.format(len(docs), len(label_dic)))

total data size: 11464, label type num: 2


In [71]:
# take a look at some sentences in the dataset
print(docs[25])
print(titles[25])
print(labels[25])

Right around 3.79 billion miles.
NASA releases images captured at a record-breaking 3.79 billion miles from Earth
0


I chose to use both titles and sentences to train my model, as both together would provide more information to detect whether the post is propaganda or not.

The first piece of preprocessing I performed is converting the text to lowercase, in order to remove stopwords later (nltk's stopwords are all in lowercase).

In [72]:
# convert text to lowercase
df['article_title'] = df['article_title'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['sentence_text'] = df['sentence_text'].apply(lambda x: " ".join(x.lower() for x in x.split()))

Following this, I lemmatized the text so word importances can be correctly identified.

In [73]:
# lemmatize text
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
df['article_title'] = df['article_title'].apply(lambda x: " ".join([wordnet_lemmatizer.lemmatize(y) for y in x.split()]))
df['sentence_text'] = df['sentence_text'].apply(lambda x: " ".join([wordnet_lemmatizer.lemmatize(y) for y in x.split()]))

I then removed all punctuation, in order to make clean and word-only tokens.

In [74]:
# remove punctuation
import re
df['article_title'] = df['article_title'].apply(lambda x: re.sub('[^\w\s]', '', x))
df['sentence_text'] = df['sentence_text'].apply(lambda x: re.sub('[^\w\s]', '', x))

Next, I tokenized the text to perform analysis.

In [75]:
# tokenize text
from nltk import word_tokenize
df['tokens_title'] = df['article_title'].apply(lambda x: nltk.word_tokenize(x))
df['tokens_sentence'] = df['sentence_text'].apply(lambda x: nltk.word_tokenize(x))

In [76]:
# take another quick look at the data to make sure preprocessing is working
df.head()

Unnamed: 0,article_id,article_title,label,sentence_text,tokens_title,tokens_sentence
4634,730559808,austrian bishop forcefully reject german bisho...,non-propaganda,but parent can bless their children,"[austrian, bishop, forcefully, reject, german,...","[but, parent, can, bless, their, children]"
6749,789615291,stop appeasing the democrat,non-propaganda,she told no one about the assault until 30 yea...,"[stop, appeasing, the, democrat]","[she, told, no, one, about, the, assault, unti..."
4998,728972961,for the first time online archbishop lefebvres...,non-propaganda,i do not destroy the church,"[for, the, first, time, online, archbishop, le...","[i, do, not, destroy, the, church]"
10562,758386255,pope francis v contemplative order,non-propaganda,unfortunately it wa about a subject that the w...,"[pope, francis, v, contemplative, order]","[unfortunately, it, wa, about, a, subject, tha..."
11246,780414700,bishop morlino target homosexual subculture in...,non-propaganda,to our seminarians if you are unchastely propo...,"[bishop, morlino, target, homosexual, subcultu...","[to, our, seminarians, if, you, are, unchastel..."


It is important to remove stopwords, as including common words would make the models less effective.

In [77]:
# remove stopwords
from nltk.corpus import stopwords
sw = set(stopwords.words('english'))

def remove_stopwords(text):
    return [x for x in text if x not in sw]

cleaned_title = df['tokens_title'].apply(lambda x: remove_stopwords(x))
cleaned_sentence = df['tokens_sentence'].apply(lambda x: remove_stopwords(x))

In [79]:
# recreate the sentences in the correct format for developing the models
final_title = [' '.join(x) for x in cleaned_title]
final_sentence = [' '.join(x) for x in cleaned_sentence]

In [94]:
# combine titles and sentences to form one sentence
concat_lists = lambda x,y: x + " " + y
final_text = list(map(concat_lists, final_title, final_sentence))
final_text[0]

'austrian bishop forcefully reject german bishops idea blessing homosexual union parent bless children'

Now it's time to split the data. I chose a 60/20/20 split for the train, dev, and test sets as this is known to be a fair split in machine learning processes.

In [95]:
# split the data into train, dev and test
train_ratio, dev_ratio, test_ratio = 0.6, 0.2, 0.2
train_docs = final_text[:int(len(final_text)*train_ratio)]
train_labels = labels[:int(len(final_text)*train_ratio)]

dev_docs = final_text[int(len(final_text)*train_ratio):int(len(final_text)*(train_ratio+dev_ratio))]
dev_labels = labels[int(len(final_text)*train_ratio):int(len(final_text)*(train_ratio+dev_ratio))]

test_docs = final_text[-int(len(final_text)*(test_ratio)):]
test_labels = labels[-int(len(final_text)*(test_ratio)):]

print('train size: {}, dev size: {}, test size: {}'.format(len(train_labels), len(dev_labels), len(test_labels)))

train size: 6878, dev size: 2293, test size: 2292


### Classic machine learning model: Logistic Regression

Tf-idf is an ideal method to use for feature extraction in this text analysis as it would outline the key words required to correctly categorize the data. I had initially decided not to set a max feature number, as limiting the features produced results with a lower accuracy. However, I realized that this led to overfitting of the data, so I used a max feature number of 3000.

I believe logistic regression is a good algorithm to use in this case because the labels are binary - propaganda and non-propaganda. Since logistic regression is one of the most common methods for binary classification, this is a safe bet.

In [103]:
# tf-idf and logistic regression
from sklearn.feature_extraction.text import TfidfVectorizer
max_feature_num = 3000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_docs)
test_vecs = TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_docs)

from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression(solver='lbfgs').fit(train_vecs, train_labels)

# make prediction
test_pred_lr = clf_lr.predict(test_vecs)
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
acc_lr = accuracy_score(test_labels, test_pred_lr)
pre_lr, rec_lr, f1_lr, _ = precision_recall_fscore_support(test_labels, test_pred_lr, average='macro')
print('accuracy:', acc_lr)
print('precision:', pre_lr)
print('rec:', rec_lr)
print('f1:', f1_lr)

accuracy: 0.7478184991273996
precision: 0.7126680453796216
rec: 0.5847357220371788
f1: 0.5826031065881093


Logistic regression provides a macro f1 of 0.583, which is not very high. I then tried a neural-based model (MLP) to try to increase this.

### Neural Network model: Multi-layer Perceptron

I decided to use GloVe word embeddings, as this model looks at longer range co-occurance rather than simply analyzing words next to each other.
I chose the vectors trained on Wikipedia 2014 + Gigaword 5 with 6 billion tokens, as a dataset of 11464 sentences is fairly small and would not require the larger word vector text files.
https://github.com/stanfordnlp/GloVe

In [23]:
# load the glove pre-trained embedding
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

path_of_downloaded_files = "/users/richa/glove.6B/glove.6B.300d.txt"
glove_file = datapath(path_of_downloaded_files)
word2vec_glove_file = get_tmpfile("glove.6B.300d.txt")
glove2word2vec(glove_file, word2vec_glove_file)
word_vectors = KeyedVectors.load_word2vec_format(word2vec_glove_file)

In [96]:
import numpy as np

word_vec_dim = 300
oov_vec = np.random.rand(word_vec_dim)

def vectorize_sent(word_vectors, sent):
    word_vecs = []
    for token in word_tokenize(sent): 
        if token not in word_vectors: 
            word_vecs.append(oov_vec)
        else:
            word_vecs.append(word_vectors[token].astype('float64'))
    return np.mean(word_vecs,axis=0)

# test function to make sure it works
vv = vectorize_sent(word_vectors, train_docs[1])
print(vv.shape)
print(vv.dtype)

(300,)
float64


In [97]:
# create vector representations
train_vectors = np.array([vectorize_sent(word_vectors, ss) for ss in train_docs])
dev_vectors = np.array([vectorize_sent(word_vectors, ss) for ss in dev_docs])
print(train_vectors.shape)
print(train_vectors.dtype)

(6878, 300)
float64


In [98]:
# define a simple MLP (multi-layer perceptron) as the classification model
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim, out_dim, dp_rate):
        super(MLP, self).__init__()
        self.hidden_layer = nn.Linear(input_dim, input_dim*2)
        self.output_layer = nn.Linear(input_dim*2, out_dim)
        self.dropout = nn.Dropout(dp_rate)
        self.relu = torch.nn.ReLU()
       
    def forward(self, x_in):
        z1 = self.dropout(x_in) # output of the input layer, after dropout
        z2 = self.relu(self.hidden_layer(z1)) # output of the hidden layer
        logits = self.output_layer(z2)
        return logits

It is important to test out and tune hyperparameters to find a model with the best accuracy. I chose a fairly standard initial learning rate of 0.005 so that errors can be explored. Since the learning rate decreases with each epoch, this would benefit the model. I initially tried a higher learning rate of 0.01, but this was too high for the model to be efficient.
I decided to limit the number of epochs to 30, as I noticed that after a certain point the model's accuracy would drop. A dropout rate of 0.5 seemed to be the best option, as well as a batch size of 128 - this allows the model to train on more samples before updating model parameters, so a higher batch size led to a higher accuracy.

In [118]:
# build model
dropout_rate = 0.5 
model = MLP(word_vec_dim,len(label_dic),dropout_rate) 
loss_fnc = torch.nn.CrossEntropyLoss()

# hyper parameters
n_epochs = 30 # number of epoch (i.e. number of iterations)
batch_size = 128 # mini batch size
lr = 0.005 # initial learning rate

# initialize optimizer and scheduler (lr adjustor)
import torch.optim as optim
optimizer = optim.Adam(params=model.parameters(), lr=lr) # use Adam as the optimizer
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.9) # decays the learning rate of each parameter group by gamma every step_size epochs

In [119]:
best_f1 = -1.
best_model = None
import copy
import numpy as np
from sklearn.metrics import precision_recall_fscore_support

for epoch_i in range(n_epochs):
    # the inner loop is over the batches in the dataset
    model.train() # let pytorch know that gradients should be computed, so as to update the model
    for idx in range(0,len(train_vectors),batch_size):
        # Step 0: Get the data
        x_data = torch.tensor(train_vectors[idx:idx+batch_size], dtype=torch.float)
        if x_data.shape[0] == 0: continue
        y_target = torch.tensor(train_labels[idx:idx+batch_size], dtype=torch.int64)

        # Step 1: Clear the gradients 
        optimizer.zero_grad()

        # Step 2: Compute the forward pass of the model
        y_pred = model(x_data)

        # Step 3: Compute the loss value that we wish to optimize
        loss = loss_fnc(y_pred, y_target)

        # Step 4: Propagate the loss signal backward
        loss.backward()

        # Step 5: Trigger the optimizer to perform one update
        optimizer.step()
    
    # after each epoch, we can test the model's performance on the dev set
    with torch.no_grad(): # let pytorch know that no gradient should be computed
        model.eval() # let the model know that it in test mode, i.e. no gradient and no dropout
        dev_data = torch.tensor(dev_vectors, dtype=torch.float)
        dev_target = torch.tensor(dev_labels, dtype=torch.int64)
        dev_prediction = model(dev_data)
        pred_labels = [np.argmax(dp.numpy()) for dp in dev_prediction]
        pre, rec, f1, _ = precision_recall_fscore_support(dev_target, pred_labels, average='macro')
        print('\n---> after epoch {} the macro-f1 on dev set is {}'.format(epoch_i, f1))
        for param_group in optimizer.param_groups:
            print('learning rate', param_group['lr'])
        
        # save the best model
        if f1 > best_f1:
            best_f1 = f1
            best_model = copy.deepcopy(model.state_dict())
            print('best model updated; new best f1',f1)
            
    # (optional) adjust learning rate according to the scheduler
    scheduler.step()


---> after epoch 0 the macro-f1 on dev set is 0.5256958200114521
learning rate 0.005
best model updated; new best f1 0.5256958200114521

---> after epoch 1 the macro-f1 on dev set is 0.5591363300867644
learning rate 0.005
best model updated; new best f1 0.5591363300867644

---> after epoch 2 the macro-f1 on dev set is 0.5964138609325236
learning rate 0.005
best model updated; new best f1 0.5964138609325236

---> after epoch 3 the macro-f1 on dev set is 0.6137111732430592
learning rate 0.005
best model updated; new best f1 0.6137111732430592

---> after epoch 4 the macro-f1 on dev set is 0.615233487383501
learning rate 0.005
best model updated; new best f1 0.615233487383501

---> after epoch 5 the macro-f1 on dev set is 0.6528440694131449
learning rate 0.005
best model updated; new best f1 0.6528440694131449

---> after epoch 6 the macro-f1 on dev set is 0.6334546100420498
learning rate 0.005

---> after epoch 7 the macro-f1 on dev set is 0.6399110921412379
learning rate 0.005

---> af

The MLP model provides a best macro f1 of 0.684 on the train/dev set.

I then tested my model on the test set.

In [120]:
# load the best model weights
model.load_state_dict(best_model) 
test_vectors = np.array([vectorize_sent(word_vectors, ss) for ss in test_docs])

with torch.no_grad(): 
    model.eval()
    test_data = torch.tensor(test_vectors, dtype=torch.float)
    test_target = torch.tensor(test_labels, dtype=torch.int64)
    test_prediction = model(test_data)
    pred_labels = [np.argmax(dp.numpy()) for dp in test_prediction]
    pre, rec, f1, _ = precision_recall_fscore_support(test_target, pred_labels, average='macro')
    print('macro-f1 on test data:', f1)

macro-f1 on test data: 0.6836644730734227


When the model is applied to the test set, the macro-f1 is 0.684. This is higher than the score obtained from the logistic regression model, so this is clearly the better performing model.

### Save the trained model

In [122]:
import pickle

all_info_to_save = {
    'input_dim': word_vec_dim,
    'dropout_rate': dropout_rate,
    'neural_weights': best_model,
    'oov_vector': oov_vec,
    'n_epochs': n_epochs,
    'batch_size': batch_size,
    'lr': lr
}
save_path = open("trained_model.pickle","wb")
pickle.dump(all_info_to_save, save_path)
save_path.close()