# "The Mechanical Jerk"
> "AI for evil"
- comments: true
- categories: [ml, nlp]

In [1]:
#hide
import json
import pandas as pd
import numpy as np
from pathlib import Path
import re

## Background

Sometimes, the anonynimity of the internet relieves whatever societal pressure skewing the battle between superego and id, causing the cesspits we see on Twitter, Reddit, and any other place people talk to each other. I wanted to do something about the rage-inducing interactions between two people who see each other as faceless 'others' where people dehumanise others for not sharing their viewpoint and are, therefore, morons.

I thought a good idea would be to put my natural language processing (NLP) skillset into use. Have you ever heard of the saying 'you can't win a fight against a heavy bag'? Well what about if the heavy bag could also goad you into a fight as well as hit back? My plan was to finetune a medium sized GPT-2 model such that it could argue with people coherently online and place this model inside bots that could interact with Twitter and Reddit like I'm a modern day Rabbi Loew. By siccing my golems on people who are engaged in flame wars and making the existence of these bots known, I hope to discourage excessive toxicity and raise awareness. After all, if the idiot you're arguing with might be a pile of linear algebra, what's the point in getting angry.

Of course, all of this is nonsense post-hoc justification for me to watch people get angry and scream impotently into the abyss.

## The Model

I'm using GPT-2 a transformer model developed by OpenAI using generative pretraining to allow for unsupervised training on text. So similarly to how convolutional neural networks can be pretrained on ImageNet or some large image dataset and finetuned on a much smaller dataset, GPT-2 does the same for language. The objective during pretraining is to simply predict the next word given the preceding text. This is useful because text is rather abundant and usually unlabelled. For more information, read the original paper https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf or one of the multitude of blog posts on the subject.

During finetuning, the model is given a comment and the objective is to iteratively predict the next word in the response to the comment until the "&lt; endoftext>" token is predicted, denoting the end of the response.

## Data

For my training data, I chose to use the Internet Argument Corpus provided by UC Santa Cruz https://nlds.soe.ucsc.edu/iac. The dataset consists of roughly 73,000,000 words of forum debates on all the juicy topics such as gun control, abortion, the existence of a god, and many more.

Firstly, I trim the fat from the corpus, leaving only the text, discussion id, id, and parent id for each comment. Then come some  preprocessing steps, such as removing posts containing quotations of other posts. This could be parsed properly, cleaned up and included but there's such an abundance of data that it's not necessary. Then it's just cleaning up some '\n' newline characters and random slashes. We randomly sample

In [4]:
#hide_input
posts = pd.read_csv('posts_lite.csv')
posts.fillna(value = 0, inplace = True)
posts.parent_id = posts.parent_id.astype(int)
posts = posts.loc[~posts.text.str.contains('QUOTE')]
posts['text'] = posts.text.str.replace('\n','')
posts['text'] = posts.text.str.replace("\\\'","'")
posts

Unnamed: 0,discussion_id,id,parent_id,text
0,4,2049,4,"Yeah, he had a recession, and his tax cuts mov..."
1,4,2050,4,They cut the deficit by 70 thousand million......
2,4,15235,13912,The US is still the closest thing to a pure ...
3,4,4,0,i just thought that it would be cool to revive...
4,4,517,4,capitalism in theory allows people to have equ...
...,...,...,...,...
111563,14532,409477,409473,It's now legal for people to carry their guns ...
111564,14532,409505,409487,Whoops! I was wrong. Instead of ad hominem a...
111565,14532,409483,409477,"Not to come off negatively, NATO, but you are ..."
111566,14532,409487,409483,The refuse to address that such an animal ev...


 I parse these debate threads into parent and response pairs, sample 40000 pairs and concatenate them, inserting the special seperator characters that will allow GPT-2 to understand which bits are the parent and which are the response.

In [6]:
#hide_input
def get_pairs(posts):
    pairs = pd.DataFrame()
    for i in np.unique(posts['discussion_id'].values):
        disc = posts.loc[posts['discussion_id']==i]
        for i, row in disc.iterrows():
            children = disc.loc[disc['parent_id']==row['id']]
            pairs = pairs.append(pd.DataFrame(zip([row.text]*children.shape[0], children.text.values)), ignore_index=True)
    pairs.columns = ['parent', 'child']
    return pairs
posts = posts.sample(10000)
pairs = get_pairs(posts)
pairs.iloc[:100]

Unnamed: 0,parent,child
0,i just thought that it would be cool to revive...,I never denied that it didn't. But that was a ...
1,i just thought that it would be cool to revive...,"Capitolism sucks, unfortunately people are bli..."
2,i just thought that it would be cool to revive...,unemployment did not decrease during reagans a...
3,i just thought that it would be cool to revive...,I'll give to the poor when and if I want to. I...
4,i just thought that it would be cool to revive...,capitalism in theory allows people to have equ...
...,...,...
95,"Source: Merriam-Webster Dictionary of Law, ...",Hmmm perhaps I used the wrong word. I meant ...
96,No it definitly goes deeper than that. It is ...,Are you looking for an answer from a creationi...
97,"I am aware of this, the UK doesn't have any s...","Since you are not a citizen of the USA, you ..."
98,"bootfitter, just responding to jitobear and I ...",Of course you may respond to whomever you wis...


In [None]:
input_text = ''.join((pairs.parent+' ||| '+pairs.child+' <endoftext> ').values)
Path("input-text.txt").write_text(input_text)

## Training

Thank god for Docker. The days of worrying about versions of CUDA, CUDNN, Nvidia drivers and Tensorflow being incompatible are over. All training was done in a Docker container on a Google Cloud instance with a beefy V100 GPU attached. My RTX 2060 Super didn't have enough VRAM even when using half-precision. Using Google Cloud spared me from having to deal with the instability of Colab and was a good way to burn the rest of my free credit before it expired. Training was run for 

## Conclusion
