# Introduction

In this series of notebooks, we explore the idea of potentially creating an automated NLP based model that can learn what constitutes behavior worthy of a subreddit ban and apply this to future posts and communities.  Note that we limit our attention to r/incels and r/foreveralone comments at present with the assumption that if we can get high accuracy predicting comments between these two threads, we could get possibly better accuracy comparing incels to an unrelated subreddit.  Later, we also limit the amount of jargon present in an effort to generalize the model to other subreddits.  Though this process should always involve human judgement to a larger extent, we might use the models obtained here to detect smaller groups either starting new subreddits or permeating through existing subreddits, potentially stemming toxic behavior before it is enters the community.

To this end, we use three notebooks to:
1. Gather posts to be used later.
2. Explore raw frequency of words to explore how well a simple bag-of-words approach might help us solve the problem.
3. Introduce several changes to 2., including rescaling of our numerical representation of words as well as using a different classification method.  Though it is therefore difficult to determine if it is the rescaling or the new classifier that is responsible for the change in result, we simply note the increased accuracy in distinguishing the two groups and continue the discussion from there.

## Gathering Reddit Data from Pushshift API

Here, we document the process used to interact with the pushshift API.  The database has collected a large volume of reddit content, which are used in the main notebook.  More information about the database, including query parameters and update schedule, can be found on https://pushshift.io/.

## Pooling Data from the API

We start by defining a function used to collect either submissions or comments from the API.  The maximum number of items retreived at a time is 25.  A separate function was added to collect results over multiple files.  The function is limited to retreiving the most latest comments at the moment.

In [2]:
import requests
import json
from pathlib import Path
import time

In [1]:
def collect_r_json(subreddit, suffix, com_or_submit = 'submission', collect_n_ish=1000, start_time = None):
    filename = f'./jsons/{subreddit}_{com_or_submit}_{suffix}.json'
    if Path(filename).is_file():
        print(f'{subreddit}_{com_or_submit}_{suffix}.json already exists.  Please delete to replace.')
        return
    
    if com_or_submit not in ['submission', 'comment']:
        print('com_or_submit(second argument) needs to be either "submission" or "comment" from subreddit.')
        return
    
    if start_time:
        URL = f"https://api.pushshift.io/reddit/{com_or_submit}/search/?subreddit={subreddit}&before={start_time}"
    else:
        URL = f"https://api.pushshift.io/reddit/{com_or_submit}/search/?subreddit={subreddit}"
    header = {'User-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.3253.633 Safari/537.36'}
    results = requests.get(URL, headers=header)
    final = results.json()
    

    before = final['data'][len(final)-1]['created_utc']
    
    for i in range((collect_n_ish-25)//25):
        URL = f"https://api.pushshift.io/reddit/{com_or_submit}/search/?subreddit={subreddit}&before={before}"
        results = requests.get(URL, headers=header)
        data = results.json()
        final['data'].extend(data['data'])
        before = data['data'][len(data['data'])-1]['created_utc']
        if len(data['data']) < 25:
            print('came up short')
            break
        time.sleep(1)
            
    try:
        f = open(filename, 'w')
    except:
        Path(filename).touch()
        f = open(filename, 'w')
    
    json.dump(final, f)
    
    f.close()
    return before

In [2]:
def collect_N_files(subreddits, com_or_sub, N_files):
    start_time = None
    for topic in subreddits:
        for key in com_or_sub:
            for i in range(N_files):
                start_time = collect_r_json(topic, i, key, com_or_sub[key], start_time)

In [6]:
topics = ['incels', 'foreveralone', 'uncensorednews', 'altnewz', 'changemyview']

In [7]:
com_or_sub = {'comment':10000, 'submission':2000}

In [None]:
collect_N_files(topics, com_or_sub, 10)

A few more forums are collected here.  These two are assumed to contain some of the behavior of incels, though only truecels was banned at the time this sentence was typed.

In [9]:
collect_N_files(['truecels', 'mensrights'], com_or_sub, 1)