# Subreddit Spam Classifier
#### Web APIs & Classification
_Author: Ritchie Kwan_

---


## Table of Contents

0. [Problem Statement, Assumptions, Executive Summary](#Problem-Statement)
1. [Data Collection](#Data-Collection)
1. [Data Cleaning & EDA](02-Data-Cleaning-and-EDA.ipynb#Data-Cleaning-and-EDA)
1. [Benchmark Model](03-Benchmark-Model.ipynb#Benchmark-Model)
1. [Model Tuning](04-Model-Tuning.ipynb#Model-Tuning)
1. [Evaluation and Conceptual Understanding](04-Model-Tuning.ipynb#Evaluation-and-Conceptual-Understanding)


## Problem Statement

Spam and ham are typically used to describe junk and relevant emails, respectively. In the context of Reddit, any post unrelated to the subreddit can be considered "spam", while posts related to the subreddit are "ham". The integrity of a subreddit relies on the quality and relevance of its posts. The content of a spam post could be as bad as a hyperlink to a phishing website or as benign as a high-quality post accidentally submitted to the wrong subreddit. 

A spam detection bot could be trained to automatically detect and remove irrelevant posts before they are released to the public. **The objective is to determine how the frequency of certain two-word phrases (bi-grams) in a post affect the odds that the post belongs to a specific subreddit by using Neuro-Linguistic Programming (NLP) techniques.**


## Assumptions

The default page of each subreddit is sorted by `hot`, typically posts that have already acquired a sufficient amount of upvotes by users and possibly passed a spam detection bot. Since the posts collected using the Reddit API appear to be sorted the same way, I assume that **all posts collected using the Reddit API are ham**.

## Executive Summary

**Data Collection**  
I collected as many posts from `/r/leanfire` and `/r/fatfire` as Reddit would allow. In addition to `title` and `selftext`, I also collected `comments` to train my models. Since I collected more posts from `/r/leanfire`, it will be my target class `y=1` (ham). Posts from `/r/fatfire` will be used to imitate spam for the sake of training classification models. 

**Data Cleaning and EDA**  
I extracted text from the `title`, `selftext`, and `comments` columns to use as my predictors. I used `RegExpTokenizer` to split the text into individual words and `Lemmatizer` to reduce words to their lemma. 

**Benchmark Model**  
I train test split my data to check for overfitting. I used `CountVectorizer` and `TfidfVectorizer` to map bi-grams to their frequency or weighted frequency in each post. For my benchmark model, I used `LogisticRegression`. For the training data, I used **cross validated accuracy score to measure success** of the model. Not only did the model perform better using the `TfidfVectorizer` transformed data, the highest ranked bi-grams were also more meaningful. 

**Model Tuning**  
I used the following classification models with and without tuning hyperparameters:
- LogisticRegression
- KNN
- NaiveBayes
    - MultinomialNB
- DecisionTreeClassifier
- BaggingClassifier
- RandomForestClassifier
- ExtraTreesClassifier
- AdaBoostClassifier
- SVC
- VotingClassifer

With and without hyperparametering tuning, Naive Bayes' `MultinomialNB` performed the best.

**Evaluation and Conceptual Understanding**  
Unseen data will be newly submitted posts on `/r/leanfire`, meaning all new data belongs to the positive class (`y=1`). If the model classifies a post as negative (`y=0`), then it is likely to be spam.  

True Positives = Ham  
False Negative = Spam  

To maintain the integrity of the subreddit, I prefer the model to have a **sufficiently high False Negative rate** (spam detection rate). The consequence of this decision is that **some quality posts may be blocked**. This is preferred over **letting actual spam bypass the detection algorithm**. 


### Import Libraries

In [18]:
import requests
import time

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## Data Collection


### Choose Subreddits

In [19]:
subs = ['fatfire', 'leanfire']
# subs = ['lifeprotips', 'shittylifeprotips']

possible_subs = [('animalsbeingbros', 'animalsbeingjerks'), 
                 ('powerlifting', 'weightlifting'),
                 ('warriors', 'lakers'),
                 ('funny', 'cute'),
                 ('lifeprotips', 'shittylifeprotips'),
                 ('news', 'politics'),
                 ('aww', 'lookatmydog') ]

### Define a function to get all comments in a reddit post

In [20]:
def get_comments(children):
    '''
    Arguments: 
    children : list of child comment JSONs.
    
    Recursively extracts all of the comments of a Reddit post.
    
    Return:
    comments : List of all comments in a reddit post.
    '''
    comments = []
    
    for child in children:
        if child['kind'] == 'more':
            continue
        
        data = child['data']

        comments.append(data['body'])        
        # if child has children, get comments of children
        if data['replies'] != '':
            # get_comments returns a list, so .extend() it
            comments.extend(get_comments(data['replies']['data']['children']))
    
    return comments

## Hit the Reddit API

### Keys to note:  
`permalink` : to get comments  
`title` : predictor words  
`selftext` : more predictor words  
`body` : body of comment

### Define a function to collect the posts and comments of a subreddit

In [21]:
def scrape(subreddit):
    '''
    Arguments:
    subreddit : String of a subreddit
        example: 'leanfire'
    
    Requests as many posts from Reddit as possible
    Stores the text of every post's title, selftext and comments

    Return:
    A dataframe of post titles, selftext and comments
    '''
    headers = {'User-agent' : 'Reechee'}
    subreddit_url = 'https://www.reddit.com/r/' + subreddit + '.json'
    posts = []
    after = ''
    
    # Keep requesting JSONs until Reddit gives up
    while after != None:        
        # Request subreddit JSON for next 100 posts
        req = requests.get(subreddit_url,
                                      headers = headers,
                                      params = {'after' : after,
                                                'limit' : '100'})
        subreddit_json = req.json()
        
        # dictionary of post data
        post_data = subreddit_json['data']
        
        n_posts = len(post_data['children'])
        
        # For each post webpage, request the JSON
        for i, post in enumerate(post_data['children']):
            post_url = 'https://www.reddit.com' + post['data']['permalink'] + '.json'
            
            # Request the post webpage's JSON
            post_req = requests.get(post_url, 
                                     headers = headers)
            post_json = post_req.json()

            # Get all comments of a post
            post['data']['comments'] = get_comments(post_json[1]['data']['children'])
            
            # Print progress
#             print('[{}/{}]{} comments from {}'
#                   .format(i+1, n_posts,
#                   len(post['data']['comments']),
#                   post['data']['permalink']))
        
        # extend list of posts
        posts.extend([post['data'] \
                         for post in post_data['children']])
        
        # 'after' key for next batch of posts
        after = post_data['after']
        
        print(f'Fetching next 100 posts with after key: {after}')
    
    print('Done!')
    return pd.DataFrame(posts)



In [1]:
df0 = scrape(subs[0])

In [2]:
df1 = scrape(subs[1])

In [24]:
df0.columns

Index(['approved_at_utc', 'approved_by', 'archived', 'author',
       'author_cakeday', 'author_flair_background_color',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_patreon_flair', 'banned_at_utc', 'banned_by', 'can_gild',
       'can_mod_post', 'category', 'clicked', 'comments', 'content_categories',
       'contest_mode', 'created', 'created_utc', 'crosspost_parent',
       'crosspost_parent_list', 'distinguished', 'domain', 'downs', 'edited',
       'gilded', 'gildings', 'hidden', 'hide_score', 'id', 'is_crosspostable',
       'is_meta', 'is_original_content', 'is_reddit_media_domain',
       'is_robot_indexable', 'is_self', 'is_video', 'likes',
       'link_flair_background_color', 'link_flair_css_class',
       'link_flair_richtext', 'link_flair_template_id', 'link_flair_text',
       'link_flair_text_color', 'link_flair_typ

In [9]:
df0.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,leedar1376,,,,[],,,...,,,Advisor fees at various asset levels?,20,https://www.reddit.com/r/fatFIRE/comments/a815...,[],,False,,
1,,,False,a_random_tomato,,,,[],,,...,,,HSA record keeping,39,https://www.reddit.com/r/fatFIRE/comments/a7s7...,[],,False,,
2,,,False,retiringearly,,,,[],,,...,,,2019 Money Goals,79,https://www.reddit.com/r/fatFIRE/comments/a7f3...,[],,False,,
3,,,False,kernelcrop,,,,[],,,...,,,What’s is your Tax Loss Harvesting Threshold?,15,https://www.reddit.com/r/fatFIRE/comments/a7fu...,[],,False,,
4,,,False,TreesButterPanny,,,,[],,,...,,,"Might be retiring earlier than planned, hoping...",40,https://www.reddit.com/r/fatFIRE/comments/a76b...,[],,False,,


### Save subreddit posts separately

In [10]:
df0.to_csv(f'../data/{subs[0]}.csv', index = False)
df1.to_csv(f'../data/{subs[1]}.csv', index = False)

### Combine the data, shuffle and save it

In [11]:
df = pd.concat([df0, df1], axis = 0)
df = df.sample(frac = 1, random_state = 42).reset_index(drop = True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  if __name__ == '__main__':


In [12]:
df['subreddit'].value_counts()

leanfire    999
fatFIRE     764
Name: subreddit, dtype: int64

The classes are sufficiently balanced.

In [26]:
df['subreddit'].value_counts(normalize = True)

leanfire    0.566648
fatFIRE     0.433352
Name: subreddit, dtype: float64

In [27]:
df.to_csv(f'../data/{subs[1]}-{subs[0]}.csv', index = False)