# Subreddit Spam Classifier
#### Web APIs & Classification
_Author: Ritchie Kwan_

---


## Table of Contents

0. [Problem Statement](01-Gathering-Data.ipynb#Problem-Statement)
1. [Data Collection](01-Gathering-Data.ipynb#Data-Collection)
1. [Data Cleaning & EDA](#Data-Cleaning-and-EDA)
1. [Benchmark Model](03-Benchmark-Model.ipynb#Benchmark-Model)
1. [Model Tuning](04-Model-Tuning.ipynb#Model-Tuning)
1. [Evaluation and Conceptual Understanding](04-Model-Tuning.ipynb#Evaluation-and-Conceptual-Understanding)


### Import Libraries

In [1]:
import pandas as pd

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

import warnings
warnings.filterwarnings('ignore')

### Load Data

In [2]:
df = pd.read_csv('../data/fatfire-leanfire.csv')

In [3]:
df.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,True,Hawk_Sight,,,,[],,,...,,,Saving money on food,61,https://www.reddit.com/r/leanfire/comments/7wt...,[],,False,all_ads,6.0
1,,,False,chubbyfire,,,,[],,,...,,,Transitioning to the next chapter...what are y...,58,https://www.reddit.com/r/fatFIRE/comments/9gvb...,[],,False,,
2,,,True,afflictedfury,,,,[],,,...,,,26 yr old debating 2 career paths for FIRE/Fat...,0,https://www.reddit.com/r/fatFIRE/comments/8pdn...,[],,False,,
3,,,True,my_FI_,,,,[],,,...,,,Best banking relationships with a lot of money,28,https://www.reddit.com/r/fatFIRE/comments/6kb3...,[],,False,,
4,,,True,BlueberryRush,,,,[],,,...,,,My Mom is literally working for Health Insuran...,43,https://www.reddit.com/r/leanfire/comments/6ju...,[],,False,all_ads,6.0


## Data Cleaning and EDA

### The target is the majority class

In [4]:
subs = df['subreddit'].value_counts()
subs

leanfire    999
fatFIRE     764
Name: subreddit, dtype: int64

### Binarize the target

In [5]:
target = 'subreddit_' + subs.index[0].lower()

df[target] = df['subreddit'].map({subs.index[0] : 1, subs.index[1] : 0})
df[target].head()

0    1
1    0
2    0
3    0
4    1
Name: subreddit_leanfire, dtype: int64

### Define X and y
Our predictive features are `title`, `selftext`, and `comments`.  
Our target class is `subreddit`.

In [6]:
features = ['title', 'selftext', 'comments']

X = df[features]
y = df['subreddit_'+subs.index[0].lower()]

In [7]:
X.head()

Unnamed: 0,title,selftext,comments
0,Saving money on food,I know food is a general topic and not directl...,"['Budgetbytes.com', 'Seconded, this is a great..."
1,Transitioning to the next chapter...what are y...,I’m in my late 40s and planning to exit a comp...,"[""Have you given consideration to simply stopp..."
2,26 yr old debating 2 career paths for FIRE/Fat...,"Hey FatFire,\n\nI would like you all something...","[""PE without a doubt if you want to stay in fi..."
3,Best banking relationships with a lot of money,I currently have most of my money in Schwab bu...,"[""BofA + Merrill Edge with $100K brokerage bal..."
4,My Mom is literally working for Health Insuran...,My parents are in their late 50's. My dad rec...,['No reason for what she is doing. Get to the...


In [8]:
X.shape

(1763, 3)

### Ensure the class is balanced

In [9]:
y.value_counts()

1    999
0    764
Name: subreddit_leanfire, dtype: int64

In [10]:
y.value_counts(normalize = True)

1    0.566648
0    0.433352
Name: subreddit_leanfire, dtype: float64

### Tokenize X

In [11]:
# Instantiate Tokenizer
tokenizer = RegexpTokenizer(r'\w+')

# Run Tokenizer for each column of X
for col in features:
    X[col+'_tokens'] = X[col].apply(lambda words : 
                                    tokenizer.tokenize(str(words).lower()))
    

In [12]:
X.head()

Unnamed: 0,title,selftext,comments,title_tokens,selftext_tokens,comments_tokens
0,Saving money on food,I know food is a general topic and not directl...,"['Budgetbytes.com', 'Seconded, this is a great...","[saving, money, on, food]","[i, know, food, is, a, general, topic, and, no...","[budgetbytes, com, seconded, this, is, a, grea..."
1,Transitioning to the next chapter...what are y...,I’m in my late 40s and planning to exit a comp...,"[""Have you given consideration to simply stopp...","[transitioning, to, the, next, chapter, what, ...","[i, m, in, my, late, 40s, and, planning, to, e...","[have, you, given, consideration, to, simply, ..."
2,26 yr old debating 2 career paths for FIRE/Fat...,"Hey FatFire,\n\nI would like you all something...","[""PE without a doubt if you want to stay in fi...","[26, yr, old, debating, 2, career, paths, for,...","[hey, fatfire, i, would, like, you, all, somet...","[pe, without, a, doubt, if, you, want, to, sta..."
3,Best banking relationships with a lot of money,I currently have most of my money in Schwab bu...,"[""BofA + Merrill Edge with $100K brokerage bal...","[best, banking, relationships, with, a, lot, o...","[i, currently, have, most, of, my, money, in, ...","[bofa, merrill, edge, with, 100k, brokerage, b..."
4,My Mom is literally working for Health Insuran...,My parents are in their late 50's. My dad rec...,['No reason for what she is doing. Get to the...,"[my, mom, is, literally, working, for, health,...","[my, parents, are, in, their, late, 50, s, my,...","[no, reason, for, what, she, is, doing, get, t..."


### Lemmatize X

In [13]:
# Instantiate Lemmatizer
lemmatizer = WordNetLemmatizer()

# Run Lemmatizer
for col in features:
    X[col+'_tokens_lem'] = X[col+'_tokens'].apply(lambda tokens: 
                                              [lemmatizer.lemmatize(t) for t in tokens])
    

In [29]:
X[['title', 'title_tokens', 'title_tokens_lem']].loc[3]

title                  Best banking relationships with a lot of money
title_tokens        [best, banking, relationships, with, a, lot, o...
title_tokens_lem    [best, banking, relationship, with, a, lot, of...
Name: 3, dtype: object

### Save Data

In [14]:
df[['title', 'selftext', 'comments']] = X[['title_tokens_lem', 'selftext_tokens_lem', 'comments_tokens_lem']]

In [15]:
df[['title', 'selftext', 'comments', target]].to_csv(f'../data/{subs.index[0].lower()}-{subs.index[1].lower()}_clean.csv', index = False)
   