<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Classification of 'jobs' and 'forhire' subreddits

## Problem Statement

The goal of this project is to build a binary classification model to predict if a post on reddit belongs to the "jobs" or "forhire" subreddit. The model will be considered successful if both the F1 score and sensitivity are above 90%.

Additionally, the project aims to provide insights on the key words and phrases that people use when discussing jobs on reddit, with the stakeholders being data science peers, job seekers and job posters

# Executive Summary

Reddit is a popular social news, content, and discussions website where posts are organised according to subject into user-created 'subreddits'. Members submit content (such as images, texts, and links) to subreddits, which can then be voted on and commented by other members, creating an internet community of sorts around specific themes. In this project, I examined posts from two subreddits - [**r/jobs**](https://www.reddit.com/r/jobs/) and [**r/forhire**](https://www.reddit.com/r/forhire/).

<img src='datasets/rjobsCapture.jpg' width = 700 align = center>
<center><font size=2 color='grey'>(Fig 1. The frontpage of r/jobs as of 9pm, 22 Sep 2021.)</font></center>


<img src='datasets/rforhireCapture.jpg' width = 700 align = center>
<center><font size=2 color='grey'>(Fig 2. The frontpage of r/forhire as of 9pm, 22 Sep 2021.)</font></center>

As both subreddits are related to employment, there is potential for misclassification with both groups created around a year apart. 

The differences in r/jobs vs r/forhire include the size of the communities (r/jobs is more than twice the size of r/forhire at 577k vs 246k) with the nature of the posts on r/jobs appear mostly related to queries for career advice. In contrast, most posts on r/forhire are related to jobs posting by either job seekers or job providers.

The objective of this project is to create the best classification model to identify Reddit posts, with the goals as stated in the problem statement. For this project, the data has to be gathered manually using the Pushshift API.

For the training data set, a total of 1,000 posts were each gathered from the r/jobs subreddit and r/forhire subreddit. As usable language data can both be found in the post title and post content, they were merged to create new variable. For the purpose of this project, models comprising of combinations of the following are considered:

    - Pre-processing: 1) Tokenizer        2) Lemmatization
    - Transformer:    1) Bigrams          2) Sentiment Analysis
    - Model:          1) Random Forest    2) Voting classifier

After assessing both models, the final model is a voting classifier that uses a random forest classifier, multinomial naive Bayes, and support vector classifier to predict the subreddit based on the text in a given post. This model has an F1 score of 96% and a recall score of 97%, and would therefore be considered successful. 

Recommendations to further improve the model include:
    
- Expand our model to other subreddits that concern employment such as r/careeradvice to increase the corpus of words
- Get more data from other employment resources (e.g. LinkedIn, JobStreet etc.)

# Data Extraction and Data Cleaning

## 1. Data Extraction

In [None]:
# import libraries
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
import pickle

from datetime import datetime
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
import contractions

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import f1_score, roc_auc_score, plot_roc_curve, recall_score
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

In [None]:
# function to pull n posts from a specific subreddit
def get_reddit(subreddit, n_samples, verbose = False): 
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
    'subreddit': subreddit,
    'size':(n_samples if n_samples <= 100 else 100)}
    
    res = requests.get(url,params)
    data = res.json()
    df_final = pd.DataFrame(data['data'])
    
    count = n_samples if n_samples <= 100 else 100
    n = n_samples - 100
    
    while n > 0:
        if (verbose == True) & (count%100 == 0):
            print(f'{count} rows generated')
                
        params = {
        'subreddit': subreddit,
        'size':(n if n <= 100 else 100),
        'before':data['data'][99]['created_utc'] - 700000 
        }
        res = requests.get(url, params)
        data = res.json()
        df = pd.DataFrame(data['data'])
        df_final = df_final.append(df)
        n -= 100
        count += 100
    df_final = df_final.reset_index(drop=True)
    return df_final

In [None]:
# create dataframe for 1000 extracts from jobs sub
df_jobs = get_reddit('jobs', 1000)

# create dataframe for 5000 extracts from forhire sub
df_forhire = get_reddit('forhire', 1000)

In [None]:
# check shape of jobs extract
df_jobs.shape

In [None]:
# check extract dates
start = int(df_jobs['created_utc'][999])
end = int(df_jobs['created_utc'][0])
print(f'start date:{datetime.fromtimestamp(start)}')
print(f'end date:{datetime.fromtimestamp(end)}')

In [None]:
# check first 5 rows for job extract
df_jobs.head()

In [None]:
# check shape of forhire extract
df_forhire.shape

In [None]:
# check extract dates
start = int(df_forhire['created_utc'][999])
end = int(df_forhire['created_utc'][0])
print(f'start date:{datetime.fromtimestamp(start)}')
print(f'end date:{datetime.fromtimestamp(end)}')

In [None]:
# check first 5 rows for forhire extract
df_forhire.shape

In [None]:
# export to csv

df_jobs.to_csv('datasets/jobs.csv', index= False)
df_forhire.to_csv('datasets/forhire.csv', index = False)

## 2. Data Cleaning

In [None]:
# import raw files

df_jobs = pd.read_csv('datasets/jobs.csv')
df_forhire = pd.read_csv('datasets/forhire.csv')

# drop all columns except 'title', 'selftext' and 'subreddit'

col = ['title', 'selftext','subreddit', 'upvote_ratio']
df_jobs = df_jobs[col]
df_forhire = df_forhire[col]

### 2.1 jobs subreddit

In [None]:
#check first 5 rows
df_jobs.head()

In [None]:
# check null values
df_jobs.isnull().sum()

- Out of 1000 rows, 66 are null values for selftext (i.e. the post has a title but no text) and we shall investigate these rows 

In [None]:
df_jobs[df_jobs['selftext'].isnull()].head(5)

In [None]:
df_jobs['selftext'] = df_jobs['selftext'].fillna('')

In [None]:
# check for null values again
df_jobs.isnull().sum()

In [None]:
df_jobs[df_jobs['selftext'] != ''].tail()

In [None]:
df_jobs[df_jobs['selftext']=='[removed]'].shape

In [None]:
df_jobs[df_jobs['selftext']=='[deleted]'].shape

- Out of 1000 posts, 110 were removed by the moderators, and 2 were deleted by the users themselves
- I will delete these posts for the purpose of this project because the goal is to identify posts that are similar to what users post in the jobs subreddit, so I do not want to train my model on posts that moderators have already identified as irrelevant to this subreddit

In [None]:
# function to remove "removed" and "deleted" posts
def remove_del(df):
    mask = np.logical_not(df['selftext'].isin(['[removed]','[deleted]']))
    return df[mask]

In [None]:
df_jobs = remove_del(df_jobs)

In [None]:
# remove filler text posts
df_jobs = df_jobs[~df_jobs['selftext'].str.contains('filler')]

In [None]:
#check shape
df_jobs.shape

- For analysis, I want to combine the title and self text columns
- My classifications will be based on the words in both the title and text of the reddit posts

In [None]:
df_jobs['text'] = df_jobs['title'] + df_jobs['selftext']

In [None]:
# check for duplicates
df_jobs[df_jobs['text'].duplicated()].shape

In [None]:
# drop duplicate rows
df_jobs.drop_duplicates(inplace = True)

### 2.2 forhire subreddit

In [None]:
#check first 5 rows
df_forhire.head()

In [None]:
# check null values
df_forhire.isnull().sum()

- Out of 1000 rows, 15 are null values for selftext (i.e. the post has a title but no text) and we shall investigate these rows 

In [None]:
df_forhire[df_forhire['selftext'].isnull()].head(5)

- I will replace the selftext with blank (same as I did for jobs)

In [None]:
df_forhire['selftext'] = df_forhire['selftext'].fillna('')

In [None]:
df_forhire[df_forhire['selftext'] != ''].head()

In [None]:
df_forhire[df_forhire['selftext']=='[removed]'].shape

In [None]:
df_forhire[df_forhire['selftext']=='[deleted]'].shape

- Out of 1000 posts, 303 were removed by moderators, and 2 were deleted by users themselves

In [None]:
with pd.option_context('display.max_colwidth', None):
  display(df_forhire[df_forhire['selftext'] == '[removed]'].head(5))

- It seems the forhire subreddit has more posts that moderators may consider spam or inappropriate than the jobs subreddit, or the forhire subreddit has stricter community guidelines 
- Similar to the jobs subreddit, I will delete all rows where the self text has been removed by mods or deleted by the users

In [None]:
df_forhire = remove_del(df_forhire)

In [None]:
# check shape
df_forhire.shape

In [None]:
df_forhire['text'] = df_forhire['title'] + df_forhire['selftext']

In [None]:
# check for duplicates
df_forhire[df_forhire['text'].duplicated()]

In [None]:
# drop duplicates
df_forhire.drop_duplicates(inplace = True)

### 2.3 Combine dataframes 

In [None]:
df_jobs.shape

In [None]:
df_forhire.shape

- I have 219 more posts from r/jobs than r/forhire, since many of the posts in r/forhire were removed by moderators
- I want balanced classes when I perform my EDA and train my models, so I will take 500 posts from both datasets

In [None]:
df_jobs = df_jobs.sample(n=500, replace = False, random_state = 42).reset_index(drop=True)
df_forhire = df_forhire.sample(n=500, replace = False, random_state = 42).reset_index(drop=True)

In [None]:
# create combined dataframe
df = pd.concat([df_jobs,df_forhire]).reset_index(drop = True)
df.shape

In [None]:
# drop the title and self text columns
df.drop(['title','selftext'],axis = 1, inplace = True)

## 3. NLP Pre-Processing

- Before I move to my EDA there are some pre-processing steps for text that may be useful
- This will help me to find insights from the text data more easily during my Exploratory Data Analysis
- These steps are:
> 1. Remove line breaks and URLs
> 2. Tokenize
> 3. Remove stop words
> 4. Lemmatize

### 3.1 Remove line breaks and URLs

In [None]:
# convert everything to lower string first
df['text_adj'] = df['text'].str.lower()

In [None]:
# Lets view a long post
df[df['text_adj'].str.len()>500]['text_adj'][4]

In [None]:
# use regex to remove line breaks
df['text_adj'] = df['text_adj'].map(lambda x: re.sub('\n', ' ', x)) 

In [None]:
# confirm line breaks removed
df['text_adj'][4]

In [None]:
# create function to remove contractions
 
def expand_text(text):
    expanded_words = []
    for word in text.split():
        expanded_words.append(contractions.fix(word))
    return ' '.join(expanded_words)

In [None]:
df['text_adj'] = df['text_adj'].apply(lambda x: expand_text(x))

In [None]:
# confirm contractions have been removed
df['text_adj'][4]

In [None]:
# check for URLs
df[df['text_adj'].str.contains('http')]['text_adj'][27]

In [None]:
# use regex to remove URLs
df['text_adj'] = df['text_adj'].map(lambda x: re.sub('(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-;]*[\w@?^=%&\/~+#-])?', ' ', x)) 

#use regex to remove HTML coding (&amp, &gt, etc)
df['text_adj'] = df['text_adj'].map(lambda x: re.sub('-?&\w+', ' ', x)) 

In [None]:
df['text_adj'][27]

### 3.2 Tokenize

In [None]:
# tokenize for words, currencies, or percentages
tokenizer = RegexpTokenizer('((?:[A-Za-z]\.)+|[$¥£€%]{0,1}(?:\d{1,10}[,. ])*\d{1,10}[$¥£€%]{0,1}|\w+)')

In [None]:
df['token'] = df['text_adj'].map(lambda x: tokenizer.tokenize(x.lower()))

In [None]:
df.head()

In [None]:
# check tokens for first row are correct
df['token'][0]

### 3.3 Stop word removal

In [None]:
print(stopwords.words('english'))

In [None]:
stop = stopwords.words('english')

In [None]:
df['token'] = df['token'].apply(lambda x: [item for item in x if item not in stop])

### 3.4 Lemmatize

In [None]:
lem = WordNetLemmatizer()

In [None]:
df['token'] = df['token'].apply(lambda x: [lem.lemmatize(i) for i in x] )

In [None]:
df.head()

In [None]:
# pickle final dataframe
df.to_pickle('datasets/final')

# Exploratory Data Analysis

## 1. Post Length Distribution

In [None]:
# create length and word count features
df['text_length'] = df['text'].map(len)
df['word_count'] = df['text'].map(lambda x: len(x.split()))

### 1.1 Longest and Shortest posts

In [None]:
# 5 shortest posts in jobs
df[df['subreddit']=='jobs'].sort_values(by='word_count')[['text']].head()

In [None]:
# 5 longest posts in jobs

with pd.option_context('display.max_colwidth', 400):
  display(df[df['subreddit']=='jobs'].sort_values(by='word_count', ascending=False)[['text']].head())

In [None]:
# 5 shortest posts in forhire
df[df['subreddit']=='forhire'].sort_values(by='word_count', ascending=True)[['text']].head()

In [None]:
# 5 longest posts in forhire

with pd.option_context('display.max_colwidth', 400):
  display(df[df['subreddit']=='forhire'].sort_values(by='word_count', ascending=False)[['text']].head())

### 1.2 Distribution of Post Length

In [None]:
# Compare text length and word count

title_fig = plt.figure(figsize=(16,8))
sns.set_palette('bright')

# text legnth
ax1 = plt.subplot(1,2,1)
ax1.set_xlabel('Text Length (Characters)', fontsize=12)
sns.histplot(data=df[df['text_length']<2500], x="text_length", hue="subreddit", kde = True)

# word count
ax2 = plt.subplot(1,2,2)
sns.histplot(data=df[df['word_count']<500], x="word_count", hue="subreddit", kde = True)
ax2.set_xlabel('Word Count', fontsize=12)

plt.suptitle('Distribution of Text Length and Word Count for r/jobs and r/forhire', fontsize=15, fontweight = 'bold');

- The text length and word count distributions are similar
- The text length and word count distributions for both jobs and forhire are skewed to the right, however, the skew is greater for jobs lengths
- Posts in the jobs subreddit tend to be shorter and use fewer words than posts in the forhire subreddit

In [None]:
fig = plt.figure(figsize=(18,6))

# text length
ax1 = plt.subplot(1,2,1)
sns.barplot(data = df, x = 'text_length', y = 'subreddit')
ax1.set_ylabel('subreddit')
ax1.set_xlabel('Text Length')

# word count
ax2 = plt.subplot(1,2,2)
sns.barplot(data = df, x = 'word_count', y = 'subreddit')
ax2.set_ylabel('')
ax2.set_xlabel('Word Count')

plt.suptitle('Mean Text Length and Word Count for r/jobs and r/forhire', fontsize=15, fontweight = 'bold');

- On average, posts in the jobs subreddit have as many words as the forhire subreddit

## 2. Frequency of Key Words

### 2.1 Unigrams

In [None]:
top_20_tokens = df['token'].explode().value_counts()[:20].index
top_20_jobs = []
top_20_forhire = []
for token in top_20_tokens:
    top_20_jobs.append(df[df['subreddit']=='jobs']['token'].explode().value_counts()[token])
    top_20_forhire.append(df[df['subreddit']=='forhire']['token'].explode().value_counts()[token])
    
top20 = pd.DataFrame({'jobs': top_20_jobs, 'forhire': top_20_forhire}, index = top_20_tokens)

In [None]:
top20.plot(kind = 'barh', figsize=(16,8))
plt.title('Top 20 Frequent Words', fontsize=15, fontweight = 'bold');

- From the 20 most common words (excluding stop words) across both subreddits, it seems that the words are distinctively from either r/jobs or r/forhire, with the exception of 'year'

In [None]:
fig = plt.figure(figsize=(22,12))

ax1 = plt.subplot(1,2,1)
jobs_tokens = df[df['subreddit']=='jobs']['token'].explode()
jobs_tokens.value_counts()[:20].plot(kind='barh', color = 'orange')
ax1.set_title('r/jobs')


ax2 = plt.subplot(1,2,2)
forhire_tokens = df[df['subreddit']=='forhire']['token'].explode()
forhire_tokens.value_counts()[:20].plot(kind='barh')
ax2.set_title('r/forhire')

plt.suptitle('Top 20 Frequent Words in r/jobs and r/forhire', fontsize=15, fontweight = 'bold');

In [None]:
# words unique in jobs
set(jobs_tokens.value_counts()[:20].index) - set(forhire_tokens.value_counts()[:20].index)

In [None]:
# words unique in forhire
set(forhire_tokens.value_counts()[:20].index) - set(jobs_tokens.value_counts()[:20].index) 

In [None]:
# words in both jobs and forhire
set(jobs_tokens.value_counts()[:20].index).intersection(set(forhire_tokens.value_counts()[:20].index))

- As expected, 'job' is the most common word in r/jobs and 'hire' is one of the most common word in r/forhire
- I would expect the following words will be important in determining if a post is similar to the content posted in r/jobs:
>'job',
>'interview',
>'position',

- Although 'work' is common in r/jobs, it is also common in r/forhire and therefore may not be very helpful when building my classification model. 

### 2.2 Bigrams

- Bigrams are word pairs that help in sentiment analysis
- CountVectorizer is used to generate the most frequent bigrams in the full corpus, and both r/jobs and r/forhire

In [None]:
# All bigrams from full corpus
bigrams_cv = CountVectorizer(ngram_range=(2, 2), stop_words='english')
bigrams = bigrams_cv.fit_transform(df['text_adj'])
bigrams = pd.DataFrame(bigrams.todense(), columns=bigrams_cv.get_feature_names()).sum()

In [None]:
# Bigrams from r/jobs
jobsbigrams_cv = CountVectorizer(ngram_range=(2, 2), stop_words='english')
jobsbigrams = jobsbigrams_cv.fit_transform(df[df['subreddit']=='jobs']['text_adj'])
jobsbigrams = pd.DataFrame(jobsbigrams.todense(), columns = jobsbigrams_cv.get_feature_names()).sum()
top20jobsbigrams = jobsbigrams.sort_values(ascending = False)[:20]

In [None]:
# Bigrams from r/forhire
forhirebigrams_cv = CountVectorizer(ngram_range=(2, 2), stop_words='english')
forhirebigrams = forhirebigrams_cv.fit_transform(df[df['subreddit']=='forhire']['text_adj'])
forhirebigrams = pd.DataFrame(forhirebigrams.todense(), columns = forhirebigrams_cv.get_feature_names()).sum()
top20forhirebigrams = forhirebigrams.sort_values(ascending = False)[:20]

In [None]:
top20bigrams = bigrams.sort_values(ascending=False).head(20).index
top20_jobs = []
top20_forhire = []
for bigram in top20bigrams:
    try:
        top20_jobs.append(jobsbigrams[bigram])
    except:
        top20_jobs.append(0)
    try:
        top20_forhire.append(forhirebigrams[bigram])
    except:
        top20_forhire.append(0)
    
df_top20bigrams = pd.DataFrame({'jobs': top20_jobs, 'forhire': top20_forhire}, index = top20bigrams)

#plot top 20 bigrams
df_top20bigrams.plot(kind = 'barh', figsize=(16,8))
plt.title('Top 20 Frequent Bigrams in Full Corpus', fontsize=15, fontweight = 'bold');

- Most of the top 20 most frequent bigrams (excluding stop words) such as 'hourly rate', 'content writing', 'graphic designer', 'social media' and 'years experience' are from r/forhire, which makes sense because jobs seekers/jobs postings tend to use similar words to highlight the job requirements or person's capabilities
- The most frequent bigrams for r/jobs are 'job offer', 'current job', and 'new job' which makes sense as well since the reddit thread is centered around job advice.

In [None]:
fig = plt.figure(figsize=(22,12))

ax1 = plt.subplot(1,2,1)
top20jobsbigrams.plot(kind='barh', color = 'orange')
ax1.set_title('r/jobs')

ax2 = plt.subplot(1,2,2)
top20forhirebigrams.plot(kind='barh')
ax2.set_title('r/forhire')

plt.suptitle('Top 20 Frequent Bigrams in r/jobs and r/forhire', fontsize=15, fontweight = 'bold');

In [None]:
# bigrams unique in jobs
set(top20jobsbigrams.index) - set(top20forhirebigrams.index)

In [None]:
# bigrams unique in forhire
set(top20forhirebigrams.index) - set(top20jobsbigrams.index)

In [None]:
# bigrams in both r/jobs and r/forhire
set(top20jobsbigrams.index).intersection(set(top20forhirebigrams.index))

- Bigrams may be more useful when building my classification model as there is less overlap. The only bigram that is relevant in the top 20 most frequent bigrams for both r/jobs and r/forhire is "years experience"
- Many of the bigrams I would expect to be unique to one subreddit over the other (such as "job offer" or "graphic designer" for r/jobs and r/forhire respectively)

### 2.3 Other Features

#### Question marks

In [None]:
df['questions'] = df['text_adj'].str.count('\\?')

In [None]:
df.groupby('subreddit').sum()['questions']

- More posts in r/jobs use question marks, which could indicate more people ask questions in r/jobs than in r/forhire

#### URL

In [None]:
df['links'] = df['text'].str.count('http')
df.groupby('subreddit').sum()['links']

- More posts in r/jobs contain "http", which could indicate more people in r/jobs share links than in r/forhire

## 3. Sentiment Analysis

### 3.1 Sentiment Analysis with Vader

In [None]:
sent = SentimentIntensityAnalyzer()

In [None]:
df['polarity_scores_neg'] = df['text_adj'].apply(lambda x: sent.polarity_scores(x)['neg'])
df['polarity_scores_pos'] = df['text_adj'].apply(lambda x: sent.polarity_scores(x)['pos'])
df['polarity_scores_neu'] = df['text_adj'].apply(lambda x: sent.polarity_scores(x)['neu'])
df['polarity_scores_comp'] = df['text_adj'].apply(lambda x: sent.polarity_scores(x)['compound'])

In [None]:
polarity_col = ['subreddit','polarity_scores_neg', 'polarity_scores_pos', 'polarity_scores_neu', 'polarity_scores_comp']
df[polarity_col].groupby('subreddit').mean()

In [None]:
fig, axes = plt.subplots(2,2, figsize=(14,10))
polar_col = ['polarity_scores_neg', 'polarity_scores_pos', 'polarity_scores_neu', 'polarity_scores_comp']

for i,t in enumerate(polar_col):
    sns.boxplot(y=t, x="subreddit", data=df, ax=axes[i%2, int(i<2)])

- The mean scores for neutral, negative, and positive are similar however the compounded polarity score for r/forhire is significantly higher
- There are a lot more outliers present in the positive polarity and negative polarity boxplots for r/jobs, indicating the polarity scores are more extreme
- I will look at the 10 posts with the highest negative polarity and positive polarity scores

In [None]:
# top 10 negative posts
df.sort_values(by='polarity_scores_neg', ascending = False)[['subreddit', 'text_adj', 'polarity_scores_neg']].head(10)

- 9 out of 10 of the highest negative polarity scores are from r/jobs
- There seems to be bad experiences around the work environment (colleagues, work scope) which makes sense as posters would seek advice on how to improve their work situation

In [None]:
df.sort_values(by='polarity_scores_pos', ascending = False)[['subreddit', 'text_adj', 'polarity_scores_pos']].head(10)

- The highest positive polarity scores are split between r/jobs and r/forhire
- Hiring posts in r/forhire tend to be positive to make the job posting desirable and discussion on job opportunities availale in r/jobs would be optimistic and hence would have have elements of positivity as well.

### 3.2 Reddit's Upvote Ratio

The upvote ratio is a measure of how people react to posts in the subreddits and is a float between 0 to 1 (where 1 would mean the post has mostly upvotes compared to downvotes and 0 being the post has the most downvotes). The upvote ratio would show what posts other users in the subreddits agree or disagree with. 

As I have already cleaned the dataset of posts that were removed by moderators/deleted by users themselves (which would be more likely to have a lower upvote ratio), this feature may not be very useful in understanding how posts are received on reddit and hence I have not considered analysing this feature due to the reason stated and would be dropping this feature in my model.

# Data Modeling and Conclusion

## 1. Data Modeling

In [None]:
# map forhire = 1 and jobs = 0

df['label'] = df['subreddit'].map({'forhire': 1, 'jobs': 0})

Because the goal of this project is to be able to predict whether a post belongs to r/forhire or not, I will map r/forhire as True and r/jobs as False

This means that:
 * True Positives are posts that my model correctly predicts are from r/forhire
 * True Negatives are posts that my model correctly predicts are from r/jobs
 * False Positves are posts that my model incorrectly predicts are from r/forhire (but are actually from r/jobs)
 * False Negatives are posts that my model incorrectly predicts are from r/jobs (but are actually from r/forhire)

As the purpose of this project is to predict whether a post belongs to r/forhire, therefore, I want to focus on minimizing my False Negative rate, because I want to ensure I target all posts that may be related to r/forhire.

At the same time, I do not want to target every post across reddit because this would not be feasible from a time perspective. Hence, I will use the F1 score to ensure my true positives and true negatives are relatively low.

I wil train models including a baseline model, Random Forest Classifier, Multinomial Naive Bayes, and Support Vector Classifier to select my best performing model.

I will use the following metrics to build and evaluate my models:
- F1 score: I will compare the F1 scores of my train, test, and cross validation sets to see if my model is overfitting or underfitting
- Recall: I will tune my hyperparameters using recall to minimize my False Negative Rate
- ROC AUC, Recall, and F1 score: I will use these three metrics to compare the performance of my models on the test set so that I can select a model that performs well on all three criterias and deliver a final model with an F1 score and sensitivity above 90%

In [None]:
# drop columns except text_adj and label
df.drop(['subreddit', 'text', 'token', 'upvote_ratio'], axis = 1, inplace = True)

In [None]:
# train test split
X = df['text_adj']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

In [None]:
# define functions for modeling

# compare F1 scores
def display_f1(model, X_train, y_train, X_test, y_test):
    print('Train F1 Score: ', round(f1_score(y_train, model.predict(X_train)),5))
    print('Test F1 Score: ', round(f1_score(y_test, model.predict(X_test)),5))
    print('Cross Val F1 Score:', round(cross_val_score(model, X_test,y_test, scoring = 'f1').mean(),5))
    
# plot ROC and Confusionn matrix
def plot_model(model, X_test, y_test):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,7))
    
    #Plot ROC curve
    ax1.set_title('ROC Curve')
    plot_roc_curve(model, X_test, y_test, ax = ax1)
    ax1.plot([0, 1], [0, 1],label='baseline', linestyle='--')
    ax1.legend()

    #Plot confusion matrix
    ax2.set_title('Confusion Matrix')
    y_labeled = y_test.map({1:'forhire', 0:'jobs'})
    y_pred = pd.Series(model.predict(X_test)).map({1:'forhire', 0:'jobs'})
    cm = confusion_matrix(y_labeled, y_pred)
    sns.heatmap(cm, annot=True, fmt='g', ax=ax2, cmap='Blues')
    ax2.set_xlabel('Predicted labels')
    ax2.set_ylabel('True labels')
    ax2.xaxis.set_ticklabels(['forhire', 'jobs']) 
    ax2.yaxis.set_ticklabels(['forhire', 'jobs'])
    plt.show();

# model comparisons
def add_model(name, model, X_test, y_test):
    tn, fp, fn, tp = confusion_matrix(y_test, model.predict(X_test)).ravel()
    
    model_dictionary[name] = [round(f1_score(y_test, model.predict(X_test)),5), #F1 score
                              round(recall_score(y_test, model.predict(X_test)),5), #Recall 
                              round(roc_auc_score(y_test, model.predict_proba(X_test)[:,1]),5), #ROC AUC 
                              tp, #True Positive
                              fp, #False Positive
                              tn, #True Negative
                              fn #False Negative
                             ]
    return pd.DataFrame.from_dict(model_dictionary, orient = 'index', columns=['F1 Score', 'Recall', 'ROC AUC', 'True Positives', 'False Positives','True Negatives','False Negatives'])

### 1.1. Baseline Model

In [None]:
df['label'].value_counts(normalize = True)

- My baseline model accuracy is 50%. This means that if I were to predict all posts in my test dataset belong to r/forhire, I would have a 50% chance of being correct. 
- In order to calculate the other metrics I will use to compare my models with my baseline, I will generate baseline predictions for my test dataset by predicting 1 for every instance.

In [None]:
# create baseline dataframe
df_baseline = pd.DataFrame(y_test.values, columns=['y_true'])
df_baseline['y_pred'] = 1
df_baseline['y_pred_prob'] = 0.5

df_baseline.head()

In [None]:
# create model dictionary to compare future models to baseline
model_dictionary = {'Baseline':
                    [round(f1_score(df_baseline['y_true'],df_baseline['y_pred']),5), #F1 score
                     recall_score(df_baseline['y_true'],df_baseline['y_pred']), # recall score
                     roc_auc_score(df_baseline['y_true'], df_baseline['y_pred_prob']), #ROC AUC score
                     ((df_baseline['y_pred']==1) & (df_baseline['y_true']==1)).sum(), #True Positive
                     ((df_baseline['y_pred']==1) & (df_baseline['y_true']==0)).sum(), #False Positive
                     ((df_baseline['y_pred']==0) & (df_baseline['y_true']==0)).sum(), #True Negative
                     ((df_baseline['y_pred']==0) & (df_baseline['y_true']==1)).sum() ]} #False Negative 

### 1.2. Random Forest (Default Parameters)

In [None]:
# create pipe for cvec and random forest

rf_default_pipe = Pipeline([
                ('cvec', CountVectorizer()),
                ('rf', RandomForestClassifier(random_state = 42))
])

In [None]:
rf_default_pipe.fit(X_train, y_train)

In [None]:
display_f1(rf_default_pipe,X_train, y_train, X_test, y_test)

- Random Forest with countvectorizer and no hyperparameter tuning performed well on the train set, but worse on the test and cross validation set, which is an indication that the Model is overfitted

In [None]:
plot_model(rf_default_pipe, X_test, y_test)

In [None]:
add_model('Random Forest (default)', rf_default_pipe, X_test, y_test)

- Model performs better than the baseline for F1 score and I will try tuning my random forest hyperparameters

### 1.3. Random Forest (HyperParameter Tuning)

- I will perform a gridsearch for the best hyperparameters for my random forest model tuning for the following hyperparameters:
 * 100, 250, or 500 trees
 * CountVectorizer or TfidVectorizer
 * Unigrams or bigrams included
- My gridsearch will choose the hyperparameters that give the best recall score

In [None]:
rf_pipe = Pipeline([('vec', None), 
                    ('rf', RandomForestClassifier())])

rf_param_grid = {'vec': [CountVectorizer(), TfidfVectorizer()], 
                 'rf__n_estimators':[100, 250, 500], 
                 'vec__stop_words': ["english"], 
                 'vec__ngram_range': [(1, 1), (1, 2)]}

rf_gs = GridSearchCV(rf_pipe, rf_param_grid, scoring = 'recall')

In [None]:
rf_gs.fit(X_train, y_train)

In [None]:
display_f1(rf_gs,X_train, y_train, X_test, y_test)

- Random Forest with hyperparameter tuning does not perform better on both the test set & the cross validation set than the default random forest and the model is still overfitted

In [None]:
plot_model(rf_gs, X_test, y_test)

In [None]:
# display best params
rf_gs.best_params_

Even with hyperparamter tuning, the gridsearch has chosen hyperparameters that are actually similar to the default.
The difference is that the number of trees is 250 instead of the default 100

In [None]:
add_model('Random Forest (tuned)', rf_gs, X_test, y_test)

Both the F1 score and recall are above 90% but this model has performed worse than the baseline and default random forest

In [None]:
# plot tokens with the most importance
feat_weights = rf_gs.best_estimator_.named_steps['rf'].feature_importances_
rf_tokens = rf_gs.best_estimator_.named_steps['vec'].get_feature_names()

rf_feat = pd.DataFrame( {'top_words': rf_tokens, 'importance' : feat_weights})
rf_feat = rf_feat.set_index('top_words')
rf_feat = rf_feat.sort_values('importance',ascending = False)[:15]

plt.figure(figsize=(12,8))
plt.barh(rf_feat.index, rf_feat['importance'])
plt.xlabel('importance')
plt.ylabel('token')
plt.show();

- From the top 20 tokens, there are words I would expect would be good predictors of r/forhire and r/jobs such as "portfolio", "interview", "hiring"

###  1.4. Multinomial Naive Bayes (Default)

In [None]:
# create pipe for cvec and multinomialNB

nb_default_pipe = Pipeline([
                ('cvec', CountVectorizer()),
                ('nb', MultinomialNB())
])

In [None]:
nb_default_pipe.fit(X_train, y_train)

In [None]:
display_f1(nb_default_pipe,X_train, y_train, X_test, y_test)

- The model shows similar overfitting using the default parameters compared to both random forest models

In [None]:
plot_model(nb_default_pipe, X_test, y_test)

In [None]:
add_model('MultinomialNB (default)', nb_default_pipe, X_test, y_test)

- The F1 score for the multinomial naive bayes model is lower than the random forest model with similar number of false postives to the default random forest model.
- I will see if the recall can be improved with hyperparameter tuning

###  1.5. Multinomial Naive Bayes (HyperParameter Tuning)

In [None]:
nb_pipe = Pipeline([('vec', None), 
                    ('nb', MultinomialNB())])

nb_param_grid = {'vec': [CountVectorizer(), TfidfVectorizer()], 
                 'vec__stop_words': [None, 'english'], 
                 'vec__ngram_range': [(1, 1), (1, 2)],
                 'vec__max_features': [None, 1500, 3000, 5000]}

nb_gs = GridSearchCV(nb_pipe, nb_param_grid, scoring = 'recall')

In [None]:
nb_gs.fit(X_train, y_train)

In [None]:
display_f1(nb_gs,X_train, y_train, X_test, y_test)

- The train, test, and cross validation F1 scores are closer to each other other than previous models
- This model is the least overfit from the models I have constructed so far

In [None]:
plot_model(nb_gs, X_test, y_test)

In [None]:
# display best params
nb_gs.best_params_

- Like the default model, the tuned multinomialNB model also used countvectorizer
- The only difference between the tuned multinomialNB model and the default is that the tuned model only took the top 5000 most frequent words in the corpus

In [None]:
add_model('MultinomialNB (tuned)', nb_gs, X_test, y_test)

- The tuned multinomialNB  performing slightly better than the default multinomialNB for recall and F1 score
- Both the tuned multinomialNB and tuned random forest have F1 scores and recall scores that are similar, however, the random forest has a slightly better F1 score and the multinomialNB has a slightly better recall

In [None]:
# plot tokens with highest coefficients
nb_coef = nb_gs.best_estimator_.named_steps['nb'].coef_[0]
nb_tokens = nb_gs.best_estimator_.named_steps['vec'].get_feature_names()

nb_feat = pd.DataFrame( {'top_words': nb_tokens, 'importance' : nb_coef})

nb_feat = nb_feat.set_index('top_words')
nb_feat = nb_feat.sort_values('importance',ascending = False)[:15]

plt.figure(figsize=(12,8))
plt.barh(nb_feat.index, nb_feat['importance'])
plt.xlabel('importance')
plt.ylabel('token')
plt.show();

- Since this model did not use stop words, it is interesting that many stop words seem to have high importance in predicting which subreddit a post belongs to with no obvious tokens to predict r/forhire.

### 1.6. Support Vector Classifier

In [None]:
svc_pipe = Pipeline([('vec', None), 
                    ('svc', SVC(probability = True))])
svc_param_grid = {'vec': [CountVectorizer(), TfidfVectorizer()], 
                 'vec__stop_words': ['english'], 
                 'vec__ngram_range': [(1, 1), (1, 2)],
                 'svc__kernel': ['linear','rbf', 'sigmoid']}

svc_gs = GridSearchCV(svc_pipe, svc_param_grid, scoring = 'recall')

In [None]:
svc_gs.fit(X_train, y_train)

In [None]:
display_f1(svc_gs,X_train, y_train, X_test, y_test)

- The F1 scores are lower for train, test, and cross validation compared to the multinomialNB model

In [None]:
plot_model(svc_gs, X_test, y_test)

In [None]:
# display best parameters
svc_gs.best_params_

In [None]:
add_model('SVC', svc_gs, X_test, y_test)

- The SVC model has an F1 score and recall above 90%
- The SVC actually has the highest recall score from all the models, while still mainting an F1 score above 90%
- I will see if I can boost the three best performing models (in terms of recall and F1 scores) using a voting classifier

### 1.7 Voting Classifier

In [None]:
vc = VotingClassifier(estimators=[('svc', svc_gs), 
                                    ('rf', rf_gs), 
                                    ('mnb', nb_gs)], 
                        voting='soft')
vc = vc.fit(X_train, y_train)

In [None]:
display_f1(vc,X_train, y_train, X_test, y_test)

- This model has the highest F1 score from all the models

In [None]:
plot_model(vc, X_test, y_test)

In [None]:
add_model('Voting Classifier', vc, X_test, y_test)

- The voting classifier has the highest F1 score, recall, and ROC AUC from all the models
- The false negatives in this model is 4 from a test set of 125
- The F1 score is 96% and the recall is 97%, which is the higher than the random forest model => therefore I am comfortable using this model to predict whether or not a post belongs to the r/forhire subreddit

## 2. Conclusion and Recommendations

### 2.1. Final Model

The goal of this project was to build a binary classification model that could be used to predict if a post on reddit belongs to r/jobs or r/forhire. 

The final model is a voting classifier that uses a random forest classifier, multinomial naive Bayes, and support vector classifier to predict the subreddit based on the text in a given post. This model has an F1 score of 96% and a recall score of 97%, and would therefore be considered successful.

However, these scores are similar to those of the default Random Forest model with countvectorizer and no hyperparameter with both models having a high tendency to overfit the train data from the train_test_split. Though we have selected the model with the least drop in score when testing it on test data, the drop is still rather significant. Thus, the hyperparameter optimisation which was time-consuming only achieved very modest gains in the scores.

One reason for this could be that the moderators of the r/forhire do review posts that do not meet the specifics set in the Rules & Guidelines for r/forhire which would result in posts that are not related to job postings being removed regularly and thus resulting in the r/forhire posts being specific to job postings only.

<img src='datasets/rforhireguidelinesCapture.jpg' width = 700 align = center>
<center><font size=2 color='grey'>(Fig 3. r/forhire Rules and Guidelines as of 9pm, 22 Sep 2021.)</font></center>

Thus, the work done by the moderators might have resulted in the default Random Forest model being able to predict accurately if a post on reddit belongs to r/jobs or r/forhire.

### 2.2 Recommendations and Next Steps

Recommendations to further improve the model include:
    
- Expand our model to other subreddits that concern employment such as r/careeradvice to increase the corpus of words
- Get more data from other employment resources (e.g. LinkedIn, JobStreet etc.)

This is to build up the training data with more word pairs that can be used to distiguish posts better and reinforce the model