Project 3: Reddit Scraping

Sara Soueidan /// 26.04.2020

Problem Statement

It is possible to determine which subreddit a specific reddit post was posted. Using an array of classification models, we found that deciding whether or not a post belonged to r/Cooking or r/Baking depended of a few fundamental decisions in the iterative data science process. Principally, we explored whether a logistic regression classification model, Naive Bayes or Decision Tree classification model was better at predicting the a post's subreddit. A successful model with be able to predict with a high-degree of accuracy (Accuracy > 95%) a post's origin subreddit.

Executive Summary

Data Collection

Notebook 00 To begin, we scraped the Reddit r/Cooking and r/Baking using the Pushshift API to obtain the 100,000 most recent posts from each subreddit using Python's requests library. We pulled six features from each post (title, body, datetime posted, post ID, subreddit and media presence).

Data Cleaning & EDA

Notebooks 01 and 02 Standard EDA techniques were applied to the raw data (missing, value counts, descriptive statistics). The df_cooking and df_baking were then combined into one df_reddit_cleaned to be used for the rest of the project. The df_reddit_cleaned DataFrame was then processed and cleaned. Cleaning occurred throughout all stages of this project. Rows with Null values in either the Title or Selftext columns were dropped. To begin processing the data, a tokenizer was applied to both the title and selftext columns to assess frequency of words in the corpus. Using these results, the stopword list was modified to include words found in both the top 20 of Cooking and Baking. The data was then processed for both Title and Selftext. Ultimately, Title was chosen as the principle component of the X variable for models (Selftext was orders of magnitude larger than Title, making it computationally intense). The column Subreddit was also converted to binary values. The processed df_reddit_cleaned was exported as reddit_processed.csv for use in model building.

Data Preprocessing & Modeling

Each model run resulted in an output specifying model specific values (estimator, transformer, train score, test score, best parameters and in the case of logistic regression: coefficient dictionary). Each model was run a minimum of 5 times.

Notebook 03: Logistic Regression

The first model utilizes a classic logistic regression with a CountVectorized or TfidfVectorized X variable (post title). These models were run using GridSearchCV with a large array of parameters, further run with hypertuned parameters.

Notebook 04: Decision Tree

The second model utilizes a decision tree with a CountVectorized or TfidfVectorized transformed X variable (post title). These models were run using GridSearchCV with a large array of parameters, further run with hypertuned parameters.

Notebook 05: Naive Bayes

The third model (and final model) evaluated utilizes a Naive Bayes classifier with a CountVectorized or TfidfVectorized transformed X variable (post title). These models were run using GridSearchCV with a large array of parameters, further run with hypertuned parameters.

Model Evaluation

Notebook 06 Comparing

Data Dictionaries

Data Dictionary for Input Data

df_cooking
df_baking
df_reddit
df_reddit_cleaned
df_reddit_process

Feature	Meaning	Type
title	reddit post title	object
created_utc	epoch when post was submitted	int64
selftext	reddit post body	object
subreddit	name of subreddit	object
id	unique id for post	object
media_only	boolean for has media attachment	object

Data Dictionary for Score Dictionary

Feature	Meaning	Type
model	model name	object
transformer	transformer name	object
train score	training score	float64
test score	testing score	float64

Code Notebooks

Data Collection

Import Libraries
Get Data
Make DataFrame
Save to CSV ('cooking.csv' and 'baking.csv')

Data Exploration and Cleaning

Import Libraries
Read in Data
Combine Cooking and Baking DataFrames
Remove Nulls (Missing and Duplicates)
Save to CSV ('reddit_cleaned.csv')

Data Processing and Cleaning

Import Libraries
Read in Data
Drop Nulls (Again)
Review Top N Words in Corpus
Determine Additional Stopwords
Remove Stopwords
Process Words
Convert Subreddit Column to Binary
Save to CSV ('reddit_processed.csv')

Data Modeling: Logistic Regression

Import Libraries
Read in Data
Check for NaNs / Drop NaNs in X, y Features
Select X and y Features (Title and Subreddit)
Train Test Split
Build Estimator Function
Run Models (Complex -> Simple)
- Save best results dictionary to scores dataframe

Data Modeling: Decision Tree

Import Libraries
Read in Data
Check for NaNs / Drop NaNs in X, y Features
Select X and y Features (Title and Subreddit)
Train Test Split
Build Estimator Function
Run Models (Complex -> Simple)
- Save best results to dictionary to scores dataframe

Data Modeling: Naive Bayes

Import Libraries
Read in Data
Check for NaNs / Drop NaNs in X, y Features
Select X and y Features (Title and Subreddit)
Train Test Split
Build Estimator Function
Run Models (Complex -> Simple)
- Save best results to dictionary to scores dataframe

Model Evaluation

ROC AUC Scores
Word Histograms
Bag of Word Analysis
Confusion Matrix

Main Directory

Jupyter Notebooks (00 - 07) [README] ('./README.md') [Reddit Classification] ('./Reddit Classification.pdf') [Sara Script] ('./sara.py')

Data Folder

Baking CSV [Cooking CSV] ('./data/cooking.csv') [Reddit Cleaned CSV] ('./data/reddit_cleaned.csv') [Reddit Processed CSV] ('./data/reddit_processed.csv') [All Scores CSV] ('./data/all_scores.csv')

Misc Folder

[Scratchpad] ('./misc/scratchpad.ipynb') [Requirements] ('./misc/requirements.txt') [starter_README] ('./misc/starter_README.md')

Conclusion

It is, indeed, possible to predict from which subreddit a post belongs. From the scores it is possible to see that the best performing model was the K-NN model in combination with a CountVectorizer.

To Do List

add more viz
delete extraneous libraries from notebooks
build output csvs for model scores
build output csv for all scores
build out NB 07

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
data		data
misc		misc
visualizations		visualizations
00_get_data.ipynb		00_get_data.ipynb
01_clean_data.ipynb		01_clean_data.ipynb
02_process_data.ipynb		02_process_data.ipynb
03_model_data_logisitic.ipynb		03_model_data_logisitic.ipynb
04_model_data_decision_tree.ipynb		04_model_data_decision_tree.ipynb
05_model_data_naive_bayes.ipynb		05_model_data_naive_bayes.ipynb
06_model_evaluation.ipynb		06_model_evaluation.ipynb
README.md		README.md
Reddit Classification.pdf		Reddit Classification.pdf
sara.py		sara.py

saraso2014/nlp-reddit

Folders and files

Latest commit

History

Repository files navigation

Project 3: Reddit Scraping

Sara Soueidan /// 26.04.2020

Problem Statement

Executive Summary

Data Collection

Data Cleaning & EDA

Data Preprocessing & Modeling

Model Evaluation

Data Dictionaries

Code Notebooks

Main Directory

Data Folder

Misc Folder

Conclusion

To Do List

About

Resources

Stars

Watchers

Forks

Languages