# Reddit NLP Classification Analysis

## Part I: Data Collection, Cleaning, & EDA

<img src='./images/pysql.png'>

<font color=white>.</font>

run me! ↓

In [318]:
# css notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

<font color=white>.</font>

#### Executive Summary

The goal of this analysis is to classify text posts as either belonging to one subreddit or another, using NLP. 

Success will be evaluated using the model's accuracy score in correctly classifying the posts, with a second goal of trying many models to obtain the highest accuracy score possible.

#### Subreddits

r/learnpython is a subreddit for people who are learning python, mostly filled with resources and questions. r/learnsql is similar, except for people who are learning sql. Both subreddits are fairly active, but r/learnpython is much more popular; it has 613k subscribers, compared to 17.7k subscribers for r/learnsql.

Both subreddits include fairly technical content and frequent code snippets, which makes it an interesting choice for a classification analysis. It's likely that classification could be done in multiple ways: 

* using the text content only (ignoring special characters), 
* using only special characters and pattern/frequency of special characters, as they make up the syntax of each language.

#### Research Questions

* What preprocessing steps are best for creating a classification model with high accuracy?
* Which classification models and hyperparameters lead to the highest accuracy score?

<font color=white>.</font>

#### Imports

In [18]:
import requests
import pandas as pd
import json
import csv
import time
import datetime

<font color=white>.</font>

#### Testing API

Reddit's API allows 60 requests per minute. Here is the basic process to pull from the reddit API, using the <code>requests</code> library.

In [21]:
url = 'https://api.pushshift.io/reddit/search/submission'

params = {
'subreddit': 'boardgames',
'size':100}

res = requests.get(url, params)
res.status_code

In [24]:
data=res.json()
posts = data['data']
len(posts)
df = pd.DataFrame(posts)

<font color=white>.</font>

Here is all the data that the Reddit API makes available to us:

In [25]:
df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'title', 'total_awards_rece

<font color=white>.</font>

Looking at the 'subreddit', 'selftext', and 'title' categories of the data I pulled.

In [26]:
df[['subreddit', 'selftext', 'title']].head()

Unnamed: 0,subreddit,selftext,title
0,boardgames,"As a young, 19 year old woman, I felt down upo...",The board game community saved my life
1,boardgames,I am looking at getting my partner a game mat ...,Suggestion: Tabletop Board Game Mat
2,boardgames,"So I'm on the hunt, no pun intended, for a rea...",Best murder mystery co-op games (not board games)
3,boardgames,,I bought a retro chess computer! Enterprise S ...
4,boardgames,,"You play r/Ark_Nova? Prove it, you zoo-buildin..."


<font color=white>.</font>

## Data Collection

The reddit API only allows you to pull 100 rows of data per request. In order to pull more without ending up with duplicate data, you need a function that specifies 'before' and 'after' dates for the date of each post.

To avoid converting timestamps and adding dates, I calculated the 'before' and 'after' parameters based off UTC time. It doesn't really matter that these are exact. These subreddits are active, so I found that a one-month span for around 100 posts worked well even if it didn't capture every post.

In [45]:
# Get today's date in UTC timestamp
# UTC datetime is calculated in "seconds elapsed", usually seconds elapsed since a specific date in 1970. 
# One year is about 31,536,000 seconds, one month is 2,628,288.

from datetime import timezone
dt = datetime.datetime.now(timezone.utc)
utc_time = dt.replace(tzinfo=timezone.utc)
today = utc_time.timestamp()

print(today)    

1651443724.311495


<font color=white>.</font>

I wrote a function that generates a dictionary of before:after timestamp ranges to pass into my api function.

In [36]:
def range_generator(num_months):
    rangelist={}
    onemonth = 2620000
    for i in range (0, num_months):
        after = int(round(today - (onemonth*(i+1)),0)) 
        before = int(round(today - (onemonth*i), 0)) #When i is 0, "before" is set to today.
        rangelist[before]=after
    return(rangelist)

range_generator(3)

1651442009.675046


{1651442010: 1648822010, 1648822010: 1646202010, 1646202010: 1643582010}

<font color=white>.</font>

Next I chose the fields I wanted to pull from the API, and wrote a function to automate api requests.

In [37]:
# Choosing the fields I want to pull from the API
fields = ['id', 'created_utc', 'title', 'selftext', 'author', 'upvote_ratio', 'score', 'num_comments', 'subreddit', 'total_awards_received', 'is_created_from_ads_ui', 'url']

def query(subreddit, num, before, after):   
    params = {
    'subreddit': subreddit,
    'size':num,
    'before':before,
    'after':after,
    'fields':fields}    # defined above
    url = 'https://api.pushshift.io/reddit/search/submission/?'
    res = requests.get(url, params)

    data = res.json()
    test = pd.DataFrame(data['data'])
    return(test)

# Testing my function
query('learnpython', 1, 1651108665, 1648480377)

Unnamed: 0,author,created_utc,id,is_created_from_ads_ui,num_comments,score,selftext,subreddit,title,total_awards_received,upvote_ratio,url
0,unit111,1648482436,tqc9ff,False,0,1,Basically title. I am trying to figure out if ...,learnpython,Has anybody used Playwright with Behave,0,1.0,https://www.reddit.com/r/learnpython/comments/...


<font color=white>.</font>

I then iterated over my before:after dictionary and passed key:value pairs into the api function to set my time range.  

Due to differences in each subreddit's activity, I had to set my range of time to much longer for r/learnsql in order to avoid having unbalanced data. I chose to do this in a loop instead of a list comprehension so I could add sleep time and avoid overwhelming the API.

In [46]:
# Pull and make dataframe for learnpython sub

# List comp version sometimes works and sometimes doesn't, possibly because I can't add sleep time. I've commented it out for now
# listdfs=[query('learnpython', 100, key, val) for key, val in rangelist.items()]

listdfs=[]
for key, val in range_generator(34).items():
    i = query('learnpython', 100, key, val) 
    print('pulled 100 rows')
    listdfs.append(i)
    time.sleep(10)

learnpython=pd.concat(listdfs)
learnpython.info()

pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3396 entries, 0 to 98
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   author                  3396 non-null   object 
 1   created_utc             3396 non-null   int64  
 2   id                      3396 non-null   object 
 3   is_created_from_ads_ui  1098 non-null   object 
 4   num_comments            3396

In [50]:
learnpython.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3396 entries, 0 to 98
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   author                  3396 non-null   object 
 1   created_utc             3396 non-null   int64  
 2   id                      3396 non-null   object 
 3   is_created_from_ads_ui  1098 non-null   object 
 4   num_comments            3396 non-null   int64  
 5   score                   3396 non-null   int64  
 6   selftext                3388 non-null   object 
 7   subreddit               3396 non-null   object 
 8   title                   3396 non-null   object 
 9   total_awards_received   3396 non-null   int64  
 10  upvote_ratio            2398 non-null   float64
 11  url                     3396 non-null   object 
dtypes: float64(1), int64(4), object(7)
memory usage: 344.9+ KB


<font color=white>.</font>

In [44]:
# Pull and make dataframe for learnsql sub
# List comp version sometimes works and sometimes doesn't, possibly because I can't add sleep time. I've commented it out for now
# listdfs=[query('learnsql', 100, key, val) for key, val in rangelist.items()]

listdfs=[]
for key, val in range_generator(80).items():
    i = query('learnsql', 100, key, val) 
    print('pulled 100 rows')
    listdfs.append(i)
    time.sleep(10)

learnsql=pd.concat(listdfs)
learnsql.info()

pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 100 rows
pulled 1

In [51]:
learnsql.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3407 entries, 0 to 10
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   author                  3407 non-null   object 
 1   created_utc             3407 non-null   int64  
 2   id                      3407 non-null   object 
 3   is_created_from_ads_ui  1030 non-null   object 
 4   num_comments            3407 non-null   int64  
 5   score                   3407 non-null   int64  
 6   selftext                3407 non-null   object 
 7   subreddit               3407 non-null   object 
 8   title                   3407 non-null   object 
 9   total_awards_received   2580 non-null   float64
 10  upvote_ratio            1999 non-null   float64
 11  url                     3407 non-null   object 
dtypes: float64(2), int64(3), object(7)
memory usage: 346.0+ KB


<font color=white>.</font>

#### Export dataframes

In [52]:
# export df
learnpython.to_csv('./data/learnpython2.csv')

In [53]:
learnsql.to_csv('./data/learnsql2.csv')