# Subreddit Classification with NLP

## Overview 

Lately the popularity of low carbohydrate diets has grown. There is clear evidence backing it up, and show that it promotes weight loss, such as [this article from Harvard](https://www.hsph.harvard.edu/nutritionsource/carbohydrates/low-carbohydrate-diets/#:~:text=Research%20shows%20that%20a%20moderately,selections%20come%20from%20healthy%20sources.&text=More%20evidence%20of%20the%20heart,for%20Heart%20Health%20(OmniHeart)).<br>
On a website known as Reddit, there are many subcommunities, known as subreddits, that discuss questions, advice, progress, and even frustrations around Nutrition and the many different diet styles. There are those dedicated to low/no carbohydrates as well. Within each subreddit, users can create text or image posts, and upvote or downvote posts to express approval or disapproval regarding the content of the post. The number of upvotes and downvotes are fed into a hot-ranking algorithm to determine a score for the post, with higher scoring posts rising to the top of the subreddit.

## Problem Statement

Given the rise in popularity of low/no carb diets, clients at MyNutrition Clinic have been requesting to be put on either a [Ketogenic](https://www.healthline.com/nutrition/ketogenic-diet-101) or a [Zerocarb (carnivore)](https://www.healthline.com/nutrition/carnivore-diet) diet. *(Some resources describing what these diets are have been linked)*. <br>
<p>This is based on the client's "I want to lose weight quickly" mentality, as well as their own additional research.
However we are concerned that this may result in "yo-yo dieting", where the client rebounds from the diet, either by quitting the diet because its too difficult, or ending the diet because they reached their short term goal, or by losing control and binge eating.
    
This rebound ends our clients up in a worse shape than when they started. If a client is fasely identified as belonging to the `zerocarb` group, and they eventually rebound, this rebound is worse, than a rebound for the `keto` diet, as the `zerocarb` diet is generally accepted as being the "harder" diet with more restrictions. As such, we ask ourselves the question, **Can we build a classification model to classify our clients into each group based on subreddit data, and minimize false positives especially when it comes to the `zerocarb` group**.

To tackle this problem we selected two subreddits that seemed closely linked with our problem statement, namely: 
1. `r/keto`
2. `r/zerocarb`

In this and the following notebooks, we will be carrying out the following actions:
- Web Scraping
- Data Cleaning and EDA
- Data Vectorizing Modeling
- Model Evaluation and Conclusion

This is the first of four notebooks:
1. **Subreddit Scraping (Current Notebook)**
2. [Data Cleaning](2_Data_Cleaning.ipynb)
3. [EDA on Cleaned Subreddits Data](3_EDA_on_Subreddits.ipynb)
4. [Modeling and Conclusions](4_Modeling_and_Conclusions.ipynb)

Quick Links:<br>
[1. Imports](#Imports)<br>
[2. Custom Class](#Custom-Class)<br>
[3. `r/keto` Scraping](#keto-Subreddit-Scraping)<br>
[4. `r/zerocarb` Scraping](#zerocarb-Subreddit-Scraping)

# Imports

In [26]:
import requests
import pandas as pd
import time
import re
from random import randint

---

# Custom Class

We will be scraping a total of `10_000` posts. During that time we might run into errors that may break the process, causing all progress to be lost. To avoid this, we will create a class that will help us to scrape and save the posts.

In [25]:
class SubredditPosts:
    """SubredditPost object which has methods to pull 
    n number of posts from a specified subreddit according 
    to parameters accepted by the PushShift API. This
    class REQUIRES the parameters.
    url accepts a text url. 
    parameters accepts parameters as per the PushShift API 
    documentation. Fields is a requirement in parameters.
    posts_required accepts a int input for the number of 
    posts to be pulled.
    fields accepts a list of column names you have mentioned 
    in the parameters.
    dataframe is an optional argument, where if your give a
    dataframe as an input, (the columns names should be the same
    as the fields in params), then new posts will be added 
    on to this dataframe."""
    
    # initialing the class and the required arguments
    def __init__(self, url, parameters, posts_required, 
                 fields, dataframe=''):
        # creating attributes that will be required in 
        # later class methods
        self.url = url
        self.params = parameters
        self.req_posts = posts_required
        self.trimmed_posts = []
        self.posts = []
        
        # check if DataFrame passed into class
        if len(dataframe):
            
            # adjusting the 'before' date so that we will 
            # not pull duplicate reddit posts
            self.params['before'] = dataframe.iloc[-1]['created_utc']
            
            print('DataFrame not empty, '
                  'params adjusted to contain "Before Date".')
            # creating an attribute to assign the DataFrame to
            self.gathered_posts = dataframe
        else:
            # since no DataFrame passed, we just create a blank DataFrame
            self.gathered_posts = pd.DataFrame(columns=fields)
            print('Attributes created successfully.')
    
    # method to scrape posts from reddit
    def get_posts(self):
        """Calling this method starts a while loop to 
        get the number of posts specified while created
        the SubredditPosts object. The while loop can be
        interrupted via KeyBoardInterrupt, and you will 
        be able to keep progress of your pulled posts."""
        
        # creating variables for the while loop
        enough_posts = False
        runthrough = 0
        
        # print statement to let us know that the function is running.
        print(f'Starting runthrough number {runthrough + 1}.')
              
        while not enough_posts:
            # try and except block for any errors that might occur
            try:
                # resetting trimmed post at the beginning of each loop
                self.trimmed_posts = []
                
                # calling another method that actually pulls n number 
                # of posts from a subreddit
                self.subreddit_scrape()
                
                # print statement to let the user know if API pull is successful.
                print('PushshiftAPI request, and posts filtering is successful.')
                
                # adding the pulled posts into our cumulative DataFrame
                self.gathered_posts = pd.concat([self.gathered_posts, 
                                           pd.DataFrame(self.trimmed_posts)])
                
                # dropping duplicated posts as quite a few reposts take place.
                self.gathered_posts.drop_duplicates(subset=['selftext', 'title'],
                                                    keep='last', inplace=True)
                
                # reseting index after concat
                self.gathered_posts.reset_index(inplace=True)
                
                # dropping the additional index column that gets created
                self.gathered_posts.drop(columns=['index'], inplace=True)
                
                # setting variable for num of posts gathered so far
                num_posts_gathered = len(self.gathered_posts)
                
                # if statement to check if enough posts are gathered
                if num_posts_gathered < self.req_posts:
                    
                    # calling method to set the date to the latest date pulled
                    self.adjust_date()
                    
                    # stop the function for 1-5 seconds, to ensure we don't 
                    # query the API too many times.
                    time.sleep(randint(1,5))
                    
                    # adjusting the runthrough number
                    runthrough += 1
                    
                    # print statement to let user know how many posts have been gathered
                    # and which runthrough we are on
                    print(f'{num_posts_gathered} posts have been gathered so far, '
                    f'starting runthrough number {runthrough + 1}.')

                else:
                    # if enough posts are gathered, we exit the loop
                    enough_posts = True
                    
                    # print statement to update the user that enough 
                    # posts have been gathered
                    print(f'{num_posts_gathered} posts have been gathered, '
                    f'this allows us to exit the while loop.')
                    
            # except block in case of Connection error or ValueError
            except (ConnectionError, ValueError) as e:
                
                # ask for user input if they want to continue pulling, 
                # or they can stop the loop and review the DataFrame 
                # pulled so far
                user_input = input(f'There has been an Error, --> {e}.'
                                   f'Do you wish to continue pulling '
                                   f'more posts? Type Y/N \n').lower()
                
                # check user input and take appropriate action
                if user_input == 'y':
                     self.get_posts()
                        
                else:
                    enough_posts = True
                    print('While loop has been exited, and pulled data has been saved.')
            
            # user may try to stop the function from running for some reason.. 
            # we will present the user with an option to stop the loop.
            except KeyboardInterrupt:
                
                # raise error message
                raise KeyboardInterrupt('You have attempted to stop the kernel.')
                
                # ask for user input if they want to continue pulling, 
                # or they can stop the loop and review the DataFrame 
                # pulled so far
                user_input = input(f'Do you wish to continue pulling '
                                   f'more posts? Type Y/N \n').lower()
                
                # check user input and take appropriate action
                if user_input == 'y':
                     self.get_posts()
                        
                else:
                    enough_posts = True
                    print('While loop has been exited, and pulled data has been saved.')
    
    
    # method to pull subreddit posts
    def subreddit_scrape(self):
        """Internally called method. Checks the post against
        criteria for this problem statement."""
        
        # use the requests library to scrape from reddit in the specific subreddit
        # as per the user defined params
        res = requests.get(self.url, self.params)
        
        # status code 200 means scrape was successful
        # if not successful, then raise ConnectionError
        if res.status_code != 200:
            raise ConnectionError(f'Error code:{res.status_code} has occured.')
            
        # if scrape successful
        else:
            # update the posts attribute
            # reason for this attribute: of 100 posts we pull, we might not "pass"
            # any post to be added to our cumulative DataFrame as per the conditions 
            # below if that is the case, we need a way to adjust the 'before' date
            # this attribute will allow us to do that
            self.posts = res.json()['data']
            
            # create throwaway variable to add posts if conditions met
            pulled_posts = []
            for post in self.posts:
            # each post is a dictionary
                # check to see if post is blank
                if post == '':
                    pass
                
                # check to see if certain keys exists in this post
                elif 'selftext' in post.keys() and \
                'is_video' in post.keys() and \
                'media_only' in post.keys():
                    
                    # filter out posts that are only media or videos
                    if post['is_video'] == False and \
                    post['media_only'] == False:
                        
                        # filter our posts that are removed, deleted or blank
                        if post['selftext'] != '[removed]' and \
                        post['selftext'] != '[deleted]' and \
                        post['selftext'] != '':
                            pulled_posts.append(post)
                
                # if previous keys check failed, check the following keys
                # if we go back far enough on a subreddit, the old keys 
                # were different
                elif 'selftext' in post.keys() and \
                'is_video' in post.keys():
                    
                    # filter out posts with videos
                    if post['is_video'] == False:
                        
                        # filter our posts that are removed, deleted or blank
                        if post['selftext'] != '[removed]' and \
                        post['selftext'] != '[deleted]' and \
                        post['selftext'] != '':
                            pulled_posts.append(post)
                
                # if previous keys check failed, check the following keys
                # if we go back far enough on a subreddit, the old keys 
                # were different
                elif 'selftext' in post.keys():
                    
                    # filter our posts that are removed, deleted or blank
                    if post['selftext'] != '[removed]' and \
                    post['selftext'] != '[deleted]' and \
                    post['selftext'] != '':
                        pulled_posts.append(post)
            
            # update the trimmed posts attribute, which gets passed back to the 
            # get_posts() method
            self.trimmed_posts.extend(pulled_posts)
            
    # method to adjust the date
    # this function helps to make the 'get_posts()' function less cluttered
    def adjust_date(self):
        """Internally called method. Adjusts the date of the 
        next pull from the subreddit."""
        
        # if trimmed posts not blank, it means some posts passed our checks
        if len(self.trimmed_posts):
            # in this case we can change the date to the trimmed post
            self.params['before'] = self.trimmed_posts[-1]['created_utc']
            
        # if trimmed_posts blank, then check if posts pulled are blank
        elif len(self.posts):
            # if posts pulled not blank, it means all these posts 
            # failed our checks, so we need to adjust the date from here
            # instead
            self.params['before'] = self.posts[-1]['created_utc']
        
        # if both trimmed posts and pulled posts blank, it means 
        # the subreddit has no more posts to be pulled.
        else:
            raise ValueError(f'There are no more posts that can be pulled!')

---

Now that the class is created, we will create a custom field list that we want to pull from the subreddit, based on which fields may help in EDA and modeling. 

In [27]:
# creating custom field list 
col_list = ['can_mod_post', 'created_utc', 'is_self', 'is_video', 
            'locked', 'media_only', 'no_follow', 'num_comments', 
            'num_crossposts', 'over_18','pinned', 'score','upvote_ratio',
            'selftext','subreddit', 'subreddit_subscribers', 
            'subreddit_type', 'title','total_awards_received', 
            'whitelist_status']

---

# `keto` Subreddit Scraping

In [15]:
# defining keto url and params to feed into our class
url_keto = 'https://api.pushshift.io/reddit/search/submission'
params_keto = {
    'size': 100,
    'subreddit': 'keto',
    'fields': col_list   
}

In [28]:
# creating keto class
keto = SubredditPosts(url_keto, params_keto, 10_000, col_list)

In [30]:
# calling 'get_posts()' method to start web scraping
keto.get_posts()

Starting runthrough number 1.
PushshiftAPI request, and posts filtering is successful.
69 posts have been gathered so far, starting runthrough number 2.
PushshiftAPI request, and posts filtering is successful.
120 posts have been gathered so far, starting runthrough number 3.
PushshiftAPI request, and posts filtering is successful.
179 posts have been gathered so far, starting runthrough number 4.
PushshiftAPI request, and posts filtering is successful.
233 posts have been gathered so far, starting runthrough number 5.
PushshiftAPI request, and posts filtering is successful.
285 posts have been gathered so far, starting runthrough number 6.
PushshiftAPI request, and posts filtering is successful.
341 posts have been gathered so far, starting runthrough number 7.
PushshiftAPI request, and posts filtering is successful.
414 posts have been gathered so far, starting runthrough number 8.
PushshiftAPI request, and posts filtering is successful.
481 posts have been gathered so far, starting 

PushshiftAPI request, and posts filtering is successful.
3718 posts have been gathered so far, starting runthrough number 67.
PushshiftAPI request, and posts filtering is successful.
3768 posts have been gathered so far, starting runthrough number 68.
PushshiftAPI request, and posts filtering is successful.
3820 posts have been gathered so far, starting runthrough number 69.
PushshiftAPI request, and posts filtering is successful.
3872 posts have been gathered so far, starting runthrough number 70.
PushshiftAPI request, and posts filtering is successful.
3916 posts have been gathered so far, starting runthrough number 71.
PushshiftAPI request, and posts filtering is successful.
3966 posts have been gathered so far, starting runthrough number 72.
PushshiftAPI request, and posts filtering is successful.
4026 posts have been gathered so far, starting runthrough number 73.
PushshiftAPI request, and posts filtering is successful.
4076 posts have been gathered so far, starting runthrough num

PushshiftAPI request, and posts filtering is successful.
7234 posts have been gathered so far, starting runthrough number 132.
PushshiftAPI request, and posts filtering is successful.
7292 posts have been gathered so far, starting runthrough number 133.
PushshiftAPI request, and posts filtering is successful.
7353 posts have been gathered so far, starting runthrough number 134.
PushshiftAPI request, and posts filtering is successful.
7408 posts have been gathered so far, starting runthrough number 135.
PushshiftAPI request, and posts filtering is successful.
7469 posts have been gathered so far, starting runthrough number 136.
PushshiftAPI request, and posts filtering is successful.
7538 posts have been gathered so far, starting runthrough number 137.
PushshiftAPI request, and posts filtering is successful.
7592 posts have been gathered so far, starting runthrough number 138.
PushshiftAPI request, and posts filtering is successful.
7646 posts have been gathered so far, starting runthro

In [32]:
# check to ensure that we have at least 10_000 posts
keto.gathered_posts.shape

(10058, 20)

In [34]:
# saving the gathered 'r/keto' posts to csv
keto.gathered_posts.to_csv('datasets/keto_subreddit.csv', index=False)

---

# `zerocarb` Subreddit Scraping

In [35]:
# defining zerocarb url and params to feed into our class
url_zerocarb = 'https://api.pushshift.io/reddit/search/submission'
params_zerocarb = {
    'size': 100,
    'subreddit': 'zerocarb',
    'fields': col_list   
}

In [36]:
# creating zerocarb class
zerocarb = SubredditPosts(url_zerocarb, params_zerocarb, 10_000, col_list)

In [37]:
# calling 'get_posts()' method to start web scraping
zerocarb.get_posts()

Starting runthrough number 1.
PushshiftAPI request, and posts filtering is successful.
15 posts have been gathered so far, starting runthrough number 2.
PushshiftAPI request, and posts filtering is successful.
33 posts have been gathered so far, starting runthrough number 3.
PushshiftAPI request, and posts filtering is successful.
58 posts have been gathered so far, starting runthrough number 4.
PushshiftAPI request, and posts filtering is successful.
76 posts have been gathered so far, starting runthrough number 5.
PushshiftAPI request, and posts filtering is successful.
96 posts have been gathered so far, starting runthrough number 6.
PushshiftAPI request, and posts filtering is successful.
120 posts have been gathered so far, starting runthrough number 7.
PushshiftAPI request, and posts filtering is successful.
135 posts have been gathered so far, starting runthrough number 8.
PushshiftAPI request, and posts filtering is successful.
163 posts have been gathered so far, starting runt

2149 posts have been gathered so far, starting runthrough number 67.
PushshiftAPI request, and posts filtering is successful.
2197 posts have been gathered so far, starting runthrough number 68.
PushshiftAPI request, and posts filtering is successful.
2243 posts have been gathered so far, starting runthrough number 69.
PushshiftAPI request, and posts filtering is successful.
2280 posts have been gathered so far, starting runthrough number 70.
PushshiftAPI request, and posts filtering is successful.
2324 posts have been gathered so far, starting runthrough number 71.
There has been an Error, --> Error code:522 has occured..Do you wish to continue pulling more posts? Type Y/N 
y
Starting runthrough number 1.
PushshiftAPI request, and posts filtering is successful.
2374 posts have been gathered so far, starting runthrough number 2.
PushshiftAPI request, and posts filtering is successful.
2425 posts have been gathered so far, starting runthrough number 3.
PushshiftAPI request, and posts fi

5613 posts have been gathered so far, starting runthrough number 61.
PushshiftAPI request, and posts filtering is successful.
5651 posts have been gathered so far, starting runthrough number 62.
PushshiftAPI request, and posts filtering is successful.
5686 posts have been gathered so far, starting runthrough number 63.
PushshiftAPI request, and posts filtering is successful.
5720 posts have been gathered so far, starting runthrough number 64.
PushshiftAPI request, and posts filtering is successful.
5753 posts have been gathered so far, starting runthrough number 65.
PushshiftAPI request, and posts filtering is successful.
5800 posts have been gathered so far, starting runthrough number 66.
PushshiftAPI request, and posts filtering is successful.
5836 posts have been gathered so far, starting runthrough number 67.
PushshiftAPI request, and posts filtering is successful.
5875 posts have been gathered so far, starting runthrough number 68.
PushshiftAPI request, and posts filtering is succ

9260 posts have been gathered so far, starting runthrough number 126.
PushshiftAPI request, and posts filtering is successful.
9342 posts have been gathered so far, starting runthrough number 127.
PushshiftAPI request, and posts filtering is successful.
9429 posts have been gathered so far, starting runthrough number 128.
PushshiftAPI request, and posts filtering is successful.
9483 posts have been gathered so far, starting runthrough number 129.
PushshiftAPI request, and posts filtering is successful.
9561 posts have been gathered so far, starting runthrough number 130.
PushshiftAPI request, and posts filtering is successful.
9646 posts have been gathered so far, starting runthrough number 131.
PushshiftAPI request, and posts filtering is successful.
9734 posts have been gathered so far, starting runthrough number 132.
PushshiftAPI request, and posts filtering is successful.
9807 posts have been gathered so far, starting runthrough number 133.
PushshiftAPI request, and posts filtering

In [5]:
# check to ensure that we have at enough posts
zerocarb.shape

(9935, 20)

*in this case, we actually ran out of posts on the Zerocarb subreddit. However we have very close to `10_000` posts, just short of `65`, so we will still continue with our analysis.

In [38]:
# saving the gathered 'r/zerocarb' posts to csv
zerocarb.gathered_posts.to_csv('datasets/zerocarb_subreddit.csv', index=False)

Now that we have pulled out `10_000` posts from each subreddit, we are ready to do some data cleaning and exploratory analysis. We will tackle that in the next notebook.