# Project 3 - Reddit APIs & Classification (Data scraping section)

## Background

Reddit is a massive collection of forums where people share news and content or comment on other people’s posts. Reddit is broken up into more than a million communities known as “subreddits” and each of which covers a different topic.

## Problem Statement

There are posts related to Pytorch and Tensorflow which could be wrongly posted in subreddits having similar content, and might require the moderators of the subreddits to clean it up occassionally to ensure content relevancy for viewers.

In this project, I aim to develop classifier models and identify if a Naive Bayes classifer or another classifier model would be suitable in classifying if a reddit post belongs to the either Pytorch or Tensorflow subreddits which can be useful for the moderators in deciding which posts that require cleanup.

## Executive Summary

The growing AI industry and the availability of open-sourced frameworks online such as the Tensorflow and Pytorch, the top 2 common frameworks used for implementing various neural network architecture by AI researchers, enthusiasts and application engineers has resulted in lots of questions and answers, especially in Reddit. 

As such there might be posts on Reddit site related to the two frameworks that could be wrongly posted in wrong subreddits. In view of the increasing amount of posts, it would be beneficial for subreddit stakeholders to have a classifier that helps to classify related posts/comments which are posted incorrectly in various subreddits as part of their cleanup process.

In this project, more than 900 unique reddit posts have been scraped from both Pytorch and Tensorflow subreddits respectively via Reddit's API, for the purpose of training classifier models which could potentially recommend Reddit moderators if a specific post should be moved to correct subreddit.

# Implementation
This notebook focuses on scraping the posts of the subreddits of interest(Tensorflow and Pytorch) and saves the scraped data into csv files after removing duplicate entries.

## Import necessary libraries and define necessary function

In [1]:
import requests
import time
import pandas as pd
import random

In [2]:
def reddit_query(url, req_headers, query_times = 4):
    """
    Function that takes in url and number of queries which are used in determining the amount of queries
    to be made from a site url for the purpose of collecting reddit posts. An additional request header
    was required to prevent the issue of code 429 error caused by the use of common request header by various
    client end points on the same page. 
    
    Arguments:
    @url: reddit url to be queried 
    @req_headers: Customised user agent for python 
    @query_times: number of queries to be made towards reddit url
    """
    posts = []
    after = None
    for count in range(query_times):
        print("Query count {}".format(count))
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
            
        # Change user agent identifier
        req_headers["User-agent"] = req_headers["User-agent"] + str(count)
        
        res = requests.get(current_url, headers = req_headers)
        # Check the status code before extending the number of posts
        if res.status_code == 200:
            print("Request sucessful")
            the_json = res.json()
            posts.extend(the_json['data']['children'])
        else:
            print("Request failure")
            print(res.status_code)
            break
            
        after = the_json['data']['after']
        #Throttle accordingly based on 60 requests/min restrictions
        sleep_duration = random.randint(2,10)
        print(f"Sleeping {sleep_duration}s")
        time.sleep(sleep_duration)
    return posts



def empty_string_proportion(df, col):
    """
    Function that counts the proportion of empty string for a provided dataframe column
    This function returns the scraped posts in the form of a python list.
    
    Arguments:
    @df: pandas dataframe 
    @col: dataframe column name of interest in string
    """
    if col not in df.columns:
        raise Exception("Sorry, no such column exists")
    else:
        number_entries = df.shape[0]
        empty_string_entries = df[df[col] == ""].shape[0]
        proportion = empty_string_entries/number_entries
        print(f"Entries with empty string: {empty_string_entries} out of {number_entries}")
    return proportion

## Scraping subreddits of interest
The variables name used are generic so as to cater scraping of subreddit content of any topic of interest without the need to rename when there is a need to change the topics to be scrap for project purposes.

In [3]:
subreddit_url1 = "https://www.reddit.com/r/pytorch.json"
subreddit_url2 = "https://www.reddit.com/r/tensorflow.json"

In [4]:
req_headers = {"User-agent": "G.A_Proj3_QZQ"}

In [5]:
posts1 = reddit_query(subreddit_url1, req_headers, query_times = 80)
posts2 = reddit_query(subreddit_url2, req_headers, query_times = 80)

Query count 0
Request sucessful
Sleeping 5s
Query count 1
Request sucessful
Sleeping 2s
Query count 2
Request sucessful
Sleeping 8s
Query count 3
Request sucessful
Sleeping 7s
Query count 4
Request sucessful
Sleeping 4s
Query count 5
Request sucessful
Sleeping 2s
Query count 6
Request sucessful
Sleeping 6s
Query count 7
Request sucessful
Sleeping 8s
Query count 8
Request sucessful
Sleeping 2s
Query count 9
Request sucessful
Sleeping 4s
Query count 10
Request sucessful
Sleeping 2s
Query count 11
Request sucessful
Sleeping 8s
Query count 12
Request sucessful
Sleeping 3s
Query count 13
Request sucessful
Sleeping 2s
Query count 14
Request sucessful
Sleeping 4s
Query count 15
Request sucessful
Sleeping 9s
Query count 16
Request sucessful
Sleeping 10s
Query count 17
Request sucessful
Sleeping 7s
Query count 18
Request sucessful
Sleeping 2s
Query count 19
Request sucessful
Sleeping 2s
Query count 20
Request sucessful
Sleeping 5s
Query count 21
Request sucessful
Sleeping 6s
Query count 22
Requ

**Check the number of posts for each subreddit. Need collect lots of posts**

In [6]:
print(f"Number of posts for subreddit1: {len(posts1)}")       
print(f"Number of posts for subreddit2: {len(posts2)}") 

Number of posts for subreddit1: 1988
Number of posts for subreddit2: 1992


## Exploration of json structure of scraped data

In [7]:
sample_post1 = posts1[0]
sample_post2 = posts2[0]
sample_post1

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'pytorch',
  'selftext': 'I have implemented ResNet-18 CNN from scatch in Python and PyTorch using CIFAR-10 dataset. You can see it [here](https://github.com/arjun-majumdar/CNN_Classifications/blob/master/ResNet-18_CIFAR10-PyTorch.ipynb).\n\nLet me know your comments/feedbacks.',
  'author_fullname': 't2_2mmql89p',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': 'ResNet-18 from scratch',
  'link_flair_richtext': [],
  'subreddit_name_prefixed': 'r/pytorch',
  'hidden': False,
  'pwls': 6,
  'link_flair_css_class': None,
  'downs': 0,
  'thumbnail_height': None,
  'top_awarded_type': None,
  'hide_score': False,
  'name': 't3_m12byo',
  'quarantine': False,
  'link_flair_text_color': 'dark',
  'upvote_ratio': 0.76,
  'author_flair_background_color': None,
  'subreddit_type': 'public',
  'ups': 9,
  'total_awards_received': 0,
  'media_embed': {},
  'thumbnail_width': None,
  'aut

In [8]:
sample_post1['data']

{'approved_at_utc': None,
 'subreddit': 'pytorch',
 'selftext': 'I have implemented ResNet-18 CNN from scatch in Python and PyTorch using CIFAR-10 dataset. You can see it [here](https://github.com/arjun-majumdar/CNN_Classifications/blob/master/ResNet-18_CIFAR10-PyTorch.ipynb).\n\nLet me know your comments/feedbacks.',
 'author_fullname': 't2_2mmql89p',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'ResNet-18 from scratch',
 'link_flair_richtext': [],
 'subreddit_name_prefixed': 'r/pytorch',
 'hidden': False,
 'pwls': 6,
 'link_flair_css_class': None,
 'downs': 0,
 'thumbnail_height': None,
 'top_awarded_type': None,
 'hide_score': False,
 'name': 't3_m12byo',
 'quarantine': False,
 'link_flair_text_color': 'dark',
 'upvote_ratio': 0.76,
 'author_flair_background_color': None,
 'subreddit_type': 'public',
 'ups': 9,
 'total_awards_received': 0,
 'media_embed': {},
 'thumbnail_width': None,
 'author_flair_template_id': None,
 'is_original_content'

In [9]:
sample_post2['data']

{'approved_at_utc': None,
 'subreddit': 'tensorflow',
 'selftext': "You can discuss anything here that doesn't require it's own post",
 'author_fullname': 't2_yr9xa',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'The Official Feedback and Discussion Thread',
 'link_flair_richtext': [],
 'subreddit_name_prefixed': 'r/tensorflow',
 'hidden': False,
 'pwls': 6,
 'link_flair_css_class': '',
 'downs': 0,
 'thumbnail_height': None,
 'top_awarded_type': None,
 'hide_score': False,
 'name': 't3_fxzwdq',
 'quarantine': False,
 'link_flair_text_color': 'dark',
 'upvote_ratio': 1.0,
 'author_flair_background_color': None,
 'subreddit_type': 'public',
 'ups': 3,
 'total_awards_received': 0,
 'media_embed': {},
 'thumbnail_width': None,
 'author_flair_template_id': None,
 'is_original_content': False,
 'user_reports': [],
 'secure_media': None,
 'is_reddit_media_domain': False,
 'is_meta': False,
 'category': None,
 'secure_media_embed': {},
 'link_flair_te

In [10]:
# The keys would be our dataframe column names
sample_post1['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'top_awarded_type', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'upvote_ratio', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'post_hint', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'sug

In [11]:
# Check subreddit of the post(label)
sample_post1['data']['subreddit']

'pytorch'

In [12]:
post_list1 = [p['data'] for p in posts1]
post_list2 = [p['data'] for p in posts2]

In [13]:
df1 = pd.DataFrame(post_list1)
print(df1.shape)
df2 = pd.DataFrame(post_list2)
print(df2.shape)

(1988, 114)
(1992, 114)


**Drop duplicates**

In [14]:
df1_nodup = df1.drop_duplicates(subset=['selftext','title'])
df2_nodup = df2.drop_duplicates(subset=['selftext','title'])

**Check if the columns of NVIDIA subreddit is a result of additional 2 columns compared to AMD subreddit**

In [15]:
set(list(df1_nodup.columns)) - set(list(df2_nodup.columns))

{'poll_data'}

In [16]:
set(list(df2_nodup.columns)) - set(list(df1_nodup.columns))

{'link_flair_template_id'}

**Check number of empty values in selftext columns**

In [17]:
empty_string_proportion(df1_nodup, "selftext")

Entries with empty string: 230 out of 943


0.24390243902439024

In [18]:
empty_string_proportion(df2_nodup, "selftext")

Entries with empty string: 207 out of 918


0.22549019607843138

## Save the dataframes into their respective csv files

In [19]:
df1_nodup.to_csv('./pytorch_posts.csv')
df2_nodup.to_csv('./tf_posts.csv')