# Project 3 Part 1: Webscraping Reddit

This workbook is 1 out of 3 parts of an Natural Language Processing (NLP) classification model. The focus of this part is to gather the necessary data from Reddit for further processing.

In [None]:
#Import libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import datetime

## Problem Statement

The use of chatbots and other bots are increasingly prevalent in modern times. Corporations, and government bodies alike make use of these bots to handle repetitive tasks. Some of the chatbots rely on automated responses, while others are using Machine Learning to enable better responses that are closer to human interaction. In order to make the interaction more immersive, there has been a trend towards machine learning models for such chatbot. 

With this growing trend, legal implications to organisations may arise, since there has been several instances of such chatbots misbehaving and giving harmful replies, ranging from defamation, to telling people to kill their parents. Machine Learning models may inadvertantly dish out legal advices that makes the company which owns the chatbot liable if they do not work out.

In light of this trend, it may be beneficial for Machine Learning model to classify what is a query that is asking for legal advice and what is a query that is not legal advice. The response when a legal query is received can be to direct it to the relevant authorities, caveat and protect the owner of the bot before answering, or just not responding to those types of queries.

I have opted to use `r/NoStupidQuestions` to mimic real life questions, where users from all walks of life ask any questions they want, while `r/legaladvice` to mimic real life legal queries and questions that have legal implications. This project will look at how successful I will be able to correctly classify posts from the two subreddits.

It is interesting to note that `r/NoStupidQuestions` contains questions that have no boundaries, since all questions are welcome. The result is a melting pot of questions which mimics the borderline nonsensical questions that many asks chatbots out of leisurely purposes, as well as serious and insightful questions.

## Reddit Pushshift API scraping
I wrote a while loop to enable me to extract the important fields from each subreddit, ensuring that I get 12,000 posts from each of the subreddits of choice. To ensure that I do not go beyond the Pushshift API's hashrate of 60 request per second, I will sleep the code for 2 seconds at the end of each line. 

## Code for extracting `r/NoStupidQuestions` submissions

In [None]:
# Initiate date to search
date = 1647752175

# Instatiate main df for nostupidquestions
df_nostupidquestions = pd.DataFrame(columns = ['author',
                                               'is_self',
                                               'is_video',
                                               'num_comments',
                                               'permalink',
                                               'score',
                                               'selftext',
                                               'subreddit', 
                                               'title', 
                                               'upvote_ratio',
                                               'created_utc'
                                              ])


while df_nostupidquestions.shape[0]<12000:
    url = 'https://api.pushshift.io/reddit/search/submission'
    params ={
        'subreddit': 'NoStupidQuestions',
        'before': date,
        'size': 100
    }
    # create temporary df 
    data_temp = pd.DataFrame(requests.get(url,params).json()['data'])[['author',
                                               'is_self',
                                               'is_video',
                                               'num_comments',
                                               'permalink',
                                               'score',
                                               'selftext',                        
                                               'subreddit', 
                                               'title', 
                                               'upvote_ratio',
                                               'created_utc'
                                              ]]

    # create list for index with [removed] and '' 
    index_df_nostupidquestions = list(data_temp[(data_temp['selftext'] == '[removed]') | (data_temp['selftext'] == '')].index)

    # drop down all in index list 
    data_temp = data_temp.drop(index = index_df_nostupidquestions)

    # concat with main df
    df_nostupidquestions = pd.concat([df_nostupidquestions,data_temp])

    # print to check on shape 
    print(df_nostupidquestions.shape)

    # set up new date for searching after 
    date = df_nostupidquestions.iloc[-1,-1]
    
    #sleep 2 sec to allow within 60 request per second
    time.sleep(2)

df_nostupidquestions

## Code for extracting `r/legaladvice` submissions

In [None]:
# initiate date to search
date = 1647752175

#instatiate main df for legaladvice
df_legaladvice = pd.DataFrame(columns = ['author',
                                               'is_self',
                                               'is_video',
                                               'num_comments',
                                               'permalink',
                                               'score',
                                               'selftext',
                                               'subreddit', 
                                               'title', 
                                               'upvote_ratio',
                                               'created_utc'
                                              ])


while df_legaladvice.shape[0]<12000:
    url = 'https://api.pushshift.io/reddit/search/submission'
    params ={
        'subreddit': 'legaladvice',
        'before': date,
        'size': 100
    }
    # create temporary df 
    data_temp = pd.DataFrame(requests.get(url,params).json()['data'])[['author',
                                               'is_self',
                                               'is_video',
                                               'num_comments',
                                               'permalink',
                                               'score',
                                               'selftext',                        
                                               'subreddit', 
                                               'title', 
                                               'upvote_ratio',
                                               'created_utc'
                                              ]]

    # create list for index with [removed] and '' 
    index_df_legaladvice = list(data_temp[(data_temp['selftext'] == '[removed]') | (data_temp['selftext'] == '')].index)

    # drop down all in index list 
    data_temp = data_temp.drop(index = index_df_legaladvice)

    # concat with main df
    df_legaladvice = pd.concat([df_legaladvice,data_temp])

    # print to check on shape 
    print(df_legaladvice.shape)

    # set up new date for searching after 
    date = df_legaladvice.iloc[-1,-1]
    
    #sleep 2 sec to allow within 60 request per second
    time.sleep(2)

df_legaladvice

## Save dataframes to CSV

In [None]:
df_nostupidquestions.to_csv('../datasets/nostupid.csv')

In [None]:
df_legaladvice.to_csv('../datasets/legaladvice.csv')