# Project 3: Web APIs & NLP
### Notebook 1: Introduction and Scraping

### Introduction & Problem Statement
As a Redditor and data aficionado, train a classifier to tell two sub-reddits apart using evaluation metrics (below):
- `r/jobs`, which focuses more on immediate issues and getting a job, and 
- `r/careerguidance`, which focuses more on longer-term (career) decisions.

#### Context
Administrators and moderators for these sub-reddits have a partnership to develop a feature to suggest to users to in which subreddit to post their content. This feature will simply take their content (`selftext`) and evaluate where their post belongs.

There is some active moderation which results in post removals. Moderation policy is less clearly written out for the `r/careerguidance` subreddit.
- [Moderation Policy for r/jobs](https://www.reddit.com/r/jobs/wiki/policy)
- [Modpost for r/careerguidance](https://www.reddit.com/r/careerguidance/comments/fwdma6/new_post_requirement_please_use_location_flairs/)

Caveat: Context is fictional. :)

#### Evaluation & metrics
The cost of misclassification is that the feature being developed would misplace the post, meaning more work for moderators. As a classification model, we look at not just accuracy, but also precision and recall. As there is no distinct reason to prioritize either precision or recall, we will focus on the F1 score.

### Methodology
Scape **posts** using Pushshift's API, then use NLP to discern investing advice between two investment-related sub-reddits: 
- [r/jobs](https://www.reddit.com/r/jobs/) and
- [r/careerguidance](https://www.reddit.com/r/careerguidance/).

### Notebook organization:
1. Introduction and scraping (**CURRENT**: Generate raw datasets)
2. [Data Cleaning and Exploratory Data Analysis (EDA)](./P03_02_data_cleaning_and_eda.ipynb) (Generate cleaned datasets)
3. [Pre-processing, modelling, assessments and conclusions](./P03_03_modelling_and_conclusions.ipynb) (Pre-processing, model, evaluate and summarize conclusions & recommendations)

### Sub-reddit scraping
This notebook will focus data collection via sub-reddit data scraping. 

It will then generate relevant `.csv` files for use in later notebooks as raw datasets.

#### On the API used
Related reference (Pushshift API parameters): https://pushshift.io/api-parameters/

#### Imports

In [1]:
import requests
import time
import pandas as pd

#### Scaping via Reddit Pushshift API

In [2]:
# Define subreddits to scrape
subreddit_1 = 'jobs'
subreddit_2 = 'careerguidance'

In [3]:
# Scraping function for subreddit
# 2 batches scraped as default
def scrape(subreddit, filepath, batches=2):
    # initialize
    url = f'https://api.pushshift.io/reddit/search/submission/'
    posts = []
    # base API parameters
    params = {
        'subreddit': subreddit,
        'size': 100
    }
    last_time_stamp = None
    # iteratively scrape
    for i in range(batches):
        if i != 0:
            params['before'] = last_time_stamp
        res = requests.get(url, params)
        data = res.json()
        print(f'{subreddit} run {i+1}, status code={res.status_code}; time={last_time_stamp}')
        last_time_stamp = data['data'][-1]['created_utc']
        posts.extend(data['data'])
    pd.DataFrame(posts)[['title','subreddit','selftext']].to_csv(filepath)

In [4]:
%%time

# initialize session
with requests.Session() as s:
    # scrape first subreddit
    scrape(subreddit_1,
           filepath='./data/subreddit_1_raw.csv',
           batches=25)
    # scrape second subreddit
    scrape(subreddit_2,
           filepath='./data/subreddit_2_raw.csv',
           batches=25)

jobs run 1, status code=200; time=None
jobs run 2, status code=200; time=1642527361
jobs run 3, status code=200; time=1642486680
jobs run 4, status code=200; time=1642454745
jobs run 5, status code=200; time=1642408353
jobs run 6, status code=200; time=1642359611
jobs run 7, status code=200; time=1642298081
jobs run 8, status code=200; time=1642262178
jobs run 9, status code=200; time=1642203542
jobs run 10, status code=200; time=1642172473
jobs run 11, status code=200; time=1642129249
jobs run 12, status code=200; time=1642102930
jobs run 13, status code=200; time=1642068485
jobs run 14, status code=200; time=1642026591
jobs run 15, status code=200; time=1642006517
jobs run 16, status code=200; time=1641964815
jobs run 17, status code=200; time=1641940698
jobs run 18, status code=200; time=1641914472
jobs run 19, status code=200; time=1641869817
jobs run 20, status code=200; time=1641844563
jobs run 21, status code=200; time=1641803802
jobs run 22, status code=200; time=1641756882
job