# Project 3: Web APIs & NLP
# Notebook 1: Data Pulling

https://github.com/pushshift/api<br>
https://api.pushshift.io/reddit/search/comment/

## Contents
- [Goal](#Goal)
- [Import Libraries](#Import-Libraries)
- [Import Comments from Subreddits and Save Relevant Data](#Import-Comments-from-Subreddits-and-Save-Relevant-Data)

## Goal
I’m a data scientist for a data science startup that has been hired by Reddit. We are using the latest 6,500 comments from the r/conservative (right-wing) and r/politics (left-wing) subreddits to train and validate our model, and then we will test the model on 50 posts from r/libertarian to see if the model succesfully predicts whether the comment is right-leaning or left-leaning. We will calculate the average political views for those r/libertarian commenters' usernames, and Reddit will offer Republican and Democrat organizations the opportunity to market ads towards the conservative-leaning libertarians and the liberal-leaning libertarians, respectively, to try to win them over before the election.

## Import Libraries

In [39]:
import requests
import pandas as pd
import time

pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 100)

## Import Comments from Subreddits and Save Relevant Data

**Based on reviewing Reddit comments, the comments in r/politics tend to be liberal, the comments in r/conservative tend to be conservative, and the comments in r/libertarian are sometimes liberal-leaning and sometimes conservative-leaning.**

In [40]:
url = "https://api.pushshift.io/reddit/search/comment/"

In [41]:
counter = 0
posts_p_before = None
posts_c_before = None
posts_l_before = None
df_p = pd.DataFrame([])
df_c = pd.DataFrame([])
df_l = pd.DataFrame([])

while counter <= 6000:

    params_p = {
        "subreddit": "politics",
        "size": 500,
        "before": posts_p_before
    }
    req_p = requests.get(url, params_p)
    data_p = req_p.json()
    posts_p = data_p["data"]
    df_p = df_p.append(pd.DataFrame(posts_p))
    posts_p_before = posts_p[499]["created_utc"]
    time.sleep(.1) # don't overload the Reddit server

    params_c = {
        "subreddit": "conservative",
        "size": 500,
        "before": posts_c_before
    }
    req_c = requests.get(url, params_c)
    data_c = req_c.json()
    posts_c = data_c["data"]
    df_c = df_c.append(pd.DataFrame(posts_c))
    posts_c_before = posts_c[499]["created_utc"]
    time.sleep(.1) # don't overload the Reddit server

    params_l = {
        "subreddit": "libertarian",
        "size": 500,
        "before": posts_l_before
    }
    req_l = requests.get(url, params_l)
    data_l = req_l.json()
    posts_l = data_l["data"]
    df_l = df_l.append(pd.DataFrame(posts_l))
    posts_l_before = posts_l[499]["created_utc"]
    time.sleep(.1) # don't overload the Reddit server

    counter += 500

In [42]:
df_p = df_p[["subreddit", "subreddit_id", "author", "author_fullname", "author_flair_text", "body"]]
df_c = df_c[["subreddit", "subreddit_id", "author", "author_fullname", "author_flair_text", "body"]]
df_l = df_l[["subreddit", "subreddit_id", "author", "author_fullname", "author_flair_text", "body"]]

In [43]:
df_p.to_csv("../datasets/politics_subreddit_latest_6500.csv", index = False)
df_c.to_csv("../datasets/conservative_subreddit_latest_6500.csv", index = False)
df_l.to_csv("../datasets/libertarian_subreddit_latest_6500.csv", index = False)

## See Notebook 2 for Data Cleaning