# Project 3: Web APIs & NLP Part 1 Intro and Webscraping

Revathi Satkuna | DSIR1010 

## Problem Statement

Truth is stranger than fiction. The English department at the University you attend has requested you to demonstrate how data science can be relevant to literature.  The ENG105MISC class, Creative Writing in Tech, at the local community college wanted to visualize the intersection in English writing and technology.  The professor wanted to demonstrate how language can be generated by AI and in tune how AI can classify different types of text. To help with their learning (being your primary motivator...not money!), you wanted to develop a Classification model to show the students how your computer, with the help of a coding language like Python, can predict where a text can come from. To create the model, and the type of model in question is either going to be a RandomForestClassifier or LogisticRegression, you use the r/creativewriting and r/talesfromtechsupport to classify whether a phrase would belong in the Creative Writing subreddit or in the Technical Support subreddit.  You know you've created a successful model when data is being classified correctly. 

## Data Dictionary


|Feature|Type|Dataset|Description|
|---|---|---|---|
|selftext|object|combo|text from subreddit posts|
|author|object|combo|authors of subreddit posts|
|title|object|combo|titles of subreddit posts|
|subreddit|int64|combo|binary number identifying which subreddit post came from|
|fulltext|object|combo|combining the selftext and title post|
|tokenized_fulltext|object|combo|tokenized fulltext|
|lemmatized_tokenized_fulltext|object|combo|lemmatized tokenized fulltext|
|stemmatized_lemmatized_tokenized_fulltext|object|combo|stemmatized lemmatized tokenized fulltext|

In [1]:
# imports
import numpy as np
import pandas as pd
import datetime, time
import json
import requests

# Collecting Data and Exporting DataFrames

In [2]:
#Credit to Gwen Rathgeber/Ben Mathis
subreddits = ['talesfromtechsupport', 'creativewriting']
kind = "submission"  # we want text posts

# Establish URL base
BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint"

last_date = datetime.datetime.utcfromtimestamp(time.time())     #utc from timestamp -50_000
posts = {}  #empty dictionary
for subreddit in subreddits:
    posts[subreddit] = []
    day = 2                       #start with the most recent post
    cumulative_posts = 0
    while cumulative_posts < 20000:                           #scrape 20,000 b/c minimum is 10,000 and some will be junk from what you scrape
        stem = f"{BASE_URL}?subreddit={subreddit}&size=100"   #part of query, #will scrape from 100 posts
        URL = f"{stem}&after={day}d"                           #will scrape from after the day we scrape it
        print("Querying from: " + URL)
        try:                                                  #we use try, except b/c scraping from the web, you'll get a lot of errors
            res = requests.get(URL)
            assert res.status_code == 200
            json = res.json()['data']
            df = pd.DataFrame(json)
            posts[subreddit].append(df)
            cumulative_posts += df.shape[0]
            final_date_pulled = datetime.datetime.utcfromtimestamp(df.iloc[-1, df.columns.get_loc('created_utc')])
            increment = (last_date - final_date_pulled).days + 1
            increment = increment if increment > 0 else 1
            day += increment
            last_date = final_date_pulled
            print('successful')
        except:
            print(f'Scrape for {URL}, {day} failed')

        time.sleep(2)                    #this is a delay in between scrapes

print("Query complete!")

techsupport_frame = pd.concat(posts['talesfromtechsupport'])
creative_frame = pd.concat(posts['creativewriting'])

techsupport_frame.to_csv('data2/raw_techsupport_initial_scrape.csv')
creative_frame.to_csv('data2/raw_creative_initial_scrape.csv')

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=talesfromtechsupport&size=100&after=2d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=talesfromtechsupport&size=100&after=3d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=talesfromtechsupport&size=100&after=4d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=talesfromtechsupport&size=100&after=5d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=talesfromtechsupport&size=100&after=6d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=talesfromtechsupport&size=100&after=7d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=talesfromtechsupport&size=100&after=8d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=talesfromtechsupport&size=100&after=9d
successful
