<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Using Natural Language Processing (NLP) Modelling to Predict Desktop CPU Brand Popularity

# Part 2 - Data Scraping

### Contents:
- [Data Scraping](#Data-Scraping)

In [1]:
#Library imports
import requests
import pandas as pd
import time

---

## Data Scraping

Pushshift API (https://api.pushshift.io/) will be used to obtain 10,000 posts for all the subreddits. These subreddits will then be exported to .csv files for further data cleaning.

Pushshift API has some quirks like 1) only allows for 100 posts per API request. 2) has a limit of 60 requests per minute.

These quirks were overcome by 1) looping the request (and changing the ```'before'``` parameter with each iteration) in order to get 10,000 posts and 2) adding ```time.sleep(1)``` in order to keep the function within the limit.

In [2]:
def pushshift_10k(subreddit_str):
    """Returns at least 10_000 unique reddit posts with text using PushShift API"""
    #setting the url
    url = 'https://api.pushshift.io/reddit/search/submission'
    #Initializing the base parameters
    params = {
    'subreddit': subreddit_str,
    'size': 100,
    'fields': ['author','id','subreddit','title','selftext','created_utc'],
    }
    
    #instantiating an empty DataFrame
    df = pd.DataFrame()
    
    #utilizing a while loop as long as the number of unique posts with text has not reach 10_000
    while df.shape[0] < 10_000:
        req = requests.get(url, params)    
        if req.status_code == 200:
            df_json = req.json()
            df_new = pd.DataFrame(df_json['data'])
            df_time = df_new.iloc[-1]['created_utc'] #retrieving the last post's timestamp
             indexs = df_new[
            (df_new['selftext'] == '') |             #removing blank posts
            (df_new['selftext'] == '[removed]') |    #removing removed posts
            (df_new['selftext'] == '[deleted]') |    #removing deleted posts
            (df_new['selftext'].isnull())            #removing NaN posts
            ].index
            df_new.drop(indexs, axis = 0, inplace = True)
            df = pd.concat([df, df_new], axis = 0).reset_index(drop=True)
            df.drop_duplicates(subset=['selftext', 'title'], keep='last', inplace=True) #removing duplicated posts
            time.sleep(1) #adding a 1 sec pause in between loops
        else:
            print('Unsuccessful') #returns an error if PushShift API returns an error status code  
            return df
            break
        params['before'] = df_time #adding the time into the 'before' parameter to pull the next 100 posts    
    return df

In [2]:
#Please uncomment if intend to run
# amd_df = pushshift_10k('amd')
# amd_df.to_csv('amd_csv.csv', index=False)

In [3]:
#Please uncomment if intend to run
#intel_df = pushshift_10k('intel')
#intel_df.to_csv('intel_csv.csv', index=False)

In [4]:
#Please uncomment if intend to run
#buildapc_df = pushshift_10k('buildapc')
#buildapc_df.to_csv('buildapc_csv.csv', index=False)

---