<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 55px"> 

# Project 3: Web APIs & NLP
## *01 - Scraping*

## **Background**

General Assembly is feeling the heat with more and more coding bootcamps popping up over the years. As an industry leader, the organization would like to keep its position and the marketing team is thus interested in streamlining and raising the effectiveness of its digital advertising efforts. They have as such, roped the data team in to provide some insights.

The starting point of most digital advertising strategies is finding the right keywords. While keywords such as 'bootcamps' and 'coding' are immediately apparent, being able to identify other less obvious ones are important here as well. Keywords that can help to differentiate similar groups of online personas into potential leads and less likely ones are appreciated.

Taking this into account and adding on the rise of social media (see [ourworldindata](https://ourworldindata.org/rise-of-social-media)) and its place in the advertising space, your team has decided to gather data from Reddit due to its uniqueness - it is a social media platform done in a forum style. Reddit contains a large amount of subreddits, which are essentially communities within the platform. According to Reddit themselves, 'there's a community for everybody'.

Each subreddit contains posts that are relevant to its topic. These features make Reddit a trove of social media-like text posts and therefore an ideal scraping candidate. 

## **Problem Statement**

This project aims to build a model with >90% accuracy that helps to identify between those who are looking for bootcamp style learning vs computer science majors/prospective students based on the words they use online.

## Contents:
- [Scraping](#Let-the-Scrape-Begin)
- [r/codingbootcamp](#r/codingbootcamp)
- [r/csMajors](#r/csMajors)

## Import Libraries

In [1]:
import pandas as pd

import requests
from bs4 import BeautifulSoup

from pprint import pp
import time
from datetime import datetime

## Let the Scrape Begin
Using the PushShift API we will scrape posts from two subreddits - r/codingbootcamp and r/csMajors. There is a need to avoid getting banned from overusing PushShift, so an added time delay between each request of a 100 (its limit) is required. The date and time of post creation is used to automate the process by asking the function to retrieve each batch of 100 posts from an earlier time period.

### Preliminary Look at the Data

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
# URL and scraping parameters
url = 'https://api.pushshift.io/reddit/search/submission/'
params = {
    'subreddit':'codingbootcamp', 
    'size':'100'
    }

In [4]:
# check status of request
req = requests.get(url, params)
req.status_code

200

In [5]:
bootcamp_raw = req.json()
bootcamp_raw = bootcamp_raw['data']

In [6]:
# taking a look at what data is pulled
df_bootcamp_raw = pd.DataFrame(bootcamp_raw)
print(df_bootcamp_raw.shape)
df_bootcamp_raw.head()

(100, 72)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,removed_by_category,thumbnail_height,thumbnail_width,url_overridden_by_dest,crosspost_parent,crosspost_parent_list,media,media_embed,secure_media,secure_media_embed,is_gallery
0,[],False,nazthetech,,[],,text,t2_9fquu,False,False,False,[],False,False,1669244663,self.codingbootcamp,https://www.reddit.com/r/codingbootcamp/commen...,{},z33ggw,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/codingbootcamp/comments/z33ggw/anyone_from_...,False,6,1669244673,1,"Hello all, I'm a human biology graduate. I'm n...",True,False,False,codingbootcamp,t5_372sz,22098,public,self,Anyone from Toronto have advice on bootcamp ch...,0,[],1.0,https://www.reddit.com/r/codingbootcamp/commen...,all_ads,6,,,,,,,,,,,,,
1,[],False,alejandracapurro,,[],,text,t2_f5pz1osc,False,False,False,[],False,False,1669244075,self.codingbootcamp,https://www.reddit.com/r/codingbootcamp/commen...,{},z337u7,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/codingbootcamp/comments/z337u7/ironhack_vs_...,False,6,1669244085,1,Helloo!! I‘m looking for bootcamps and the bes...,True,False,False,codingbootcamp,t5_372sz,22098,public,self,IronHack vs Le Wagon,0,[],1.0,https://www.reddit.com/r/codingbootcamp/commen...,all_ads,6,,,,,,,,,,,,,
2,[],False,Toastieez,,[],,text,t2_7nhu6wnt,False,False,False,[],False,False,1669243416,self.codingbootcamp,https://www.reddit.com/r/codingbootcamp/commen...,{},z32yc3,False,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,all_ads,/r/codingbootcamp/comments/z32yc3/is_an_associ...,False,6,1669243426,1,So I’ve been accepted into a local college tha...,True,False,False,codingbootcamp,t5_372sz,22096,public,self,Is an associates degree enough?,0,[],1.0,https://www.reddit.com/r/codingbootcamp/commen...,all_ads,6,,,,,,,,,,,,,
3,[],False,Verblewd,,[],,text,t2_uigyvhx2,False,False,False,[],False,False,1669238234,self.codingbootcamp,https://www.reddit.com/r/codingbootcamp/commen...,{},z30tw2,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/codingbootcamp/comments/z30tw2/is_freecodin...,False,6,1669238244,1,I mean while it does sound too good to be true...,True,False,False,codingbootcamp,t5_372sz,22096,public,self,Is freecodingbootcamp.org legit?,0,[],1.0,https://www.reddit.com/r/codingbootcamp/commen...,all_ads,6,,,,,,,,,,,,,
4,[],False,joshuamenko,,[],,text,t2_pe4kd,False,False,False,[],False,False,1669234613,self.codingbootcamp,https://www.reddit.com/r/codingbootcamp/commen...,{},z2zbt9,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/codingbootcamp/comments/z2zbt9/artist_to_co...,False,6,1669234624,1,"Hey guys, so I been holding onto the idea of b...",True,False,False,codingbootcamp,t5_372sz,22094,public,self,"Artist to Coder, What are my options?",0,[],1.0,https://www.reddit.com/r/codingbootcamp/commen...,all_ads,6,,,,,,,,,,,,,


### Request Function for SubReddit Posts
Posts were scraped starting from the 24th of November 2022. This timestamp is as such used to initiate the scraping function. We will start extracting posts that were submitted from this point on and work backwards until a total of 5000 posts per subreddit is obtained.

In [7]:
# start datetime of post scraping for this project (2022, 11, 24, 20, 58, 10, 336811)
datetime(2022, 11, 24, 20, 58, 10, 336811).timestamp()

1669294690.336811

In [8]:
# function to request from reddit PushShift API at 3s intervals

def req_subreddit(subreddit, iterations):
    
    # based on starting time of your choice
    list_dfs = [] # stores data
    current_time = 1669294690
    
    # scrape 100 posts per iteration
    for i in range(iterations):
        res = requests.get(
            url,
            params={
                'subreddit': subreddit,
                'size': 100,
                'before': current_time
                }
            )
        # 3s delay to prevent lockout
        time.sleep(3)
        df = pd.DataFrame(res.json()['data'])
        df = df[['subreddit', 'title', 'created_utc', 'selftext']]
        list_dfs.append(df)
        
        # use time of earliest post per iteration to continue scraping further back
        current_time = df.created_utc.min() 
        
    scraped_num = (sum(len(x) for x in list_dfs))
    print(f"{scraped_num} posts scraped!")
    df_final = pd.concat(list_dfs, axis=0) # combine all posts into one df
    df_final = df_final.reset_index(drop=True) # reset indexes
    return df_final 

### r/codingbootcamp

In [9]:
df_bootcamp = req_subreddit('codingbootcamp', 50)

4990 posts scraped!


In [10]:
print(df_bootcamp.shape)
df_bootcamp

(4990, 4)


Unnamed: 0,subreddit,title,created_utc,selftext
0,codingbootcamp,Anyone from Toronto have advice on bootcamp ch...,1669244663,"Hello all, I'm a human biology graduate. I'm n..."
1,codingbootcamp,IronHack vs Le Wagon,1669244075,Helloo!! I‘m looking for bootcamps and the bes...
2,codingbootcamp,Is an associates degree enough?,1669243416,So I’ve been accepted into a local college tha...
3,codingbootcamp,Is freecodingbootcamp.org legit?,1669238234,I mean while it does sound too good to be true...
4,codingbootcamp,"Artist to Coder, What are my options?",1669234613,"Hey guys, so I been holding onto the idea of b..."
...,...,...,...,...
4985,codingbootcamp,my math skills are equal to algebra from 11 ye...,1525831870,To put it simply.. I really have my mind set o...
4986,codingbootcamp,Is a coding bootcamp right for me?,1525693933,
4987,codingbootcamp,Anyone have any experience with Actualize (Any...,1524936502,I am in Chicago and they are the only ones who...
4988,codingbootcamp,The Unwritten Guide To Your Hack Reactor Inter...,1523829388,


In [11]:
# check for earliest and latest scraped post datetime
print(datetime.utcfromtimestamp(df_bootcamp['created_utc'].min()).strftime('%Y-%m-%d %H:%M:%S'))
print(datetime.utcfromtimestamp(df_bootcamp['created_utc'].max()).strftime('%Y-%m-%d %H:%M:%S'))

2018-04-06 05:49:29
2022-11-23 23:04:23


### r/csMajors

In [12]:
df_degree = req_subreddit('csmajors', 50)

4992 posts scraped!


In [13]:
print(df_degree.shape)
df_degree

(4992, 4)


Unnamed: 0,subreddit,title,created_utc,selftext
0,csMajors,Final Rounds for Expedia Mobile Engineer Intern,1669252773,Has anyone taken the final interview for the M...
1,csMajors,Is it normal to apply for multiple different p...,1669252295,I see lots of posts with people saying they ap...
2,csMajors,Google STEP/Microsoft Explore with no experience?,1669251959,Can you get into the big tech internship progr...
3,csMajors,Expedia New Grad Final Round,1669251690,Has anyone done the final round for Expedia ye...
4,csMajors,"Multimedia, fileformats and compression",1669251617,https://studienhandbuch.jku.at/93811?id=93811&...
...,...,...,...,...
4987,csMajors,apple epm intern interview timeline?,1665788384,"for context, i'm a ghc '22 in-person attendee ..."
4988,csMajors,Lutron Electronics New Grad Final Interview,1665788244,Has anyone recently interviewed with Lutron El...
4989,csMajors,Valkyrie Round 2,1665786853,Anyone done the Valkyrie Round 2 SWE Intern?\n...
4990,csMajors,HRT vs Citadel TC/Career/Reputation(Prestige) ...,1665785980,\n\n[View Poll](https://www.reddit.com/poll/y4...


In [14]:
# check for earliest and latest scraped post datetime
print(datetime.utcfromtimestamp(df_degree['created_utc'].min()).strftime('%Y-%m-%d %H:%M:%S'))
print(datetime.utcfromtimestamp(df_degree['created_utc'].max()).strftime('%Y-%m-%d %H:%M:%S'))

2022-10-14 22:16:04
2022-11-24 01:19:33


## Save to CSV

In [15]:
df_bootcamp.to_csv('datasets/codingbootcamp_submissions.csv', index=False)
df_degree.to_csv('datasets/csmajors_submissions.csv', index=False)

## Continued in 02 - Preprocessing & Vectorizing