# Project 3 : Web APIs and NLP

In this notebook we will be covering the following:

1. [Introduction and Problem Statement](#Inital_understanding_of_datasets)
2. [Data Dictionary](#Data_Dictionary) 
3. [Web Scraping](#Web_Scraping)
4. [Exporting Data](#Exporting_Data)

## Introduction

In this project, we will be looking at creating a binary classifier to classify two Reddit posts of choice. Reddit is a social news, content, and discussions website. Posts are organised according to subject into user-created 'subreddits'. Members submit content (such as images, texts, and links) to subreddits, which can be voted up ('upvote') or down ('downvote') by other members. As of June 2021, Reddit ranked among the most popular mobile social apps in the United States with almost 48 million monthly active users.

## Problem Statement

With the increasing popularity of data, technology and their uses in the past decade, it seems that the age of technology is here to stay. Not only is the world learning to harness the power of data, it is also learning to live with it. It comes as no surprise that such a relatively new industry attracts many talents to join in and seek a career from within, be it for the lucrative remuneration, or its challenges. We will be using data from two subreddits, 'datascience' and 'dataengineering' to understand, from a career switcher's point of view, the key differences between the two and if the results will enable him/her to decide which path to pivot to, taking into account his/her skillsets, academic qualification and interests.

## Data Dictionary

|Feature |Type|Dataset|Description|
|---|---|---|--|
|title|str|datascience_posts/dataengineering_posts|title of reddit post|
|selftext|str|datascience_posts/dataengineering_posts|selftext of reddit post|
|subreddit|str|datascience_posts/dataengineering_posts|subreddit name|
|title lemmatized|str|datascience_posts/dataengineering_posts|title after lemmatizing|
|selftext lemmatized|str|datascience_posts/dataengineering_posts|selftext after lemmatizing|
|lemma_comb|str|datascience_posts/dataengineering_posts|merge of title lemmatized and selftext lemmatized|

In [1]:
# Importing libraries

import pandas as pd
import requests, time, math

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Web Scraping

We will be scrapping data from the two subreddits, **'DataScience'** and **'DataEngineering'**.

In [2]:
# Function to scrape reddit posts with arguments subreddit, number of counts and epoch number

def subreddit_scrapper(subreddit, postcount, before):
    # Empty list to append all posts
    postlist = []
    
    # For-loop to get posts    
    for num in range((postcount//100)):
        base_url = 'https://api.pushshift.io/reddit/search/submission'
        params = {
        'subreddit': subreddit,
        'size': 100,
        'before': before
        }
        
        # Requests get using defined parameters
        res = requests.get(base_url, params)
        
        # Convert data to .json and append to empty list
        data = res.json()
        posts = data['data']
        postlist.extend(posts)
        
        # Set new 'before' to get new 'before' posts
        before = posts[-1]['created_utc']
        
    print(f'{len(postlist)} number of posts scrapped from {subreddit}.')    
    return postlist

In [3]:
%%time

# Apply function to scrape data from respective subreddits
ds = subreddit_scrapper('datascience', postcount = 1000, before = 1642060546)
de = subreddit_scrapper('dataengineering',postcount = 1000, before = 1642060546)

1000 number of posts scrapped from datascience.
1000 number of posts scrapped from dataengineering.
Wall time: 1min 45s


In [4]:
# Converting scraped data to DataFrames

ds_df = pd.DataFrame(ds)
de_df = pd.DataFrame(de)

## Exporting data

In [12]:
# Exporting data

ds_df.to_csv('../data/datasci.csv', index = False)
de_df.to_csv('../data/dataengin.csv', index = False)

Exploratory data analysis will be done in the next notebook.