<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

--- 
**Primary Learning Objectives:**
1. Collect posts from two subreddits of our choice. 
2. Use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

---

## Contents:
- Background
- Problem Statement
- Scrapping Reddit Data

## Background

Project 3 involves two main objectives. 
- Firstly, the project requires the collection of posts from two chosen subreddits. To accomplish this, the Pushshift API can be used or any other low-code platform. It is also possible to web-scrape or use other low-code platforms such as ParseHub to collect data from different social media platforms such as Twitter, Facebook, Instagram, or Wikipedia.

- Secondly, the project involves training a classifier using Natural Language Processing (NLP) techniques. The classifier is trained to predict which subreddit (or other sub-sections of the site that the data is collected from) a post belongs to. This is a binary classification problem, and the model will be trained to classify posts as belonging to one of the two chosen subreddits.


In this project, we will explore the following two sub reddits:
1) [Retirement](https://www.reddit.com/r/retirement/)
2) [Financial Planning](https://www.reddit.com/r/FinancialPlanning/)

## Data

Data from the subreddits r/retirement and r/FinancialPlanning were used for this project.These are:
*  Retirement: https://www.reddit.com/r/retirement/
*  Financial Planning: https://www.reddit.com/r/FinancialPlanning/

A total of 2,000 posts per subreddit were scrapped. These were the most recent posts created before 4 Feb 2023 and 8 Mar 2023 for the financial planning and retirement subreddits respectively. 

The data from the subreddits provide valuable insights into the concerns, strategies, and experiences of individuals planning for their financial future and retirement.

## Problem Statement

The goal of this project is to develop a classifier that can accurately distinguish between posts from two subreddits, Retirement and Financial Planning. This is important because although both subreddits deal with personal finance, they have distinct differences in their focus and audience. 

Being able to differentiate between the two subreddits can help financial professionals and individuals better tailor their content and advice to the specific needs and interests of their audience. 

To achieve this, natural language processing techniques will be used to analyze the content of posts from both subreddits and build a classifier that can accurately identify which subreddit a given post belongs to. 

The success of the project will be measured by the accuracy of the classifier in correctly identifying the source subreddit of new posts. The accuracy score is selected as the decision metric as it tells us how often the model correctly predicts the class of a post. The cost of misclassification is equal across the classes of retirement and financial planning since there are significant overlaps in features, and retirement adequacy is a major component in personal financial planning. A post related to retirement would help provide related information that an individual could consider when embarking on personal financial planning; while posts on financial planning will help those who are interested in retirement source for relevant information to provide better financial independence and peace of mind.

## Scrapping Reddit Data

First 1,000 posts under Retirement Subreddit

In [1]:
# Import libaries
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import json

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
params = {
        'subreddit':'retirement',
        'size': 1000,
        'before': 1678306393
}

In [4]:
# Send API request
res = requests.get(url, params=params)

In [5]:
# Check status code
res.status_code

200

In [6]:
#convert response to dataframe
posts = res.json()['data']
df = pd.DataFrame(posts)

In [7]:
#check len
len(posts)

1000

In [8]:
# Save out first 1000 posts for retirement subreddit
df.to_csv('../data/retirement_a.csv', index=False)

In [9]:
posts[999]

{'distinguished': None,
 'gilded': 0,
 'media': None,
 'link_flair_text': None,
 'retrieved_on': 1459195522,
 'from_kind': None,
 'is_self': True,
 'thumbnail': 'self',
 'author_flair_css_class': None,
 'title': 'Can I front-load my 401k to the Annual Max if I am not planning to be there for the full year?',
 'hide_score': False,
 'media_embed': {},
 'downs': 0,
 'link_flair_css_class': None,
 'selftext': 'Just what it says. Any reason why I cannot defer most of my paycheck the first few months to hit my max contributions, and then either retire, leave, or set my contributions to zero the rest of the way?   Assume I have no company match to contend with. Thank you!',
 'subreddit_id': 't5_2tgdb',
 'locked': False,
 'secure_media': None,
 'ups': 3,
 'score': 3,
 'quarantine': False,
 'author': 'rodg89',
 'archived': False,
 'secure_media_embed': {},
 'subreddit': 'retirement',
 'stickied': False,
 'over_18': False,
 'url': 'https://www.reddit.com/r/retirement/comments/46d97p/can_i_frontl

Second 1,000 posts for retirement subreddit

In [10]:
params_b = {
        'subreddit':'retirement',
        'size': 1000,
        'before': 1454975161
}

In [11]:
# Send API request
res = requests.get(url, params=params_b)

In [12]:
# Check status code
res.status_code

200

In [13]:
#convert response to dataframe
posts = res.json()['data']
df = pd.DataFrame(posts)

In [14]:
#check len
len(posts)

1000

In [15]:
# Save out second 1000 posts for retirement subreddit
df.to_csv('../data/retirement_b.csv', index=False)

First 1,000 posts under Financial Planning subreddit

In [16]:
params_c = {
        'subreddit':'FinancialPlanning',
        'size': 1000,
        'before': 1678306393
}

In [17]:
# Send API request
res = requests.get(url, params=params_c)

In [18]:
# Check status code
res.status_code

200

In [19]:
#convert response to dataframe
posts = res.json()['data']
df = pd.DataFrame(posts)

In [20]:
#check len
len(posts)

1000

In [21]:
# Save out first 1000 posts for financial planning subreddit
df.to_csv('../data/fp_a.csv', index=False)

In [22]:
posts[999]

{'subreddit': 'FinancialPlanning',
 'selftext': '[removed]',
 'author_fullname': 't2_94f77wog',
 'gilded': 0,
 'title': 'What are some behavioral characteristics of high NW individuals?',
 'link_flair_richtext': [],
 'subreddit_name_prefixed': 'r/FinancialPlanning',
 'hidden': False,
 'pwls': 6,
 'link_flair_css_class': None,
 'thumbnail_height': None,
 'top_awarded_type': None,
 'hide_score': True,
 'quarantine': False,
 'link_flair_text_color': 'dark',
 'upvote_ratio': 1.0,
 'author_flair_background_color': None,
 'subreddit_type': 'public',
 'total_awards_received': 0,
 'media_embed': {},
 'thumbnail_width': None,
 'author_flair_template_id': None,
 'is_original_content': False,
 'secure_media': None,
 'is_reddit_media_domain': False,
 'is_meta': False,
 'category': None,
 'secure_media_embed': {},
 'link_flair_text': None,
 'score': 1,
 'is_created_from_ads_ui': False,
 'author_premium': False,
 'thumbnail': 'self',
 'edited': False,
 'author_flair_css_class': None,
 'author_flair_

Second 1,000 posts for Financial Planning Subreddit

In [23]:
params_d = {
        'subreddit':'FinancialPlanning',
        'size': 1000,
        'before': 1675490930
}

In [24]:
# Send API request
res = requests.get(url, params=params_d)

In [25]:
# Check status code
res.status_code

200

In [26]:
#convert response to dataframe
posts = res.json()['data']
df = pd.DataFrame(posts)

In [27]:
#check len
len(posts)

1000

In [28]:
# Save out second 1000 posts for financial planning subreddit
df.to_csv('../data/fp_b.csv', index=False)