# Project 3: Natural Language Processing of Subreddit Posts
------------------------------------------------------------

Project notebook organisation:
1. **Webscraping and Data Acquisition** (current notebook)
2. [Exploratory Data Analysis and Preprocessing]()
3. [Model Tuning and Insights]()

###  Contents:
  1. [Executive Summary](#Executive-Summary)
  2. [Background](#Background)
  3. [Problem Statement](#Problem-Statement)
  4. [Webscraping](#Webscraping)

## Executive Summary
Reddit is a social news, content, and discussions website. Posts are organised according to subject into user-created 'subreddits', which covers practically any topic imaginable. Members submit content (such as images, texts, and links) to subreddits. 

As a new investment company which has 2 main trading desks - one for traditional securities and another for cryptocurrency, reddit is a platform that piques our interest. The authentic daily discussions from the wide range of financial discussions, from financial news, to market data, and traditional securities in the Investing subreddit, to the blockchain technology and general sentiments for new coins in the CryptoCurrency subreddit, this raw conversations are what we intend to keep our eyes peeled on.

This project aims to automate the monitoring of reddit posts related to investing for new investing leads for both desks. Through this new leads and hot trends, I hope to filter this information to the specific trading desks. As such, I need a model to analyse and categorise the reddit posts for further review & investigation by either the securities or crypto trading desks. The prediction will be made with the best logistic regression model or Multinomial Naive Bayes Classifier, with Count Vectoriser or TF-IDF Vectoriser, as evaluated by F1 score, Sensitivity, Specificity and Accuracy. 

4 models were evaluated, namely Logistic Regression (Count Vectoriser), Logistic Regression (TF-IDF Vectoriser), Multinomial Naive Bayes Classifier (Count Vectoriser) and Multinomial Naive Bayes Classifier(TF-IDF Vectoriser). The entire dataset was split into a training dataset and a testing dataset. Logistic Regression (TF-IDF Vectoriser) is preferred as compared to the other 3 models. Two reasons for this: maximization of focus metric, and best overall balance in our 4 metrics. Our focus metric, specificity, performed best in a logistic regression. The highest specificity is desired in the models. In addition, the logistic regression scores over 90% on 3 out of 4 metrics, while specificity scores a hair below 90%, at 89.9%.

## Background
Reddit is a social news, content, and discussions website. Posts are organised according to subject into user-created 'subreddits', which covers practically any topic imaginable. Members submit content (such as images, texts, and links) to subreddits.

As a new investment company which has 2 main trading desks - one for traditional securities and another for cryptocurrency, reddit is a platform that piques our interest. The authentic daily discussions from the wide range of financial discussions, from financial news, to market data, and traditional securities in the Investing subreddit, to the blockchain technology and general sentiments for new coins in the CryptoCurrency subreddit, this raw conversations are what we intend to keep our eyes peeled on.

## Problem Statement
This project aims to automate the monitoring of reddit posts related to investing for new investing leads for both desks. Through this new leads and hot trends, I hope to filter this information to the specific trading desks. As such, I need a model to analyse and categorise the reddit posts for further review & investigation by either the securities or crypto trading desks. The prediction will be made with the best logistic regression model or Multinomial Naive Bayes Classifier, with Count Vectoriser or TF-IDF Vectoriser, as evaluated by F1 score, Sensitivity, Specificity and Accuracy.


# Webscraping & Initial Filtering
### Importing Packages

In [1]:
# Importing libaries
import requests
import time
import pandas as pd

## Data Collection and Filtering
### PushShift API
The pushshift.io Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions. I need to create a loop for the PushShift API as it only allows 100 post per request. The two subreddits I have chosen are: r/investing and r/CryptoCurrency. I have 2 goals to achieve while collecting this data. 

### Goals
1. Collect at least 1000 non-duplicate submissions from each subreddit, these submissions should not have 'removed', 'deleted' or empty titles either.
2. Collect 'title', 'subreddit','author','selftext' and 'created_utc'.

### Function's Logic
The idea behind the pushshift function are as follows: 
1. To collect data before the stipulated time daily.
2. With each loop, the 100 post requested will be filtered to be free from entries having 'removed', 'deleted' or empty entries to reduce guesswork needed down the line.
3. Create a loop counter to check the progress of the data collection and ensure that the code is running properly.
4. Create a dataframe for aggregation of data

In [2]:
def pushshift(subreddit, post_type='submission',loops=1, size=100,skip= 1,epoch=13214523):
# subreddit: str, name of subreddit to search for
# post_type: type of post to search for
# loops: int, number of times to request posts
# size: int, number of threads per request (max 100 per pushshift api guide)
# epoch: int, time in epoch to collect before stipulated time and date

    # columns required for submissions
    columns = ['subreddit','author','selftext','title','created_utc']
    # instantiate list for posts data
    list_posts = []    
    url_stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size={}".format(post_type, subreddit, size)
    # skip a minimum of 1 day
    after = 1   
        
    for i in range(loops):
        # add parameters to url to skip threads (after could be used to match up to post at end of previous loop if skip = 0)
        url = '{}&after={}d&before={}'.format(url_stem, skip * i + after, epoch)
        # monitor status as loops run
        print(i, url)
        # get data
        res = requests.get(url)
        # add dictionaries for posts to list_posts
        list_posts.extend(res.json()['data'])
        # be polite
        time.sleep(1)

    # turn list_posts (a list of dictionaries where each dictionary contains data on one post) into a dataframe
    full_data = pd.DataFrame.from_dict(list_posts) 
    # filtering columns from pulled data
    df_threads = full_data[columns]
        
    # Dropping unusable selftext
    selected_cols = df_threads.dropna(subset=['selftext'])
    # Adding data to main dataframe
    selected_cols_clean = selected_cols.loc[(selected_cols['selftext'] != '[removed]')
                                        & (selected_cols['selftext'] != '[deleted]')
                                        & (selected_cols['selftext'] != '')]
        
    # Adding data to main dataframe
    df_threads = pd.concat(objs=[df_threads, selected_cols_clean], axis=0)
    
    # Dropping any other duplicates 
    df_threads.drop_duplicates(subset=['selftext'], inplace=True)

    return df_threads

### Scraping r/investing

In [3]:
# Apply function for scraping
investing_subs = pushshift('Investing', post_type='submission', loops=80, skip=1, epoch=1635177599)
# Checking shape of dataframe pulled
print('shape', investing_subs.shape)
# Save to csv
investing_subs.to_csv('investing_subs-pushshift.csv')

0 https://api.pushshift.io/reddit/search/submission/?subreddit=Investing&size=100&after=1d&before=1635177599
1 https://api.pushshift.io/reddit/search/submission/?subreddit=Investing&size=100&after=2d&before=1635177599
2 https://api.pushshift.io/reddit/search/submission/?subreddit=Investing&size=100&after=3d&before=1635177599
3 https://api.pushshift.io/reddit/search/submission/?subreddit=Investing&size=100&after=4d&before=1635177599
4 https://api.pushshift.io/reddit/search/submission/?subreddit=Investing&size=100&after=5d&before=1635177599
5 https://api.pushshift.io/reddit/search/submission/?subreddit=Investing&size=100&after=6d&before=1635177599
6 https://api.pushshift.io/reddit/search/submission/?subreddit=Investing&size=100&after=7d&before=1635177599
7 https://api.pushshift.io/reddit/search/submission/?subreddit=Investing&size=100&after=8d&before=1635177599
8 https://api.pushshift.io/reddit/search/submission/?subreddit=Investing&size=100&after=9d&before=1635177599
9 https://api.pushs

### Scraping r/CryptoCurrency

In [4]:
# Apply function for scraping
crypto_subs = pushshift('CryptoCurrency', post_type='submission', loops = 45, size= 100, skip= 1, epoch=1635177599)
# Checking shape of dataframe pulled
print('shape', crypto_subs.shape)
# Save to csv
crypto_subs.to_csv('cryptocurrency_subs-pushshift.csv')

0 https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100&after=1d&before=1635177599
1 https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100&after=2d&before=1635177599
2 https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100&after=3d&before=1635177599
3 https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100&after=4d&before=1635177599
4 https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100&after=5d&before=1635177599
5 https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100&after=6d&before=1635177599
6 https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100&after=7d&before=1635177599
7 https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100&after=8d&before=1635177599
8 https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100&a