## Project 3: Using APIs and NLP for Prediction of Subreddit posts: r/CryptoCurrency and r/StockMarket

### Part 1. 



Brief Background about Themes in the project:

What does **REDDIT** and **SUBREDDIT** MEAN?

Reddit is a large community made up of thousands of smaller communities. These smaller, sub-communities within Reddit are also known as "subreddits" and are created and moderated by redditors like you.

A subreddit is a specific online community, and the posts associated with it, on the social media website Reddit. Subreddits are dedicated to a particular topic that people write about, and they’re denoted by /r/, followed by the subreddit’s name, e.g., /r/gaming ([source](https://www.dictionary.com/e/slang/subreddit/)).

This project is based on two subreddits: CryptoCurrency and StockMarket.

What is cryptocurrency?

Cryptocurrency, sometimes called crypto-currency or crypto, is any form of currency that exists digitally or virtually and uses cryptography to secure transactions. Cryptocurrencies don't have a central issuing or regulating authority, instead using a decentralized system to record transactions and issue new units.

Cryptocurrencies run on a distributed public ledger called blockchain, a record of all transactions updated and held by currency holders. There are thousands of cryptocurrencies. Some of the best known include: Bitcoin, Ethereum, Litecoin etc ([source](https://www.kaspersky.com/resource-center/definitions/what-is-cryptocurrency)).

What is Stock Market?

At the most basic level, a stock is simply a share of ownership in a company or corporation. There are two types of stock: private and public. The New York Stock Exchange (NYSE) and Nasdaq are the world's biggest stock exchanges. Exchanges are the places and systems were stocks are traded ([source](https://www.quickenloans.com/blog/stock-market-101-stock-market-work)).

This project has three jupyter notbooks:

1. [Data Collection](https://git.generalassemb.ly/sileshith/dsir-111/blob/master/projects/project-03/submission/1-data-collection.ipynb)
2. [Data Cleaning and EDA](https://git.generalassemb.ly/sileshith/dsir-111/blob/master/projects/project-03/submission/2-data-cleaning.ipynb)
3. [Modeling and Evaluation](https://git.generalassemb.ly/sileshith/dsir-111/blob/master/projects/project-03/submission/3-modeling.ipynb)

Performance metrics: accuracy and precision

I chose my two topics among the common 'topics to talk about' in our saily life: stock market and crypto currency trsding/investment. Thus, the subreddits I chose are:

*  https://www.reddit.com/r/CryptoCurrency/

*  https://www.reddit.com/r/StockMarket/

These 2 subreddits have more than <mark>**6 million members**</mark> (4M+ Crypto and 2M+ stock), and for the sake of this project, I will be focusing on accurately classifying posts that belong to the CryptoCurrency subreddit group. Therefore, from a data science perspective the optimization parameter for my model is going to be accuracy.

**Problem Statement**

With stock and cypto investors in mind, I am using Reddit's API for collecting posts from two subreddits, r/CryptoCurrency and r/StockMarket, and use NLP to train a classifier on which subreddit a given post came from.

**Performance metrics**: accuracy and precision

Overview of technical analysis:

**Data collection method**: Data Scraping using Reddit API through **pushshift.io** (resources below). I collected 4000 posts, 2000 CryptoCurrency posts, 2063 StockMarket posts, from 30 days

Exploratory Data Analysis

**Vectorizers used**: CountVectorizer, TfidfVectorizer to create the sparse matrix of features count/frequency respectively, to feed it to the classification model; tokenizer is included in these vectorizers.

**Models used/tested**: Random Forest, Logistic Regression, Support Vector Machine, and Multinomial Naive Bayes.

**Modeling tools used**: Pipelines, and GridSearch.

**Evaluation methods**: accuracy score, cross-validation, precision from classification report, confusion matrix to see False Positives and False Negative, ROC curve to visualize model performance.



### Data Collection

In [1]:
# importing modules
import numpy as np
import pandas as pd
import requests
import time
import datetime as dt
import json

### Pushshift search function


In [2]:

def pushshift(subreddit, post_type='submission', loops=1, size=500, skip=30):
# subreddit: str, name of subreddit to search for
# post_type: {'submission', 'comment'}, type of post to search for
# loops: int, number of times to request posts
# size: int, number of posts per request (max 500 per pushshift api guide)
# skip: int, number of days back to search in each loop 
        # increase if too many duplicate posts are returned, decrease if you want to skip fewer posts

    # data fields to return for submissions
    subfields = ['author', 'author_fullname', 'created_utc', 'id', 'num_comments', 'permalink', 
                 'score', 'selftext', 'subreddit', 'title', 'url', 'is_self']    
    # data fields to return for comments
    comfields = ['author', 'author_fullname', 'body', 'created_utc', 'id', 'parent_id', 
                'permalink', 'score', 'subreddit']
    # instantiate list for posts data
    list_posts = [] 
    url_stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size={}".format(post_type, subreddit, size)
    # skip a minimum of 1 day
    after = 1    

    # check before requesting data
    if post_type not in ['submission', 'comment']:
        print("post_type must be 'submission' or 'comment'")
        return None
    
    for i in range(loops):
        # add parameters to url to skip posts (after could be used to match up to post at end of previous loop if skip = 0)
        url = '{}&after={}d'.format(url_stem, skip * i + after) 
        # monitor status as loops run
        print(i, url)
        # get data
        res = requests.get(url)
        # add dictionaries for posts to list_posts
        list_posts.extend(res.json()['data']) 
        # be polite
        time.sleep(1) 

    # turn list_posts (a list of dictionaries where each dictionary contains data on one post) into a dataframe
    df_posts = pd.DataFrame.from_dict(list_posts) 

    # filter fields for submissions or comments
    if post_type == 'submission':
        df_posts = df_posts[subfields]
    elif post_type == 'comment':
        df_posts = df_posts[comfields]  
#     else:
#         print("post_type must be 'submission' or 'comment'")
#         return None

    # drop any duplicates
    df_posts.drop_duplicates(inplace=True)
    # add a field identifying submissions or comments
    df_posts['post_type'] = post_type
    
    return df_posts

#this code is adopted from:  https://github.com/scolnik/dsi-project-03-reddit



**Get Reddit posts and save to csv's**


In [3]:
crypto_subs = pushshift('cryptocurrency', post_type='submission', loops=20, skip=1)
print('Cryptocurrency size: ', crypto_subs.shape)
crypto_subs.to_csv('crypto_subs_pushshift.csv')

0 https://api.pushshift.io/reddit/search/submission/?subreddit=cryptocurrency&size=500&after=1d
1 https://api.pushshift.io/reddit/search/submission/?subreddit=cryptocurrency&size=500&after=2d
2 https://api.pushshift.io/reddit/search/submission/?subreddit=cryptocurrency&size=500&after=3d
3 https://api.pushshift.io/reddit/search/submission/?subreddit=cryptocurrency&size=500&after=4d
4 https://api.pushshift.io/reddit/search/submission/?subreddit=cryptocurrency&size=500&after=5d
5 https://api.pushshift.io/reddit/search/submission/?subreddit=cryptocurrency&size=500&after=6d
6 https://api.pushshift.io/reddit/search/submission/?subreddit=cryptocurrency&size=500&after=7d
7 https://api.pushshift.io/reddit/search/submission/?subreddit=cryptocurrency&size=500&after=8d
8 https://api.pushshift.io/reddit/search/submission/?subreddit=cryptocurrency&size=500&after=9d
9 https://api.pushshift.io/reddit/search/submission/?subreddit=cryptocurrency&size=500&after=10d
10 https://api.pushshift.io/reddit/sear

In [4]:
StockMarket_subs = pushshift('StockMarket', post_type='submission', loops=21, skip=30)
print('StockMarket size: ', StockMarket_subs.shape)
StockMarket_subs.to_csv('StockMarket_subs_pushshift.csv')

0 https://api.pushshift.io/reddit/search/submission/?subreddit=StockMarket&size=500&after=1d
1 https://api.pushshift.io/reddit/search/submission/?subreddit=StockMarket&size=500&after=31d
2 https://api.pushshift.io/reddit/search/submission/?subreddit=StockMarket&size=500&after=61d
3 https://api.pushshift.io/reddit/search/submission/?subreddit=StockMarket&size=500&after=91d
4 https://api.pushshift.io/reddit/search/submission/?subreddit=StockMarket&size=500&after=121d
5 https://api.pushshift.io/reddit/search/submission/?subreddit=StockMarket&size=500&after=151d
6 https://api.pushshift.io/reddit/search/submission/?subreddit=StockMarket&size=500&after=181d
7 https://api.pushshift.io/reddit/search/submission/?subreddit=StockMarket&size=500&after=211d
8 https://api.pushshift.io/reddit/search/submission/?subreddit=StockMarket&size=500&after=241d
9 https://api.pushshift.io/reddit/search/submission/?subreddit=StockMarket&size=500&after=271d
10 https://api.pushshift.io/reddit/search/submission/?s

In [5]:
# comments about crypto
crypto_commnts = pushshift('cryptocurrency', post_type='comment', loops=20, skip=1)
print('Crypto comments size: ', crypto_commnts.shape)
crypto_commnts.to_csv('crypto_commnts_pushshift.csv')

0 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocurrency&size=500&after=1d
1 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocurrency&size=500&after=2d
2 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocurrency&size=500&after=3d
3 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocurrency&size=500&after=4d
4 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocurrency&size=500&after=5d
5 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocurrency&size=500&after=6d
6 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocurrency&size=500&after=7d
7 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocurrency&size=500&after=8d
8 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocurrency&size=500&after=9d
9 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocurrency&size=500&after=10d
10 https://api.pushshift.io/reddit/search/comment/?subreddit=cryptocu

In [6]:
# comments about books
StockMarket_commnts = pushshift('StockMarket', post_type='comment', loops=20, skip=1)
print('StockMarket comment size: ', StockMarket_commnts.shape)
StockMarket_commnts.to_csv('StockMarket_commnts_pushshift.csv')

0 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=1d
1 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=2d
2 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=3d
3 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=4d
4 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=5d
5 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=6d
6 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=7d
7 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=8d
8 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=9d
9 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=10d
10 https://api.pushshift.io/reddit/search/comment/?subreddit=StockMarket&size=500&after=11d
11 http

#### Create csv for analysis of comment body text only

In [7]:
df = pd.concat([crypto_commnts[['body', 'subreddit']], StockMarket_commnts[['body', 'subreddit']]], ignore_index=True)
df.to_csv('comments.csv', index=False)