Summary
Goals:
Using Reddit or Pushshift.io API, collect posts from two subreddits of my choosing (r/cats and r/dogs).
Use NLP to train a classifier model on which subreddit a given post came from.
Stretch goal: Use sentiment analysis to solve the age-old debate of cats vs dogs.

### What is Reddit?

Founded by University of Virginia roommates **Steve Huffman** and **Alexis Ohanian**, with **Aaron Swartz** - in 2005, **Reddit** is a website comprising user-generated content—including photos, videos, links, and text-based posts—and discussions of this content in what is essentially a bulletin board system. The name "Reddit" is a play-on-words with the phrase "read it", i.e., "I read it on Reddit." According to Reddit, in 2019, there were approximately 430 million monthly users, who are known as "redditors". The site's content is divided into categories or communities known on-site as "subreddits", of which there are more than 138,000 active communities.

As a network of communities, Reddit's core content consists of posts from its users. Users can comment on others' posts to continue the conversation. A key feature to Reddit is that users can cast positive or negative votes, called upvotes and downvotes respectively, for each post and comment on the site. The number of upvotes or downvotes determines the posts' visibility on the site, so the most popular content is displayed to the most people. Users can also earn "karma" for their posts and comments, a status that reflects their standing within the community and their contributions to Reddit. Posts are automatically archived after six months, meaning they can no longer be commented or voted on.

### Subreddits

Subreddits are user-created areas of interest where discussions on Reddit are organized. There are about 138,000 active subreddits (among a total of 1.2 million) as of July 2018. Subreddit names begin with "r/"; for instance, "r/science" is a community devoted to discussing scientific topics, while "r/television" is a community devoted to discussing TV shows and "r/Islam", a community dedicated for Islam oriented topics([source](https://en.wikipedia.org/wiki/Reddit). 

## Problem Statements



`What characteristics of a post on Reddit contribute most to what subreddit it belongs to?

Predict what Posts/comments belong to which subreddit By using NLP and Classification models
`

* Reddit posts are extremely large, it is near to impossible to sort all the posts manually.
* How to utilise machine learning ability(NLP) to organise the posts into a logical acceptable way.


## Three parts are:

### Part I: Data wrangling/gathering/acquisition
### Part II: Natural Language Processing
### Part III:Classification Modeling

In this section we will carry out:

* API Scrapping
* Data Cleaning
* Modeling
* Model Evaluation
* Conclusion and Recommendation

In [4]:
# imports
import pandas as pd
import numpy as np
import datetime as dt 
import time 
import requests
import json
np.random.seed (42)
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [None]:




# from sklearn.linear_model import LogisticRegressionCV
# from sklearn.linear_model import LogisticRegression
# from sklearn.pipeline import Pipeline
# from sklearn.model_selection import train_test_split
# from sklearn import metrics
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.model_selection import cross_val_score, StratifiedKFold
# from sklearn.metrics import classification_report
# from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
# from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
# from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
# from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer, CountVectorizer
# from sklearn.preprocessing import StandardScaler
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import confusion_matrix as cm
# from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
# from sklearn.metrics import roc_auc_score
# from sklearn.metrics import roc_curve, auc
# from sklearn.metrics import roc_auc_score
# from sklearn.externals.six import StringIO 
# from IPython.display import Image  
# from sklearn.tree import export_graphviz


# %config InlineBackend.figure_format = 'retina'
# %matplotlib inline

### API Scrapping

In [2]:
#prepare the reddit title that we are intrested in
url_1 = 'https://www.reddit.com/r/marvel.json'  
url_2 = 'https://www.reddit.com/r/dccomics.json'  

In [3]:
#create a function to capture posts
def get_post(url, csv_name):

    posts = []
    after = None

    for a in range(35):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'Data Inc'})
    
#if unable to access reddit, print the error and 'after' so we know where should we continue from
        if res.status_code != 200:
            print('Status error', res.status_code)
            print(after)
            break
    
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']
    
        if a > 0:
#read the previous posts, concat it with the current posts and save 
            prev_posts = pd.read_csv(csv_name)
            current_df = pd.DataFrame(current_posts) 
            current_df = pd.concat([prev_posts, current_df])
            pd.DataFrame(current_df).to_csv(csv_name, index = False)
        
        else:
            pd.DataFrame(posts).to_csv(csv_name, index = False)

# generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,10)
        time.sleep(sleep_duration)