# Project 3 - Reddit classifier

## Part A) Extracting data from subreddits to create ML models using NLP

### Problem statement: 
##### Extracting data from the `tea` and `Coffee` subreddits using API wrapper PRAW, as outlined by Reddit. A maximum of 1000 posts maybe obtained using API requests, and for the subreddits modeled in this projects, two data sets of approximately 1000 posts must be obtained.

In [1]:
# Import necessary libraries and modules
import pandas as pd
import requests
import json, praw
import datetime as dt

In [2]:
# Reading in the necessary Reddit credentials from the json file on local machine
creds_file = open('./../../../../creds.json', 'r')
reddit_creds = json.loads(creds_file.read())
reddit_creds.keys()

dict_keys(['id', 'secret', 'user', 'pass'])

In [3]:
# Instatiate a class Reddit
reddit = praw.Reddit(client_id = reddit_creds['id'],
            client_secret = reddit_creds['secret'],
            password = reddit_creds['pass'],
            username = reddit_creds['user'],
            user_agent = 'uday')

In [4]:
# Two subreddits we are attemtping to classify
tea = reddit.subreddit('tea')
coffee = reddit.subreddit('Coffee')

In [5]:
# Accessing the top posts from the two subreddits
tea_top_1000 = tea.top(limit=1000)
coffee_top_1000 = coffee.top(limit=1000)
tea_top = [submission for submission in tea_top_1000]
coffee_top = [submission for submission in coffee_top_1000]

In [6]:
# Creating dictionaries for tea and coffee 
tea_dict, coffee_dict = {}, {}
tea_dict['title'] = []
tea_dict['id'] = []
tea_dict['over_18'] = []
tea_dict['is_self'] = []
tea_dict['selftext'] = []
tea_dict['score'] = []
tea_dict['rank'] = []
tea_dict['created'] = []

coffee_dict['title'] = []
coffee_dict['id'] = []
coffee_dict['over_18'] = []
coffee_dict['is_self'] = []
coffee_dict['selftext'] = []
coffee_dict['score'] = []
coffee_dict['rank'] = []
coffee_dict['created'] = []

In [7]:
# Creating dictionary of lists 
for idx,item in enumerate(tea_top):
    tea_dict['title'].append(item.title)
    tea_dict['id'].append(item.id)
    tea_dict['over_18'].append(item.over_18)
    tea_dict['is_self'].append(item.is_self)
    tea_dict['selftext'].append(item.selftext)
    tea_dict['score'].append(item.score)
    tea_dict['rank'].append(idx+1)
    tea_dict['created'].append(item.created_utc)

for idx, item in enumerate(coffee_top):
    coffee_dict['title'].append(item.title)
    coffee_dict['id'].append(item.id)
    coffee_dict['over_18'].append(item.over_18)
    coffee_dict['is_self'].append(item.is_self)
    coffee_dict['selftext'].append(item.selftext)
    coffee_dict['score'].append(item.score)
    coffee_dict['rank'].append(idx+1)
    coffee_dict['created'].append(item.created_utc)

In [9]:
# Creating another file with the public description and total subscribers
tea_subreddit_info = {'subscribers': [tea.subscribers], 'public_description':[tea.public_description]}
coffee_subreddit_info = {'subscribers': [coffee.subscribers], 'public_description':[coffee.public_description]}

In [10]:
# Creating dataframes
tea_data = pd.DataFrame(tea_dict)
coffee_data = pd.DataFrame(coffee_dict)

In [11]:
# Code for converting Unix timestamp inspired from the following blog of "Felippe Rodrigues"
# http://www.storybench.org/how-to-scrape-reddit-with-python/

tea_data['created'] = tea_data['created'].map(lambda x: dt.datetime.fromtimestamp(x))
coffee_data['created'] = coffee_data['created'].map(lambda x: dt.datetime.fromtimestamp(x))

In [12]:
tea_data.head()

Unnamed: 0,title,id,over_18,is_self,selftext,score,rank,created
0,Perfect job doesn't exi-,7xhbzb,False,False,,8271,1,2018-02-14 03:20:05
1,One thing coffee and tea drinkers can agree on...,57wi79,False,False,,7908,2,2016-10-17 02:43:14
2,4chan's Beginners Guide on Tea,63lb7c,False,False,,5975,3,2017-04-05 05:49:29
3,My buddies like the warmth,603d65,False,False,,4339,4,2017-03-18 02:46:23
4,There's no better way to do it. [x-post /r/iran],5g35da,False,False,,4317,5,2016-12-02 05:06:24


In [13]:
coffee_data.head()

Unnamed: 0,title,id,over_18,is_self,selftext,score,rank,created
0,How do I use a coffee press?,df881q,False,True,"Do I have to grind the beans, or just put them...",11941,1,2019-10-08 16:20:33
1,An update for you pretentious fucks,5nxd2u,False,True,I posted here last week asking for recommendat...,3074,2,2017-01-14 05:12:05
2,"Hey guys, it's been 44 days since hurricane Ma...",7anryf,False,True,I though you guys could understand my frustrat...,2072,3,2017-11-03 18:13:18
3,10 things you may not know about coffee. I'm a...,1z0qje,False,False,,2042,4,2014-02-26 12:29:14
4,"Former Rams fan here, I'm making the switch",40qsem,False,True,I've always considered myself a casual fan of ...,1657,5,2016-01-12 22:27:42


In [14]:
# Writing the dataframes to csv files
tea_data.to_csv('./../data/tea_data.csv',index=False)
coffee_data.to_csv('./../data/coffee_data.csv',index=False)
pd.DataFrame(tea_subreddit_info).to_csv('./../data/tea_info.csv', index=False)
pd.DataFrame(coffee_subreddit_info).to_csv('./../data/coffee_info.csv',index=False)