# Collecting Reddit Data

First setup Reddit app and get the API keys

### First install PRAW, the Python Reddit API Wrapper.

pip install praw

(more info https://praw.readthedocs.io/en/latest/getting_started/installation.html)

In [None]:
import praw #pip install praw

In [None]:
# We need a few other libraries for the rest of the exercise
#from os.path import isfile
import pandas as pd
from time import sleep

### Setup Reddit app and get the API keys. 
Follow the steps mentioned in class. See the steps on Canvas [here](https://canvas.uw.edu/courses/1434897/files/folder/Labs?preview=72511628)

## Create "praw.ini" file to save the application information in the following format
**Do not share this file with others since it has your credentials and keep it in the same directory as your other Reddit scripts**

Check to make sure everything is working by running the following code that creates Reddit instance

The Reddit instance provides convenient access to Reddit’s API. [read the docs](https://praw.readthedocs.io/en/latest/code_overview/reddit_instance.html)

In [None]:
# Get credentials from DEFAULT instance in praw.ini
reddit = praw.Reddit('DEFAULT')

## Fetching data from a subreddit

Refer to the docs for syntax and method details:
[reddit.subreddit](https://praw.readthedocs.io/en/latest/code_overview/models/subreddit.html?highlight=subreddit.top)

Let's start by fetching the top submissions from the subreddit `r/news`

If you are new to Reddit, go to the subreddit r/news to see how it looks: http://reddit.com/r/news

In [None]:
# Create subreddit variable like this 
# Pick any subreddit like (news, pics, science, technology, politics) or pick any other you want
subreddit = reddit.subreddit('news')

# Get top posts by ".top" (you can also do .hot, .new, .controversial and .gilded) and put limit
# This creates iterator item which we can use to parse posts individually
top_subreddit = subreddit.top(limit=10)

#Print titles and id of posts by iterating over the top_subreddit object
for submission in top_subreddit:
    print(submission.title, submission.id)

In [None]:
#Another quick alternative

subreddit = reddit.subreddit('news')
for submission in subreddit.top(limit=10):
    print(submission.title, submission.id)

#### Check to see if you really fetched the correct data:

Go here and do a quick scan: https://www.reddit.com/r/news/top/

<span class="mark">**TODO**:</span> Fetch the 15 hottest submissions

You just fetched `top` submissions from the subreddit r/news. Using the same logic, fetch the top 10 hottest submissions from the subreddit of your choice. Quick browse through the [docs](https://praw.readthedocs.io/en/latest/code_overview/models/subreddit.html?highlight=subreddit.top#) can help.


In [None]:
## Your code below




### Submission object
Let's try to explore the submission object a bit more.

<span class="mark">**TODO**</span>
Apart from `title` and `id` what else can you fetch.

Checkout the documentation table [here](https://praw.readthedocs.io/en/latest/code_overview/models/submission.html)

In [None]:

for submission in subreddit.top(limit=10):
    ## Your rest of the code here

    
    

### Creating pandas frames from subreddit related data 

* We will first fetch various fields from a bunch of reddit posts using the Reddit praw API. (You just did this step in the previous code blocks)
* Next, we will create a dictionary of key:value pairs to store each of these fields
* Finally, we will create a pandas dataframe from this dictionary.


Note: If you are new to python and python dictionary, this is one data structure that will really come in handy. Do read how we can easily handle dictionary using [`defaultdict`](https://docs.python.org/3/library/collections.html#collections.defaultdict) 

In [None]:
# You can scrape different characteristics of posts (refer to Reddit API documentation to see what options are available)
# Here we will scrape title, score, url, id, number of comments, UTC timestamp and body text
# We will store it in a dictionary
from collections import defaultdict
posts_dict = defaultdict(list)

# Get top posts by ".top" (you can also do .hot, .new, .controversial and .gilded) and put limit
# This creates iterator item which we can use to parse posts individually
top_subreddit = subreddit.top(limit=15)

# 1. If you had not used defaultdict, you had to first create a disctionary structure to store all such fields.
#posts_dict = { "title":[],"score":[], "id":[], "url":[], "comms_num": [], "created": [], "body":[]}

# 2. Now iterate over the top_subreddit object and store fields for each post in the disctionary
for submission in top_subreddit:
    posts_dict["title"].append(submission.title)
    posts_dict["score"].append(submission.score) #score: The number of upvotes for the submission.
    posts_dict["upvote_ratio"].append(submission.upvote_ratio) #upvote_ratio: The percentage of upvotes from all votes on the submission.
    posts_dict["id"].append(submission.id)
    posts_dict["url"].append(submission.url)
    posts_dict["comms_num"].append(submission.num_comments)
    posts_dict["created"].append(str(submission.created).strip('.0')) #avoid storing data in scientific format
    posts_dict["orginial"].append(submission.is_original_content) #Whether or not the submission has been set as original content.    

In [None]:
posts_dict

In [None]:
# Create a dataframe from a dictionary
news_data = pd.DataFrame(posts_dict)
# Print 
news_data.head()

In [None]:
# Save data to CSV
news_data.to_csv('./top_15_posts_in_news.csv', encoding='utf-8')

## Comment object
See the [docs](https://praw.readthedocs.io/en/latest/code_overview/models/comment.html?highlight=comment) to realize the multiple fields that you can retrive from the comments object.

Below is a small example. But the ways to expand this is limitless.

In [None]:
for comment in reddit.subreddit("news").comments(limit=5):
    print('SUBMISSION', comment.submission.title, '\n', 'COMMENT:', comment.body)

## Get use profile info

example user: u/karmanaut.

You can also see how this user looks like on the Reddit: https://www.reddit.com/user/karmanaut

Now let's programmatically get some data from this user profile

More info in [Redditor](https://praw.readthedocs.io/en/latest/code_overview/models/redditor.html) doc

In [None]:
# Get user object
user = reddit.redditor("karmanaut")

In [None]:
# Get user id
print( user.id)

# Know if user has verified email
print( user.has_verified_email)

#know the username (if you created the user instance with user-id instead of username)
print( user.name)

<span class="mark">**TODO**:</span> Apart from `name` , `has_verified_email` and `id` what else can you find out about the "user"?

*Hint*: See the Attribute table in the [Redditor](https://praw.readthedocs.io/en/latest/code_overview/models/redditor.html) documentation

In [None]:
# Your code below to extract additional fields for this user




## Get user comments info 

In [None]:
# Create user comment disctionaty to store attributes of top comments made by the user
ucomm_dict = defaultdict(list)
#ucomm_dict = { "subreddit":[],"score":[], "id":[], "body":[]}

# Iterate over user comment object and store in dictionary 
for comment in user.comments.top(limit=50):
    ucomm_dict['subreddit'].append(comment.subreddit)
    ucomm_dict['score'].append(comment.score)
    ucomm_dict['id'].append(comment.id)
    ucomm_dict['body'].append(comment.body)

In [None]:
# Create a dataframe from a dictionary

ucomm_data = pd.DataFrame(ucomm_dict)
ucomm_data.head()

In [None]:
# Save data to CSV
ucomm_data.to_csv("./user_comment_data.csv", encoding='utf-8')

## Readability

Let's find the readability of these comments.

Python's `textstat` package can come in handy: https://pypi.org/project/textstat/

`pip install textstat`

In [None]:
import textstat

In [None]:
for ind in ucomm_data.index:
    print(textstat.smog_index(ucomm_data['body'][ind]))

**What do these values even mean?**

How do we interpret the SMOG index. 
See: https://en.wikipedia.org/wiki/Gunning_fog_index

#### pandas dataframe.

In the previous code block you also practiced iterating over pandas dataframe.

So while you are at, also checkout other ways of iterating over rows in pandas:

https://www.geeksforgeeks.org/different-ways-to-iterate-over-rows-in-pandas-dataframe/

You can iterate using the `loc` function that we saw last week.

<span class="mark">**TODO**</span>: 

The "Fake News Packs a Lot in Title..." paper that you read had used other three types of readability indexes (see the section on Complexity feature, pg. 3). Specifically:
- SMOG Grade, 
- Gunning Fog, and 
- Flesh-Kincaid grade level index.

You computed the SMOG index just now.

Write code to find the two other readability indices for the comment data.

Also, interpret the values.

In [None]:
# Your code below



<span class="mark">**TODO**:</span> *Medium difficulty*

We saw how to get top posts from a subreddit, top comments for a user and also comments from a subreddit. Can you figure out how to get posts made by a user? You can take this same example user `karmanaut` and print their top posts

In [None]:
# Your code below



