# Phase I: Data Collection

#### Choosing the Classes: r/Science and r/EverythingScience

This analysis will look at the similarities and differences in the text data of two distinct subreddits: r/Science and r/EverythingScience. I chose these two subreddits because they are alike in topic and in post format (i.e. users typically post summaries of linked articles or reserach studies). The only place where the subreddits differ is that r/Science requires posts to be peer-reviewed studies, whereas r/EverythingScience does not. Initially, I had chosen r/Psychology as the counterpart to r/EverythingScience, out of a concern that using two subreddits on the exact same topic might cause poor model performance. However, given that the objective is to test the predictive power of models to judge the validity of a claim based solely on text, I decided to keep all of the variables in the experiment as fair as possible. If the models can't distinguish between the posts of each science subreddit, in that case we can change a variable like subreddit topic and measure any changes in model performance.


#### Data Collection Methodology

In order to gather each subreddit's data and convert it into a usable format, I queried the Reddit API and requested data in the form of a JSON string. Reddit allows 25 posts to be returned from a single API request, so I implemented a for loop in order to accumulate enough data points from each subreddit. I appended this JSON data to a list, and was able to create a DataFrame from there. I repeated this process a few times over multiple days - both to generate enough data, and avoid querying the API too aggressively - and thus the following code only represents one cycle through the API query / data collection process. Ultimately I wound up with three separate .csv files of data that I later merged into a singe DataFrame. Since the initial API queries involved the r/Psychology subreddit, which I eventually excluded form this analysis, I performed some filtering on the data set before exporting the final .csv at the bottom of this notebook.

---

### Importing Packages & Libraries:

In [1]:
import requests
import time
import pandas as pd

### Querying the API and collecting post data to a list:

In [2]:
posts = []

#### Some notes on the for loop:
 
- In order to deal with pagination, this for loop needed to have some sort of reference so that it could keep iterating through successive pages and returning consecutive, unique posts. The "after" variable is the unique post ID of the last post on each page.
- This partucular API employs a firewall, so I used a user-agent to bypass any potential errors here.

In [None]:
after = None
headers = {'User-agent': 'bot 0.1'}
urls = ['https://www.reddit.com/r/science/new/.json', 'https://www.reddit.com/r/EverythingScience.json', 
        'https://www.reddit.com/r/psychology/new/.json']

for url in urls:
    for i in range(50):
        print(i)
        print(after)
        if after == None:
            params = {}
        else:
            params = {'after': after}
        res = requests.get(url, params=params, headers=headers)
        if res.status_code == 200:
            the_json = res.json()
            posts.extend(the_json['data']['children'])
            after = the_json['data']['after']
        else:
            print(res.status_code)
            break
        time.sleep(5)

I ran this loop multiple times to see how many posts I could accumulate in a single day, and then checked my list of posts to confirm the number of unique entries.

In [6]:
len(posts)

1402

In [7]:
len(set(p['data']['name'] for p in posts))

1227

### Creating a DataFrame from the relevant post data:

In [9]:
df = pd.DataFrame([p['data'] for p in posts]).drop_duplicates(subset='name')
df.shape

(1227, 100)

In [10]:
df.to_csv('./Datasets/redditAPI_json12202018.csv', index=False)

### Combining all .csv files of collected data into one data set:

In [11]:
csv_1 = pd.read_csv('./Datasets/redditAPI_json12132018.csv')
csv_2 = pd.read_csv('./Datasets/redditAPI_json12172018.csv')
csv_3 = pd.read_csv('./Datasets/redditAPI_json12202018.csv')

In [13]:
csvs = [csv_1, csv_2, csv_3]
df = pd.concat(csvs)
df.drop_duplicates(subset='name', inplace=True)
df.drop('Unnamed: 0', axis=1, inplace=True)
df.tail()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
1222,,,False,dongasaurus_prime,,,,[],,,...,78.0,140.0,"100% renewables, No problems: ""contrary to uns...",34,https://physicsworld.com/a/100-renewables-no-p...,[],,False,all_ads,6
1223,,,False,twwitterr,,,,[],,,...,56.0,140.0,Scientists Discovered Rare Giant Viruses Lurki...,20340,https://www.google.com/amp/s/www.sciencealert....,[],,False,all_ads,6
1224,,,False,BloodSoakedDoilies,,,,[],,,...,72.0,140.0,"Planckian Dissipation, Strange Metals, and a Q...",13,https://www.theatlantic.com/science/archive/20...,[],,False,all_ads,6
1225,,,False,Wagamaga,,,reward1,[],,,...,46.0,140.0,"Eating leafy greens, dark orange and red veget...",38,https://eurekalert.org/pub_releases/2018-11/aa...,[],,False,all_ads,6
1226,,,False,FillsYourNiche,,,env reward3,"[{'e': 'text', 't': 'MS | Ecology and Evolutio...",,MS | Ecology and Evolution | Ethology,...,136.0,140.0,Breast tumors can boost their growth by recrui...,13,https://www.eurekalert.org/pub_releases/2018-1...,[],,False,all_ads,6


In [14]:
df.shape

(2832, 102)

In [16]:
df['subreddit'].unique()

array(['EverythingScience', 'psychology', 'science'], dtype=object)

In [18]:
df = df[df['subreddit'] != 'psychology']
df['subreddit'].unique()

array(['EverythingScience', 'science'], dtype=object)

In [20]:
df.shape

(1916, 102)

### Saving / exporting the final data set for later use:

In [21]:
df.to_csv('./Datasets/Final_Reddit_Dataset.csv', index=False)