# Data Collection

Before collecting any data, we must answer several questions about how we wish to gather our data and what constitutes a good dataset. 

 1. **What is the time period we collect our data from?** <br>
	A very **narrow time period** such as the last few months would be a **very biased representation** of the subreddit as there would be major events we would miss any posts on and more importantly bias the dataset to focus on a few certain events too much. For example, if we collected the data of the last four months, all future analysis and models would give extreme weightage to coronavirus related posts which is obviously not a balanced representation of the subreddit. The state of the subreddit **10 years ago** would be extremely different due to the **data and mobile revolution** since then. This is why an appropriate time frame would be posts from the **last four years, i.e. 2016-2020** hence covering all major recent events.
<br><br>


 2. **How much data should we collect?** <br>
	This is purely dependent on the **resources available** to us for data exploration and model training. A reasonable number of posts I can successfully train on my PC is around **1,50,000**. This number when combined with the time frame of last four years would average around **100 posts per day**.
<br><br>

3. **What is the method for collecting posts?** <br>
	There are several ways in which we can form our dataset such as:
	
		1. equal number of posts from each popular flair
		2. posts with the highest score
		3. random posts
	After pondering over various methods, I realized that I wanted to strike a balance between  **relevant posts** and a good distribution of posts over time. I define a relevant post to be one which is viewed and liked by many people hence displaying it's impact on society. For this reason, I divided the four year period into batches of **10 days each**. For each batch, I would collect the **top 1000 liked posts**. Although this would make for an imbalanced dataset in terms of flairs, it is an accurate representation of the relevant posts on the subreddit.

In [84]:
import pandas as pd

In [65]:
import requests
from datetime import datetime
import traceback

### Method for getting posts

There are two widely used methods for getting posts: 
1. **PRAW Reddit API wrapper**<br>
    In the latest version of the API, one can only get all the post from one of three categories: hot, new and top. This restricts collection of posts to a total of 1000 posts from any categories. This means that even if we consider no overlap between posts in different categories, I can get only a maximum of 3000 posts. This will not work.
<br><br>
2. **pushshift API**<br>
    This API allows sending a GET request to get a maximum of 1000 posts in one request. You can set various parameters in the request to specify a time period in which the posts must be and sort them according to any parameter. We will make use of this.
<br><br>
We request for top 1000 scoring posts for a time frame between ```(epoch - ten_days, epoch)``` 144 times to cover the entire 4 year time span

Note: If you wish to replicate the results further, you may execute the following code blocks but for the data exploration notebook, use the given reddit_data.csv and not the one you would generate here.

In [117]:
start_timestamp = int(datetime.utcnow().timestamp())
url = "https://api.pushshift.io/reddit/submission/search/?score=>0&before={}&after={}&sort_type=score&sort=desc&subreddit=india&limit=1000"
dataset = []
epoch = start_timestamp
ten_days = 10*24*60*60
epoch_prev = epoch - ten_days
post_counts = 0
total_years = 4
for every_ten_days in range(total_years*12*3):
    final_url = url.format(str(epoch),str(epoch_prev))
    json_data = requests.get(final_url, headers={'User-Agent': "test reddit app"})
    data = json_data.json()
    posts = data['data']
    for post in posts:
        post_counts = post_counts + 1
        update_array = get_update_array(post)
        dataset.append(update_array)
    epoch = epoch_prev
    epoch_prev = epoch - ten_days
    print(i)    

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139000
140000
141000
142000
143000
144000


```def get_update_array(post)``` takes as parameter a dictionary which represents information about a post. The function then extracts the important information from the post required for data exploration and model training and returns it as a list to be appended to the final dataset 
   

In [123]:
def get_update_array(post):
    title = post['title']
    author = post['author']
    created_utc = post['created_utc']
    self_post = post['is_self']
    score = post['score']
    over_18 = post['over_18']
    num_comments = post['num_comments']
    
    if 'is_original_content' not in post:
        is_original_content = None
    else:
        is_original_content = post['is_original_content']
    
    if 'selftext' not in post:
        self_text = ""
    else:
        self_text = post['selftext']
    
    if 'link_flair_text' not in post:
        flair = None
    else:
        flair = post['link_flair_text']
    
    update_array = [title, flair, score, num_comments, author, is_original_content, created_utc, self_post, self_text, over_18]
    return update_array

We convert our list to a pandas dataframe which will form the basis of our data exploration and convert it to CSV format

In [118]:
df = pd.DataFrame(dataset, columns =['title', 'flair', 'score', 'num_comments', 'author', 'is_original_content', 'created_utc', 'self_post', 'self_text', 'over_18'])
df.head(10)

Unnamed: 0,title,flair,score,num_comments,author,is_original_content,created_utc,self_post,self_text,over_18
0,Even the poorest are supporting Modi in this.,Politics,64,78,hungarywolf,False,1586106160,False,,False
1,Someone tried to sell Statue of Unity on Olx. ...,Non-Political,50,143,Athar147,False,1586105586,False,,False
2,Captured India with my phone.,Photography,49,202,random_saiyajin,False,1586102173,False,,False
3,The Irony,,43,109,blue_mark,False,1586100822,True,My entire house has it's lights turned off and...,False
4,You guys are too impure to understand Modi Ji'...,Coronavirus,39,81,AdmiralSP,False,1586100637,True,It's very scientific.\nI'm a scientist.\n\nEve...,False
5,BSF soldier lights candles in his bunker.,| Unverified Content / Disreputed Source |,35,4,zxkool,False,1586106901,False,,False
6,Posting again because stupidity was on show to...,Coronavirus,34,21,msbuttergourd,False,1586108196,False,,False
7,During #9PM9minute All India demand of electri...,Non-Political,32,107,The_andh_bhakth,False,1586114141,False,,False
8,Fuck this. Our country will never learn.,Coronavirus,32,349,youdidWHaAtnow,False,1586101383,True,People are bursting crackers. They're playing ...,False
9,Antila World's most expensive home - lights of...,Coronavirus,31,61,neilupinto,False,1586105379,False,,False


In [122]:
df.to_csv('reddit_data.csv')