### We first install <strong>PRAW (Python Wrapper for Reddit API)</strong><br>This library is needed as we are going to be using the Reddit API which allows us to scrape the data we require. This library allows Python to interact with this API<br>Documentation : https://praw.readthedocs.io/en/latest/index.html

In [1]:
!pip install praw



### We import praw and other necessary libraries <br>Pandas for the dataframe manipulation, tqdm to see progress bars, datetime to parse UTC timestamp

In [15]:
import praw
import pandas as pd
from tqdm.notebook import tqdm
from datetime import datetime

### We store client_id, client_secret key and user_agent in a text file called login.txt (space seperated)<br>They are stored in a text file because they are private credentials, anyone who wants to test the application can download this file through secure means or use their own private credentials for the Reddit API

In [16]:
user, key, app = open('./login.txt', 'r+').read().split()

### Initializing the application

In [3]:
reddit = praw.Reddit(client_id=user, client_secret=key, user_agent=app)

### Now we can access each post in a subreddit filtered by hottest, newest and so on.

In [4]:
subreddit = 'India'

In [5]:
hot_posts = reddit.subreddit(subreddit).hot(limit=100000)

In [6]:
posts = []

### all the attributes of each reddit post like its id, title, body, flair, url, score, num of comments and date created are appended to an empty list which will be used to create a dataframe (the url is just reddit.com/r/subreddit_name/comments/post_id)

In [7]:
for post in tqdm(hot_posts):
    posts.append((post.id, post.subreddit, post.title, post.selftext, post.link_flair_text, f'https://www.reddit.com/r/{subreddit}/comments/{post.id}', post.score, post.num_comments, post.created))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [8]:
df = pd.DataFrame(posts, columns=['id', 'subreddit', 'title', 'body', 'flair', 'url', 'score', 'num_comments', 'dated'])

### This function converts and returns the Date in YYYY-MM-DD HH:MM:SS format instead of UTC Timestamp

In [9]:
def date(ts):
    return datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')

### We apply this function over the dated column

In [10]:
df['dated'] = df['dated'].apply(date)

### This is what the data looks like for hottest posts

In [11]:
df.head(10)

Unnamed: 0,id,subreddit,title,body,flair,url,score,num_comments,dated
0,fqqdsg,india,Coronavirus (COVID-19) Megathread - News and U...,###[Covid-19 Fundraisers & Donation Links](htt...,Coronavirus,https://www.reddit.com/r/India/comments/fqqdsg,366,9255,2020-03-29 03:10:33
1,fzub9j,india,"Announcing r/IndiaMeme, our own sub for memes ...",HELLO YOU NICE PEOPLE. WE GOT REALLY TIRED OF ...,Announcement,https://www.reddit.com/r/India/comments/fzub9j,143,50,2020-04-12 18:18:46
2,g000ic,india,My favorite lockdown pic so far!,,Coronavirus,https://www.reddit.com/r/India/comments/g000ic,3248,84,2020-04-13 00:41:59
3,g0bfmo,india,Covid-19: Kamal Nath says lockdown was delayed...,,Coronavirus,https://www.reddit.com/r/India/comments/g0bfmo,153,19,2020-04-13 11:56:58
4,g014wc,india,"Lost my Job, Sick Mother and Paralysed Dad, In...",Hi....It's really tough time for everyone. I r...,AskIndia,https://www.reddit.com/r/India/comments/g014wc,904,106,2020-04-13 01:42:28
5,g0c0o9,india,"In Chhattisgarh, 108 Out of 159 “Tableeghis” T...",,Coronavirus,https://www.reddit.com/r/India/comments/g0c0o9,106,22,2020-04-13 12:38:16
6,fzzzji,india,"Behind closed doors, the biggest viruses are m...",,Coronavirus,https://www.reddit.com/r/India/comments/fzzzji,880,123,2020-04-13 00:40:34
7,g0d4n0,india,"New cases increasingly in single digits, how K...",,Coronavirus,https://www.reddit.com/r/India/comments/g0d4n0,46,2,2020-04-13 14:01:00
8,g0bxrr,india,30 ‘foreigners’ dead in Assam’s detention centres,,Politics,https://www.reddit.com/r/India/comments/g0bxrr,52,7,2020-04-13 12:32:20
9,fzsccv,india,"We can save you from Corona, But not from Stup...",,Non-Political,https://www.reddit.com/r/India/comments/fzsccv,2492,163,2020-04-12 15:29:10


### We can save this dataframe as csv file

In [12]:
df.to_csv('hot_posts.csv', index=None)

In [13]:
len(df)

787

### We can now just make a function for all kinds of post objects(new, hot, top) which performs the above task and returns the dataframe obtained

In [25]:
def process_posts(posts_obj):
    import pandas as pd
    from datetime import datetime
    from time import sleep
    def date(ts):
        return datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
    posts = []
    for post in posts_obj:
        posts.append((post.id, post.subreddit, post.title, post.selftext, post.link_flair_text, f'https://www.reddit.com/r/{subreddit}/comments/{post.id}', post.score, post.num_comments, post.created))
        #sleep(0.1)
    df = pd.DataFrame(posts, columns=['id', 'subreddit', 'title', 'body', 'flair', 'url', 'score', 'num_comments', 'dated'])
    df['dated'] = df['dated'].apply(date)
    return df

### We can obtain the new posts object and pass it to the above function to get an exact same dataset as before but with newest posts

In [18]:
new_posts = reddit.subreddit(subreddit).new(limit=10000)

In [19]:
df_new = process_posts(new_posts)

In [17]:
df_new.head(10)

Unnamed: 0,id,subreddit,title,body,flair,url,score,num_comments,dated
0,g0ehvg,india,Does anyone wanna chat? Idk. I'm really bored,"Ik i could post this on several ""making friend...",Non-Political,https://www.reddit.com/r/India/comments/g0ehvg,1,0,2020-04-13 15:51:48
1,g0ehuw,india,IAS Officer Srijana Gummalla returned to work ...,,Non-Political,https://www.reddit.com/r/India/comments/g0ehuw,1,0,2020-04-13 15:51:45
2,g0eha8,india,Every time I see Ramdev Baba or Sadhguru on a ...,It is gut-wrenching when sadhguru says tried t...,Non-Political,https://www.reddit.com/r/India/comments/g0eha8,2,0,2020-04-13 15:50:24
3,g0eaol,india,Tablighi Jamaat and the Perils of Precipitous ...,,Politics,https://www.reddit.com/r/India/comments/g0eaol,1,0,2020-04-13 15:35:05
4,g0e9a8,india,"HRs of India, do you reveal salary information...",Since CTC is so confidential and they ask us n...,AskIndia,https://www.reddit.com/r/India/comments/g0e9a8,5,3,2020-04-13 15:31:46
5,g0e3nc,india,Is my anxiety of COVID-19 rational?,So we are all going through rough times. Some ...,AskIndia,https://www.reddit.com/r/India/comments/g0e3nc,4,3,2020-04-13 15:19:04
6,g0dofi,india,101 years of Jallianwala Bagh massacre | Backg...,"***\*Picture thread attached at the bottom, pl...",Non-Political,https://www.reddit.com/r/India/comments/g0dofi,11,2,2020-04-13 14:44:55
7,g0dvt3,india,My Kitchen Garden,,Food,https://www.reddit.com/r/India/comments/g0dvt3,5,0,2020-04-13 15:01:22
8,g0dvma,india,Indian media is waging a holy war against Musl...,,Coronavirus,https://www.reddit.com/r/India/comments/g0dvma,1,1,2020-04-13 15:00:57
9,g0dsfs,india,Found a new tracker,,Coronavirus,https://www.reddit.com/r/India/comments/g0dsfs,4,0,2020-04-13 14:53:52


In [20]:
df_new.to_csv('new_posts.csv', index=None)

In [21]:
len(df_new)

881

### Same can be done for all the top posts<br>Note: Top posts can be filtered by hourly, daily, weekly, monthly, all time etc basis. In our case we're using all time.

In [11]:
top_posts = reddit.subreddit(subreddit).top('all', limit=100000)

In [12]:
df_top = process_posts(top_posts)

In [13]:
df_top.head(10)

Unnamed: 0,id,subreddit,title,body,flair,url,score,num_comments,dated
0,981o7s,india,Will donate thrice the number of upvotes (amou...,>**Note**: If you want to know what this is al...,[R]eddiquette,https://www.reddit.com/r/India/comments/981o7s,19714,836,2018-08-17 20:02:17
1,6f10op,india,Indian reply to NYtimes cartoon on Paris clima...,,/r/all,https://www.reddit.com/r/India/comments/6f10op,18239,1492,2017-06-03 20:47:25
2,8pymkp,india,"The essence of the Indian soap opera, distille...",,r/all,https://www.reddit.com/r/India/comments/8pymkp,18200,949,2018-06-10 12:46:29
3,f9outu,india,Fuck all Religion,"Fuck all religion. Fuck Hindusim, fuck Islam, ...",Politics,https://www.reddit.com/r/India/comments/f9outu,17838,4205,2020-02-26 14:10:49
4,eev8g5,india,German exchange Student at IIT Madras is being...,,Politics,https://www.reddit.com/r/India/comments/eev8g5,11694,501,2019-12-24 11:10:46
5,4s5bpn,india,Tragedy of India,,r/all,https://www.reddit.com/r/India/comments/4s5bpn,11264,342,2016-07-10 20:40:21
6,fmsjoc,india,Today's The Hindu,,Coronavirus,https://www.reddit.com/r/India/comments/fmsjoc,10955,180,2020-03-22 10:50:29
7,9dt64s,india,"If you are not moved by this picture, I wish I...",,Non-Political,https://www.reddit.com/r/India/comments/9dt64s,10772,382,2018-09-07 18:48:18
8,avafxp,india,Megathread: India-Pakistan border skirmish,There is a lot of news and speculation coming ...,[R]eddiquette,https://www.reddit.com/r/India/comments/avafxp,10210,6923,2019-02-27 14:56:35
9,fo661m,india,"""From midnight the entire country will go unde...",,Politics [Megathread],https://www.reddit.com/r/India/comments/fo661m,10046,1441,2020-03-24 22:38:46


In [14]:
df_top.to_csv('top_posts_all_time.csv', index=None)

In [24]:
len(df_top)

988

In [23]:
top_posts_year = reddit.subreddit(subreddit).top('year', limit=1000)

In [26]:
df_top_year = process_posts(top_posts_year)

In [27]:
df_top_year.head(10)

Unnamed: 0,id,subreddit,title,body,flair,url,score,num_comments,dated
0,cvk41x,india,Pick your poison wisely!,,Food,https://www.reddit.com/r/India/comments/cvk41x,2901,203,2019-08-26 14:07:19
1,fp3ql3,india,A watercolour painting of the Jama Masjid I di...,,Non-Political,https://www.reddit.com/r/India/comments/fp3ql3,2874,81,2020-03-26 10:42:11
2,c6ude3,india,Catch it before it’s too late.,,Politics,https://www.reddit.com/r/India/comments/c6ude3,2872,92,2019-06-29 12:19:57
3,ecp2hp,india,Pune. My city. Standing strong. How about you?,,Politics,https://www.reddit.com/r/India/comments/ecp2hp,2866,161,2019-12-19 13:58:31
4,bxks5y,india,Vegetable market in Sikkim,,Non-Political,https://www.reddit.com/r/India/comments/bxks5y,2870,127,2019-06-07 03:21:06
5,c0gf1o,india,"India Gate ( Arc-de-Triomphe"" like archway in ...",,Photography,https://www.reddit.com/r/India/comments/c0gf1o,2861,114,2019-06-14 13:37:44
6,fm9uy6,india,He knew,,Coronavirus,https://www.reddit.com/r/India/comments/fm9uy6,2858,74,2020-03-21 13:11:02
7,cx0wwv,india,Every single day this spot is absolutely clean...,,Politics,https://www.reddit.com/r/India/comments/cx0wwv,2851,115,2019-08-29 21:26:36
8,ch5ws5,india,India supplied over two thirds of AIDS medicat...,,Science/Technology,https://www.reddit.com/r/India/comments/ch5ws5,2852,232,2019-07-24 17:24:41
9,ebq5qp,india,Today's Telegraph. Vive la révolution.,,Politics,https://www.reddit.com/r/India/comments/ebq5qp,2842,156,2019-12-17 11:27:57


In [28]:
df_top_year.to_csv('top_posts_year.csv', index=None)

In [29]:
len(df_top_year)

834