# Reddit Data Extraction

Deploying a bot to [r/India](https://www.reddit.com/r/India/ "India: United We Stand") for scraping posts in order to collect relevant data.

In [5]:
import praw
import pprint
import pandas as pd

### Instantiating the PRAW API for data collection 

In [6]:
# instance and authentication of the Web App bot
bot = praw.Reddit(client_id='td74lTDXbZJWoQ',
                     client_secret='UD_Lp2-7JhCKMdxfOI5pUvSqTqU',
                     redirect_uri='http://34.73.225.220:4444',
                     user_agent='AantiNashonalBot')

bot.read_only = True # A submission read-only bot
subreddit = bot.subreddit('india')

### Flairs present on r/India (As of April 2020)

In [7]:
flairs = ['Politics', 'Photography', 'Policy/Economy', 'AskIndia', 'Sports', 'Non-Political', 'Scheduled', 
          'Science/Technology', 'Food', 'Business/Finance', 'Coronavirus', 'AMA', '[R]eddiquette']

### Data Scraping
I decided to scrape and sample my data flairwise in order to avoid class imbalance

**Features:**
>1. **"body"** : the submission self text
>2. **"comment"** : top comments of every submission
>3. **"created"** : timetamp for post creation (in UTC)
>4. **"id"** : unique Base36 id to identify a submission
>5. **"title"** : the title of the submission
>6. **"url"** : the submission URL
>7. **"label"** : The flair associated with the submission

In [8]:
def sampled_flairwise(size):
    sub_list = []

    for flair in flairs:
        sub_dict = {}
        query = ("flair:{}").format(flair)
        for submission in subreddit.search(query, limit=size):
            
            submission.comments.replace_more(limit=None)
            comments = ''
            for top_level_comment in submission.comments:
                comments += top_level_comment.body
            
            sub_dict = {
                "body" : submission.selftext,
                "comment" : comments,
                "created" : submission.created_utc,
                "id" : submission.id,
                "title" : submission.title, 
                "url" : submission.url,
                "label" : submission.link_flair_text
            }
            sub_list.append(sub_dict)
    
    return sub_list

In [9]:
sub_list = sampled_flairwise(200)
flairs_df = pd.DataFrame(sub_list)

In [10]:
pd.set_option("display.max_rows", None)
flairs_df

Unnamed: 0,body,comment,created,id,title,url,label
0,,Source?Damn! These south Indians always get th...,1586589000.0,fyyx8c,The wealth inequality in India is truly horrif...,https://i.redd.it/3rol63nk35s41.jpg,Politics
1,"Fuck all religion. Fuck Hindusim, fuck Islam, ...","I don't think ""atheist"" or ""none"" is an option...",1582697000.0,f9outu,Fuck all Religion,https://www.reddit.com/r/india/comments/f9outu...,Politics
2,,I think it was Gomie from Breaking Bad who sai...,1587346000.0,g4jmo6,"Nisha Jindal, with 10k FB fans, turns out to b...",https://timesofindia.indiatimes.com/india/nish...,Politics
3,,"Before I used laugh on his stupidity , now it'...",1587101000.0,g2vqkr,Prime Time,https://i.redd.it/1jsn588fdbt41.jpg,Politics
4,,Because Akshay Canadian Kumar bangs Thali.It's...,1587807000.0,g7qs3q,BJP wants us to see Indian Sonia Gandhi as Ita...,https://theprint.in/opinion/pov/bjp-wants-us-i...,Politics
5,,Can someone tell me what world are we living i...,1587439000.0,g57gd7,Muslims are feeding you and carrying your dead...,http://muslimmirror.com/eng/muslims-are-feedin...,Politics
6,,"# PLEASE DON'T FREAK OUT,\n\n# All essential s...",1585061000.0,fo661m,"""From midnight the entire country will go unde...",https://twitter.com/TheQuint/status/1242460593...,Politics [Megathread]
7,,I have a better theory. Who has a lot of money...,1587620000.0,g6gz0e,Arnab Goswami alleges physical assault by Cong...,https://www.theweek.in/news/india/2020/04/23/a...,Politics
8,,You criticise jumlabaazis/corruption/high hand...,1587616000.0,g6gbol,Mumbai Police Meri Jaan,https://i.redd.it/8gk6e582xhu41.jpg,Politics
9,,The flights are still showing scheduled status...,1586840000.0,g0z3a8,Breaking: National Lockdown extended May-3,https://twitter.com/LiveLawIndia/status/124992...,Politics


In [None]:
flairs_df.to_csv('data200.csv', index=False)