# Reddit Data Extraction

Deploying a bot to [r/India](https://www.reddit.com/r/India/ "India: United We Stand") for scraping posts in order to collect relevant data.

In [1]:
import praw
import pprint
import pandas as pd

### Instantiating the PRAW API for data collection 

In [2]:
# instance and authentication of the Web App bot
bot = praw.Reddit(client_id='td74lTDXbZJWoQ',
                     client_secret='UD_Lp2-7JhCKMdxfOI5pUvSqTqU',
                     redirect_uri='http://34.73.225.220:4444',
                     user_agent='AantiNashonalBot')

bot.read_only = True # A submission read-only bot
subreddit = bot.subreddit('india')

### Flairs present on r/India (As of April 2020)

In [3]:
flairs = ['Politics', 'Photography', 'Policy/Economy', 'AskIndia', 'Sports', 'Non-Political', 'Scheduled', 
          'Science/Technology', 'Food', 'Business/Finance', 'Coronavirus', 'AMA', '[R]eddiquette']

### Data Scraping
I decided to scrape and sample my data flairwise in order to avoid class imbalance and found these features of a submission to be relevant and correlated to the flairs for classification. 

**Features:**
>1. **"body"** : the submission self text
>2. **"comment"** : top comments of every submission
>3. **"created"** : timetamp for post creation (in UTC)
>4. **"id"** : unique Base36 id to identify a submission
>5. **"title"** : the title of the submission
>6. **"url"** : the submission URL
>7. **"label"** : The flair associated with the submission

In [4]:
def sampled_flairwise(size):
    sub_list = []

    for flair in flairs:
        sub_dict = {}
        query = ("flair:{}").format(flair)
        for submission in subreddit.search(query, limit=size):
            
            submission.comments.replace_more(limit=None)
            comments = ''
            for top_level_comment in submission.comments:
                comments += top_level_comment.body
            
            sub_dict = {
                "body" : submission.selftext,
                "comment" : comments,
                "created" : submission.created_utc,
                "id" : submission.id,
                "title" : submission.title, 
                "url" : submission.url,
                "label" : submission.link_flair_text
            }
            sub_list.append(sub_dict)
    
    return sub_list

In [5]:
sub_list = sampled_flairwise(200)
flairs_df = pd.DataFrame(sub_list)

In [6]:
flairs_df.sample(5)

Unnamed: 0,body,comment,created,id,title,url,label
196,>Hon’ble Shri Narendra Modi \n>Prime Minister ...,Modi: well now i'm not doing itRecently review...,1585544000.0,frkhq3,Full text of Rahul Gandhi's letter to PM Modi ...,https://www.reddit.com/r/india/comments/frkhq3...,Politics
2018,,May be the most affected section of society. A...,1587216000.0,g3nbtp,Migrant Workers,https://i.redd.it/reea3tdntkt41.jpg,Coronavirus
67,"After the shit show Janta curfew was, I didn’t...","I agree for most of it.\n\nI dissent on ""Gandh...",1586102000.0,fvfsi0,"No matter how much you hate Modi, you can’t de...",https://www.reddit.com/r/india/comments/fvfsi0...,Politics
1912,,,1587712000.0,g73uhi,Mark Zuckerberg just gave Asia’s richest man a...,https://www.livemint.com/news/india/mark-zucke...,Business/Finance
2235,"Hi, my name is Milan Vaishnav. I'm a senior fe...","Hi Milan, thanks so much for When Crime Pays! ...",1543927000.0,a309io,"Hi, I'm Milan Vaishnav and you can Ask Me Anyt...",https://www.reddit.com/r/india/comments/a309io...,AMA


In [7]:
flairs_df.to_csv('data200.csv', index=False)