# Data Acquisition
For our project we are going to use the wisdom of 3 most popular Reddit communities related to Data Science -
* Machine Learning - [r/MachineLearning](https://www.reddit.com/r/MachineLearning/)
* Artificial Intelligence - [r/artificial](https://www.reddit.com/r/Artificial/)
* Data Science - [r/DataScience](https://www.reddit.com/r/DataScience/)

We will extract the required information using Reddit's official API - [PRAW](https://praw.readthedocs.io/en/stable/code_overview/models/subreddit.html) (The Python Reddit API Wrapper).

In [20]:
# Importing important libraries
import praw
import pandas as pd
import configparser
import datetime as dt

The credentials required to access API can be procured from [reddit.com/prefs/apps](https://www.reddit.com/prefs/apps).
The required credentials to access the API are provided though 'reddit_credentials.ini' to protect senstive information.

In [8]:
# For reading configuration files for Reddit Credentials
config = configparser.ConfigParser()
config.read('reddit_credentials.ini')

# Storing credential info in local variables
user_agent = config.get('credentials', 'user_agent')
client_id = config.get('credentials', 'client_id')
client_secret = config.get('credentials', 'client_secret')
redirect_url = config.get('credentials', 'redirect_url')

In [9]:
# Creating read-only Reddit instance
reddit = praw.Reddit(user_agent = user_agent,
                    client_id = client_id,
                    client_secret = client_secret,
                    redirect_url = redirect_url)

## Extracting Top Posts
We will extract top 1000 post of all time from each sub-reddit to create our dataset along with some other useful information like Post URL (& ID), User posted, Post title, Flair, Number of Comments, Time Created, Upvote Ratio and Score.
We will use this information further to analyse and infer useful insights from it.

In [10]:
# Extracting top 1000 posts from each subreddit
posts = reddit.subreddit('MachineLearning+artificial+datascience').top(time_filter = 'all', limit = 3000)

In [11]:
# Creting DataFrame of the top posts along with other attributes for analysis

posts_list = []

for post in posts:
    posts_list.append({
        'post_id' : post.id,
        'post_title' : post.title,
        'subreddit' : post.subreddit,
        'time_created' : post.created_utc,
        'post_url' : post.url,
        'flair_text' : post.link_flair_text,
        'score' : post.score,
        'comments' : post.num_comments,
        'upvote_ratio' : post.upvote_ratio
    })
    
posts_df = pd.DataFrame(posts_list)

In [24]:
# Converting UTC Date format to Standard Date-Time format
posts_df['date-time'] = posts_df['time_created'].apply(lambda x: dt.datetime.fromtimestamp(x))

# Creating 'Year' column
posts_df['year'] = posts_df['date-time'].dt.year

# Dropping 'time_created' column
posts_df.drop('time_created', axis = 1, inplace = True)

In [26]:
# Saving our posts data in .csv format
posts_df.to_csv("Top_Posts.csv", header = True, index = False)

In [27]:
# Displaying the content of saved Post Data
posts_df = pd.read_csv('Top_Posts.csv')
posts_df.sample(10)

Unnamed: 0,post_id,post_title,subreddit,post_url,flair_text,score,comments,upvote_ratio,date-time,year
41,10mmm38,"As a hiring manager - this, this right here",datascience,https://i.redd.it/fk95v2ghilea1.png,Career,2495,140,0.96,2023-01-27 14:48:21,2023
180,vljjur,How the AI be walking on the 17th generation,artificial,https://i.redd.it/abl4dixjf2891.gif,Discussion,1243,19,0.98,2022-06-27 01:24:27,2022
896,10pb1y3,"[P] I launched “CatchGPT”, a supervised model ...",MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,Project,492,211,0.75,2023-01-30 19:09:14,2023
49,qrjmge,Stop asking data scientist riddles in interviews!,datascience,https://i.redd.it/jjtjirwagyy71.jpg,Discussion,2283,269,0.94,2021-11-11 11:52:13,2021
202,dv7mdc,"""If you torture the data long enough, it will ...",datascience,https://i.redd.it/5rg06b0c38y31.png,Fun/Trivia,1151,34,0.98,2019-11-12 09:20:56,2019
2980,5k35un,Deep Learning Enables You to Hide Screen when ...,artificial,http://ahogrammer.com/2016/11/15/deep-learning...,,82,2,0.95,2016-12-24 14:06:16,2016
820,gb08da,[P] I wrote an API to build neural networks in...,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,Project,522,38,0.98,2020-04-30 17:33:50,2020
1461,db8c4u,[N] UC Berkeley's CS 285: Deep Reinforcement L...,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,News,362,31,0.97,2019-09-30 08:05:51,2019
2065,jjbdfm,Probability practice problems,datascience,https://www.reddit.com/r/datascience/comments/...,Job Search,247,31,0.98,2020-10-27 22:21:07,2020
2102,amsdk2,Some Important Data Science Tools that aren’t ...,datascience,https://towardsdatascience.com/some-important-...,,235,42,0.96,2019-02-03 18:34:09,2019


## Extracting Comments
Using 'post_id' of top posts we will further extract all comments. We will create a different dataset containing 'post_id' and 'comment' to create our textual dataset for training our large NLP model (GPT-3.5-turbo). We will also utilize this data to analyse the sentiment aroud different topics and recognizing emotions of the text. 

In [None]:
# Creating DataFrame of all the comments available in the Top Posts

comments_list = []

for post_id in posts_df['post_id']:
    submission = reddit.submission(post_id)
    submission.comments.replace_more(limit = None)
    
    for comment in submission.comments.list():
        comments_list.append({
            'post_id' : post_id,
            'comment' : comment.body
        })
        
comments_df = pd.DataFrame(comments_list)

In [None]:
# Saving our comments data in .csv format
comments_df.to_csv('Top_Posts_Comments.csv', header = True, index = False)

In [15]:
# Displaying the content of our Comments Data
comments_df = pd.read_csv('Top_Posts_Comments.csv')
comments_df.sample(10)

Unnamed: 0,post_id,comment
61970,11w03sy,!remindme one week
146623,2lmo0l,"Hello Dr Hinton, Im doing a case study in my c..."
96224,r76igz,Transformers robots in disguise
62757,ulvdgm,"Love your work, scared of your name, uncertain..."
185205,kf2j1l,What does dagster bring to airflow that airflo...
200955,riup34,Great comment! I'm a hybrid data engineer/data...
2316,hohvgq,Average DS guy from a business undergrad. Don’...
169021,b3zlha,gpt-2 finish this\n\n
135045,65ukie,You obvoiusly need to search better in the lat...
214825,bl6gbm,[deleted]


In [19]:
print("Shape of Posts Data - {}".format(posts_df.shape))
print("Shape of Comments Data - {}".format(comments_df.shape))

Shape of Posts Data - (2987, 9)
Shape of Comments Data - (223174, 2)


We have successfully extracted ~223K comments from top 1000 posts from popular sub-reddits related to Data Science.