## Reddit Comment Collection

**IMT 547**:
Walker A., Wen Yi A., Ken M.

This notebook contains functions and processes to retrieve comments from reddit submissions. This will be used to generate dataframe and csv file containing comments for submissions.

### Part 1: Function to retrieve comments:

In [69]:
# Imports
import pandas as pd
import praw
import numpy as np
from collections import defaultdict
from praw.models import MoreComments
import re

# Get credentials from DEFAULT instance in praw.ini
reddit = praw.Reddit('DEFAULT')

To help retrieve comments and comment features from our reddit posts, we have to create a function to retrieve comments from a given posts submission id. Here is the function:

In [79]:
def retrieve_submissions(submission_id):
    """
    This function takes in a submission id and returns top comments, along with comment score, and is_submitter,
    and distinguished.
    """
    # a dictionary to store our comment features
    comment_dict = defaultdict(list)
    # retrieve our submission from passed id
    submission = reddit.submission(submission_id)
    # storing features of interest
    for comment in submission.comments:
        # ignoring MoreComments
        if isinstance(comment, MoreComments):
            continue
        comment_dict["submission id"].append(submission_id)
        comment_dict["comment id"].append(comment.id)
        comment_dict["body"].append(comment.body)
        comment_dict["score"].append(comment.score)
        comment_dict["is_submitter"].append(comment.is_submitter)
        comment_dict["distinguished"].append(comment.distinguished)
    # return dictionary
    return comment_dict

### Part 2: Collecting comments from our collected Reddit posts

So with this function, we have to pass in post id's. We can use our collected reddit posts from our subreddits and the 'id' column to gather comments for posts we want. First we have to read in out postdata.csv file which contains  500 collected posts from subreddits (careerguidance, AskHR).

In [55]:
posts_df = pd.read_csv('postdata.csv')
print("Shape:", posts_df.shape)
posts_df.head()

Shape: (500, 11)


Unnamed: 0.1,Unnamed: 0,subreddit,title,id,author,text,num_comments,score,upvote_ratio,flair,distinguished
0,0,careerguidance,Can we all agree to normalize gaps on resumes?,nd3r9i,minirumbaba,There is nothing wrong with not working for 3 ...,178,1423,0.98,,
1,1,careerguidance,Anyone else’s coworkers suddenly quitting with...,qv8jc4,Pugnastyornah,Our 3rd coworker in less than 2 months quit Fr...,357,1315,0.99,Coworkers,
2,2,careerguidance,I have only been working Full-Time for 5 years...,ojgq1c,Archprimus_,I don't understand what is wrong with me. When...,231,1011,0.98,,
3,3,careerguidance,"Lied about getting another offer, hr wants to ...",sdnh5m,ten_choe,"So for the last three months, I took over my b...",346,944,0.97,Advice,
4,4,careerguidance,Anyone else have the feeling their job is bull...,o1xcex,Suitable-Excitement3,Just looking to vent incase anyone else here h...,279,919,0.99,Advice,


We can see that we have 500 posts. Each post has an associated 'id' which is what we pass into our function. Let's retrieve a list of all the ids.

In [59]:
id_list = posts_df['id'].values
print(len(id_list))

500


In [80]:
# retrieveing comments for the first 100 posts ids
comments_dict_list = [] # storing our function dicts in a list
for id_ in id_list[0:100]:
    comments_dict_list.append(retrieve_submissions(id_))

Converting our collected comments into a dataframe:

In [81]:
# creating an initial dataframe
comments_df = pd.DataFrame(comments_dict_list[0])
# concatenating our comment dictionaries
for i in np.arange(1, len(comments_dict_list)):
    comments_df = pd.concat([comments_df, pd.DataFrame(comments_dict_list[i])], ignore_index=True)

In [82]:
# vieiwng the first 100 posts comment df:
comments_df

Unnamed: 0,submission id,comment id,body,score,is_submitter,distinguished
0,nd3r9i,gy95035,I recently got hired after a 3 year employment...,150,False,
1,nd3r9i,gy9il5r,I’ve got the big five against me. \n\nage\n\nj...,75,False,
2,nd3r9i,gy8q51h,I agree with you! Just because you have a gap ...,276,False,
3,nd3r9i,gy8oudy,"Yeah, to be honest, hiring people forget what ...",127,False,
4,nd3r9i,gy96d2e,I think this is something that will become mor...,36,False,
...,...,...,...,...,...,...
8446,slhnhq,hvv6pb2,Amazon ? Or food industry ? Try opening up you...,1,False,
8447,slhnhq,hwc21j1,Truck driver and earn more then your fellow pe...,1,False,
8448,slhnhq,hvrobm4,OP I strongly advise you to check your state's...,-1,False,
8449,slhnhq,hvr4r34,Location?,0,False,


Just from our first 100 reddit submissions, we have collected 8451 comments. Along with comment body, id, and submission id, we also collected the score (upvote ratio), and whether the comment is from submitter, or whether it recieved an 'award' (distinguished). Lets collect comments for the remainder of the career guidance posts:

In [83]:
# rest of the careerguidance comments
comments_dict_list_250 = [] 
for id_ in id_list[100:250]:
    comments_dict_list_250.append(retrieve_submissions(id_))

In [84]:
# concatenating our comment dictionaries
for i in np.arange(0, len(comments_dict_list_250)):
    comments_df = pd.concat([comments_df, pd.DataFrame(comments_dict_list_250[i])], ignore_index=True)

In [87]:
print("Shape of Comments Dataframe from CareerGuidance:",comments_df.shape)

Shape of Comments Dataframe from CareerGuidance: (16476, 6)


In [88]:
comments_df

Unnamed: 0,submission id,comment id,body,score,is_submitter,distinguished
0,nd3r9i,gy95035,I recently got hired after a 3 year employment...,150,False,
1,nd3r9i,gy9il5r,I’ve got the big five against me. \n\nage\n\nj...,75,False,
2,nd3r9i,gy8q51h,I agree with you! Just because you have a gap ...,276,False,
3,nd3r9i,gy8oudy,"Yeah, to be honest, hiring people forget what ...",127,False,
4,nd3r9i,gy96d2e,I think this is something that will become mor...,36,False,
...,...,...,...,...,...,...
16471,nosq4z,h03fl83,ADD/ADHD - Not certain but have a look into it...,1,False,
16472,nosq4z,h03j00m,"Nothing's wrong, your prefectly normal. You ar...",1,False,
16473,nosq4z,h03k6i7,"Why is this wrong? I've been working in tech, ...",1,False,
16474,nosq4z,h03ki06,Normal tbh. \n\nI hate having to work for a li...,1,False,


The benefit of splitting up my query to collect comments in batches is that I do not exceed any rate limits! From the first 250 posts from post_df, I have 16476 comments collected. We can collect the remainder of the comments from the AskHR reddit posts (posts 250 - 499):

In [89]:
# askHR comments
comments_dict_list_HR = []
for id_ in id_list[250:499]:
    comments_dict_list_HR.append(retrieve_submissions(id_))

In [90]:
# creating an initial dataframe
askHRcomments_df = pd.DataFrame(comments_dict_list_HR[0])
# concatenating our comment dictionaries
for i in np.arange(1, len(comments_dict_list_HR)):
    askHRcomments_df = pd.concat([comments_df, pd.DataFrame(comments_dict_list_HR[i])], ignore_index=True)

In [92]:
print("Shape for comments for askHR Data:", askHRcomments_df.shape)
askHRcomments_df

Shape for comments for askHR Data: (16488, 6)


Unnamed: 0,submission id,comment id,body,score,is_submitter,distinguished
0,nd3r9i,gy95035,I recently got hired after a 3 year employment...,150,False,
1,nd3r9i,gy9il5r,I’ve got the big five against me. \n\nage\n\nj...,75,False,
2,nd3r9i,gy8q51h,I agree with you! Just because you have a gap ...,276,False,
3,nd3r9i,gy8oudy,"Yeah, to be honest, hiring people forget what ...",127,False,
4,nd3r9i,gy96d2e,I think this is something that will become mor...,36,False,
...,...,...,...,...,...,...
16483,qiob93,hikvoy8,He seems like he doesn’t want to deal with the...,2,True,
16484,qiob93,himfs00,Definitely understand why thats frustrating or...,2,False,
16485,qiob93,hirs5rs,I think you still should bring this to your HR...,2,False,
16486,qiob93,hil4m0e,Is not getting someone coffee aggressive? Ya i...,2,False,


### Part 3: Preliminary Cleaning:

Now we have two dataframes: one for comments from our posts of careerguidance (comments_df) and one for comments from askHR (askHRcomments_df). Before we export the data to a csv, lets do some simple cleaning. We want to remove any \n symbols and also remove any comments were the body is '[removed]'.

In [103]:
def replace_newline(text):
    '''
    This function takes in an string text and using RegEx to replace \n in the text to a space.
    It returns the replaced string.
    '''
    return re.sub('\\n', ' ', text)

We can apply this to all our body texts for our two dataframes:

In [108]:
# removing \n from askHR comments:
askHRcomments_df['body'] = askHRcomments_df['body'].apply(replace_newline)
askHRcomments_df.head()

Unnamed: 0,submission id,comment id,body,score,is_submitter,distinguished
0,nd3r9i,gy95035,I recently got hired after a 3 year employment...,150,False,
1,nd3r9i,gy9il5r,I’ve got the big five against me. age job h...,75,False,
2,nd3r9i,gy8q51h,I agree with you! Just because you have a gap ...,276,False,
3,nd3r9i,gy8oudy,"Yeah, to be honest, hiring people forget what ...",127,False,
4,nd3r9i,gy96d2e,I think this is something that will become mor...,36,False,


Now we can see that in row 1, the text does not include the '\n' line character anymore. Lets apply this also to our careerguidance dataframe:

In [109]:
# removing \n from careerguidance comments:
comments_df['body'] = comments_df['body'].apply(replace_newline)
comments_df.head()

Unnamed: 0,submission id,comment id,body,score,is_submitter,distinguished
0,nd3r9i,gy95035,I recently got hired after a 3 year employment...,150,False,
1,nd3r9i,gy9il5r,I’ve got the big five against me. age job h...,75,False,
2,nd3r9i,gy8q51h,I agree with you! Just because you have a gap ...,276,False,
3,nd3r9i,gy8oudy,"Yeah, to be honest, hiring people forget what ...",127,False,
4,nd3r9i,gy96d2e,I think this is something that will become mor...,36,False,


An issue we also want to resolve at this stage is to filter out any removed comments. Below is an example:

In [110]:
# filtering our the [removed] comments
askHRcomments_df[askHRcomments_df['body'] == '[removed]']

Unnamed: 0,submission id,comment id,body,score,is_submitter,distinguished
1159,olw1ik,h5hfskj,[removed],-22,False,
3245,sm4ej0,hvuqvld,[removed],-3,False,
4599,ug0rpl,i6xqkjw,[removed],-28,False,
5552,oazowk,h3w4vt3,[removed],1,False,
8390,slhnhq,hvrzksy,[removed],-3,False,
10573,npltvl,h06powy,[removed],4,False,
10875,qkpdxv,hizly3i,[removed],0,False,
12264,nf5v5a,gyk0wkj,[removed],6,False,
13459,qqwl1j,hk4y6ra,[removed],4,False,
15677,s57i2x,hsyms35,[removed],5,False,


These comments are likely removed for being against the subreddit guidelines or for being antisocial. They do not benefit our study since we cannot look at what content the posts included. Lets filter these out of both our datasets:

In [111]:
# creating clean copies of our datasets:
clean_careerguidance = comments_df.copy()
clean_askHR = askHRcomments_df.copy()

In [114]:
# filtering out [removed] content
clean_careerguidance = clean_careerguidance[clean_careerguidance['body'] != '[removed]']
clean_askHR = clean_askHR[clean_askHR['body'] != '[removed]']

We can compare our shapes to see the number of removed comments:

In [116]:
print("Number of removed comments from careerguidance:", comments_df.shape[0] - clean_careerguidance.shape[0])
print("Number of removed comments from askHR:", askHRcomments_df.shape[0] - clean_askHR.shape[0])

Number of removed comments from careerguidance: 11
Number of removed comments from askHR: 11


Both datasets had 11 removed comments each. That means we lost 22 data points, which is not much compared to our >32000 comments. Let's see each DF's final shapes:

In [117]:
print(clean_careerguidance.shape)
print(clean_askHR.shape)

(16465, 6)
(16477, 6)


Both datasets have aroung 16,400 comments from 500 posts total! Thats plenty to work with for now. We can export these datasets to csv. (Note: more intensive cleaning will be done in actual analysis steps. The cleaning now had more to do with saving usable data)

In [119]:
# saving as csv files:
clean_careerguidance.to_csv('careerguidance_comments.csv', sep=',', encoding='utf-8')
clean_askHR.to_csv('askHR_comments.csv', sep=',', encoding='utf-8')

We will use these collected comments to analyze what makes for useful, and good, advice through feature analysis. This will be done in Phase 2.