# Characterizing the Nature of Online Career Advice Communities

### Team Name: Team 3
### Team Members: Wen Yi Aw, Walker Azam, Ken Masumoto


## Project Goal
The goal of our project is to better understand how online career advice communities on Reddit operate. These communities are popular places to seek advice, especially considering that these questions are often high stakes and personal. By researching these communities, we will be able to learn about advice seeking/giving on the internet as a whole, as well as better understand the problems that the modern-day employee faces.

## Research Questions
Our research questions are as follows:
- What type of career-related questions are most being asked and discussed?
- Who is asking these questions? What role does Age, Gender, and Anonymity play in these communities?
- What are the qualities of what is considered ‘good’ advice on these subreddits?

We feel that these research questions cover the most important factors surrounding our topic, and will give us information that will lead us to make actionable recommendations. 


## Motivation
If we are successful, there are several stakeholders who stand to gain from our research. Companies will have a better understanding of where information gaps exist for their employees and be able to find ways to fill those information gaps. As a result, employees will have their needs better addressed by the companies they work for. And finally, Reddit users who frequent advice-giving communities will be able to better identify “good” advice. 

## Data Description
We decided to collect the submission information from the top 250 posts from r/askHR and r/careerguidance to help us answer the first two research questions. Then we collected the comments from each of these submissions to help us answer the third research question. The code we used to collect the post data is outlined in the rest of this notebook. The code we used to collect the comment data from these posts is in the notebook titled 'CollectRedditComments.ipynb.' 

Our Deliverables
- **Collect_Reddit_Submissions.ipynb** (This notebook): Project outline and submission collection
- **CollectRedditComments.ipynb**: Comment collection
- **postdata.csv**: CSV file of our submission data (250 submissions from r/askHR and 250 submissions from r/careeradvice)
- **askHR_comments.csv** and **careeradvice_comments.csv**: CSV files of our comment data

## Contributions

Wen Yi Aw: Code used to gather submission post data
Walker Azam: Code used to gather comment data
Ken Masumoto: Problem statement and motivation writeup



## Reddit Post Collection

This notebook contains processes to retrieve Reddit submissions from work advice communities. This will be used to generate a csv file containing submission data.

Our two subreddits of interest are r/careerguidance and r/askHR. Although we had previously planned to examine more subreddits, we found that collecting the comments of even a small subset of posts would still lead to a large dataset. We collected data on 250 posts from each of these subreddits, as well as the comments associated with each post.

Because a key part of our analysis will involve comments (determining good advice versus bad advice), we selected posts sorted by "top" as these posts have a high amount of interactions and comments. These posts were taken from the time period over the last year, so that our analysis can address current concerns of people seeking career advice.

In [36]:
import praw
import pandas as pd
from collections import defaultdict

# Get credentials from DEFAULT instance in praw.ini
reddit = praw.Reddit('DEFAULT')

# Establish dictionary for storing post data
posts_dict = defaultdict(list)

# List of subreddits to examine
subs = ['careerguidance', 'askHR']

for sub in subs:
    # Create subreddit variable
    subreddit = reddit.subreddit(sub)

    # Iterator item to parse the top 250 posts in the last year
    top_subreddit = subreddit.top(limit=250, time_filter='year')

    # Iterate over the top_subreddit object and store fields for each post in the dictionary
    for submission in top_subreddit:
        posts_dict["subreddit"].append(submission.subreddit)
        posts_dict["title"].append(submission.title)
        posts_dict["id"].append(submission.id) # ID of the submission.
        posts_dict["author"].append(submission.author)
        posts_dict["text"].append(submission.selftext) 
        posts_dict["num_comments"].append(submission.num_comments) # The number of comments on the submission.
        posts_dict["score"].append(submission.score) # number of upvotes for the submission.
        posts_dict["upvote_ratio"].append(submission.upvote_ratio) # percentage of upvotes from all votes on the submission.
        posts_dict["flair"].append(submission.link_flair_text) # The link flair’s text content, or None if not flaired.
        posts_dict["distinguished"].append(submission.distinguished) # Whether or not the submission is distinguished.
    
# Create a dataframe from the post dictionary
post_data = pd.DataFrame(posts_dict)

# Print head of dataframe to examine results
print(post_data.head())

        subreddit                                                                                                                                         title      id                author  \
0  careerguidance  Can we all agree to normalize gaps on resumes?                                                                                                nd3r9i  minirumbaba            
1  careerguidance  Anyone else’s coworkers suddenly quitting with no job lined up?                                                                               qv8jc4  Pugnastyornah          
2  careerguidance  I have only been working Full-Time for 5 years. But I am already exhausted and don't want to work anymore. Does anybody else feel like this?  ojgq1c  Archprimus_            
3  careerguidance  Lied about getting another offer, hr wants to see my offer letter. What should I do?                                                          sdnh5m  ten_choe               
4  careerguidance  Anyone else have

The fields we collected from each post were:
- subreddit: to differentiate between communities (r/careerguidance or r/askHR)
- title: could be helpful to identify the topic of the post (ex. promotions, interviews, coworkers, etc.)
- post ID: for later collection of comments
- author: make sure authors are counted as unique in potential demographic measurements 
- text of the post: who is asking for advice and what subjects they want advice on
- number of comments: could be helpful for verifying that all comments of each post are collected
- number of upvotes: potentially score what the community considers a "good" advice request
- percentage of votes which are upvotes: same as number of upvotes
- the post flair: could be helpful to identify the topic of the post
- whether the post is distinguished: marks as a post from a moderator (may consider removing moderator posts in the future, if they are not advice)


In [38]:
# Save dataframe to new CSV file with UTF-8 encoding
post_data.to_csv('./postdata.csv', encoding='utf-8')

Once saved, this CSV will be used for collecting comments of each submission (based on submission ID).