# 01 Data Collection


## Problem Statement

IDEAL

Develop a model to analyse text in a reddit post and determine its source. It will have either come from the datascience or analytics subreddit. The idea is to find out whether self classification by users can determine whether the two groups have distinctive language based on what they post.

REALITY

There is a lot of debate as to whether there is a difference between a data scientist and an analyst. The roles do have some overlap in terms of tools used, type of work and they way some people perceive each role. Even role titles in companies are different and no definitive defintion exists. 

CONSEQUENCES

If the model is unsucessful then it will add weight to the argument that the disciplines are the same. If the model is not conclusive it will mean that even in self classifing groups, the language and topics which are discussed are indistinguishable to even advanced NLP models which are not biased by subjective factors. 

This will make it harder for employers, employees and education providers to claim that such a split exists and may have to tailor their approach accordingly.

PROPOSAL

I will scrape the data using the reddit API. I will bring back the posts and also the first layer of comments. This is to make sure I have enough data. After analysing the data I will make a baseline model and a first run of the model. I will then refine the model by using techniques such as lemmatising, adding more data and try to improve its accuracy score.

I will compare Logistic regression, Mulitnomial Naive Bayes and use the TF-ifd and CountVectoriser vectorisers.  After choosing the model and optimising it I will look to see what is driving the model and try and infer key words which drive any distinction. It will allow me to drive appropriate conclusions from there.

Success may not be a high score from the test. If there is no difference between the two roles then it is likely that the score will be low. This would also be important to know. Similarly a score above 70% on the model would be useful as I can conclude that differences do exist. The model should also be able to provide me with key insights into what is driving the model and explain the differences between the two sub reddit groups.

## Executive Summary


I have created a model to distinguish between a data scientist and an analyst just by looking at the text they use. The model is accurate and provides key insights into what the two groups discuss.

Data science and analytics have been used interchangably in the past causing confusion to employees, employers and training providers. However by training an NLP model from reddit posts it has detected key features and differences between the two which 
can be useful.

The model shows that data science is more technical and builds tools from the ground up. Analytics users more off the shelf software to allow them to use existing data and quickly create actionable business insights.

Some key words for data science are machine learning, data engineering, python, data visualisatin and web applications. Key analytics words include google analytics, data form, data warehouse, google optimise and Adobe analytics.

Now employees can classify their jobs better, employees can better determine what job interests them and training providers can run appropriate courses.

### Contents:
- [Work Book Overview](#Work-Book-Overview)
- [Data Collection Classes](#Data-Collection-Classes)
- [Json Explore Subreddit](#Json-Explore-Subreddit)
- [Json Explore Subreddit/comments](#Json-Explore-Subreddit/comments)
- [Data Scrape](#Data-Scrape)



# Import Libraries

In [1]:
# load libraries
import numpy as np
import pandas as pd
import requests
import time
import random
import re
from bs4 import BeautifulSoup

%matplotlib inline

---

# Work Book Overview


### EDA
+ A certain amount of EDA  has occured during the data collections stage. This is beacuse it is important to see what data is being collected. This can help decide which is useful and which is not. Therefore it will limit the amount of wasted data being collected. This can save space and be more efficient

### Explore subreddits
+ Pull back 1 record from r/datascience and explore the data and all the information contained
+ Pull back 1 record from r/dataanalytics and compare useful parts identified by part one above
+ Choose appropriate columns to use for data extraction

### Explore subreddits comments
+ pull back 1 record from r/datascience/comments and explore the data and all the information contained
+ pull back 1 record from r/dataanalytics/comments and compare useful parts identified by part one above
+ Choose appropriate columns to use for data extraction

### Data collection
+ I have used the Reddit API to pull back data in two stages.

#### Stage 1
+ Firstly to collect the posts from both sub reddits
+ Collect the comments link/ID to use in stage 2

#### Stage 2 
+ Use the comments ID to scrap the first layer of comments
+ This enables a lot more data to be collected

#### Outputs
1. data_science posts csv file
2. analytics posts csv file
3. data_science comments csv file
4. analytics comments csv file

I will keep them seperate and join in the next notebook as I can see if there is an improvemnt in the model by using posts and comments or just posts.

#### Structure
I have used OOD to help with data capture. This will help keep the code clean and help re-use joint code. I have a base class to handle the connection and general functions and then two child classes to give appropriate functionality to the collection of post and comments data which is similar but distinctly different. Therefore it would be a good use of OOD.

---

# Data Collection Classes

In [2]:
# Parent class containing common variables and methods

class Scrape(object):
    
    # Initialise parent class
    def __init__(self, output_file_name, cols_to_keep,  url,  min_sleep = 2, max_sleep = 8, 
                     file_type = '.csv', file_root = './data/'):
        self.url = url
        self.output_file_name = output_file_name
        self.min_sleep = min_sleep
        self.max_sleep = max_sleep
        self.file_type = file_type
        self.file_root = file_root
        self.cols_to_keep = cols_to_keep
        self.posts = []
        self.current_posts_df = pd.DataFrame()
    
    # Call the redit API and pull back data in JSON form
    def call_api(self):
        res = requests.get(self.url, headers={'User-agent': 'Pony Inc 1.0'})
        status = self.problem_with_status(res)
        return res.json(), status
    
    # check there are no errors with the HTTP request
    def problem_with_status(self,res):
        if res.status_code != 200:
            print('Status error', res.status_code, ' After: ', after)
            # to indicate problem return false
            return True
        # to indicate all ok return True
        return False
    
    # Add a pause into the model to prevent server overloading and IP blocking
    def sleep(self):
        sleep_duration = random.randint(self.min_sleep, self.max_sleep)
        time.sleep(sleep_duration)
        return sleep_duration  

In [3]:
# Posts child class to get the main posts off of reddit

class Posts(Scrape):
    
    # Initilaise Posts and Scrape class
    def __init__(self, output_file_name, cols_to_keep, url, min_sleep=2, max_sleep=8, file_type = '.csv', 
                  after = None, iterations = 700):
        super().__init__(output_file_name, cols_to_keep, url, min_sleep=2, max_sleep=8, file_type = '.csv', 
                         file_root = './data/')
        self.after = after
        self.max_iters = iterations
        self.count = 0
        self.base_url = self.url
                
    # Function to get the posts data
    def get_posts(self):
        while (self.after != None or self.count == 0) and (self.max_iters > self.count):
            if self.after != None:
                self.url = self.base_url + '?after=' + self.after

            current_dict, error_with_status = self.call_api()
            
            if error_with_status:
                break  
            
            self.after = current_dict['data']['after']
            self.__process_json(current_dict)
            self.__clean_comments()
            self.__save_data()
            self.__end_loop()   
    
    # Only keep columns that are needed
    def __clean_comments(self):
        self.current_posts_df = pd.DataFrame(self.posts)
        if len(self.cols_to_keep) > 0:
            self.current_posts_df = self.current_posts_df.loc[:,self.cols_to_keep]

    # Extract and store the posts from the relevant JSON
    def __process_json(self, current_dict):
        current_posts = [p['data'] for p in current_dict['data']['children']]
        self.posts.extend(current_posts)
        self.after = current_dict['data']['after']
    
    # Save data to a file
    def __save_data(self):
        if self.count > 0:
            prev_posts = pd.read_csv(self.file_root + self.output_file_name + self.file_type)
            self.current_posts_df = prev_posts.append(self.current_posts_df)
        print(f"size of df: {self.current_posts_df.shape}")
        self.current_posts_df.to_csv(self.file_root + self.output_file_name + self.file_type, index = False)
    
    # End of function features to do at the end of the main loop
    def __end_loop(self):
        self.posts = []
        self.count += 1
        print(f"current_url: {self.url}, After: {self.after}, Sleep for {self.sleep()} seconds")


In [4]:
# Comments child class to get the 1st layer of comments related to the reddit posts 

class Comments(Scrape):
    
    # Initilaise Comments and Scrape class
    def __init__(self, output_file_name, cols_to_keep, keys_to_process, url, min_sleep=2, max_sleep=8, 
                         file_type = '.csv', file_root = './data/'):
        super().__init__(output_file_name, cols_to_keep, url,  min_sleep=2, max_sleep=8, file_type = '.csv', 
                         file_root = './data/')
        self.keys_to_process = keys_to_process
    
    # Function to get the comments data
    def get_comments(self):        
        for k, each_page in self.keys_to_process.items():
            try:
                self.url = 'https://www.reddit.com' + each_page + '.json'
                current_list, error_with_status = self.call_api()

                if not error_with_status:
                    comments_dict = current_list[1]['data']['children']
                    self.__process_json(comments_dict)
                    self.__clean_comments()
                    self.__save_data()
                    self.__end_loop()
            except:
                pass
    
    # Function to get the comments data
    def __clean_comments(self):
        self.current_posts_df = pd.DataFrame(self.posts)

        if len(self.cols_to_keep) > 0:
            self.current_posts_df = self.current_posts_df.loc[:,self.cols_to_keep]
    
    # Extract and store the comments the relevant JSON
    def __process_json(self, comments_dict):
        current_posts = [p['data'] for p in comments_dict]
        self.posts.extend(current_posts)
    
    # Save data to file
    def __save_data(self):
        # use try block. If comments have been recently deleted this would cause an error
        try:
            prev_posts = pd.read_csv(self.file_root + self.output_file_name + self.file_type)
            self.current_posts_df = prev_posts.append(self.current_posts_df)
        except:
            pass
        
        # check that the function is working
        print(f"size of df: {self.current_posts_df.shape}")
        self.current_posts_df.to_csv(self.file_root + self.output_file_name + self.file_type, index = False)

    # End of function features to do at the end of the main loop
    def __end_loop(self):
        self.current_posts_df = pd.DataFrame()
        self.posts = []
        print(f"current_url: {self.url[37:]}, pause: {self.sleep()}secs")


---

# Json Explore Subreddit 

### Before mass download I will look a 1 post and look at the differenct columns from the comments page

In [5]:
url_ds_test = 'https://www.reddit.com/r/datascience/.json'

In [6]:
res_ds_test = requests.get(url_ds_test, headers={'User-agent': 'Pony Inc 1.0'})

In [7]:
res_ds_test.status_code

200

In [8]:
res_ds_test = res_ds_test.json()

---

#### Investigate the columns to see what data is there

+ I am mainly concerned by the post data and which sub reddit it came from.
+ However there may be other data which maybe useful later which I dont want to discard.
+ At the same time there is not much point storing unimportant data.


![data_structure](images/json_overview.png)

### 1.0

In [9]:
## What keys does the dictionary have?
res_ds_test.keys()

dict_keys(['kind', 'data'])

In [10]:
# Explore these keys
# 'kind' looks like it identifies the data as a 'listing'
res_ds_test['kind']

'Listing'

### 2.0 

In [11]:
# 'data' is a dictionary 
type(res_ds_test['data'])

dict

In [12]:
# It has the following columns
res_ds_test['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

### 2.1

In [13]:
# 'modhash' is empty
res_ds_test['data']['modhash']

''

### 2.2

In [14]:
# 'dist' is the number of records
res_ds_test['data']['dist']

26

### 2.2

In [15]:
# 'children' is a list
type(res_ds_test['data']['children'])

list

In [16]:
type(res_ds_test['data']['children'])

list

In [17]:
# It has 26 records matching the 'dist' information
len(res_ds_test['data']['children'])

26

In [18]:
# Explore 1 entry - reveals another dictionary
res_ds_test['data']['children'][2]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'datascience',
  'selftext': 'Hello all,\n\nI am a Data Scientist at a Fortune 500 company, with a PhD in Electrical Engineering. For the last 5 years, I thought myself Python and Data Science and progressed a lot in that arena. I wanted some change after 5 years in the same company and wanted to explore options. Amazon AWS Pro Serve sounded interesting as you get to work with different companies. I did not want to work on a deep Machine Learning Problem on my cubicle (after corona, for now home desk :) ). I was excited about meeting new people and potentially solving data problems for different industries.\n\nAm I making a right choice? Is Pro-serve considered to be same with, say "Applied scientist" role in AWS (asking in regards to: 1) career growth, 2) reputation and 3) financial) ? Is meeting new people and potentially gaining more exposure to different industries in Pro-serve a naive way of thinking? As we all know 

In [19]:
# see how many keys 'data' has then explore
res_ds_test['data']['children'][0].keys()

dict_keys(['kind', 'data'])

### 2.3.1 'data' -> 'children' -> 'kind'

In [20]:
# Check out kinds to see if it changes
# Shows all 26 are the same values

unique_values_of_kind = set()
for i, x in enumerate(res_ds_test['data']['children']):
    unique_values_of_kind.add(x['kind'])

unique_values_of_kind

{'t3'}

### 2.3.1 'data' -> 'children'-> 'data'

In [21]:
# It is a dictionary
type(res_ds_test['data']['children'][0]['data'])

dict

In [22]:
# Dictionary contains over a 100 keys.
len(res_ds_test['data']['children'][0]['data'].keys())

104

In [23]:
# The post data is stored
res_ds_test['data']['children'][0]['data']['selftext']

"Welcome to this week's entering &amp; transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:\n\n* Learning resources (e.g. books, tutorials, videos)\n* Traditional education (e.g. schools, degrees, electives)\n* Alternative education (e.g. online courses, bootcamps)\n* Job search questions (e.g. resumes, applying, career prospects)\n* Elementary questions (e.g. where to start, what next)\n\nWhile you wait for answers from the community, check out the [FAQ](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions) and [Resources](Resources) pages on our wiki. You can also search for answers in [past weekly threads](https://www.reddit.com/r/datascience/search?q=weekly%20thread&amp;restrict_sr=1&amp;sort=new)."

### Check for adverts
+ There are adverts on the reddit page. I want to see if they are picked up in the text I will be using.
+ I will therefore loop through the first page and print out the 'selftext' key where the posts are stored

In [24]:
for x in res_ds_test['data']['children']:
    print(x['data']['selftext'])

Welcome to this week's entering &amp; transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:

* Learning resources (e.g. books, tutorials, videos)
* Traditional education (e.g. schools, degrees, electives)
* Alternative education (e.g. online courses, bootcamps)
* Job search questions (e.g. resumes, applying, career prospects)
* Elementary questions (e.g. where to start, what next)

While you wait for answers from the community, check out the [FAQ](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions) and [Resources](Resources) pages on our wiki. You can also search for answers in [past weekly threads](https://www.reddit.com/r/datascience/search?q=weekly%20thread&amp;restrict_sr=1&amp;sort=new).

Hello all,

I am a Data Scientist at a Fortune 500 company, with a PhD in Electrical Engineering. For the last 5 years, I thought myself Python and Data Science and progressed a lot i

### Observation

It looks like adverts arent included as I know adverts are on this page but are not present in this list.

#### Other Keys

I know where the post data is however I want to check if there is anymore useful data I can use. If not it can be discarded. Otherwise I can keep and potentially use later.


I will loop through and see which columns contains unique data. I will use sets so there will be no duplicates.

In [25]:
# create empty dic for storing data
category_ds_dict = {}

# loop through all the records and get unique keys and add to dictionary
for x in res_ds_test['data']['children']:
    keys = x['data'].keys()
    for k in keys:
        category_ds_dict[k] = set()
        

# Loop through all 26 reddit posts and add data to a set. 
# Duplicates will be ignored as sets only contain unique items.
# Columns with 26 items indicate potentially useful data as they are different.

for post in res_ds_test['data']['children']:
    for key in post['data'].keys():
        try:
            category_ds_dict[key].add(post['data'].get(key))
        except:
            pass


#### Print table to see which keys there are any columns which only have 0 or 1 values in the

In [26]:
# Commented out as no need to always print. #

# print("|key|unique values|Values|Comments|\n|---|---|---|---|")
# for k,v in category_ds_dict.items():
#     if len(v) < 2:
#         print(f"|{k}|{len(v)}|{v}||")

|key|unique values|Values|Comments|
|---|---|---|---|
|approved_at_utc|1|{None}||
|subreddit|1|{'datascience'}||
|saved|1|{False}||
|mod_reason_title|1|{None}||
|gilded|1|{0}||
|clicked|1|{False}||
|link_flair_richtext|0|set()||
|subreddit_name_prefixed|1|{'r/datascience'}|Useful to ID the source later|
|hidden|1|{False}||
|pwls|1|{6}||
|downs|1|{0}||
|top_awarded_type|1|{None}||
|hide_score|1|{False}||
|quarantine|1|{False}||
|link_flair_text_color|1|{'dark'}||
|subreddit_type|1|{'public'}||
|total_awards_received|1|{0}||
|media_embed|0|set()||
|author_flair_template_id|1|{None}||
|is_original_content|1|{False}||
|user_reports|0|set()||
|secure_media|1|{None}||
|is_reddit_media_domain|1|{False}||
|is_meta|1|{False}||
|category|1|{None}||
|secure_media_embed|0|set()||
|can_mod_post|1|{False}||
|approved_by|1|{None}||
|author_flair_richtext|0|set()||
|gildings|0|set()||
|content_categories|1|{None}||
|mod_note|1|{None}||
|link_flair_type|1|{'text'}||
|wls|1|{6}||
|removed_by_category|1|{None}||
|banned_by|1|{None}||
|author_flair_type|1|{'text'}||
|likes|1|{None}||
|banned_at_utc|1|{None}||
|view_count|1|{None}||
|archived|1|{False}||
|is_crosspostable|1|{False}||
|pinned|1|{False}||
|over_18|1|{False}||
|all_awardings|0|set()||
|awarders|0|set()||
|media_only|1|{False}||
|can_gild|1|{False}||
|spoiler|1|{False}||
|locked|1|{False}||
|treatment_tags|0|set()||
|visited|1|{False}||
|removed_by|1|{None}||
|num_reports|1|{None}||
|subreddit_id|1|{'t5_2sptq'}||
|mod_reason_by|1|{None}||
|removal_reason|1|{None}||
|is_robot_indexable|1|{True}||
|report_reasons|1|{None}||
|discussion_type|1|{None}||
|whitelist_status|1|{'all_ads'}||
|contest_mode|1|{False}||
|mod_reports|0|set()||
|author_patreon_flair|1|{False}||
|parent_whitelist_status|1|{'all_ads'}||
|subreddit_subscribers|1|{237934}||
|media|1|{None}||
|is_video|1|{False}||
|preview|0|set()||


### Observations

I will remove most of these columns from the data collection unless Highlighted a reason to keep.. This will save space. I will not lose any data as I can refer back to this table later if necessary.

---

#### Print a table to see which keys there are any columns which only have 0 or 1 values in the

In [27]:
## Commented out as no need to always print. ##

# print("|key|unique values|details|comments\n|---|---|---|---|")
# for k,v in category_ds_dict.items():
#     if len(v) >= 2 and len(v) <= 7:
#         print(f"|{k}|{len(v)}|{v}|||")
#     elif len(v) > 7:
#         print(f"|{k}|{len(v)}|||")

|key|unique values|details|comments
|---|---|---|---|
|selftext|26|||
|author_fullname|26|||
|title|26|||
|link_flair_css_class|7|{'', 'education', 'discussion', None, 'tooling', 'projects', 'career'}|||
|thumbnail_height|2|{81, None}|||
|name|26|||
|upvote_ratio|17|||
|author_flair_background_color|2|{'', None}|||
|ups|19||number of upvotes|
|thumbnail_width|2|{140, None}|||
|link_flair_text|7|{'Job Search', None, 'Career', 'Education', 'Discussion', 'Projects', 'Tooling'}|||
|score|19||Determines net likes|
|author_premium|2|{False, True}|||
|thumbnail|2|{'self', 'https://b.thumbs.redditmedia.com/sO54GMM4hox5uOI_8FG7S-d0jb43W_8bBsuW14tsP-k.jpg'}|||
|edited|7|{False, 1592430658.0, 1592322535.0, 1592337227.0, 1592311663.0, 1592515672.0, 1592444729.0}|||
|author_flair_css_class|2|{'seniorflair', None}|||
|is_self|2|{False, True}|||
|created|26|||
|domain|2|{'afox.dev', 'self.datascience'}|||
|allow_live_comments|2|{False, True}|||
|selftext_html|26|||
|suggested_sort|2|{'new', 'confidence'}|||
|no_follow|2|{False, True}|||
|author_flair_text|3|{'MS  Student','v0.5.1', None}|||
|distinguished|2|{None, 'moderator'}|||
|link_flair_background_color|2|{'', '#edeff1'}|||
|id|26|||
|author|26|||
|num_comments|22|||
|send_replies|2|{False, True}|||
|author_flair_text_color|2|{'dark', None}|||
|permalink|26|||
|stickied|2|{False, True}|||
|url|26|||
|created_utc|26|||
|num_crossposts|2|{0, 1}|||
|link_flair_template_id|6|{'a6ee6fa0-d780-11e7-b6d0-0e0bd8823a7e', '4fad7108-d77d-11e7-b0c6-0ee69f155af2', 'aaf5d8cc-d780-11e7-a4a5-0e68d01eab56', '937a6f50-d780-11e7-826d-0ed1beddcc82', '71803d7a-469d-11e9-890b-0e5d959976c8', '99f9652a-d780-11e7-b558-0e52cdd59ace'}|||
|post_hint|2|{'self', 'link'}|||

### Observations

After looking at the 2 and more data columns I will keep only the columns with 7 pieces of data or more

In [28]:
# create column list which will be used to store only the relevant data which is pulled back

keep_cols_posts = ['subreddit_name_prefixed']
for k,v in category_ds_dict.items():
    if len(v) >= 7:
        keep_cols_posts.append(k)

len(keep_cols_posts)

18

### Finish by checking 'after', 'before'

### 2.4 'data' -> 'after'

In [29]:
### 2.4# This is the link to the next pages post
res_ds_test['data']['after']

't3_hc83zi'

### 2.5 'data' -> 'before'

In [30]:
# this is empty because there are no posts before this page
res_ds_test['data']['before']

---

## Explore data analytics subreddit

In [31]:
url_test_analytics = 'https://www.reddit.com/r/analytics.json'

res_anayltics_test = requests.get(url_test_analytics, headers={'User-agent': 'Pony Inc 1.0'})

res_anayltics_test.status_code

200

In [32]:
analtics_test_dict = res_anayltics_test.json()

### Explore analytics_test_dict['data']['children']

I will loop through and see which columns are the same as r/datascience


In [33]:
# create empty dic for storing data
category_dict_analytics = {}

# loop through all the records and get unique keys and add to dictionary
for x in analtics_test_dict['data']['children']:
    keys = x['data'].keys()
    for k in keys:
        category_dict_analytics[k] = set()
        

# Loop through all reddit posts and add data to a set. Duplicates will be ignored as sets only contain unique items.

for post in analtics_test_dict['data']['children']:
    for key in post['data'].keys():
        try:
            category_dict_analytics[key].add(post['data'].get(key))
        except:
            pass

In [34]:
r_analytics_cols = set(category_dict_analytics.keys())
r_dscience_cols = set(category_ds_dict.keys())
r_analytics_cols - r_dscience_cols

set()

### This shows they have the same structure which is expected but is good to confirm.

---

# Json Explore Subreddit Comments pages

### Before mass download I will look a 1 post and look at the differenct columns

In [35]:
url_ds_comments = 'https://www.reddit.com/r/datascience/comments/hcactj/user_claims_he_is_a_data_scientist_that_makes/.json'

In [36]:
res_ds_comments = requests.get(url_ds_comments, headers={'User-agent': 'Pony Inc 1.0'})

In [37]:
res_ds_comments.status_code

200

In [38]:
reddit_comment_dict = res_ds_comments.json()

# JSON Pre-exploration of comments data

#### Investigate the columns to see what data is there

+ I am mainly concerned by the post data and which sub reddit it came from.
+ However there may be other data which maybe useful later which I dont want to discard.
+ At the same time there is not much point storing unimportant data.


![data_structure](images/json_comments_overview.png)

### 0.0 Root

In [39]:
## What is the initial data structure have?
type(reddit_comment_dict)

list

In [40]:
# How many elements to the list
len(reddit_comment_dict)

2

### label structure
+ 1.0 for the first element - Dictionary
+ 2.0 for the second element in the list - Dictionary

### 1.1 'kind'

In [41]:
# What is the first type?
type(reddit_comment_dict[0])

dict

In [42]:
# What keys does it have?
reddit_comment_dict[0].keys()

dict_keys(['kind', 'data'])

In [43]:
# What does kind contain?
reddit_comment_dict[0]['kind']

'Listing'

+ 'kind' Not useful

### 1.2 'data'

In [44]:
# What is the second type?
type(reddit_comment_dict[0]['data'])

dict

In [45]:
# What keys does it have?
reddit_comment_dict[0]['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

### 1.2.1 'data' -> 'modhash'

In [46]:
# 'modhash' is empty
reddit_comment_dict[0]['data']['modhash']

''

### 1.2.2 'data' -> 'dist'

In [47]:
# 'dist' says it has 1 record
reddit_comment_dict[0]['data']['dist']

1

### 1.2.3 'data' -> 'children'

In [48]:
# The children contains a list of dictionary and I know there is only 1 entry. I will check the 'selftext' key
reddit_comment_dict[0]['data']['children'][0]['data']['selftext']

'Comment:\n\n&gt; [user 1]\nI currently go into my FANG co, write a few SQL scripts, and call it a day by 4ish. Pays $250k/yr with a ton of amazing benefits. No MBA needed :)\n\n&gt; [user 2]\nData scientist?\n\nThis was found on a business careers related subreddit. I thought this claim was crazy and outlandish, but maybe this sub could confirm. Are data scientist jobs so lax you can casually write SQL and make 250k/year?\n\nI was always under the impression you needed a masters/PHD in stats and had to know a whole bunch of complicated math just to even break six figures, much less 250k.'

#### Observation
+ The text above is the same entry as the r/subredit post

In [49]:
# compare keys to reddit structure keys
comments_keys = set(reddit_comment_dict[0]['data']['children'][0]['data'].keys())
post_keys = set(res_ds_test['data']['children'][0]['data'].keys())

In [50]:
# show which columns are different the reddit and the reddit/comments page
comments_keys.difference(post_keys)

{'link_flair_template_id', 'num_duplicates'}

### The contents of these cells dont have value for this project
+ The first is not text and a link to a template I will not use
+ The second is a count of duplicates. I will check and remove any duplicates later anyway
+ I already have the data from scraping the original post. Therefore I dont need to collect it gain. 
    + I cant use this instead because I need to scrap the post first to get the id's to then search the data.

In [51]:
reddit_comment_dict[0]['data']['children'][0]['data']['link_flair_template_id']

'a6ee6fa0-d780-11e7-b6d0-0e0bd8823a7e'

In [52]:
reddit_comment_dict[0]['data']['children'][0]['data']['num_duplicates']

0

### 1.2.4 'data' -> 'after'

In [53]:
reddit_comment_dict[0]['data']['after']

### 1.2.5 'data' -> 'before'

In [54]:
reddit_comment_dict[0]['data']['before']

#### Observation
+ no values in these cells as this page doesnt link to each other

### 2.0 2nd Dictionary in root list

In [55]:
# What is the second element in the root list
type(reddit_comment_dict[1])

dict

In [56]:
# What keys does it have?
reddit_comment_dict[1].keys()

dict_keys(['kind', 'data'])

In [57]:
type(reddit_comment_dict[1]['kind'])

str

In [58]:
type(reddit_comment_dict[1]['data'])

dict

### 2.1 'kind'

In [59]:
# What does kind contain?
reddit_comment_dict[1]['kind']

'Listing'

### 2.1 'data'

In [60]:
# What does kind contain?
type(reddit_comment_dict[1]['data'])

dict

In [61]:
# Which keys?
reddit_comment_dict[1]['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

### 2.2.1  'data' -> 'modhash'

In [62]:
# 'modhash' is empty
reddit_comment_dict[1]['data']['modhash']

''

### 2.2.2 'data' -> 'dist'

In [63]:
# 'dist' is empty
reddit_comment_dict[1]['data']['dist']

### 2.2.3 'data' -> 'children'

In [64]:
# 'children' is a list
type(reddit_comment_dict[1]['data']['children'])

list

In [65]:
# 25 items in the list
len(reddit_comment_dict[1]['data']['children'])

27

In [66]:
# each element is a dictionary
type(reddit_comment_dict[1]['data']['children'][0])

dict

In [67]:
# with similar keys
reddit_comment_dict[1]['data']['children'][0].keys()

dict_keys(['kind', 'data'])

### 2.2.3.1 'data' -> 'children' -> 'data' -> list[0] -> 'kind'

In [68]:
# Check out kinds to see if it changes
# Shows all are the same value

unique_values_of_kind = set()
for x in reddit_comment_dict[1]['data']['children']:
    unique_values_of_kind.add(x['kind'])

unique_values_of_kind

{'t1'}

### 2.2.3.2 'data' -> 'children' -> 'data' -> list[0] -> 'data'


In [69]:
# It has 66 records matching the 'dist' information
type(reddit_comment_dict[1]['data']['children'][0]['data'])

dict

In [70]:
# It has 66 keys matching the 'dist' information
len(reddit_comment_dict[1]['data']['children'][0]['data'].keys())

66

#### Analyse columns like I did for the r/post columns

In [71]:
# I will loop through and see which columns contains unique data. I will use sets so there will be no duplicates.

# create empty dic for storing data
category_dict_comments = {}

# loop through all the records and get unique keys and add to dictionary
for x in reddit_comment_dict[1]['data']['children']:
    keys = x['data'].keys()
    for k in keys:
        category_dict_comments[k] = set()
        

# Loop through all 25 reddit comment posts and add data to a set. Duplicates will be ignored as sets only contain unique items.
# Columns with 25 items indicate potentially useful data as they are different.

for post in reddit_comment_dict[1]['data']['children']:
    for key in post['data'].keys():
        try:
            category_dict_comments[key].add(post['data'].get(key))
        except:
            pass


In [72]:
# Print table to see which keys there are any columns which only have 0 or 1 values in the

#Commented out as no need to always print. 

# print("|key|unique values|Values|Comments|\n|---|---|---|---|")
# for k,v in category_dict_comments.items():
#     if len(v) < 2:
#         print(f"|{k}|{len(v)}|{v}||")

|key|unique values|Values|Comments|
|---|---|---|---|
|total_awards_received|1|{0}||
|approved_at_utc|1|{None}||
|awarders|0|set()||
|mod_reason_by|1|{None}||
|banned_by|1|{None}||
|author_flair_type|1|{'text'}||
|removal_reason|1|{None}||
|link_id|1|{'t3_hcactj'}|This is useful for link to original post|
|author_flair_template_id|1|{None}||
|likes|1|{None}||
|replies|1|{''}||
|user_reports|0|set()||
|saved|1|{False}||
|banned_at_utc|1|{None}||
|mod_reason_title|1|{None}||
|gilded|1|{0}||
|archived|1|{False}||
|can_mod_post|1|{False}||
|send_replies|1|{True}||
|parent_id|1|{'t3_hcactj'}||
|report_reasons|1|{None}||
|approved_by|1|{None}||
|all_awardings|0|set()||
|subreddit_id|1|{'t5_2sptq'}||
|downs|1|{0}||
|author_flair_css_class|1|{None}||
|is_submitter|1|{False}||
|author_flair_richtext|0|set()||
|author_patreon_flair|1|{False}||
|gildings|0|set()||
|associated_award|1|{None}||
|stickied|1|{False}||
|author_premium|1|{False}||
|subreddit_type|1|{'public'}||
|can_gild|1|{True}||
|top_awarded_type|1|{None}||
|author_flair_text_color|1|{None}||
|score_hidden|1|{False}||
|num_reports|1|{None}||
|locked|1|{False}||
|subreddit|1|{'datascience'}||
|author_flair_text|1|{None}||
|treatment_tags|0|set()||
|subreddit_name_prefixed|1|{'r/datascience'}|Can use this as the tag the source|
|depth|1|{0}||
|author_flair_background_color|1|{None}||
|collapsed_because_crowd_control|1|{None}||
|mod_reports|0|set()||
|mod_note|1|{None}||
|distinguished|1|{None}||

### Observations

I will remove these columns from the data collection apart from the one I can avoid. This will save space. I will not lose any data as I can refer back to this table later if necessary.

---

### Print table to see which keys there are any columns which only have 0 or 1 values in the

In [73]:
# Print table to see which keys there are any columns which only have 0 or 1 values in the

In [74]:
# Commented out as no need to always print. #

# print("|key|unique values|details|\n|---|---|---|")
# for k,v in category_dict_comments.items():
#     if len(v) >= 2 and len(v) <= 7:
#         print(f"|{k}|{len(v)}|{v}|")
#     elif len(v) > 7:
#         print(f"|{k}|{len(v)}||")

|key|unique values|details|
|---|---|---|
|ups|15||
|id|25||
|no_follow|2|{False, True}|
|author|25|Might be needed to check 1 author is not to active and thus distorting results|
|score|15||
|author_fullname|25||
|body|25||
|edited|3|{False, 1592624764.0, 1592650575.0}|
|collapsed|2|{False, True}|
|body_html|25||
|collapsed_reason|2|{'comment score below threshold', None}|
|permalink|25||
|name|25||
|created|25||
|created_utc|25||
|controversiality|2|{0, 1}|

### Observations

After looking at the 2 and more data columns I will keep only the columns with 15 above pieces of data or more.

In [75]:
# create column list
keep_comment_cols = []
for k,v in category_dict_comments.items():
    if len(v) >= 15:
        keep_comment_cols.append(k)

keep_comment_cols.append('subreddit')
keep_comment_cols.append('link_id')

In [76]:
keep_comment_cols

['ups',
 'id',
 'author',
 'score',
 'author_fullname',
 'body',
 'body_html',
 'permalink',
 'name',
 'created',
 'created_utc',
 'subreddit',
 'link_id']

### Finish by checking 'after', 'before'

### 2.2.4 'data' -> 'after'

In [77]:
### this is empty because there are no posts after this page
reddit_comment_dict[1]['data']['after']

### 2.2.5 'data' -> 'before'

In [78]:
# this is empty because there are no posts before this page
reddit_comment_dict[1]['data']['before']

---

## Explore data analytics subreddit comments page

In [79]:
url_analytics_comments = 'https://www.reddit.com/r/datascience/comments/hcxeno/amazon_aws_proserve_data_scientist_is_it_a_good/.json'

res_anayltics_test_comments = requests.get(url_analytics_comments, headers={'User-agent': 'Pony Inc 1.0'})

res_anayltics_test_comments.status_code

200

In [80]:
analytics_test_comments_dict = res_anayltics_test.json()

### Explore analytics_test_comments_dict['data']['children']

I will loop through and see which columns are the same as r/datascience


In [81]:
# create empty dic for storing data
category_analytics_comments_dict = {}

# loop through all the records and get unique keys and add to dictionary
for x in analytics_test_comments_dict['data']['children']:
    keys = x['data'].keys()
    for k in keys:
        category_analytics_comments_dict[k] = set()
        

# Loop through all reddit posts and add data to a set. Duplicates will be ignored as sets only contain unique items.

for post in analytics_test_comments_dict['data']['children']:
    for key in post['data'].keys():
        try:
            category_analytics_comments_dict[key].add(post['data'].get(key))
        except:
            pass

In [82]:
r_analytics_comm_cols = set(category_analytics_comments_dict.keys())
r_dscience_comm_cols = set(category_dict_comments.keys())
r_analytics_cols - r_dscience_cols

set()

### This shows they have the same structure which is expected but is good to confirm.

---

Data Collection

Was enough data gathered to generate a significant result?
Was data collected that was useful and relevant to the project?
Was data collection and storage optimized through custom functions, pipelines, and/or automation?
Was thought given to the server receiving the requests such as considering number of requests per second?

---

# Data Scrape

### Retrieve r/datascience posts

In [144]:
posts = Posts(url='https://www.reddit.com/r/datascience.json', 
              output_file_name='datascience', cols_to_keep=keep_cols_posts)

In [145]:
posts.get_posts()

size of df: (26, 18)
current_url: https://www.reddit.com/r/datascience.json, After: t3_hc80nv, Sleep for 2 seconds
size of df: (51, 18)
current_url: https://www.reddit.com/r/datascience.json?after=t3_hc80nv, After: t3_ha6ffm, Sleep for 4 seconds
size of df: (76, 18)
current_url: https://www.reddit.com/r/datascience.json?after=t3_ha6ffm, After: t3_h7a5mz, Sleep for 5 seconds
size of df: (101, 18)
current_url: https://www.reddit.com/r/datascience.json?after=t3_h7a5mz, After: t3_gzd84g, Sleep for 6 seconds
size of df: (126, 18)
current_url: https://www.reddit.com/r/datascience.json?after=t3_gzd84g, After: t3_gw8z13, Sleep for 4 seconds
size of df: (151, 18)
current_url: https://www.reddit.com/r/datascience.json?after=t3_gw8z13, After: t3_gvgc0f, Sleep for 2 seconds
size of df: (176, 18)
current_url: https://www.reddit.com/r/datascience.json?after=t3_gvgc0f, After: t3_gtrro4, Sleep for 3 seconds
size of df: (201, 18)
current_url: https://www.reddit.com/r/datascience.json?after=t3_gtrro4, A

In [146]:
ds_posts_df = pd.read_csv('./data/datascience.csv')
ds_posts_df.shape

(629, 18)

#### Check datascience dataframe

In [83]:
datascience = pd.read_csv('./data/datascience.csv')

In [84]:
datascience.shape

(629, 18)

In [85]:
datascience.size

11322

In [86]:
datascience.loc[datascience.duplicated(),]

Unnamed: 0,subreddit_name_prefixed,selftext,author_fullname,title,link_flair_css_class,name,upvote_ratio,ups,link_flair_text,score,created,selftext_html,id,author,num_comments,permalink,url,created_utc


In [87]:
datascience['selftext']

0      Welcome to this week's entering &amp; transiti...
1                                                    NaN
2      Hello all,\n\nI am a Data Scientist at a Fortu...
3      Hi,\n\nA friend of mine is doing MSCS and we a...
4      Planning a move from Toronto - I've heard sala...
                             ...                        
624                                                  NaN
625                                                  NaN
626    Hi,\n\nBased in northern Europe, I am a data s...
627                                                  NaN
628    I just received an email from Tableau sharing ...
Name: selftext, Length: 629, dtype: object

In [88]:
datascience.head()

Unnamed: 0,subreddit_name_prefixed,selftext,author_fullname,title,link_flair_css_class,name,upvote_ratio,ups,link_flair_text,score,created,selftext_html,id,author,num_comments,permalink,url,created_utc
0,r/datascience,Welcome to this week's entering &amp; transiti...,t2_4l4cxw07,Weekly Entering &amp; Transitioning Thread | 2...,,t3_hd5t6m,0.66,1,Discussion,1,1592770000.0,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",hd5t6m,datascience-bot,5,/r/datascience/comments/hd5t6m/weekly_entering...,https://www.reddit.com/r/datascience/comments/...,1592741000.0
1,r/datascience,,t2_3rl9tafm,The best SQL vs NoSQL mindset I've ever heard,education,t3_hd3tqs,0.87,48,Education,48,1592759000.0,,hd3tqs,kotartemiy,6,/r/datascience/comments/hd3tqs/the_best_sql_vs...,https://codarium.substack.com/p/the-best-sql-v...,1592730000.0
2,r/datascience,"Hello all,\n\nI am a Data Scientist at a Fortu...",t2_1jwhofnt,Amazon AWS Pro-Serve Data Scientist is it a go...,career,t3_hcxeno,0.94,94,Career,94,1592729000.0,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",hcxeno,GreenerCar,13,/r/datascience/comments/hcxeno/amazon_aws_pros...,https://www.reddit.com/r/datascience/comments/...,1592700000.0
3,r/datascience,"Hi,\n\nA friend of mine is doing MSCS and we a...",t2_3b84s1v5,Help with implementation of a paper about Imba...,education,t3_hd4xlb,0.75,4,Education,4,1592765000.0,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",hd4xlb,takathur,0,/r/datascience/comments/hd4xlb/help_with_imple...,https://www.reddit.com/r/datascience/comments/...,1592736000.0
4,r/datascience,Planning a move from Toronto - I've heard sala...,t2_6ltcva0l,How are data science salaries in Montreal for ...,,t3_hcyrod,0.83,17,Job Search,17,1592734000.0,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",hcyrod,remembr_this,11,/r/datascience/comments/hcyrod/how_are_data_sc...,https://www.reddit.com/r/datascience/comments/...,1592706000.0


---

### Retrieve r/analytics posts

In [156]:
analytics_posts = Posts(url='https://www.reddit.com/r/analytics.json', 
              output_file_name='analytics', cols_to_keep=keep_cols_posts)

In [157]:
analytics_posts.get_posts()

size of df: (27, 18)
current_url: https://www.reddit.com/r/analytics.json, After: t3_hafpju, Sleep for 6 seconds
size of df: (52, 18)
current_url: https://www.reddit.com/r/analytics.json?after=t3_hafpju, After: t3_h0vync, Sleep for 7 seconds
size of df: (77, 18)
current_url: https://www.reddit.com/r/analytics.json?after=t3_h0vync, After: t3_gzexxm, Sleep for 8 seconds
size of df: (102, 18)
current_url: https://www.reddit.com/r/analytics.json?after=t3_gzexxm, After: t3_gvxoor, Sleep for 2 seconds
size of df: (127, 18)
current_url: https://www.reddit.com/r/analytics.json?after=t3_gvxoor, After: t3_gsnewt, Sleep for 3 seconds
size of df: (152, 18)
current_url: https://www.reddit.com/r/analytics.json?after=t3_gsnewt, After: t3_gnelvp, Sleep for 6 seconds
size of df: (177, 18)
current_url: https://www.reddit.com/r/analytics.json?after=t3_gnelvp, After: t3_glse1v, Sleep for 4 seconds
size of df: (202, 18)
current_url: https://www.reddit.com/r/analytics.json?after=t3_glse1v, After: t3_girle0,

#### Check analytics dataframe

In [89]:
analytics_df = pd.read_csv('./data/analytics.csv')

In [109]:
analytics_df.columns

Index(['subreddit_name_prefixed', 'selftext', 'author_fullname', 'title',
       'link_flair_css_class', 'name', 'upvote_ratio', 'ups',
       'link_flair_text', 'score', 'created', 'selftext_html', 'id', 'author',
       'num_comments', 'permalink', 'url', 'created_utc'],
      dtype='object')

In [90]:
analytics_df.shape

(679, 18)

In [91]:
analytics_df.size

12222

In [92]:
len(analytics_df.loc[analytics_df.duplicated(),])

0

In [93]:
analytics_df.head()

Unnamed: 0,subreddit_name_prefixed,selftext,author_fullname,title,link_flair_css_class,name,upvote_ratio,ups,link_flair_text,score,created,selftext_html,id,author,num_comments,permalink,url,created_utc
0,r/analytics,"Have a question regarding interviewing, career...",t2_6l4z3,Monthly Career Advice Thread - June 2020,,t3_gum80v,0.72,3,,3,1591053000.0,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",gum80v,AutoModerator,20,/r/analytics/comments/gum80v/monthly_career_ad...,https://www.reddit.com/r/analytics/comments/gu...,1591024000.0
1,r/analytics,Share your current marketing openings in the c...,t2_6l4z3,Monthly Job Openings - June 2020,,t3_gxvv1j,0.86,5,,5,1591495000.0,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",gxvv1j,AutoModerator,2,/r/analytics/comments/gxvv1j/monthly_job_openi...,https://www.reddit.com/r/analytics/comments/gx...,1591467000.0
2,r/analytics,I’m an incoming freshman at my university and ...,t2_2i26s653,Best minor for undergrad to help me get a job?,,t3_hd0aje,0.93,12,Question,12,1592741000.0,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",hd0aje,BEdabonemhatrs,15,/r/analytics/comments/hd0aje/best_minor_for_un...,https://www.reddit.com/r/analytics/comments/hd...,1592712000.0
3,r/analytics,"hello,\n\nabout me: I can code, so I don't min...",t2_roct4,Automated alerting tool/solution when two data...,,t3_hcpvrl,1.0,13,,13,1592702000.0,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",hcpvrl,dronedesigner,5,/r/analytics/comments/hcpvrl/automated_alertin...,https://www.reddit.com/r/analytics/comments/hc...,1592673000.0
4,r/analytics,"Hi,\n\nI'm currently finishing up my computer ...",t2_1ilqbagv,Getting a job in DA/BI without a 4-year degree?,,t3_hcxtuh,0.67,1,,1,1592731000.0,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",hcxtuh,InfiniteLeverage,8,/r/analytics/comments/hcxtuh/getting_a_job_in_...,https://www.reddit.com/r/analytics/comments/hc...,1592702000.0


---

#### Get comments data
+ Get data science comments

In [94]:
# get dict of names from datascience. Have to remove posts that have '0' comments as these pages dont exist

page_name_df = pd.DataFrame()
page_name_df['name'] = datascience['name']
page_name_df['permalink'] = datascience['permalink']
page_name_df['num_comments'] = datascience['num_comments']
page_name_df = page_name_df.loc[page_name_df['num_comments'] > 0,]

page_name = dict(zip(page_name_df['name'],page_name_df['permalink']))


In [95]:
len(page_name)

594

In [243]:
sample = {'t3_h8skw3': '/r/datascience/comments/h8skw3/weekly_entering_transitioning_thread_14_jun_2020/',
 't3_hbxj93': '/r/datascience/comments/hbxj93/forever_a_fraud_keep_having_horrific_interviews/',
 't3_hcactj': '/r/datascience/comments/hcactj/user_claims_he_is_a_data_scientist_that_makes/',
 't3_hc070m': '/r/datascience/comments/hc070m/how_to_become_proficient_using_docker/',
 't3_hce7a4': '/r/datascience/comments/hce7a4/what_skills_are_awesome_to_have_in_a_data/',
 't3_hcdfp6': '/r/datascience/comments/hcdfp6/best_place_to_deploy_deep_learning_web_app/'}

In [244]:
page_name

{'t3_hd5t6m': '/r/datascience/comments/hd5t6m/weekly_entering_transitioning_thread_21_jun_2020/',
 't3_hd3tqs': '/r/datascience/comments/hd3tqs/the_best_sql_vs_nosql_mindset_ive_ever_heard/',
 't3_hcxeno': '/r/datascience/comments/hcxeno/amazon_aws_proserve_data_scientist_is_it_a_good/',
 't3_hcyrod': '/r/datascience/comments/hcyrod/how_are_data_science_salaries_in_montreal_for_05/',
 't3_hd7b4j': '/r/datascience/comments/hd7b4j/algo_bias_female_data_scientist_frequently/',
 't3_hcf4i7': '/r/datascience/comments/hcf4i7/tired_of_siraj_ravals_plagiarism_heres_what_you/',
 't3_hd3dw3': '/r/datascience/comments/hd3dw3/javascript_anyone/',
 't3_hcv2oe': '/r/datascience/comments/hcv2oe/code_management/',
 't3_hd1nf7': '/r/datascience/comments/hd1nf7/how_should_i_approach_this_problem_digging/',
 't3_hcuo4d': '/r/datascience/comments/hcuo4d/what_applications_of_data_science_or_ai_are_there/',
 't3_hcye42': '/r/datascience/comments/hcye42/dilemma_ms_industrial_systems_engineering_or_ms/',
 't3

In [250]:
comment_ds = Comments(url='', output_file_name='ds_comments', 
                   cols_to_keep=keep_comment_cols, keys_to_process=page_name)

In [251]:
comment_ds.get_comments()

size of df: (5, 13)
current_url: comments/hd5t6m/weekly_entering_transitioning_thread_21_jun_2020/.json, pause: 7secs
size of df: (9, 13)
current_url: comments/hd3tqs/the_best_sql_vs_nosql_mindset_ive_ever_heard/.json, pause: 4secs
size of df: (14, 13)
current_url: comments/hcxeno/amazon_aws_proserve_data_scientist_is_it_a_good/.json, pause: 7secs
size of df: (19, 13)
current_url: comments/hcyrod/how_are_data_science_salaries_in_montreal_for_05/.json, pause: 8secs
size of df: (23, 13)
current_url: comments/hd7b4j/algo_bias_female_data_scientist_frequently/.json, pause: 4secs
size of df: (46, 13)
current_url: comments/hcf4i7/tired_of_siraj_ravals_plagiarism_heres_what_you/.json, pause: 6secs
size of df: (48, 13)
current_url: comments/hd3dw3/javascript_anyone/.json, pause: 4secs
size of df: (55, 13)
current_url: comments/hcv2oe/code_management/.json, pause: 6secs
size of df: (57, 13)
current_url: comments/hd1nf7/how_should_i_approach_this_problem_digging/.json, pause: 2secs
size of df: (

size of df: (767, 13)
current_url: comments/h7gna8/how_is_your_experience_with_vim_neovim/.json, pause: 4secs
size of df: (769, 13)
current_url: comments/h78wsg/as_a_mentee_how_to_get_the_most_out_of_a/.json, pause: 5secs
size of df: (770, 13)
current_url: comments/h77p8s/navigating_temporal_vs_fixed_data_interesting/.json, pause: 2secs
size of df: (773, 13)
current_url: comments/h14ies/would_you_use_a_remote_service_to_manage_your/.json, pause: 5secs
size of df: (787, 13)
current_url: comments/h0h3p0/how_to_support_remote_ds_internships/.json, pause: 7secs
size of df: (789, 13)
current_url: comments/h0zwvg/how_much_facetoface_interaction_vs_coding_do_you/.json, pause: 8secs
size of df: (792, 13)
current_url: comments/h0thd1/autoregressive_model_with_an_exogenous_variable/.json, pause: 6secs
size of df: (820, 13)
current_url: comments/h01j32/early_career_data_scientist_pain_points/.json, pause: 4secs
size of df: (828, 13)
current_url: comments/h0hrme/phd_by_publication/.json, pause: 8s

size of df: (1577, 13)
current_url: comments/gumr3u/at_what_point_do_you_stop_with_a_clustering/.json, pause: 6secs
size of df: (1582, 13)
current_url: comments/gumw43/real_world_data_collection/.json, pause: 8secs
size of df: (1584, 13)
current_url: comments/gusf6x/exponential_functions_and_optimisation/.json, pause: 5secs
size of df: (1619, 13)
current_url: comments/gu2raf/does_anyone_else_that_has_been_doing_data_science/.json, pause: 2secs
size of df: (1625, 13)
current_url: comments/gue1fu/how_do_you_manage_credentialspasswords_for_your/.json, pause: 8secs
size of df: (1630, 13)
current_url: comments/gur7t8/data_science_ethic_issues_suggestions/.json, pause: 5secs
size of df: (1654, 13)
current_url: comments/gtv2c9/i_got_the_chance_to_interview_a_data_scientist_at/.json, pause: 6secs
size of df: (1660, 13)
current_url: comments/gu9lym/is_there_a_websiteplatform_where_you_can_sellbuy/.json, pause: 4secs
size of df: (1664, 13)
current_url: comments/gufkqn/how_to_avoid_nontechnical_e

current_url: comments/gnetpw/what_are_some_bad_coding_practice_youve_noticed/.json, pause: 5secs
size of df: (2656, 13)
current_url: comments/gnlelq/keeping_statistical_knowledge_fresh/.json, pause: 2secs
size of df: (2663, 13)
current_url: comments/gn4vkx/a_foolproof_way_to_shrink_deep_learning_models/.json, pause: 3secs
size of df: (2666, 13)
current_url: comments/gnhv8i/what_is_the_model_used_in_forward_and_backward/.json, pause: 3secs
size of df: (2670, 13)
current_url: comments/gn9eu6/offer_suggestion_data_scientist_in_amsterdam/.json, pause: 2secs
size of df: (2675, 13)
current_url: comments/gnbsqz/for_my_next_laptop_should_i_buy_a_mac_or_windows/.json, pause: 4secs
size of df: (2724, 13)
current_url: comments/gmirks/my_apologies_from_a_data_science_company_stole_my/.json, pause: 3secs
size of df: (2733, 13)
current_url: comments/gn5ddq/best_ml_platforms_for_a_nontechnical_startup_owner/.json, pause: 2secs
size of df: (2741, 13)
current_url: comments/gmx51a/clustering_a_large_dat

size of df: (3594, 13)
current_url: comments/gdf9l5/meme_the_hierarchy_of_data_science/.json, pause: 8secs
size of df: (3612, 13)
current_url: comments/gdmn0f/for_those_on_the_hiring_side_how_do_you_deal_with/.json, pause: 2secs
size of df: (3616, 13)
current_url: comments/gdyijn/data_integration_help/.json, pause: 2secs
size of df: (3626, 13)
current_url: comments/gdh074/best_interview_questions_as_interviewer_how_to/.json, pause: 2secs
size of df: (3651, 13)
current_url: comments/gczle5/what_are_the_manipulation_techniques_any_aspiring/.json, pause: 6secs
size of df: (3663, 13)
current_url: comments/gd3b41/data_scientistsanalysts_who_moved_to_pandas_from/.json, pause: 8secs
size of df: (3670, 13)
current_url: comments/gd876c/how_are_you_going_to_account_for_the_2020/.json, pause: 3secs
size of df: (3683, 13)
current_url: comments/gd0wj9/improving_presentation_skills/.json, pause: 4secs
size of df: (3712, 13)
current_url: comments/gcnv6y/thoughts_on_matplotlib_python_library_vs_tablea

size of df: (4854, 13)
current_url: comments/g2k5zi/from_your_experience_how_has_data_science_changed/.json, pause: 3secs
size of df: (4882, 13)
current_url: comments/g20x47/100days_data_science_challenge/.json, pause: 7secs
size of df: (4886, 13)
current_url: comments/g2f68k/mathematical_proof_that_no_decisionmaking/.json, pause: 2secs
size of df: (4892, 13)
current_url: comments/g1suaz/is_bert_too_general_for_an_nlp_project_with_a/.json, pause: 5secs
size of df: (4902, 13)
current_url: comments/g1yrv7/how_to_get_real_meaning_from_clustering_analysis/.json, pause: 5secs
size of df: (4917, 13)
current_url: comments/g23gw3/faang_data_scientist_software_engineer/.json, pause: 6secs
size of df: (4948, 13)
current_url: comments/g12zmd/20_best_libraries_for_data_science_in_r/.json, pause: 4secs
size of df: (4959, 13)
current_url: comments/g17928/ds_topic_of_the_week_should_data_science/.json, pause: 3secs
size of df: (4963, 13)
current_url: comments/g12g3b/build_an_efficient_recommendation_

size of df: (5967, 13)
current_url: comments/frkgr7/graph_of_graph_analysis/.json, pause: 8secs
size of df: (5972, 13)
current_url: comments/fsc7hr/focus_on_data_science_vs_deep_learning/.json, pause: 6secs
size of df: (5973, 13)
current_url: comments/fsif1v/seeking_feedback_on_data_science_workflow_tool/.json, pause: 4secs
size of df: (5976, 13)
current_url: comments/fs9eqi/business_analytics/.json, pause: 5secs
size of df: (5982, 13)
current_url: comments/fsej3k/estimating_covid19_infections_based_on_reported/.json, pause: 6secs
size of df: (5986, 13)
current_url: comments/frxyhm/how_do_you_use_aws_google_cloud_platform_and/.json, pause: 6secs
size of df: (6000, 13)
current_url: comments/frd031/unethical_nobel_behaviour/.json, pause: 8secs
size of df: (6001, 13)
current_url: comments/frsbby/publishing_interactive_datasets_and/.json, pause: 2secs
size of df: (6004, 13)
current_url: comments/fs07kh/new_external_gpu_for_deep_learning/.json, pause: 6secs
size of df: (6005, 13)
current_ur

size of df: (6513, 13)
current_url: comments/foml4d/were_hopeful_that_by_shedding_light_on_products/.json, pause: 7secs
size of df: (6516, 13)
current_url: comments/forw59/tools_for_handling_some_of_the_more_common_data/.json, pause: 4secs
size of df: (6525, 13)
current_url: comments/fobxih/why_are_most_of_the_data_science_articles_on/.json, pause: 8secs
size of df: (6527, 13)
current_url: comments/foozin/white_house_announces_new_partnership_to_unleash/.json, pause: 2secs
size of df: (6528, 13)
current_url: comments/forwy1/we_have_launched_a_covid19_dashboard_for_all/.json, pause: 3secs
size of df: (6532, 13)
current_url: comments/fojqo9/what_data_science_course_from_coursera_should_i/.json, pause: 4secs
size of df: (6540, 13)
current_url: comments/fo84cp/would_you_rather_have_an_analyst_on_your_team_who/.json, pause: 3secs
size of df: (6542, 13)
current_url: comments/fo9e59/all_sas_elearning_is_free_for_30_days_including/.json, pause: 2secs
size of df: (6558, 13)
current_url: comment

current_url: comments/flbsgq/does_anyone_have_a_viable_method_for_forecasting/.json, pause: 5secs
size of df: (6939, 13)
current_url: comments/flhlpr/data_science_consultant_hourly_rate_us_customer/.json, pause: 8secs
size of df: (6948, 13)
current_url: comments/fkqtb1/i_wrote_a_python_package_to_make_the_cord19/.json, pause: 8secs
size of df: (6951, 13)
current_url: comments/fl259m/covid19_data_hub_daily_updated_data_sets_and/.json, pause: 8secs


#### Check data science comments dataframe

In [252]:
datascience_comments = pd.read_csv('./data/ds_comments.csv')

In [253]:
datascience_comments.shape

(6951, 13)

In [254]:
datascience_comments.size

90363

In [255]:
len(datascience_comments.loc[datascience_comments.duplicated(),])

0

In [256]:
datascience_comments.head()

Unnamed: 0,ups,id,author,score,author_fullname,body,body_html,permalink,name,created,created_utc,subreddit,link_id
0,1.0,fvjpubi,ToothPickLegs,1.0,t2_55fytx,Good degree setup for Data Science career?\n\n...,"&lt;div class=""md""&gt;&lt;p&gt;Good degree set...",/r/datascience/comments/hd5t6m/weekly_entering...,t1_fvjpubi,1592784000.0,1592755000.0,datascience,t3_hd5t6m
1,1.0,fvjj6gv,sl5567,1.0,t2_674p23vi,I'm located in the NYC area and currently loo...,"&lt;div class=""md""&gt;&lt;p&gt;I&amp;#39;m loc...",/r/datascience/comments/hd5t6m/weekly_entering...,t1_fvjj6gv,1592780000.0,1592751000.0,datascience,t3_hd5t6m
2,3.0,fvj6yfx,Kamiklo,3.0,t2_4ygyz290,What kind of open source projects can I contri...,"&lt;div class=""md""&gt;&lt;p&gt;What kind of op...",/r/datascience/comments/hd5t6m/weekly_entering...,t1_fvj6yfx,1592771000.0,1592743000.0,datascience,t3_hd5t6m
3,1.0,fvj6rzt,productive_guy123,1.0,t2_2a4d4wz8,I am coming from a finance (BA) background and...,"&lt;div class=""md""&gt;&lt;p&gt;I am coming fro...",/r/datascience/comments/hd5t6m/weekly_entering...,t1_fvj6rzt,1592771000.0,1592742000.0,datascience,t3_hd5t6m
4,1.0,fvj5rh5,YoYoVaTsA,1.0,t2_zacn2,"So, I have studied the math and stats and mach...","&lt;div class=""md""&gt;&lt;p&gt;So, I have stud...",/r/datascience/comments/hd5t6m/weekly_entering...,t1_fvj5rh5,1592770000.0,1592741000.0,datascience,t3_hd5t6m


#### Get comments data
+ Get analytics comments

In [96]:
# # get dict of names from datascience
# page_name_analytics = dict(zip(analytics_df['name'],analytics_df['permalink']))


# get dict of names from datascience. Have to remove posts that have '0' comments as these pages dont exist

page_name_analytics_df = pd.DataFrame()
page_name_analytics_df['name'] = analytics_df['name']
page_name_analytics_df['permalink'] = analytics_df['permalink']
page_name_analytics_df['num_comments'] = analytics_df['num_comments']
page_name_analytics_df = page_name_analytics_df.loc[page_name_analytics_df['num_comments'] > 0,]

page_name_analytics = dict(zip(page_name_analytics_df['name'],page_name_analytics_df['permalink']))


In [97]:
len(page_name_analytics)

601

In [98]:
sample_analytics = {'t3_gum80v': '/r/analytics/comments/gum80v/monthly_career_advice_thread_june_2020/',
 't3_gxvv1j': '/r/analytics/comments/gxvv1j/monthly_job_openings_june_2020/',
 't3_hd0aje': '/r/analytics/comments/hd0aje/best_minor_for_undergrad_to_help_me_get_a_job/',
 't3_hcpvrl': '/r/analytics/comments/hcpvrl/automated_alerting_toolsolution_when_two/',
 't3_hcxtuh': '/r/analytics/comments/hcxtuh/getting_a_job_in_dabi_without_a_4year_degree/'}

In [99]:
page_name_analytics

{'t3_gum80v': '/r/analytics/comments/gum80v/monthly_career_advice_thread_june_2020/',
 't3_gxvv1j': '/r/analytics/comments/gxvv1j/monthly_job_openings_june_2020/',
 't3_hd0aje': '/r/analytics/comments/hd0aje/best_minor_for_undergrad_to_help_me_get_a_job/',
 't3_hcpvrl': '/r/analytics/comments/hcpvrl/automated_alerting_toolsolution_when_two/',
 't3_hcxtuh': '/r/analytics/comments/hcxtuh/getting_a_job_in_dabi_without_a_4year_degree/',
 't3_hcls6k': '/r/analytics/comments/hcls6k/is_there_an_apptool_to_see_multiple_sites/',
 't3_hcg9zt': '/r/analytics/comments/hcg9zt/what_to_major_in_to_do_data_analyst/',
 't3_hccb49': '/r/analytics/comments/hccb49/fb_inbound_tracking_in_google_analytics/',
 't3_hbvgii': '/r/analytics/comments/hbvgii/what_are_the_most_important_math_skills_as_an/',
 't3_hbsu89': '/r/analytics/comments/hbsu89/has_anyone_switched_from_tableau_to_powerbi_if_so/',
 't3_hbyihh': '/r/analytics/comments/hbyihh/how_does_this_field_view_mathematical_economics/',
 't3_hbiys8': '/r/a

In [101]:
comment_analytics = Comments(url='', output_file_name='analytics_comments', 
                   cols_to_keep=keep_comment_cols, keys_to_process=page_name_analytics)

In [102]:
comment_analytics.get_comments()

size of df: (14, 13)
current_url: mments/gum80v/monthly_career_advice_thread_june_2020/.json, pause: 5secs
size of df: (15, 13)
current_url: mments/gxvv1j/monthly_job_openings_june_2020/.json, pause: 3secs
size of df: (21, 13)
current_url: mments/hd0aje/best_minor_for_undergrad_to_help_me_get_a_job/.json, pause: 8secs
size of df: (25, 13)
current_url: mments/hcpvrl/automated_alerting_toolsolution_when_two/.json, pause: 4secs
size of df: (34, 13)
current_url: mments/hcxtuh/getting_a_job_in_dabi_without_a_4year_degree/.json, pause: 3secs
size of df: (38, 13)
current_url: mments/hcls6k/is_there_an_apptool_to_see_multiple_sites/.json, pause: 5secs
size of df: (47, 13)
current_url: mments/hcg9zt/what_to_major_in_to_do_data_analyst/.json, pause: 3secs
size of df: (49, 13)
current_url: mments/hccb49/fb_inbound_tracking_in_google_analytics/.json, pause: 4secs
size of df: (58, 13)
current_url: mments/hbvgii/what_are_the_most_important_math_skills_as_an/.json, pause: 8secs
size of df: (69, 13)
c

size of df: (356, 13)
current_url: mments/gxa6p9/statistics/.json, pause: 5secs
size of df: (357, 13)
current_url: mments/gxluvo/ranking_items_with_multiple_variables/.json, pause: 2secs
size of df: (373, 13)
current_url: mments/gwwp5i/why_would_someone_use_matplotlib_seaborn_or/.json, pause: 7secs
size of df: (375, 13)
current_url: mments/gxc19g/credit_card_risk_for_businesses/.json, pause: 7secs
size of df: (376, 13)
current_url: mments/gx43gt/open_web_analytics_database_schema/.json, pause: 6secs
size of df: (382, 13)
current_url: mments/gx64jv/tableau_is_too_slow/.json, pause: 7secs
size of df: (385, 13)
current_url: mments/gwn5qa/blending_facebook_ads_data_to_customer_records_in/.json, pause: 3secs
size of df: (387, 13)
current_url: mments/gwret6/junior_digital_analytics_consultant_job_interview/.json, pause: 6secs
size of df: (389, 13)
current_url: mments/gworz4/any_analysts_here_working_in_supply_chain/.json, pause: 2secs
size of df: (392, 13)
current_url: mments/gwc2bo/has_anyo

size of df: (717, 13)
current_url: mments/glg1mb/tableau_and_power_bi_are_these_two_enough_for/.json, pause: 5secs
size of df: (718, 13)
current_url: mments/glse1v/for_those_using_segmentcom_what_structure_do_you/.json, pause: 7secs
size of df: (720, 13)
current_url: mments/glqp45/main_difference/.json, pause: 4secs
size of df: (723, 13)
current_url: mments/glqi9z/is_a_masters_degree_recommended_in_analytics/.json, pause: 4secs
size of df: (724, 13)
current_url: mments/glipis/how_to_avoid_web_analytics_data_overload/.json, pause: 4secs
size of df: (735, 13)
current_url: mments/gkx260/disappointed_and_feeling_defeated_rant_post_lol/.json, pause: 6secs
size of df: (743, 13)
current_url: mments/gklf0e/how_can_i_use_a_mis_degree_to_get_into_analytics/.json, pause: 7secs
size of df: (744, 13)
current_url: mments/gkugmw/multiple_hosts_how_to_differentiate_impressions/.json, pause: 6secs
size of df: (749, 13)
current_url: mments/gksdu4/would_a_career_in_data_analytics_suit_a_highly/.json, pau

size of df: (1019, 13)
current_url: mments/g8d1eg/whats_your_deliverable/.json, pause: 4secs
size of df: (1021, 13)
current_url: mments/g8ck0q/session_recording_question/.json, pause: 3secs
size of df: (1024, 13)
current_url: mments/g7im78/monster_insights_giving_me_an_extremely_bounce/.json, pause: 2secs
size of df: (1028, 13)
current_url: mments/g7hbyq/dimensions_in_data_and_data_structures/.json, pause: 6secs
size of df: (1029, 13)
current_url: mments/g7ecp9/how_do_i_transfer_a_google_analytics_account_to/.json, pause: 8secs
size of df: (1042, 13)
current_url: mments/g73qhz/is_it_easy_to_transition_to_tableau_or_power_bi/.json, pause: 4secs
size of df: (1043, 13)
current_url: mments/g73pe7/have_done_my_bachelors_in/.json, pause: 7secs
size of df: (1055, 13)
current_url: mments/g6ok1n/about_to_graduate_with_a_ms_business_analytics/.json, pause: 7secs
size of df: (1057, 13)
current_url: mments/g6mxk9/over_the_last_3_days_search_console_says_226/.json, pause: 8secs
size of df: (1058, 1

size of df: (1313, 13)
current_url: mments/fx3om3/how_to_connect_website_and_app_via_google/.json, pause: 2secs
size of df: (1315, 13)
current_url: mments/fx6snu/msba_or_online_certificatesdegrees/.json, pause: 8secs
size of df: (1317, 13)
current_url: mments/fx4hju/how_to_store_text_to_perform_analytics_on_it/.json, pause: 7secs
size of df: (1322, 13)
current_url: mments/fwsjcl/best_online_data_analytics_courses/.json, pause: 8secs
size of df: (1325, 13)
current_url: mments/fwrai1/masters_vs_certificate_in_analytics/.json, pause: 8secs
size of df: (1326, 13)
current_url: mments/fwzpj5/learning_advice/.json, pause: 7secs
size of df: (1327, 13)
current_url: mments/fwna2a/how_to_use_ga_to_link_a_user_to_a_place/.json, pause: 6secs
size of df: (1332, 13)
current_url: mments/fw9yjz/internal_recruiter_looking_for_knowledgeresources/.json, pause: 6secs
size of df: (1336, 13)
current_url: mments/fw3mxs/monthly_job_openings_april_2020/.json, pause: 7secs
size of df: (1340, 13)
current_url: mme

size of df: (1589, 13)
current_url: mments/fg8iob/is_it_possible_to_perform_text_analytics_using/.json, pause: 6secs
size of df: (1592, 13)
current_url: mments/ffzbfm/moving_google_analytics_from_a_wordpress_plugin/.json, pause: 2secs
size of df: (1596, 13)
current_url: mments/ffrzck/how_to_get_data_from_google_analytics/.json, pause: 7secs
size of df: (1599, 13)
current_url: mments/ffahu8/connect_facebook_analytics_with_datastudio/.json, pause: 4secs
size of df: (1607, 13)
current_url: mments/fex1qo/where_to_learn_the_math_and_statistics_for_data/.json, pause: 8secs
size of df: (1609, 13)
current_url: mments/ff4kys/should_i_take_an_graduate_level_intro_to_finance/.json, pause: 6secs
size of df: (1614, 13)
current_url: mments/femwiw/how_can_i_make_the_best_of_this_new_job_that_is/.json, pause: 8secs
size of df: (1616, 13)
current_url: mments/fewk3f/is_there_any_difference_between_msba_and_mban/.json, pause: 3secs
size of df: (1617, 13)
current_url: mments/feheqj/monthly_job_openings_ma

current_url: mments/f5u3qv/need_some_advice_on_persisting_ga_utm_parameters/.json, pause: 8secs
size of df: (1859, 13)
current_url: mments/f5t4tu/ideas_for_hackathon/.json, pause: 6secs
size of df: (1861, 13)
current_url: mments/f5qbh8/about_metrics_and_kpi/.json, pause: 8secs
size of df: (1864, 13)
current_url: mments/f5sjut/i_want_to_scrape_the_web_for_raw_data_then/.json, pause: 6secs
size of df: (1869, 13)
current_url: mments/f5eg1u/analytics_software_progression_path/.json, pause: 2secs
size of df: (1870, 13)
current_url: mments/f5dhrs/logical_reasoning_and_mathematics_for_interview/.json, pause: 5secs
size of df: (1877, 13)
current_url: mments/f55pi5/python_with_power_bi_anyone/.json, pause: 7secs
size of df: (1879, 13)
current_url: mments/f5fgc1/does_anyone_else_calculate_nonbounced_pages_per/.json, pause: 5secs
size of df: (1882, 13)
current_url: mments/f5cfb7/should_i_do_an_unpaid_internship_abroad_right_now/.json, pause: 5secs
size of df: (1883, 13)
current_url: mments/f5afz7

size of df: (2124, 13)
current_url: mments/evg4z8/adobe_analytics_missing_data/.json, pause: 7secs
size of df: (2133, 13)
current_url: mments/ev7mi7/possible_to_transition_from_data_based_role_in/.json, pause: 3secs
size of df: (2134, 13)
current_url: mments/ev7v75/seeing_top_abandoned_products_in_ga/.json, pause: 6secs
size of df: (2137, 13)
current_url: mments/ev7qfn/how_does_one_become_a_analytics_consultant/.json, pause: 2secs
size of df: (2139, 13)
current_url: mments/ev7k90/is_my_plan_too_aggressive_for_no_reason/.json, pause: 3secs
size of df: (2143, 13)
current_url: mments/ev6qnh/what_is_a_good_alternative_for_google_analytics/.json, pause: 6secs
size of df: (2148, 13)
current_url: mments/euzwb4/looking_to_obtain_entry_level_analyst_position_in/.json, pause: 4secs
size of df: (2150, 13)
current_url: mments/ev7l1t/whats_up_with_google_analytics_and_pdfs/.json, pause: 6secs
size of df: (2151, 13)
current_url: mments/ev43bf/tablet_share_drop_off/.json, pause: 8secs
size of df: (21

---

#### Check analytics comments dataframe

In [103]:
analytics_comments = pd.read_csv('./data/analytics_comments.csv')

In [104]:
analytics_comments.shape

(2379, 13)

In [105]:
analytics_comments.size

30927

In [106]:
len(analytics_comments.loc[analytics_comments.duplicated(),])

0

In [110]:
analytics_comments.head()

Unnamed: 0,ups,id,author,score,author_fullname,body,body_html,permalink,name,created,created_utc,subreddit,link_id
0,3,fswdofi,Beschwipstfrau,3,t2_1705qd,"Right now I am nearly a year into my ""first jo...","&lt;div class=""md""&gt;&lt;p&gt;Right now I am ...",/r/analytics/comments/gum80v/monthly_career_ad...,t1_fswdofi,1591332000.0,1591303000.0,analytics,t3_gum80v
1,2,fsrhm37,thekidboy,2,t2_1huc3t1,Thoughts on pursuing a double major in Compute...,"&lt;div class=""md""&gt;&lt;p&gt;Thoughts on pur...",/r/analytics/comments/gum80v/monthly_career_ad...,t1_fsrhm37,1591232000.0,1591203000.0,analytics,t3_gum80v
2,1,fsj8png,FruityPebblePug,1,t2_4ttrw2fw,I want to apply for a more programming heavy j...,"&lt;div class=""md""&gt;&lt;p&gt;I want to apply...",/r/analytics/comments/gum80v/monthly_career_ad...,t1_fsj8png,1591054000.0,1591025000.0,analytics,t3_gum80v
3,1,fsl41c9,OhHeyJeannette,1,t2_g77dsgn,I've been in Consumer Products Category Manage...,"&lt;div class=""md""&gt;&lt;p&gt;I&amp;#39;ve be...",/r/analytics/comments/gum80v/monthly_career_ad...,t1_fsl41c9,1591088000.0,1591059000.0,analytics,t3_gum80v
4,1,ft8pvbc,NatesFayt,1,t2_emfdu,"Since the beginning of this year, I've been st...","&lt;div class=""md""&gt;&lt;p&gt;Since the begin...",/r/analytics/comments/gum80v/monthly_career_ad...,t1_ft8pvbc,1591536000.0,1591507000.0,analytics,t3_gum80v
