# Background
I am a follower in the r/dataanalysis and r/datascience subbredits. I have always found the adundance of advice and ideas on there to be very useful. However, I came across a problem where I just did not have the time and patience to go thoroughly read hundreds of posts.

Hence, I am creating a python script which extracts posts from r/dataanalysis and r/datascience, which then cleans and returns only the most useful summarised information for me.

I have used PRAW package (Python Reddit API Wrapper) to retreive ~300 of the top and hottest posts from each of these subreddits. In addition, I have also extracted the top 10 comments from each of these posts.

In this notebook, I will be cleaning and preparing the data for an interactive app which achieves my goal:

#### Gather all the given career advice into one easily digestable place.

In [22]:
import pandas as pd
import re
from datetime import datetime
import contractions # contractions
from nltk.corpus import stopwords # stopwords
from nltk.stem import WordNetLemmatizer # lemmatiser
import yake # keyword extraction
import heapq
from collections import Counter

In [2]:
top = pd.read_csv(".\\data\\top_data_posts.csv")
top_comments = pd.read_csv(".\\data\\top_data_post_comments.csv")
top.head()

Unnamed: 0,subreddit,id,title,author_name,created_unix,flair,score,upvote_ratio,description,url
0,dataanalysis,wfgn7j,"After 2 months, 150 resumes, 6 interviews, I f...",SomeEmotion3,1659557000.0,Career Advice,469,0.99,The job is Data Analyst for a well known Analy...,https://www.reddit.com/r/dataanalysis/comments...
1,dataanalysis,10onhl2,Want to become an analyst? Start here.,milwted,1675038000.0,Career Advice,459,0.99,Starting a career in data analytics can open u...,https://www.reddit.com/r/dataanalysis/comments...
2,dataanalysis,unoys0,Google Apprenticeship Response from Google 2022,Danielle-Dee,1652318000.0,,444,0.99,I applied for the Google Apprenticeship and I ...,https://www.reddit.com/r/dataanalysis/comments...
3,dataanalysis,z0mrku,"SQL roadmap, things you should know",JamySun,1668997000.0,,437,0.99,"Most important SQL command and function, hope ...",https://i.redd.it/qpoytkb0a91a1.jpg
4,dataanalysis,z1v48z,It really be like that,toketoornot,1669128000.0,,371,0.99,,https://i.redd.it/o234ckej1k1a1.jpg


In [3]:
hot = pd.read_csv(".\\data\\hot_data_posts.csv")
hot_comments = pd.read_csv(".\\data\\hot_data_post_comments.csv")
hot.head()

Unnamed: 0,subreddit,id,title,author_name,created_unix,flair,score,upvote_ratio,description,url
0,dataanalysis,11dl3sf,Hoping this video helps you in your data analy...,sujaynadkarni,1677528000.0,,14,0.95,,https://youtu.be/P7OTI17Wp-M
1,dataanalysis,11dpdoa,Very good data analytics article on HBR,ozarzoso,1677538000.0,,6,0.81,Hello. I strongly recommend you all this outst...,https://www.reddit.com/r/dataanalysis/comments...
2,dataanalysis,11dgh1t,urgent: job only requires Google analytics,RaceyDesiWithNoFacey,1677517000.0,,14,0.77,I'm a fresher and trying to break into the fie...,https://www.reddit.com/r/dataanalysis/comments...
3,dataanalysis,11dswih,Math Teacher to Data Specialist,Sea_Obligation_2802,1677547000.0,Career Advice,2,0.75,"Hello, I am looking to get out of teaching. I ...",https://www.reddit.com/r/dataanalysis/comments...
4,dataanalysis,11czods,Data analysts who make 120k+ per year - what s...,garbage_gemlin,1677463000.0,Career Advice,68,0.97,"I am a data analyst and love my current job, b...",https://www.reddit.com/r/dataanalysis/comments...


# Data Cleaning

#### Keep only non-image (usually memes) posts

In [5]:
# keep non-image (usually memes) TOP posts
non_meme_post_index = [i for i, url in enumerate(top['url']) if re.search(".(jpg|png|gif)", url) is None] # get non-image index   
top = top.iloc[non_meme_post_index, :] # keep only non-image rows
top_comments = top_comments[top_comments['post_id'].isin(top['id'])] # keep non-image posts
# keep non-image (usually memes) HOT posts
non_meme_post_index = [i for i, url in enumerate(hot['url']) if re.search(".(jpg|png|gif)", url) is None]    
hot = hot.iloc[non_meme_post_index, :]
hot_comments = hot_comments[hot_comments['post_id'].isin(hot['id'])]

#### Replace missing (NA) values

In [6]:
top.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 384 entries, 0 to 596
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   subreddit     384 non-null    object 
 1   id            384 non-null    object 
 2   title         384 non-null    object 
 3   author_name   366 non-null    object 
 4   created_unix  384 non-null    float64
 5   flair         262 non-null    object 
 6   score         384 non-null    int64  
 7   upvote_ratio  384 non-null    float64
 8   description   322 non-null    object 
 9   url           384 non-null    object 
dtypes: float64(2), int64(1), object(7)
memory usage: 33.0+ KB


In [7]:
# replace NA in author_name
top['author_name'] = top['author_name'].fillna('DELETED ACCOUNT')
hot['author_name'] = hot['author_name'].fillna('DELETED ACCOUNT')

In [8]:
# Investigate/Replace NA in description
top[top['description'].isna()].head()

Unnamed: 0,subreddit,id,title,author_name,created_unix,flair,score,upvote_ratio,description,url
34,dataanalysis,k40wpj,Thought you guys might like this,SicDev,1606759000.0,,142,0.99,,https://gfycat.com/conventionalanchoredgardens...
45,dataanalysis,stxqf8,Animated Voronoi Diagram Showing Spacial Contr...,falseNaoi,1645023000.0,,122,1.0,,https://v.redd.it/zd31azeik7i81
51,dataanalysis,vqgmhy,What are visualizations like this called? And ...,ChapliKebab,1656852000.0,Data Question,108,1.0,,https://v.redd.it/df6jannmkc991
57,dataanalysis,wrghuu,data analyst interview questions: Link to the ...,matarrwolfenstein,1660820000.0,Career Advice,102,0.94,,https://media-exp1.licdn.com/dms/document/C4D1...
62,dataanalysis,zlmwkh,I've noticed a lot of posts about people tryin...,alorentz,1671011000.0,,94,0.99,,https://whaly.io/posts/the-2023-data-analyst-s...


Looking at URLs, these posts are referencing to other sources, and are often less with no description.

In [9]:
top[top['description'].isna()]['flair'].value_counts()

Fun/Trivia                10
Career Advice              9
Tooling                    4
Data Analysis Tutorial     3
Discussion                 3
Education                  3
Data Question              2
Resume Help                2
Job Search                 2
Projects                   2
Employment Opportunity     1
Project Feedback           1
Networking                 1
Name: flair, dtype: int64

Looking at flairs of these missing posts, we can remove posts which are for fun/trivia. Other types of post may be useful to our goal.

In [10]:
# Remove Fun/Trivia flaired posts
top = top[top['flair'] != 'Fun/Trivia']
top_comments = top_comments[top_comments['post_id'].isin(top['id'])]
hot = hot[hot['flair'] != 'Fun/Trivia']
hot_comments = hot_comments[hot_comments['post_id'].isin(top['id'])]

In [11]:
# Replace missing description and flair with empty string
top.loc[top['description'].isna(), 'description'] = ""
hot.loc[hot['description'].isna(), 'description'] = ""

top.loc[top['flair'].isna(), 'flair'] = ""
hot.loc[hot['flair'].isna(), 'flair'] = ""

#### Keep 'useful' posts
Useful in our case means a score of more than 1. Posts by default on creation have a score of 1.

In [12]:
# keep useful posts (i.e. score > 1)
top = top[top['score'] > 1]
top_comments = top_comments[top_comments['score'] > 1]
# keep useful posts (i.e. score > 1)
hot = hot[hot['score'] > 1]
hot_comments = hot_comments[hot_comments['score'] > 1]

#### Replace UNIX time with datetime

In [14]:
# convert unix time to date
top['created_unix'] = [datetime.utcfromtimestamp(dt).strftime('%Y-%m-%d') for dt in top['created_unix']]
hot['created_unix'] = [datetime.utcfromtimestamp(dt).strftime('%Y-%m-%d') for dt in hot['created_unix']]
# rename column
top = top.rename(columns={'created_unix': 'datetime'})
hot = hot.rename(columns={'created_unix': 'datetime'})

#### Clean text columns

In [16]:
def clean_text(array): # expecting list or pandas series
    stop_words = set(stopwords.words('english'))
    lemma = WordNetLemmatizer()
    new_desc = []
    for text in array:
        if text == text: # captures NaN
            # 1. normalise (lowercase) text
            text = text.lower()
            # 2. expand formal contractions (e.g. i'll, haven't, don't, etc.)
            text = " ".join([contractions.fix(w) for w in text.split()])
            # 3. remove unicode characters (note: don't remove digits)
            text = re.sub(r"https?://\S+", "", text) # remove urls
            text = re.sub(r"([^a-z0-9.])", " ", text) # keep only character, digit, or fullstop (for sentences)
            text = re.sub(r"\s{2,}", " ", text) # replace multiple spaces with one space
            # 4. remove stopwords
            text = " ".join([w for w in text.split() if w.strip(".") not in stop_words])
            # 5. lemmatise each word (group words based on root/origin)
            text = " ".join([lemma.lemmatize(w) for w in text.split()])
            new_desc.append(text)
        else:
            new_desc.append("")
    return new_desc

top['description'] = clean_text(top['description'])
hot['description'] = clean_text(hot['description'])
top['title'] = clean_text(top['title'])
hot['title'] = clean_text(hot['title'])
top_comments['comment'] = clean_text(top_comments['comment'])
hot_comments['comment'] = clean_text(hot_comments['comment'])

# Data Extraction

Given that one post can have many comments, I will utilise text summarisation and keywords to extract insights.

Our goal is to accumulate all advice into one easily interpretable place. So, if we think of posts like questions, and comments as answers; then a summary of the comments is reasonable advice.

In [17]:
# group together comments and scores of the same post
summ_top_comments = top_comments.groupby("post_id")['comment'].transform(lambda x: ". ".join(x)) # joins all comments of a post
summ_top_comments = pd.concat([top_comments['post_id'], summ_top_comments], axis=1)
summ_top_comments_score = top_comments.groupby("post_id")['score'].mean()
summ_top_comments = summ_top_comments.merge(summ_top_comments_score, how='inner', on='post_id')
summ_top_comments = summ_top_comments.rename(columns={'score': 'avg_comment_score'})
summ_top_comments = summ_top_comments.drop_duplicates('post_id').reset_index(drop=True)

summ_hot_comments = hot_comments.groupby("post_id")['comment'].transform(lambda x: ". ".join(x))
summ_hot_comments = pd.concat([hot_comments['post_id'], summ_hot_comments], axis=1)
summ_hot_comments_score = hot_comments.groupby("post_id")['score'].mean()
summ_hot_comments = summ_hot_comments.merge(summ_hot_comments_score, how='inner', on='post_id')
summ_hot_comments = summ_hot_comments.rename(columns={'score': 'avg_comment_score'})
summ_hot_comments = summ_hot_comments.drop_duplicates('post_id').reset_index(drop=True)

# join to top/hot dataframe
top = top.merge(summ_top_comments, how='inner', left_on='id', right_on='post_id')
top = top.drop(columns='post_id')
hot = hot.merge(summ_hot_comments, how='inner', left_on='id', right_on='post_id')
hot = hot.drop(columns='post_id')

In [18]:
# replace double fullstops with one
top['comment'] = [re.sub(r"\.\.", ".", comm) for comm in top['comment']]
hot['comment'] = [re.sub(r"\.\.", ".", comm) for comm in hot['comment']]

In [19]:
# round avg_comment_score
top['avg_comment_score'] = top['avg_comment_score'].round()
hot['avg_comment_score'] = hot['avg_comment_score'].round()

### Keyword Extraction

Extracting the top N keywords from the title, description, or comments will help see what topics are among discussion.

By default, we are extracting the top 5 keywords where a 'keyword' can be up to a legth of 3 (3-gram).

In [21]:
# get N keywords from given string
def get_keywords(text, n_gram_max=3, dup_limit=0.5, max_num_kw=5):
    '''
    n_gram_max := max size of n-gram (consecutive N-words)
    dup_limit := tolerance of duplicate keywords (0.1 = avoid repetition, 0.9 = allow repetition)
    max_num_kw := max number of keywords returned
    '''
    if text == "":
        return ""
    else:
        custom_kw_extractor = yake.KeywordExtractor(
                lan='en',
                n=n_gram_max,
                dedupLim=dup_limit,
                top=max_num_kw,
                features=None
            )
        kw = custom_kw_extractor.extract_keywords(text)
        kw = [(tup[1], tup[0]) for tup in kw]
        heapq.heapify(kw) # sort by probability
        top_kw = heapq.nlargest(max_num_kw, kw) # top-10 keywords
        return ". ".join([tup[1] for tup in top_kw])

# NOTE: DEFAULT IS 3-GRAM, BALANCED DUPLICATES (0.5), MAX_NUM_KW=5 [to pass other args: .apply(get_keywords, max_num_kw=10, ...)]
top['title_kw'] = top['title'].apply(get_keywords)
hot['title_kw'] = top['title'].apply(get_keywords)
top['description_kw'] = top['description'].apply(get_keywords)
hot['description_kw'] = top['description'].apply(get_keywords)
top['comment_kw'] = top['comment'].apply(get_keywords)
hot['comment_kw'] = top['comment'].apply(get_keywords)

### (Extractive) Text Summary
We utilise EXTRACTIVE text summary. Words are given probabilities based on how often they occur (frequency). We then use these probabilities to measure how important a sentence (collection of words) is, and extract the 5 most important ones (by default).

In [23]:
def get_text_summary(text, topN=5):
    if text == "":
        return ""
    else:
        # tokenise words
        words = []
        for word in text.split():
            word = word.strip(".") # remove fullstop
            word = word.strip() # remove possible whitespace
            words.append(word)
        # get word frequencies
        word_freq = Counter(words)
        # get word probabilties
        max_freq = max(word_freq.values())
        for word in word_freq:
            word_freq[word] = word_freq[word] / max_freq
        # get sentence probaility/score
        sent_probs = {}
        sentences = [sent.strip() for sent in text.split(".")]
        for sent in sentences:
            for w in sent.split():
                if w in word_freq:
                    if sent not in sent_probs:
                        sent_probs[sent] = word_freq[w]
                    else:
                        sent_probs[sent] += word_freq[w]
        # select N most likely sentences (i.e. summarise)
        if len(sentences) < topN:
            return ". ".join(sentences)
        else:
            h = [(score, sent) for sent, score in sent_probs.items()]
            heapq.heapify(h)
            summary = [tup[1] for tup in heapq.nlargest(topN, h)]
            return ". ".join(summary)

topN = 5 # get top 5 sentences
top['description_summ'] = top['description'].apply(get_text_summary, topN=topN)
hot['description_summ'] = top['description'].apply(get_text_summary, topN=topN)
top['comment_summ'] = top['comment'].apply(get_text_summary, topN=topN)
hot['comment_summ'] = top['comment'].apply(get_text_summary, topN=topN)

In [24]:
# merge HOT and TOP dataframes
top['origin'] = 'top'
hot['origin'] = 'hot'
data = pd.concat([top, hot], axis=0)
data

Unnamed: 0,subreddit,id,title,author_name,datetime,flair,score,upvote_ratio,description,url,comment,avg_comment_score,title_kw,description_kw,comment_kw,description_summ,comment_summ,origin
0,dataanalysis,wfgn7j,2 month 150 resume 6 interview finally signed ...,SomeEmotion3,2022-08-03,Career Advice,469,0.99,job data analyst well known analytics conpany....,https://www.reddit.com/r/dataanalysis/comments...,stopping say congrats. hey show link project r...,23.0,month. signed job offer. finally signed job. j...,data. job data. conpany. job data analyst. ana...,congratulation. congrats. stopping. project. r...,way type resume way deliver interview matter 6...,wow congratulation happy deserve get better in...,top
1,dataanalysis,10onhl2,want become analyst start,milwted,2023-01-30,Career Advice,459,0.99,starting career data analytics open many excit...,https://www.reddit.com/r/dataanalysis/comments...,great post. upvoted. maybe murphyslab sir quac...,11.0,start. analyst start,experience. data analytics. exciting opportuni...,great. excel. data. start. great post. upvoted,prepared application process like 100 job appl...,another thing help sub weekly stickied enterin...,top
2,dataanalysis,unoys0,google apprenticeship response google 2022,Danielle-Dee,2022-05-12,,444,0.99,applied google apprenticeship want anything ne...,https://www.reddit.com/r/dataanalysis/comments...,still waiting take heavy grain salt swe friend...,44.0,apprenticeship. google. response google. googl...,people. bet lot applied. lot applied nervous. ...,friend google close. salt swe friend. grain sa...,x200b update 8 23 wow almost 2 month since con...,8mins ago sent interview ux design program ext...,top
3,dataanalysis,xg0c4u,started google data analytics course july 26th...,GoobGoobb,2022-09-16,Career Advice,355,0.98,basically got lucky. finished course august 27...,https://www.reddit.com/r/dataanalysis/comments...,congrats job offer learned sql intermediate le...,18.0,started. july. analytics course july. google d...,job. lucky. basically. resume. basically got l...,congrats. learned sql intermediate. level star...,nailed interview process signed offer yesterda...,able actually switch career data analytics one...,top
4,dataanalysis,q37irg,google data analysis course review,Free_Dimension1459,2021-10-07,,298,1.00,hi week 4 7th course little bit r capstone go ...,https://www.reddit.com/r/dataanalysis/comments...,recently completed full course except capstone...,9.0,data. review. google. analysis course review. ...,prep. skill. interview. data. job,position. job. data. data analysis. recently c...,break 3 category foundation course 1 2 dash th...,data analythics career locked unless go back s...,top
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,dataanalysis,10ylebf,become advanced data analyst carreer path data...,Gheron,2023-02-10,,36,0.93,data analyst two year wondering develop data a...,https://www.reddit.com/r/dataanalysis/comments...,would learn statistic probability hypothesis b...,8.0,career. shift google cert. google cert success...,lot. working. excited share accepted. share ac...,google. story. success. congratulation. congrats,scoured open source dataset website found thin...,damn inspiring considering similar position wo...,hot
12,dataanalysis,10xz7wz,day data analyst,Immighthaveloat10k,2023-02-09,,45,0.92,start professional career data analytics come ...,https://www.reddit.com/r/dataanalysis/comments...,30 meeting 60 cleaning data 10 creating viz pr...,28.0,starting thread. sharing answer. sql excel dat...,resume. wondering getting interviews. analyst....,analyst. excel. year ride started. data. sql,good resume much le important skill listed ess...,bi analyst knowing sql important know bi tool ...,hot
13,dataanalysis,10wieqb,first month working data analyst bootcamp. ask...,Think_Thought4982,2023-02-08,,88,0.92,food server 20 year left pursue career tech. f...,https://www.reddit.com/r/dataanalysis/comments...,many application submit many interview partici...,13.0,accepted,recently accepted data. dear everyone pleased....,signing nda selection. requires entry exam. nd...,combining effort web development data analytic...,congrats huge step admit never heard data anal...,hot
14,dataanalysis,10w2i9f,mean strong excel,shmoe94,2023-02-07,Data Tools,44,0.91,one understand comfortable considered strong e...,https://www.reddit.com/r/dataanalysis/comments...,know power query pivot table index match proba...,14.0,awesome. offer. interview. application. awesom...,encouragement like faced. transitioning data a...,congrats. career. started learn data. learn da...,wanted post encouragement like faced discourag...,congratulation feel coursera certificate helpe...,hot


In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 370 entries, 0 to 15
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   subreddit          370 non-null    object 
 1   id                 370 non-null    object 
 2   title              370 non-null    object 
 3   author_name        370 non-null    object 
 4   datetime           370 non-null    object 
 5   flair              370 non-null    object 
 6   score              370 non-null    int64  
 7   upvote_ratio       370 non-null    float64
 8   description        370 non-null    object 
 9   url                370 non-null    object 
 10  comment            370 non-null    object 
 11  avg_comment_score  370 non-null    float64
 12  title_kw           370 non-null    object 
 13  description_kw     370 non-null    object 
 14  comment_kw         370 non-null    object 
 15  description_summ   370 non-null    object 
 16  comment_summ       370 non-

In [21]:
# # save to file
# data.to_csv(".\\data\\clean_data_posts.csv", index=False)

# Data Exploration

Given that we are dealing with sentences, there are not many visualisations we can produce to reflect this. This is fine, since my original goal was to summarise all advice to something more compact/digestable. But, we can create visuals with keywords if needed. 

Note: Sentiment is not of interest in this project.

Hierarchy:
- Subreddit
    - Flair | Origin
        - Title | Description
            - Comment

In [26]:
# Setup (for data exploration)
import pandas as pd
pd.set_option('display.max_colwidth', None) # full length text
pd.set_option('display.max_rows', 500)
import ipywidgets as widgets
from IPython.display import display, HTML
from wordcloud import WordCloud
import matplotlib.pyplot as plt

data = pd.read_csv(".\\data\\clean_data_posts.csv", keep_default_na=False)

In [28]:
# Functions which create widgets
def get_dropdown(df, colname):
    values = df[colname].unique().tolist()
    dropdown = widgets.Dropdown(
        options = ['All'] + values,
        value = 'All',
        description = f"{colname}:",
        continuous_update=False
    )
    return dropdown

def get_kw_box(df, colname):
    textbox = widgets.Text(
        value = "",
        placeholder = "",
        description = f"Keyword Search for {colname}:",
        display='flex',
        flex_flow='column',
        align_items='stretch',
        style= {'description_width': 'initial'},
        continuous_update=False
    )
    return textbox

def get_columns(df, pat=""): # return dropdown of column names
    if pat == "":
        values = df.columns.tolist()
    else:
        values = [colname for colname in df.columns if pat in colname]
    dropdown = kw_columns_dd = widgets.Dropdown(
        options = values,
        value = 'title_kw',
        description = f"Keyword Column:",
        display='flex',
        flex_flow='column',
        align_items='stretch',
        style= {'description_width': 'initial'},
        continuous_update=False
    )
    return dropdown

In [29]:
def apply_dd_filter(df, colname, choice):
    if choice == "All":
        return df
    else:
        return df[df[colname] == choice]

def apply_text_filter(df, colname, kw):
    if kw == "":
        return df
    else:
        df = df.reset_index(drop=True)
        kw = kw.lower()
        keep_i = []
        for i, text in enumerate(df[colname]):
            for word in text.split("."):
                if word.strip() == kw:
                    keep_i.append(i)
                    break
        return df.iloc[keep_i, :]

def display_table(subreddit_opt, flair_opt, origin_opt, title_kw_opt, description_kw_opt, comment_kw_opt):
    df = data.copy()
    df = apply_dd_filter(df, 'subreddit', subreddit_opt)
    df = apply_dd_filter(df, 'flair', flair_opt)
    df = apply_dd_filter(df, 'origin', origin_opt)
    df = apply_text_filter(df, 'title_kw', title_kw_opt)
    df = apply_text_filter(df, 'description_kw', description_kw_opt)
    df = apply_text_filter(df, 'comment_kw', comment_kw_opt)
    print("Note: press ENTER after text input.")
    df = df.sort_values(['upvote_ratio','avg_comment_score'], ascending=False)
    comment_kws = [w for string in df['comment_kw'] for w in string.split(". ")]
    df = df[['subreddit','title','datetime','flair','origin','upvote_ratio','description_summ','comment_summ','avg_comment_score']]
    return df

def display_wc(subreddit_opt, flair_opt, origin_opt, kw_col_opt):
    df = data.copy()
    df = apply_dd_filter(df, 'subreddit', subreddit_opt)
    df = apply_dd_filter(df, 'flair', flair_opt)
    df = apply_dd_filter(df, 'origin', origin_opt)
    kws = [word for string in df[kw_col_opt] for word in string.split(". ")]
    kws = " ".join(kws)
    wc = WordCloud(width=800, height=600, min_font_size=10, background_color="white").generate(kws)
    plt.figure(figsize = (8, 8), facecolor = None)
    plt.imshow(wc)
    plt.axis("off")
    plt.tight_layout(pad = 0)
    return plt.show()

In [30]:
# Widgets
subreddit_dd = get_dropdown(data, 'subreddit')
flair_dd = get_dropdown(data, 'flair')
origin_dd = get_dropdown(data, 'origin')
title_kw_box = get_kw_box(data, 'title_kw')
description_kw_box = get_kw_box(data, 'description_kw')
comment_kw_box = get_kw_box(data, 'comment_kw')
kw_column_dd = get_columns(data, '_kw')

### Summarised Posts:

In [31]:
# Display Data Table
widgets.interact(
    display_table,
    subreddit_opt=subreddit_dd,
    flair_opt=flair_dd,
    origin_opt=origin_dd,
    title_kw_opt=title_kw_box,
    description_kw_opt=description_kw_box,
    comment_kw_opt=comment_kw_box
);

interactive(children=(Dropdown(description='subreddit:', options=('All', 'dataanalysis', 'datascience'), value…

Improvements for next time...
- Colour the keywords in summarised text using pandas styling (e.g. https://monkeylearn.com/static/3b4b48a512024d2f139ce5324534bf9f/b7203/studio-chewy.webp)
- Group similar flairs (since some flairs are specific to certain subreddits)

### Keywords from Posts:

In [32]:
# Display word cloud of keywords
widgets.interact(
    display_wc,
    subreddit_opt=subreddit_dd,
    flair_opt=flair_dd,
    origin_opt=origin_dd,
    kw_col_opt=kw_column_dd
);

interactive(children=(Dropdown(description='subreddit:', options=('All', 'dataanalysis', 'datascience'), value…

Note: if error produced - means that filter results returns empty.