### Step 1 - Retrieve and Extract Subreddit Data

At the time of writing the pushshift API was down and historical posts were not available.  As a backup, historical subreddit posts from academictorrents.com were retrieved.  Academictorrents.com maintains a repository of the pushshift data.  Data was retrieved from 2005-06 through 2022-12.  Details can be found here: https://academictorrents.com/details/c398a571976c78d346c325bd75c47b82edf6124e.

Data was retrieved in a compressed format.  The uncompressed format is an NDJSON file format.

#### Install Dependencies

In [1]:
!pip install openai

!pip install bertopic

Collecting openai
  Downloading openai-0.27.5-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.5
[0mCollecting bertopic
  Downloading bertopic-0.14.1-py2.py3-none-any.whl (120 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.7/120.7 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting hdbscan>=0.8.29
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting plotly>=4.7.0
  Downloading plotly-5.14.1-py2.py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

#### Import required libraries and read NDJSON file into a pandas dataframe.  Limit to required fields only.

In [1]:
import pandas as pd
import json 
import csv 
import time
import datetime
import os
import re

In [6]:
filename = '../data/talesfromcallcenters_submissions.ndjson' 

# Read the JSON file directly into a dataframe, selecting the desired columns
results = pd.read_json(filename, lines=True)[['id', 'title', 'selftext', 'author', 'score', 'num_comments', 'created_utc']]

#### Drop NaN and view the first 10 rows of the dataframe

In [7]:
results.dropna(axis=1, inplace=True)

total_submissions = num_rows = len(results.index)


results

Unnamed: 0,id,title,selftext,author,score,num_comments,created_utc
0,yc807,"Yes, us lowly call center team members want no...","So there I am, sitting at my desk after time-s...",DovahkENT,40,10,1345149220
1,yc4bm,"No brain, No pain.",I swear that convergys will hire any moron tha...,[deleted],28,10,1345145883
2,yamv8,"""how fast does your modems go ?",Starting off the awesome Subreddit\n\nI used t...,[deleted],34,4,1345080297
3,yalvt,Welcome to r/talesfromcallcenters,As a former phone monkey myself i thought it w...,[deleted],33,3,1345079410
4,ydntm,"[I can't tell you] information line, this is P...",Yay! 28th subscriber!\n\nI used to work for a...,PoglaTheGrate,29,4,1345215773
...,...,...,...,...,...,...,...
17111,zz8ztg,When will they learn?,(Work emergency roadside assistance as a dispa...,HogwartsAlumni25,30,11,1672428178
17112,zzdo0y,"""this is the most pleasant phone call i've had...",[removed],trip90458343,1,0,1672439681
17113,zzgolf,call center anxiety??,[deleted],[deleted],1,0,1672447446
17114,zzk31c,Call Avoidance,"Long time lurker here, but I recently found ou...",Ceh0208,147,63,1672457066


#### Note the [deleted] value in author field.  Check to see if this occurs in the selftext field as well.  Selftext is the body of the submission.  Selftext value of [deleted] will not be useful for the anlaysis.

In [8]:


# Count the number of times "[deleted]" appears in the specified column
num_deleted = (results['selftext'] == '[deleted]').sum()

# Print the result
print(f'The word "[deleted]" appears {num_deleted} times in the selftext column')

The word "[deleted]" appears 1103 times in the selftext column


#### Since the word [deleted] appears we remove those rows from the results in the next step.  Author [deleted], if any, are OK though since we still have the selftext to analyze.

In [9]:
results = results[~results['selftext'].isin(['[deleted]', '[removed]'])]

total_usable_submissions = num_rows = len(results.index)


#### Display the results to get the total row count and a view of the first five and last five rows

In [10]:
results

Unnamed: 0,id,title,selftext,author,score,num_comments,created_utc
0,yc807,"Yes, us lowly call center team members want no...","So there I am, sitting at my desk after time-s...",DovahkENT,40,10,1345149220
1,yc4bm,"No brain, No pain.",I swear that convergys will hire any moron tha...,[deleted],28,10,1345145883
2,yamv8,"""how fast does your modems go ?",Starting off the awesome Subreddit\n\nI used t...,[deleted],34,4,1345080297
3,yalvt,Welcome to r/talesfromcallcenters,As a former phone monkey myself i thought it w...,[deleted],33,3,1345079410
4,ydntm,"[I can't tell you] information line, this is P...",Yay! 28th subscriber!\n\nI used to work for a...,PoglaTheGrate,29,4,1345215773
...,...,...,...,...,...,...,...
17109,zyts64,[long] they're my earbuds and i want them now!,i've been a lurker of this subreddit ever sinc...,secret-tacos,61,6,1672383546
17110,zz7684,"kudos to you guys, I don't know how you do it.","I worked in retail for 7 years, recently took ...",Fact0ry0fSadness,211,54,1672423668
17111,zz8ztg,When will they learn?,(Work emergency roadside assistance as a dispa...,HogwartsAlumni25,30,11,1672428178
17114,zzk31c,Call Avoidance,"Long time lurker here, but I recently found ou...",Ceh0208,147,63,1672457066


#### The created_utc field is showing time from epoch in seconds which is not useful.  This needs to be converted to a readable format and placed in a new field called created_date

In [11]:


results['created_date'] = pd.to_datetime(results['created_utc'], unit='s')

# Drop the epoch seconds column
results.drop('created_utc', axis=1, inplace=True)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['created_date'] = pd.to_datetime(results['created_utc'], unit='s')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results.drop('created_utc', axis=1, inplace=True)


#### Check the dataframe again to ensure that the new field has been populated correctly and the old field dropped

In [12]:
results

Unnamed: 0,id,title,selftext,author,score,num_comments,created_date
0,yc807,"Yes, us lowly call center team members want no...","So there I am, sitting at my desk after time-s...",DovahkENT,40,10,2012-08-16 20:33:40
1,yc4bm,"No brain, No pain.",I swear that convergys will hire any moron tha...,[deleted],28,10,2012-08-16 19:38:03
2,yamv8,"""how fast does your modems go ?",Starting off the awesome Subreddit\n\nI used t...,[deleted],34,4,2012-08-16 01:24:57
3,yalvt,Welcome to r/talesfromcallcenters,As a former phone monkey myself i thought it w...,[deleted],33,3,2012-08-16 01:10:10
4,ydntm,"[I can't tell you] information line, this is P...",Yay! 28th subscriber!\n\nI used to work for a...,PoglaTheGrate,29,4,2012-08-17 15:02:53
...,...,...,...,...,...,...,...
17109,zyts64,[long] they're my earbuds and i want them now!,i've been a lurker of this subreddit ever sinc...,secret-tacos,61,6,2022-12-30 06:59:06
17110,zz7684,"kudos to you guys, I don't know how you do it.","I worked in retail for 7 years, recently took ...",Fact0ry0fSadness,211,54,2022-12-30 18:07:48
17111,zz8ztg,When will they learn?,(Work emergency roadside assistance as a dispa...,HogwartsAlumni25,30,11,2022-12-30 19:22:58
17114,zzk31c,Call Avoidance,"Long time lurker here, but I recently found ou...",Ceh0208,147,63,2022-12-31 03:24:26


In [13]:
# extract earliest and latest dates
start_date = results['created_date'].min().date()
end_date = results['created_date'].max().date()

# print results
print("Start date: ", start_date)
print("End date: ", end_date)

Start date:  2012-08-16
End date:  2022-12-31


#### Remove posts shorter than 100 words

In [14]:
def word_count(text):
    return len(text.split())

results = results[results['selftext'].apply(word_count) >= 100]

#### Remove posts that are greater than 874 tokens

In [15]:
from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('google/pegasus-large')

# Define the token count threshold
max_tokens = 874

# Function to count tokens in a text
def count_tokens(text):
    return len(tokenizer.encode(text, truncation=False, max_length=1024))

# Apply the token count function to the 'selftext' column
results['token_count'] = results['selftext'].apply(count_tokens)

# Filter the dataframe
filtered_submissions = results[results['token_count'] <= max_tokens]

# Drop the 'token_count' column as it's not needed anymore
filtered_df = filtered_submissions.drop(columns=['token_count'])

Downloading tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/3.02k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['token_count'] = results['selftext'].apply(count_tokens)


#### Get number of rows

In [16]:
filtered_df

Unnamed: 0,id,title,selftext,author,score,num_comments,created_date
1,yc4bm,"No brain, No pain.",I swear that convergys will hire any moron tha...,[deleted],28,10,2012-08-16 19:38:03
2,yamv8,"""how fast does your modems go ?",Starting off the awesome Subreddit\n\nI used t...,[deleted],34,4,2012-08-16 01:24:57
4,ydntm,"[I can't tell you] information line, this is P...",Yay! 28th subscriber!\n\nI used to work for a...,PoglaTheGrate,29,4,2012-08-17 15:02:53
5,yicc4,Dishwasher blues,The stories i've heard amazes me but this one ...,[deleted],31,15,2012-08-20 05:05:34
6,ymjc0,"Tech support agent, and yet I can't touch my c...",At my work we are not allowed to adjust the mo...,hanzors,40,19,2012-08-22 06:27:29
...,...,...,...,...,...,...,...
17106,zxk7w2,"""Are you a camel jockey?"" Oh dear God",I just had a call that absolutely takes the ca...,CZJayG,277,41,2022-12-28 20:35:49
17107,zxn7bt,Call Totals Giving me Anxiety,Does anyone else have to make a certain amount...,BatBitch1016,45,7,2022-12-28 22:31:50
17110,zz7684,"kudos to you guys, I don't know how you do it.","I worked in retail for 7 years, recently took ...",Fact0ry0fSadness,211,54,2022-12-30 18:07:48
17111,zz8ztg,When will they learn?,(Work emergency roadside assistance as a dispa...,HogwartsAlumni25,30,11,2022-12-30 19:22:58


In [17]:
total_inscope_submissions = num_rows = len(filtered_df.index)


#### Clean up the text

In [18]:
def clean_text(text):
    # Remove emojis
    text = re.sub(r'[\U00010000-\U0010ffff]', '', text)
    # Remove hyperlinks
    text = re.sub(r'http\S+', '', text)

    # Remove unicode characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)

    # Remove HTML codes
    text = text.replace("&amp;nbsp;", "")

    # Replace 'tl;dr' with 'in summary'
    new_text = text.replace("tl;dr", "in summary")

    return new_text

In [19]:
# Apply the function to the dataframe
filtered_df['selftext'] = filtered_df['selftext'].apply(clean_text)

#### Save work to this point by exporting the results dataframe to a CSV

In [20]:
filtered_df.to_csv('tfcc_submissions.csv', index=False)

#### Repeat above steps for comments.  In this case we will only retrieve parent_id (to associate the comment with an original submission), body, and score

In [21]:
##talesfromcallcenters_comments

filename = '../data/talesfromcallcenters_comments.ndjson'

# Read the JSON file directly into a dataframe, selecting the desired columns
results = pd.read_json(filename, lines=True)[['parent_id', 'body', 'score']]




#### Drop NaN and display the first few and last rows.

In [22]:
results.dropna(axis=1, inplace=True)

total_comments = num_rows = len(results.index)

results

Unnamed: 0,parent_id,body,score
0,t3_yamv8,"OUR HIGH SPEED INTERNET PACKAGE IS SO FAST, YO...",18
1,t1_c5txb4w,hahahahahahaha,3
2,t3_yalvt,I await with great interest to see what people...,7
3,t3_yamv8,Convergys...in Lake Mary FL? 0_0,3
4,t1_c5u5kzb,"No, I'm based in Winnipeg, Manitoba Canada and...",3
...,...,...,...
293217,t3_zzqwd8,"When I worked front desk for a hotel, someone ...",26
293218,t1_j2dej3k,I've done that but not to avoid calls. It was ...,7
293219,t1_j2dstmw,One time I was having such a bad day and not h...,3
293220,t3_zzqwd8,HAHAHAHAHA!,5


#### Remove all comments with less than 50 words

In [23]:
def word_count(text):
    return len(text.split())

results = results[results['body'].apply(word_count) >= 50]


In [24]:
results

Unnamed: 0,parent_id,body,score
6,t3_yc807,I'm not too familiar with mortgages but i thin...,8
21,t3_yc807,While I have no problem in the identification ...,4
23,t1_c5x5hyx,I never understood why we make people type in ...,3
26,t3_ymjc0,"this makes me like my call center, everything ...",3
30,t3_yc4bm,My boyfriend &amp; a few of our friends have w...,4
...,...,...,...
293208,t1_j2en3sz,I feel that. I love when claimants want a rent...,2
293211,t1_j2egyx9,"I grit my teeth and stayed there, met a handfu...",2
293217,t3_zzqwd8,"When I worked front desk for a hotel, someone ...",26
293218,t1_j2dej3k,I've done that but not to avoid calls. It was ...,7


#### Remove comments > 200 tokens

In [25]:
from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('google/pegasus-large')

# Define the token count threshold
max_tokens = 200

# Function to count tokens in a text
def count_tokens(text):
    return len(tokenizer.encode(text, truncation=False, max_length=1024))

# Apply the token count function to the 'selftext' column
results['token_count'] = results['body'].apply(count_tokens)

# Filter the dataframe
filtered_comments = results[results['token_count'] <= max_tokens]

# Drop the 'token_count' column as it's not needed anymore
filtered_comments_df = filtered_comments.drop(columns=['token_count'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['token_count'] = results['body'].apply(count_tokens)


In [26]:
total_in_scope_comments = num_rows = len(filtered_comments_df.index)


filtered_comments_df

Unnamed: 0,parent_id,body,score
6,t3_yc807,I'm not too familiar with mortgages but i thin...,8
21,t3_yc807,While I have no problem in the identification ...,4
23,t1_c5x5hyx,I never understood why we make people type in ...,3
26,t3_ymjc0,"this makes me like my call center, everything ...",3
30,t3_yc4bm,My boyfriend &amp; a few of our friends have w...,4
...,...,...,...
293203,t1_j2cqjfx,Or when they’ve paid their bill late and we ha...,3
293208,t1_j2en3sz,I feel that. I love when claimants want a rent...,2
293217,t3_zzqwd8,"When I worked front desk for a hotel, someone ...",26
293218,t1_j2dej3k,I've done that but not to avoid calls. It was ...,7


#### Clean up the text

In [27]:
filtered_comments_df['body'] = filtered_comments_df['body'].apply(clean_text)

#### Export the comments to a csv

In [28]:
filtered_comments_df.to_csv('tfcc_comments.csv', index=False)

#### Clean up.  Remove the NDJSON files from the working directory

In [31]:

os.remove('talesfromcallcenters_comments.ndjson')

os.remove('talesfromcallcenters_submissions.ndjson')


### Wrap up: We now have two CSV files of reddit submissions and posts ready for analysis

### This section creates a CSV file for the introduction section of the RMarkdown report

#### Create the subreddit meta data table

In [31]:
import openai
import json 
openai.api_key = "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"

In [32]:
import json
import openai

error_string = "{\"summary\": \"none\"}"

def getOpenAI_Summary(prompt):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": "" + prompt + ""}
        ]
    )

    initial_response = response['choices'][0]['message']['content']

    print(initial_response)

    return initial_response


#### Because Cohere and OpenAI training data included Reddit, it will be able to answer general questions about a specific subreddit.  Here we ask OpenAI to describe the subreddit

In [33]:
prompt = 'Describe the Tales from Call Centers (TFCC) subreddit in two lengthy paragraphs.  Include things like history, volume, and themes.  Write in an academic style:'

subreddit_description = getOpenAI_Summary(prompt)


Tales from Call Centers (TFCC) is a subreddit dedicated to the experiences of individuals working in various call centers across the globe. Established in 2011, this online community has amassed a significant following with a current membership of over 400,000 members. TFCC provides an open space for current and former call center employees to share and discuss their encounters with customers, work conditions, and management. 

The stories shared on TFCC are diverse in nature and vary from humorous to heartbreaking. A common theme among the posts is the strain that call center work can put on employees. Members often highlight the stringent work schedules, excessive monitoring, and lack of autonomy that makes call centers a challenging work environment. The subreddit also reveals the unpredictable nature of the job, with each interaction with a customer bringing a unique set of challenges. The stories shared on TFCC showcase the lengths employees often have to go to provide satisfactor

In [34]:
subreddit_name = "Tales from Call Centers (TFCC)"
subreddit_url = "https://www.reddit.com/r/talesfromcallcenters/"
total_submissions
total_usable_submissions
total_inscope_submissions
total_comments = num_rows
total_in_scope_comments


76842

#### Construct a string that described the number of submissions and comments that were in scope and the date range of the subreddit data

In [35]:
scope = "A total of " + format(total_submissions, ',') + " subreddit submissions and " + format(total_comments, ',') + " associated comments were extracted for the period of " + str(start_date) + " through " + str(end_date) + " from the " + subreddit_name + " subreddit (" + subreddit_url + "). Of the total submissions and comments, " + format(total_usable_submissions, ',') + " submissions and " + format(total_in_scope_comments, ',') + " comments were retained after cleanup (short text, null value removal)."


In [36]:
scope2 = " The following sections summarize each of the top 20 topics identified through topic modeling using the BERTopic library.  The sections are formatted as follows: 1) Topic number plus the BERTopic description 2) Themes identified from OpenAI 3) Sentiment analysis results (note that while all sentiment scores are negative, the comments sentiment are generally more positive than the submissions sentiment) 4) Summaries of Submissions (each paragraph represents between 25 - 50 original submissions that have been summarized twice using abstractive summarization) and 5) Summary of comments (a brief summarization of the top 100 comments related to the topic)."

In [37]:
scope = scope + scope2

In [38]:
description_intro = "The following report was generated completely using the OpenAI GPT 3.5 Turbo API through a series of summarization steps.  It is important to note that one of the common risks associated with abstractive summarization is hallucination, which is the introduction of content not completely relevant to the source text.  Abstractive summarization is not perfect and while there are methods to check the accuracy of an abstractive summary, it is not a guarantee of accuracy.  The intent of this report is to consolidate the submissions to a subreddit over an extended period of time and group those submissions into categories identified using NLP analysis techniques.  Should some of the topics be of interest, then a further review of the original subreddit posts is recommended.  Stephen Drew, 2 April 2023."

In [39]:
scope

'A total of 17,116 subreddit submissions and 76,842 associated comments were extracted for the period of 2012-08-16 through 2022-12-31 from the Tales from Call Centers (TFCC) subreddit (https://www.reddit.com/r/talesfromcallcenters/). Of the total submissions and comments, 13,813 submissions and 76,842 comments were retained after cleanup (short text, null value removal). The following sections summarize each of the top 20 topics identified through topic modeling using the BERTopic library.  The sections are formatted as follows: 1) Topic number plus the BERTopic description 2) Themes identified from OpenAI 3) Sentiment analysis results (note that while all sentiment scores are negative, the comments sentiment are generally more positive than the submissions sentiment) 4) Summaries of Submissions (each paragraph represents between 25 - 50 original submissions that have been summarized twice using abstractive summarization) and 5) Summary of comments (a brief summarization of the top 10

In [40]:
# create a dictionary with the variables as values
data = {'scope': [scope], 'subreddit_description': [subreddit_description], 'subreddit_name': [subreddit_name], 'description_intro': [description_intro]}

# create a DataFrame from the dictionary
df = pd.DataFrame(data)

In [41]:
df

Unnamed: 0,scope,subreddit_description,subreddit_name,description_intro
0,"A total of 17,116 subreddit submissions and 76...",Tales from Call Centers (TFCC) is a subreddit ...,Tales from Call Centers (TFCC),The following report was generated completely ...


#### Export to a CSV

In [42]:
df.to_csv('subreddit_overview.csv', index=False)