# Fictional Characters Reddit Data

Most of the Reddit data was obtained through the University and is in .pickle format. A small subset of the data was scraped directly from old.reddit.com in cases where the data needed was limited to only specific threads (for example, data pertaining to "Catcher in the Rye" on the r/literature or r/books subreddits).

First, I will proceed with loading, cleaning, and preprocessing the .pickle data, as this is the format of most of the data.

I will use the format below for each set of data pertaining to the different fictional works. I will replace breakingbad_data with a new variable for each different set of data.

In [92]:
import os
import pickle

# Define the folder path
folder_path = r'C:\Users\josie\Downloads\Josie\BreakingBad' #Breaking Bad as an example

# List all the pickle files that the folder contains
pickle_files = [file for file in os.listdir(folder_path) if file.endswith('.pickle')]

# Initialize empty list to store loaded data
data = {}

# Loop through each pickle file and load its contents
for file_name in pickle_files:
    with open(os.path.join(folder_path, file_name), 'rb') as f:
        data[file_name] = pickle.load(f)

# Cleaning and Preprocessing all Data

## "r/HarryPotter"

### Step 1: Upload Data, Check and Inspect Contents, Create Merged Dataframe

All of the Harry Potter Reddit Data is located in the folder "harrypotter" within the "Reddit" folder that contains all the other Reddit data.

First, we need to load it into our Python terminal.

In [93]:
# Specify the path to the folder containing the Harry Potter pickle files
folder_path = r'C:\Users\josie\OneDrive\Desktop\Reddit\harrypotter'

# List all the pickle files that the folder contains
pickle_files = [file for file in os.listdir(folder_path) if file.endswith('.pickle')]

# Initialize empty dictionary to store loaded data
harry_potter_data = {}

# Load each pickle file and store its contents in the dictionary
for file_name in pickle_files:
    with open(os.path.join(folder_path, file_name), 'rb') as f:
        harry_potter_data[file_name] = pickle.load(f)

# Now I have all the loaded Harry Potter data stored in the dictionary 'harry_potter_data'

I will now check the data to see its structure and inspect its contents.

In [94]:
# Iterate over each pickle file and inspect its contents
for file_name, data in harry_potter_data.items():
    print(f"File: {file_name}")
    print(f"Data type: {type(data)}")
    print(f"Keys: {data.keys()}")
    print()

File: subreddit_HarryPotterBooks_RC_2014-01.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_HarryPotterBooks_RC_2014-02.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_HarryPotterBooks_RC_2014-03.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_HarryPotterBooks_RC_2014-04.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_HarryPotterBooks_RC_2014-05.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_HarryPotterBooks_RC_2014

The dataset contains files belonging to the subreddits r/HarryPotter and r/HarryPotterBooks.

We are interested in keeping these analyses separate, so we will first explore the r/HarryPotter data alone.

RS refers to 'Reddit Submissions' and RC refers to 'Reddit Comments'. These files will contain differently structured dataframes, so I will first create 2 different dataframes concatenating the RS and RC data separately.

In [95]:
import pandas as pd

# Initialize empty dictionaries to store RS (Reddit submissions) and RC (Reddit comments) DataFrames
rs_data = {}
rc_data = {}

# Loop through all files in the folder
for file_name in os.listdir(folder_path):
    # Check if the file is a Reddit submission file
    if file_name.startswith("subreddit_harrypotter_RS"):
        # Load the RS DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rs_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))
    
    # Check if the file is a Reddit comment file
    elif file_name.startswith("subreddit_harrypotter_RC"):
        # Load the RC DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rc_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))

# Concatenate all the RS DataFrames into a single DataFrame
harry_potter_submissions = pd.concat(rs_data.values(), ignore_index=True)

# Concatenate all the RC DataFrames into a single DataFrame
harry_potter_comments = pd.concat(rc_data.values(), ignore_index=True)

The 'id' column in harry_potter_submissions (originally RS) contains values like '91mvt' while the link_id and parent_id contain values like 't3_91mvt'.

In [96]:
# Check if all values in 'link_id' column start with 't3_'
all_values_start_with_t3 = harry_potter_comments['link_id'].str.startswith('t3_').all()

print("All values in 'link_id' column start with 't3_':", all_values_start_with_t3)

All values in 'link_id' column start with 't3_': True


If we remove the 't3_' prefix that appears to be consistent so that I can merge the two dataframes on the matching column.

In [97]:
# Remove the 't3_' prefix from every row in the link_id column
harry_potter_comments['cleaned_link_id'] = harry_potter_comments['link_id'].str.replace('t3_', '', regex=False)

# Display the first few rows to verify the prefix has been removed
print(harry_potter_comments[['link_id', 'cleaned_link_id']].head())

# Merge the DataFrames using the cleaned_link_id column and id column
harrypotter_df = pd.merge(harry_potter_comments, harry_potter_submissions, left_on='cleaned_link_id', right_on='id', how='inner', suffixes=('_comment', '_submission'))

# Display the first few rows of the merged DataFrame
print(harrypotter_df.head())

    link_id cleaned_link_id
0  t3_91mvt           91mvt
1  t3_bzfhl           bzfhl
2  t3_ai9xg           ai9xg
3  t3_bzfhl           bzfhl
4  t3_ckp0j           ckp0j
  id_comment    author_comment   link_id parent_id subreddit_comment  \
0    c0b57bl         [deleted]  t3_91mvt  t3_91mvt       harrypotter   
1    c0pd55y         NitsujTPU  t3_bzfhl  t3_bzfhl       harrypotter   
2    c0rmz31  theartfulrambler  t3_bzfhl  t3_bzfhl       harrypotter   
3    c0rjt2h   FruitySnackLove  t3_ai9xg  t3_ai9xg       harrypotter   
4    c0tgbmr     WordsVerbatim  t3_ckp0j  t3_ckp0j       harrypotter   

                                                body cleaned_link_id  \
0                                          [deleted]           91mvt   
1  These quotes aren't very good, no offense.\n\n...           bzfhl   
2      Any time Ron says 'bloody' it makes me happy.           bzfhl   
3                                   i will try this!           ai9xg   
4  Agreed. \n\nSo, was it not fucking a

The new merged dataframe, harrypotter_df, contains some redundant columns that are no longer needed: cleaned_link_id appears twice and is not needed in light of the id_submission that connects the original submission to its corresponding comments, so we can drop it and visualize the new results.

link_id and parent_id are also unnecessary now, so I will also remove them.

In [98]:
#'cleaned_link_id' appears twice in the DataFrame
harrypotter_df.drop(labels=['cleaned_link_id'], axis=1, inplace=True)
harrypotter_df.drop(columns=['link_id', 'parent_id'], inplace=True)

print(harrypotter_df.head())

  id_comment    author_comment subreddit_comment  \
0    c0b57bl         [deleted]       harrypotter   
1    c0pd55y         NitsujTPU       harrypotter   
2    c0rmz31  theartfulrambler       harrypotter   
3    c0rjt2h   FruitySnackLove       harrypotter   
4    c0tgbmr     WordsVerbatim       harrypotter   

                                                body id_submission  \
0                                          [deleted]         91mvt   
1  These quotes aren't very good, no offense.\n\n...         bzfhl   
2      Any time Ron says 'bloody' it makes me happy.         bzfhl   
3                                   i will try this!         ai9xg   
4  Agreed. \n\nSo, was it not fucking awesome? \n...         ckp0j   

  author_submission subreddit_submission  \
0         [deleted]          harrypotter   
1         neiltracy          harrypotter   
2         neiltracy          harrypotter   
3          hpgeek42          harrypotter   
4         snatchula          harrypotter   

 

### Cleaning and Preprocessing HP data

Now that we've merged our dataframe and included only the relevant columns, let's proceed with our cleaning and preprocessing steps.

First, let's ensure that we remove any rows with the value '[deleted]' in the 'body' column. These consist of comments that have been fully deleted by the Reddit user or admin(s) and contain no content to inspect, and are therefore useless. '[deleted]' in the 'selftext', 'author_submission', or 'author_comment' is not necessary to filter out, as there may still be corresponding comments. While we may not be able to attribute them, we are still able to use their contents for our analysis as Reddit is a public forum. 

In [99]:
# Drop rows with '[deleted]' in the 'body' column
harrypotter_df = harrypotter_df[harrypotter_df['body'] != '[deleted]']

Now we can proceed with the lowercasing...

In [100]:
# Lowercase 'body' column
harrypotter_df['body'] = harrypotter_df['body'].str.lower()

# Lowercase 'selftext' column
harrypotter_df['selftext'] = harrypotter_df['selftext'].str.lower()

Now with removing the punctuation...

In [101]:
import string

# Define a function to remove punctuation from a string
def remove_punctuation(text):
    # Get the predefined set of punctuation characters
    punctuation_set = set(string.punctuation)
    # Use ''.join() to remove punctuation
    return ''.join(char for char in text if char not in punctuation_set)

# Remove punctuation from the relevant columns
text_columns = ['body', 'selftext']
harrypotter_df[text_columns] = harrypotter_df[text_columns].applymap(remove_punctuation)

\n\n is used to denote paragraph breaks or new sections of text. We can proceed with removing those.

In [102]:
# Replace '\n\n' with a space in the 'body' and 'selftext' columns
harrypotter_df['body'] = harrypotter_df['body'].str.replace('\n\n', ' ')
harrypotter_df['selftext'] = harrypotter_df['selftext'].str.replace('\n\n', ' ')

In [103]:
cleaned_file_path = r'C:\Users\josie\Downloads\REDDIT_HP_DATA_CLEANED.txt'
harrypotter_df.to_csv(cleaned_file_path, index=False, sep='\t')

print(f"Cleaned data saved to {cleaned_file_path}")

Cleaned data saved to C:\Users\josie\Downloads\REDDIT_HP_DATA_CLEANED.txt


## Now save all relevant columns and groupings accordingly as .txt files before tokenizing for analysis in Voyant.

Now, we can move ahead with tokenizing, starting with the 'body' column.

In [104]:
import nltk
from nltk.tokenize import word_tokenize

# Tokenize the 'body' column
harrypotter_df['body'] = harrypotter_df['body'].apply(word_tokenize)

# Print the updated 'body' column
print(harrypotter_df['body'])

1         [these, quotes, arent, very, good, no, offense...
2         [any, time, ron, says, bloody, it, makes, me, ...
3                                      [i, will, try, this]
4           [agreed, so, was, it, not, fucking, awesome, d]
5         [im, so, excited, i, keep, thinking, about, wh...
                                ...                        
635399    [from, memory, the, only, other, thing, i, can...
635400    [nah, i, think, if, she, can, watch, gof, she,...
635401    [i, also, thought, about, what, i, am, going, ...
635402    [thanks, for, the, advice, man, yeah, i, ’, m,...
635403                  [yes, he, does, an, excellent, job]
Name: body, Length: 615681, dtype: object


Now, the 'selftext' column...

In [105]:
harrypotter_df['selftext'] = harrypotter_df['selftext'].apply(word_tokenize)

Finally, let's remove the stopwords, but only from the 'body' and 'selftext' columns. We want to keep the titles intact because it could be difficult to discern the prompt without the original language and formatting.

In [106]:
from nltk.corpus import stopwords

# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords from the tokenized 'body' column
harrypotter_df['body'] = harrypotter_df['body'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

# Remove stopwords from the tokenized 'selftext' column
harrypotter_df['selftext'] = harrypotter_df['selftext'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

For our analyses, we are not interested in lemmatizing words as we want to pay attention to verb tenses.

Now that all the cleaning and preprocessing steps have been executed, let's run a check to ensure everything looks good.

In [107]:
harrypotter_df

Unnamed: 0,id_comment,author_comment,subreddit_comment,body,id_submission,author_submission,subreddit_submission,title,selftext
1,c0pd55y,NitsujTPU,harrypotter,"[quotes, arent, good, offense, funny, part, sh...",bzfhl,neiltracy,harrypotter,Best Harry Potter Movie Quotes,[]
2,c0rmz31,theartfulrambler,harrypotter,"[time, ron, says, bloody, makes, happy]",bzfhl,neiltracy,harrypotter,Best Harry Potter Movie Quotes,[]
3,c0rjt2h,FruitySnackLove,harrypotter,[try],ai9xg,hpgeek42,harrypotter,Butterbeer - Done Proper,[]
4,c0tgbmr,WordsVerbatim,harrypotter,"[agreed, fucking, awesome]",ckp0j,snatchula,harrypotter,NEW HP TRAILER!! Suck it Twilight,[]
5,c0tjj7l,snatchula,harrypotter,"[im, excited, keep, thinking, going, stop, fir...",ckp0j,snatchula,harrypotter,NEW HP TRAILER!! Suck it Twilight,[]
...,...,...,...,...,...,...,...,...,...
635399,fg4z5tb,[deleted],harrypotter,"[memory, thing, think, might, scary, child, me...",ewwiam,bitchdantkillmyvibe,harrypotter,How scary is the Goblet of Fire movie?,"[hey, eight, year, old, daughter, massive, pot..."
635400,fg4zeve,enfanthorrible,harrypotter,"[nah, think, watch, gof, also, watch, ootp, hb...",ewwiam,bitchdantkillmyvibe,harrypotter,How scary is the Goblet of Fire movie?,"[hey, eight, year, old, daughter, massive, pot..."
635401,fg4zmzc,enfanthorrible,harrypotter,"[also, thought, going, kids, future, thought, ...",ewwiam,bitchdantkillmyvibe,harrypotter,How scary is the Goblet of Fire movie?,"[hey, eight, year, old, daughter, massive, pot..."
635402,fg4zx8y,bitchdantkillmyvibe,harrypotter,"[thanks, advice, man, yeah, ’, hoping, keep, b...",ewwiam,bitchdantkillmyvibe,harrypotter,How scary is the Goblet of Fire movie?,"[hey, eight, year, old, daughter, massive, pot..."


## "r/HarryPotterBooks"

Let's repeat the process for the r/HarryPotterBooks data.

In [108]:
import os
import pandas as pd

# Path to the folder containing the pickle files
folder_path = r'C:\Users\josie\OneDrive\Desktop\Reddit\harrypotter'

# Load a sample pickle file
sample_file = [file for file in os.listdir(folder_path) if file.startswith('subreddit_HarryPotterBooks_RC')][0]

# Load the sample file to inspect
sample_df = pd.read_pickle(os.path.join(folder_path, sample_file))

# Display the first few rows of the sample DataFrame
print(sample_df.head())

        id           author    link_id   parent_id         subreddit  \
0  ceff2fg        [deleted]  t3_1tcbbk   t3_1tcbbk  HarryPotterBooks   
1  cegbo9j          GameOfT  t3_1tcbbk   t3_1tcbbk  HarryPotterBooks   
2  cejfxkq            Rjr18  t3_1ulye4   t3_1ulye4  HarryPotterBooks   
3  cejgivj   MilkIsMyPotion  t3_1ulye4   t3_1ulye4  HarryPotterBooks   
4  cejkfmq  GaslightProphet  t3_1sp5h8  t1_ce040as  HarryPotterBooks   

                                                body  
0                                          [deleted]  
1  There are a lot of twists and turns throughout...  
2  Easily the Half-Blood Prince. I think it has t...  
3  Its GoF. Not only because it was my first Harr...  
4         Didnt the movies pronounce it Her-My-Knee?  


In [109]:
# Initialize empty dictionaries to store RS (Reddit submissions) and RC (Reddit comments) DataFrames
rs_data = {}
rc_data = {}

# Loop through all files in the folder
for file_name in os.listdir(folder_path):
    # Check if the file is a Reddit submission file
    if file_name.startswith("subreddit_HarryPotterBooks_RS"):
        # Load the RS DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rs_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))
    
    # Check if the file is a Reddit comment file
    elif file_name.startswith("subreddit_HarryPotterBooks_RC"):
        # Load the RC DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rc_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))

# Concatenate all the RS DataFrames into a single DataFrame
harry_potter_books_submissions = pd.concat(rs_data.values(), ignore_index=True)

# Concatenate all the RC DataFrames into a single DataFrame
harry_potter_books_comments = pd.concat(rc_data.values(), ignore_index=True)

In [110]:
print(harry_potter_books_submissions)
print(harry_potter_books_comments)

          id              author         subreddit  \
0     7ndcao         stiinabiina  HarryPotterBooks   
1     7nkkre           Dillonk12  HarryPotterBooks   
2     7nnux9           [deleted]  HarryPotterBooks   
3     7ntkop              tebz07  HarryPotterBooks   
4     7nttwp          AluminiaHz  HarryPotterBooks   
...      ...                 ...               ...   
6306  zydekn            stlshlee  HarryPotterBooks   
6307  zyihxq            ivaninge  HarryPotterBooks   
6308  zym9vf  DingoAgreeable9141  HarryPotterBooks   
6309  zyyql5         SoulxShadow  HarryPotterBooks   
6310  zz9rn1    Always-bi-myself  HarryPotterBooks   

                                                  title  \
0     Jim Dale changing his pronunciation of Voldemo...   
1                                  Snape and Occlumency   
2                   Your favorite Harry Potter fan art?   
3                                    Selective Security   
4               Isn't Harry kind of a jerk to Hermione? 

In [111]:
# Remove 't3_' prefix from the link_id column in the comments DataFrame
harry_potter_books_comments['link_id'] = harry_potter_books_comments['link_id'].str.replace('t3_', '')

In [112]:
# Merge the submissions and comments DataFrames on the id and link_id columns
merged_df = harry_potter_books_comments.merge(harry_potter_books_submissions, left_on='link_id', right_on='id', suffixes=('_comment', '_submission'))

# Display the merged DataFrame
print(merged_df.head())

  id_comment author_comment link_id   parent_id subreddit_comment  \
0    ds12yz3        jhynezz  7ndcao   t3_7ndcao  HarryPotterBooks   
1    ds1e4sc     joriebooks  7ndcao   t3_7ndcao  HarryPotterBooks   
2    ds1gh8z  severusvape69  7ndcao   t3_7ndcao  HarryPotterBooks   
3    ds2jco5    stiinabiina  7ndcao  t1_ds1e4sc  HarryPotterBooks   
4    ds2jdfi    stiinabiina  7ndcao  t1_ds12yz3  HarryPotterBooks   

                                                body id_submission  \
0  I love these audiobooks and Jim dales voice is...        7ndcao   
1  I’ve heard that it was because the movies star...        7ndcao   
2                    More of a Stephen Fry kinda guy        7ndcao   
3  Ah. That's what I suspected but hoped I was wr...        7ndcao   
4      Thanks...sorry to ruin it for yoi. Obliviate!        7ndcao   

  author_submission subreddit_submission  \
0       stiinabiina     HarryPotterBooks   
1       stiinabiina     HarryPotterBooks   
2       stiinabiina     HarryPot

In [113]:
print("Merged DataFrame Columns:")
print(merged_df.columns)

Merged DataFrame Columns:
Index(['id_comment', 'author_comment', 'link_id', 'parent_id',
       'subreddit_comment', 'body', 'id_submission', 'author_submission',
       'subreddit_submission', 'title', 'selftext'],
      dtype='object')


In [114]:
# Remove the redundant columns
merged_df.drop(columns=['link_id', 'subreddit_comment'], inplace=True)

print("Columns in the cleaned DataFrame:")
print(merged_df.columns)

print("\nFirst few rows of the cleaned DataFrame:")
print(merged_df.head())

Columns in the cleaned DataFrame:
Index(['id_comment', 'author_comment', 'parent_id', 'body', 'id_submission',
       'author_submission', 'subreddit_submission', 'title', 'selftext'],
      dtype='object')

First few rows of the cleaned DataFrame:
  id_comment author_comment   parent_id  \
0    ds12yz3        jhynezz   t3_7ndcao   
1    ds1e4sc     joriebooks   t3_7ndcao   
2    ds1gh8z  severusvape69   t3_7ndcao   
3    ds2jco5    stiinabiina  t1_ds1e4sc   
4    ds2jdfi    stiinabiina  t1_ds12yz3   

                                                body id_submission  \
0  I love these audiobooks and Jim dales voice is...        7ndcao   
1  I’ve heard that it was because the movies star...        7ndcao   
2                    More of a Stephen Fry kinda guy        7ndcao   
3  Ah. That's what I suspected but hoped I was wr...        7ndcao   
4      Thanks...sorry to ruin it for yoi. Obliviate!        7ndcao   

  author_submission subreddit_submission  \
0       stiinabiina     Har

### Cleaning and Preprocessing Data

In [115]:
# Drop rows with '[deleted]' in the 'body' column
harrypotterbooks_df = merged_df[merged_df['body'] != '[deleted]']

In [116]:
# Lowercase 'body' column
harrypotterbooks_df['body'] = harrypotterbooks_df['body'].str.lower()

# Lowercase 'selftext' column
harrypotterbooks_df['selftext'] = harrypotterbooks_df['selftext'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harrypotterbooks_df['body'] = harrypotterbooks_df['body'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harrypotterbooks_df['selftext'] = harrypotterbooks_df['selftext'].str.lower()


In [117]:
# Define a function to remove punctuation from a string
def remove_punctuation(text):
    # Get the predefined set of punctuation characters
    punctuation_set = set(string.punctuation)
    # Use ''.join() to remove punctuation
    return ''.join(char for char in text if char not in punctuation_set)

# Remove punctuation from the relevant columns
text_columns = ['body', 'selftext']
harrypotterbooks_df[text_columns] = harrypotterbooks_df[text_columns].applymap(remove_punctuation)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harrypotterbooks_df[text_columns] = harrypotterbooks_df[text_columns].applymap(remove_punctuation)


In [118]:
# Replace '\n\n' with a space in the 'body' and 'selftext' columns
harrypotterbooks_df['body'] = harrypotterbooks_df['body'].str.replace('\n\n', ' ')
harrypotterbooks_df['selftext'] = harrypotterbooks_df['selftext'].str.replace('\n\n', ' ')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harrypotterbooks_df['body'] = harrypotterbooks_df['body'].str.replace('\n\n', ' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harrypotterbooks_df['selftext'] = harrypotterbooks_df['selftext'].str.replace('\n\n', ' ')


In [119]:
cleaned_file_path = r'C:\Users\josie\Downloads\REDDIT_HPBOOKS_DATA_CLEANED.txt'
harrypotterbooks_df.to_csv(cleaned_file_path, index=False, sep='\t')

print(f"Cleaned data saved to {cleaned_file_path}")

Cleaned data saved to C:\Users\josie\Downloads\REDDIT_HPBOOKS_DATA_CLEANED.txt


## Now save all relevant columns and groupings as needed as .txt files.

In [120]:
# Tokenize
harrypotterbooks_df['body'] = harrypotterbooks_df['body'].apply(word_tokenize)
harrypotterbooks_df['selftext'] = harrypotterbooks_df['selftext'].apply(word_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harrypotterbooks_df['body'] = harrypotterbooks_df['body'].apply(word_tokenize)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harrypotterbooks_df['selftext'] = harrypotterbooks_df['selftext'].apply(word_tokenize)


In [121]:
# Remove stopwords from the tokenized 'body' column
harrypotterbooks_df['body'] = harrypotterbooks_df['body'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

# Remove stopwords from the tokenized 'selftext' column
harrypotterbooks_df['selftext'] = harrypotterbooks_df['selftext'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harrypotterbooks_df['body'] = harrypotterbooks_df['body'].apply(lambda x: [word for word in x if word.lower() not in stop_words])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harrypotterbooks_df['selftext'] = harrypotterbooks_df['selftext'].apply(lambda x: [word for word in x if word.lower() not in stop_words])


Final check.

In [122]:
harrypotterbooks_df

Unnamed: 0,id_comment,author_comment,parent_id,body,id_submission,author_submission,subreddit_submission,title,selftext
0,ds12yz3,jhynezz,t3_7ndcao,"[love, audiobooks, jim, dales, voice, life, ’,...",7ndcao,stiinabiina,HarryPotterBooks,Jim Dale changing his pronunciation of Voldemo...,"[first, 3, books, clearly, says, jkr, way, vol..."
1,ds1e4sc,joriebooks,t3_7ndcao,"[’, heard, movies, started, saying, voldemort]",7ndcao,stiinabiina,HarryPotterBooks,Jim Dale changing his pronunciation of Voldemo...,"[first, 3, books, clearly, says, jkr, way, vol..."
2,ds1gh8z,severusvape69,t3_7ndcao,"[stephen, fry, kinda, guy]",7ndcao,stiinabiina,HarryPotterBooks,Jim Dale changing his pronunciation of Voldemo...,"[first, 3, books, clearly, says, jkr, way, vol..."
3,ds2jco5,stiinabiina,t1_ds1e4sc,"[ah, thats, suspected, hoped, wrong, ugh, foll...",7ndcao,stiinabiina,HarryPotterBooks,Jim Dale changing his pronunciation of Voldemo...,"[first, 3, books, clearly, says, jkr, way, vol..."
4,ds2jdfi,stiinabiina,t1_ds12yz3,"[thankssorry, ruin, yoi, obliviate]",7ndcao,stiinabiina,HarryPotterBooks,Jim Dale changing his pronunciation of Voldemo...,"[first, 3, books, clearly, says, jkr, way, vol..."
...,...,...,...,...,...,...,...,...,...
7093,fg4jrgz,freakbiotic,t3_ewtteg,"[agree, never, supported, relationship, ron, a...",ewtteg,[deleted],HarryPotterBooks,The most frustrating part of the epilogue is R...,[deleted]
7094,fg4km28,Chinoiserie91,t3_ewtteg,"[well, saying, true, however, joking, one, inc...",ewtteg,[deleted],HarryPotterBooks,The most frustrating part of the epilogue is R...,[deleted]
7095,fg4kqph,dsjunior1388,t3_ewtteg,"[many, people, tendency, revert, younger, pers...",ewtteg,[deleted],HarryPotterBooks,The most frustrating part of the epilogue is R...,[deleted]
7096,fg4mgwf,batjeep1981,t3_ewtteg,"[worst, part, cursed, child]",ewtteg,[deleted],HarryPotterBooks,The most frustrating part of the epilogue is R...,[deleted]


# "r/KnivesOutMovie"

In [123]:
import pandas as pd

# Specify the path to the folder containing the Harry Potter pickle files
folder_path = r'C:\Users\josie\OneDrive\Desktop\Reddit\KnivesOutMovie'

# List all the pickle files that the folder contains
pickle_files = [file for file in os.listdir(folder_path) if file.startswith('subreddit_KnivesOutMovie')]

# Initialize empty dictionary to store loaded data
knivesout_data = {}

# Load each pickle file and store its contents in the dictionary
for file_name in pickle_files:
    with open(os.path.join(folder_path, file_name), 'rb') as f:
        knivesout_data[file_name] = pickle.load(f)

In [124]:
# Iterate over each pickle file and inspect its contents
for file_name, data in knivesout_data.items():
    print(f"File: {file_name}")
    print(f"Data type: {type(data)}")
    print(f"Keys: {data.keys()}")
    print()

File: subreddit_KnivesOutMovie_RC_2021-01.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_KnivesOutMovie_RC_2021-02.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_KnivesOutMovie_RC_2021-03.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_KnivesOutMovie_RC_2021-04.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_KnivesOutMovie_RC_2021-05.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_KnivesOutMovie_RC_2021-06.pickle
D

In [125]:
# Initialize empty dictionaries to store RS (Reddit submissions) and RC (Reddit comments) DataFrames
rs_data = {}
rc_data = {}

# Loop through all files in the folder
for file_name in os.listdir(folder_path):
    # Check if the file is a Reddit submission file
    if file_name.startswith("subreddit_KnivesOutMovie_RS"):
        # Load the RS DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rs_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))
    
    # Check if the file is a Reddit comment file
    elif file_name.startswith("subreddit_KnivesOutMovie_RC"):
        # Load the RC DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rc_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))

# Concatenate all the RS DataFrames into a single DataFrame
knivesout_submissions = pd.concat(rs_data.values(), ignore_index=True)

# Concatenate all the RC DataFrames into a single DataFrame
knivesout_comments = pd.concat(rc_data.values(), ignore_index=True)

Check again if the link_id column for this set of data also begins with "t3_".

In [126]:
# Check if all values in 'link_id' column start with 't3_'
all_values_start_with_t3 = knivesout_comments['link_id'].str.startswith('t3_').all()

print("All values in 'link_id' column start with 't3_' or 't1_':", all_values_start_with_t3)

All values in 'link_id' column start with 't3_' or 't1_': True


In [127]:
# Remove 't3_' prefix from the link_id column in the comments DataFrame
knivesout_comments['link_id'] = knivesout_comments['link_id'].str.replace('t3_', '')

In [128]:
# Merge the submissions and comments DataFrames on the id and link_id columns
merged_df = knivesout_comments.merge(knivesout_submissions, left_on='link_id', right_on='id', suffixes=('_comment', '_submission'))

# Display the merged DataFrame
print(merged_df.head())

  id_comment    author_comment link_id  parent_id subreddit_comment  \
0    gjiw3t0       ilinamorato  kylrw7  t3_kylrw7    KnivesOutMovie   
1    gmlib2b  SheDaisy11151979  kylrw7  t3_kylrw7    KnivesOutMovie   
2    gk4ch8o   agentgravyphone  l2aivu  t3_l2aivu    KnivesOutMovie   
3    gk5987d   ErinEqualsPeace  l2aivu  t3_l2aivu    KnivesOutMovie   
4    gk63dxm         xeroxgirl  l2aivu  t3_l2aivu    KnivesOutMovie   

                                                body id_submission  \
0  Benoit explains in the next line. Because he d...        kylrw7   
1  Ransom thought that he'd successfully poisoned...        kylrw7   
2  I think she might have given them a bit of mon...        l2aivu   
3  She would probably need a fair amount for prop...        l2aivu   
4  Let's put something out there. They're not act...        l2aivu   

  author_submission subreddit_submission  \
0         imnotsus_       KnivesOutMovie   
1         imnotsus_       KnivesOutMovie   
2  SheDaisy11151979 

In [129]:
knivesout_df = merged_df

In [130]:
# Remove the redundant columns
#'cleaned_link_id' appears twice in the DataFrame
knivesout_df = knivesout_df.drop(['id_submission'], axis=1)

print(knivesout_df.head())

  id_comment    author_comment link_id  parent_id subreddit_comment  \
0    gjiw3t0       ilinamorato  kylrw7  t3_kylrw7    KnivesOutMovie   
1    gmlib2b  SheDaisy11151979  kylrw7  t3_kylrw7    KnivesOutMovie   
2    gk4ch8o   agentgravyphone  l2aivu  t3_l2aivu    KnivesOutMovie   
3    gk5987d   ErinEqualsPeace  l2aivu  t3_l2aivu    KnivesOutMovie   
4    gk63dxm         xeroxgirl  l2aivu  t3_l2aivu    KnivesOutMovie   

                                                body author_submission  \
0  Benoit explains in the next line. Because he d...         imnotsus_   
1  Ransom thought that he'd successfully poisoned...         imnotsus_   
2  I think she might have given them a bit of mon...  SheDaisy11151979   
3  She would probably need a fair amount for prop...  SheDaisy11151979   
4  Let's put something out there. They're not act...  SheDaisy11151979   

  subreddit_submission                                    title  \
0       KnivesOutMovie               Something odd with Ranso

In [131]:
knivesout_df

Unnamed: 0,id_comment,author_comment,link_id,parent_id,subreddit_comment,body,author_submission,subreddit_submission,title,selftext
0,gjiw3t0,ilinamorato,kylrw7,t3_kylrw7,KnivesOutMovie,Benoit explains in the next line. Because he d...,imnotsus_,KnivesOutMovie,Something odd with Ransom.,We see that when Mr.Blanc was revealing the my...
1,gmlib2b,SheDaisy11151979,kylrw7,t3_kylrw7,KnivesOutMovie,Ransom thought that he'd successfully poisoned...,imnotsus_,KnivesOutMovie,Something odd with Ransom.,We see that when Mr.Blanc was revealing the my...
2,gk4ch8o,agentgravyphone,l2aivu,t3_l2aivu,KnivesOutMovie,I think she might have given them a bit of mon...,SheDaisy11151979,KnivesOutMovie,What did Marta end up doing in the end?,I find myself wondering how Marta dealt with t...
3,gk5987d,ErinEqualsPeace,l2aivu,t3_l2aivu,KnivesOutMovie,She would probably need a fair amount for prop...,SheDaisy11151979,KnivesOutMovie,What did Marta end up doing in the end?,I find myself wondering how Marta dealt with t...
4,gk63dxm,xeroxgirl,l2aivu,t3_l2aivu,KnivesOutMovie,Let's put something out there. They're not act...,SheDaisy11151979,KnivesOutMovie,What did Marta end up doing in the end?,I find myself wondering how Marta dealt with t...
...,...,...,...,...,...,...,...,...,...,...
122,hh06rn7,Creepy_Willow9842,q47l24,t3_q47l24,KnivesOutMovie,Honestly though the difference ends with some ...,[deleted],KnivesOutMovie,[deleted by user],[removed]
123,hmbtue9,The_Molsen,r38frz,t3_r38frz,KnivesOutMovie,It was implied that it was some kind of social...,Bahamaunt,KnivesOutMovie,So what did Meg study anyway??,She immediately sold out Marta when she reali...
124,hmcgobc,Bahamaunt,r38frz,t1_hmbtue9,KnivesOutMovie,"I'm not very knowledgeable on the matter, so ...",Bahamaunt,KnivesOutMovie,So what did Meg study anyway??,She immediately sold out Marta when she reali...
125,hqjqgw2,CT-0614,rs25y6,t3_rs25y6,KnivesOutMovie,She opened the door when he did it,Just-Breadfruit5742,KnivesOutMovie,How did Martha get a blood spot on her shoe?,This question has been on my mind for a while....


### Cleaning and Prepocessing Knives Out Data

In [132]:
# Drop rows with '[deleted]' in the 'body' column
knivesout_df = knivesout_df[knivesout_df['body'] != '[deleted]']

In [133]:
# Lowercase 'body' column
knivesout_df['body'] = knivesout_df['body'].str.lower()

# Lowercase 'selftext' column
knivesout_df['selftext'] = knivesout_df['selftext'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  knivesout_df['body'] = knivesout_df['body'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  knivesout_df['selftext'] = knivesout_df['selftext'].str.lower()


In [134]:
# Define a function to remove punctuation from a string
def remove_punctuation(text):
    # Get the predefined set of punctuation characters
    punctuation_set = set(string.punctuation)
    # Use ''.join() to remove punctuation
    return ''.join(char for char in text if char not in punctuation_set)

# Remove punctuation from the relevant columns
text_columns = ['body', 'selftext']
knivesout_df[text_columns] = knivesout_df[text_columns].applymap(remove_punctuation)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  knivesout_df[text_columns] = knivesout_df[text_columns].applymap(remove_punctuation)


In [135]:
# Replace '\n\n' with a space in the 'body' and 'selftext' columns
knivesout_df['body'] = knivesout_df['body'].str.replace('\n\n', ' ')
knivesout_df['selftext'] = knivesout_df['selftext'].str.replace('\n\n', ' ')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  knivesout_df['body'] = knivesout_df['body'].str.replace('\n\n', ' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  knivesout_df['selftext'] = knivesout_df['selftext'].str.replace('\n\n', ' ')


In [136]:
cleaned_file_path = r'C:\Users\josie\Downloads\REDDIT_KNIVESOUT_DATA_CLEANED.txt'
knivesout_df.to_csv(cleaned_file_path, index=False, sep='\t')

print(f"Cleaned data saved to {cleaned_file_path}")

Cleaned data saved to C:\Users\josie\Downloads\REDDIT_KNIVESOUT_DATA_CLEANED.txt


## Now save all relevant columns and groupings accordingly.

In [137]:
# Tokenize the body and selftext columns
knivesout_df['selftext'] = knivesout_df['selftext'].apply(word_tokenize)
knivesout_df['body'] = knivesout_df['body'].apply(word_tokenize)

# Print the updated 'body' column
print(knivesout_df['selftext'])
print(knivesout_df['body'])

0      [we, see, that, when, mrblanc, was, revealing,...
1      [we, see, that, when, mrblanc, was, revealing,...
2      [i, find, myself, wondering, how, marta, dealt...
3      [i, find, myself, wondering, how, marta, dealt...
4      [i, find, myself, wondering, how, marta, dealt...
                             ...                        
122                                            [removed]
123    [she, immediately, sold, out, marta, when, she...
124    [she, immediately, sold, out, marta, when, she...
125    [this, question, has, been, on, my, mind, for,...
126    [this, question, has, been, on, my, mind, for,...
Name: selftext, Length: 124, dtype: object
0      [benoit, explains, in, the, next, line, becaus...
1      [ransom, thought, that, hed, successfully, poi...
2      [i, think, she, might, have, given, them, a, b...
3      [she, would, probably, need, a, fair, amount, ...
4      [lets, put, something, out, there, theyre, not...
                             ...             

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  knivesout_df['selftext'] = knivesout_df['selftext'].apply(word_tokenize)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  knivesout_df['body'] = knivesout_df['body'].apply(word_tokenize)


In [139]:
# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords from the tokenized 'body' column
knivesout_df['body'] = knivesout_df['body'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

# Remove stopwords from the tokenized 'selftext' column
knivesout_df['selftext'] = knivesout_df['selftext'].apply(lambda x: [word for word in x if word.lower() not in stop_words])# Define the stopwords

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  knivesout_df['body'] = knivesout_df['body'].apply(lambda x: [word for word in x if word.lower() not in stop_words])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  knivesout_df['selftext'] = knivesout_df['selftext'].apply(lambda x: [word for word in x if word.lower() not in stop_words])# Define the stopwords


In [140]:
knivesout_df

Unnamed: 0,id_comment,author_comment,link_id,parent_id,subreddit_comment,body,author_submission,subreddit_submission,title,selftext
0,gjiw3t0,ilinamorato,kylrw7,t3_kylrw7,KnivesOutMovie,"[benoit, explains, next, line, doesnt, know, t...",imnotsus_,KnivesOutMovie,Something odd with Ransom.,"[see, mrblanc, revealing, mystery, says, ranso..."
1,gmlib2b,SheDaisy11151979,kylrw7,t3_kylrw7,KnivesOutMovie,"[ransom, thought, hed, successfully, poisoned,...",imnotsus_,KnivesOutMovie,Something odd with Ransom.,"[see, mrblanc, revealing, mystery, says, ranso..."
2,gk4ch8o,agentgravyphone,l2aivu,t3_l2aivu,KnivesOutMovie,"[think, might, given, bit, money, like, enough...",SheDaisy11151979,KnivesOutMovie,What did Marta end up doing in the end?,"[find, wondering, marta, dealt, inheritance, p..."
3,gk5987d,ErinEqualsPeace,l2aivu,t3_l2aivu,KnivesOutMovie,"[would, probably, need, fair, amount, property...",SheDaisy11151979,KnivesOutMovie,What did Marta end up doing in the end?,"[find, wondering, marta, dealt, inheritance, p..."
4,gk63dxm,xeroxgirl,l2aivu,t3_l2aivu,KnivesOutMovie,"[lets, put, something, theyre, actually, broke...",SheDaisy11151979,KnivesOutMovie,What did Marta end up doing in the end?,"[find, wondering, marta, dealt, inheritance, p..."
...,...,...,...,...,...,...,...,...,...,...
122,hh06rn7,Creepy_Willow9842,q47l24,t3_q47l24,KnivesOutMovie,"[honestly, though, difference, ends, similarit...",[deleted],KnivesOutMovie,[deleted by user],[removed]
123,hmbtue9,The_Molsen,r38frz,t3_r38frz,KnivesOutMovie,"[implied, kind, social, studies, x, justice, c...",Bahamaunt,KnivesOutMovie,So what did Meg study anyway??,"[immediately, sold, marta, realized, mother, w..."
124,hmcgobc,Bahamaunt,r38frz,t1_hmbtue9,KnivesOutMovie,"[im, knowledgeable, matter, would, social, stu...",Bahamaunt,KnivesOutMovie,So what did Meg study anyway??,"[immediately, sold, marta, realized, mother, w..."
125,hqjqgw2,CT-0614,rs25y6,t3_rs25y6,KnivesOutMovie,"[opened, door]",Just-Breadfruit5742,KnivesOutMovie,How did Martha get a blood spot on her shoe?,"[question, mind, shown, one, scene, marthas, s..."


# "r/Anne" 

In [141]:
# Specify the path to the folder containing the Harry Potter pickle files
folder_path = r'C:\Users\josie\Downloads\Josie\Anne'

# List all the pickle files that the folder contains
pickle_files = [file for file in os.listdir(folder_path) if file.startswith('subreddit_Anne')]

# Initialize empty dictionary to store loaded data
anne_data = {}

# Load each pickle file and store its contents in the dictionary
for file_name in pickle_files:
    with open(os.path.join(folder_path, file_name), 'rb') as f:
        knivesout_data[file_name] = pickle.load(f)

In [142]:
# Initialize empty dictionaries to store RS (Reddit submissions) and RC (Reddit comments) DataFrames
rs_data = {}
rc_data = {}

# Loop through all files in the folder
for file_name in os.listdir(folder_path):
    # Check if the file is a Reddit submission file
    if file_name.startswith("subreddit_Anne_RS"):
        # Load the RS DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rs_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))
    
    # Check if the file is a Reddit comment file
    elif file_name.startswith("subreddit_Anne_RC"):
        # Load the RC DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rc_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))

# Concatenate all the RS DataFrames into a single DataFrame
anne_submissions = pd.concat(rs_data.values(), ignore_index=True)

# Concatenate all the RC DataFrames into a single DataFrame
anne_comments = pd.concat(rc_data.values(), ignore_index=True)

In [143]:
# Check if all values in 'link_id' column start with 't3_'
all_values_start_with_t3 = anne_comments['link_id'].str.startswith('t3_').all()

print("All values in 'link_id' column start with 't3_':", all_values_start_with_t3)

All values in 'link_id' column start with 't3_': True


In [144]:
# Remove 't3_' prefix from the link_id column in the comments DataFrame
anne_comments['link_id'] = anne_comments['link_id'].str.replace('t3_', '')

In [145]:
# Merge the submissions and comments DataFrames on the id and link_id columns
merged_df = anne_comments.merge(anne_submissions, left_on='link_id', right_on='id', suffixes=('_comment', '_submission'))

# Display the merged DataFrame
print(merged_df.head())

  id_comment author_comment link_id   parent_id subreddit_comment  \
0    ds1hn4j    TorsionFree  7nfx88   t3_7nfx88           AnnePro   
1    ds2sokd         quido3  7nfx88  t1_ds1hn4j           AnnePro   
2    ds7gf2v        argon07  7nfx88  t1_ds1hn4j           AnnePro   
3    ds7wzds        VexFlex  7nfx88   t3_7nfx88           AnnePro   
4    ds1khcq      Thebatsem  7ng364   t3_7ng364           AnnePro   

                                                body id_submission  \
0  You'd have to desolder all the switches to get...        7nfx88   
1  Plastic? I'm pretty sure mine has a metal plat...        7nfx88   
2            Dang I was hoping not to desolder...rip        7nfx88   
3  Well if you have the patience to mask off all ...        7nfx88   
4  I wouldn't bother about the warranty on these ...        7ng364   

  author_submission subreddit_submission  \
0           argon07              AnnePro   
1           argon07              AnnePro   
2           argon07             

In [146]:
# Remove the redundant columns
anne_df = merged_df

anne_df = anne_df.drop(['link_id','id_submission'], axis=1)

print(anne_df.head())

  id_comment author_comment   parent_id subreddit_comment  \
0    ds1hn4j    TorsionFree   t3_7nfx88           AnnePro   
1    ds2sokd         quido3  t1_ds1hn4j           AnnePro   
2    ds7gf2v        argon07  t1_ds1hn4j           AnnePro   
3    ds7wzds        VexFlex   t3_7nfx88           AnnePro   
4    ds1khcq      Thebatsem   t3_7ng364           AnnePro   

                                                body author_submission  \
0  You'd have to desolder all the switches to get...           argon07   
1  Plastic? I'm pretty sure mine has a metal plat...           argon07   
2            Dang I was hoping not to desolder...rip           argon07   
3  Well if you have the patience to mask off all ...           argon07   
4  I wouldn't bother about the warranty on these ...          xgdnekox   

  subreddit_submission                                       title  \
0              AnnePro  Has anyone tried painting their backplate?   
1              AnnePro  Has anyone tried paintin

In [147]:
anne_df = anne_df[anne_df['body'] != '[deleted]']

In [148]:
# Lowercase 'body' column
anne_df['body'] = anne_df['body'].str.lower()

# Lowercase 'selftext' column
anne_df['selftext'] = anne_df['selftext'].str.lower()

In [149]:
# Define a function to remove punctuation from a string
def remove_punctuation(text):
    # Get the predefined set of punctuation characters
    punctuation_set = set(string.punctuation)
    # Use ''.join() to remove punctuation
    return ''.join(char for char in text if char not in punctuation_set)

# Remove punctuation from the relevant columns
text_columns = ['body', 'selftext']
anne_df[text_columns] = anne_df[text_columns].applymap(remove_punctuation)

In [150]:
# Replace '\n\n' with a space in the 'body' and 'selftext' columns
anne_df['body'] = anne_df['body'].str.replace('\n\n', ' ')
anne_df['selftext'] = anne_df['selftext'].str.replace('\n\n', ' ')

In [151]:
cleaned_file_path = r'C:\Users\josie\Downloads\REDDIT_ANNE_DATA_CLEANED.txt'
knivesout_df.to_csv(cleaned_file_path, index=False, sep='\t')

print(f"Cleaned data saved to {cleaned_file_path}")

Cleaned data saved to C:\Users\josie\Downloads\REDDIT_ANNE_DATA_CLEANED.txt


## Now save all relevant columns and groupings as needed.

In [152]:
# Tokenize the body and selftext columns
anne_df['selftext'] = anne_df['selftext'].apply(word_tokenize)
anne_df['body'] = anne_df['body'].apply(word_tokenize)

# Print the updated 'body' column
print(anne_df['selftext'])
print(anne_df['body'])

0        [i, have, a, black, anne, pro, rn, and, i, nev...
1        [i, have, a, black, anne, pro, rn, and, i, nev...
2        [i, have, a, black, anne, pro, rn, and, i, nev...
3        [i, have, a, black, anne, pro, rn, and, i, nev...
4        [does, obins, have, a, warranty, or, should, i...
                               ...                        
42551                                                   []
42552                                                   []
42553                                                   []
42554                                            [deleted]
42555                                                   []
Name: selftext, Length: 41182, dtype: object
0        [youd, have, to, desolder, all, the, switches,...
1        [plastic, im, pretty, sure, mine, has, a, meta...
2             [dang, i, was, hoping, not, to, desolderrip]
3        [well, if, you, have, the, patience, to, mask,...
4        [i, wouldnt, bother, about, the, warranty, on,...
           

In [153]:
# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords from the tokenized 'body' column
anne_df['body'] = anne_df['body'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

# Remove stopwords from the tokenized 'selftext' column
anne_df['selftext'] = anne_df['selftext'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

In [154]:
anne_df

Unnamed: 0,id_comment,author_comment,parent_id,subreddit_comment,body,author_submission,subreddit_submission,title,selftext
0,ds1hn4j,TorsionFree,t3_7nfx88,AnnePro,"[youd, desolder, switches, get, backplate, ord...",argon07,AnnePro,Has anyone tried painting their backplate?,"[black, anne, pro, rn, never, really, use, rgb..."
1,ds2sokd,quido3,t1_ds1hn4j,AnnePro,"[plastic, im, pretty, sure, mine, metal, plate...",argon07,AnnePro,Has anyone tried painting their backplate?,"[black, anne, pro, rn, never, really, use, rgb..."
2,ds7gf2v,argon07,t1_ds1hn4j,AnnePro,"[dang, hoping, desolderrip]",argon07,AnnePro,Has anyone tried painting their backplate?,"[black, anne, pro, rn, never, really, use, rgb..."
3,ds7wzds,VexFlex,t3_7nfx88,AnnePro,"[well, patience, mask, switches, im, pretty, s...",argon07,AnnePro,Has anyone tried painting their backplate?,"[black, anne, pro, rn, never, really, use, rgb..."
4,ds1khcq,Thebatsem,t3_7ng364,AnnePro,"[wouldnt, bother, warranty, products, would, s...",xgdnekox,AnnePro,Anne pro usb port broke off.,"[obins, warranty, look, solder, new, one]"
...,...,...,...,...,...,...,...,...,...
42551,fcojyqi,trukio,t3_ei2ux0,AnnePro,"[lmao, exact, combo]",squanchymacsquanch,AnnePro,Nice combo (G305),[]
42552,fcok6or,squanchymacsquanch,t1_fcojyqi,AnnePro,"[good, taste, bruh, 🤪]",squanchymacsquanch,AnnePro,Nice combo (G305),[]
42553,fcold0g,zp3dd4,t1_fcojyqi,AnnePro,[],squanchymacsquanch,AnnePro,Nice combo (G305),[]
42554,fcnrc6x,faddynuts,t3_ei0byk,AnnePro,[httpskbdfanscomcollections60layoutcaseproduct...,[deleted],AnnePro,Any Clear cases for the Anne Pro 2 yet? Perona...,[deleted]


# "r/The Last of Us"

In [155]:
# Specify the path to the folder containing the pickle files
folder_path = r'C:\Users\josie\Downloads\Josie\TLOU'

# List all the pickle files that the folder contains
pickle_files = [file for file in os.listdir(folder_path) if file.endswith('.pickle')]

# Initialize empty dictionary to store loaded data
tlou_data = {}

# Load each pickle file and store its contents in the dictionary
for file_name in pickle_files:
    with open(os.path.join(folder_path, file_name), 'rb') as f:
        tlou_data[file_name] = pickle.load(f)

In [156]:
# Iterate over each pickle file and inspect its contents
for file_name, data in tlou_data.items():
    print(f"File: {file_name}")
    print(f"Data type: {type(data)}")
    print(f"Keys: {data.keys()}")
    print()

File: subreddit_ThelastofusHBOseries_RC_2022-01.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_ThelastofusHBOseries_RC_2022-02.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_ThelastofusHBOseries_RC_2022-03.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_ThelastofusHBOseries_RC_2022-04.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_ThelastofusHBOseries_RC_2022-05.pickle
Data type: <class 'pandas.core.frame.DataFrame'>
Keys: Index(['id', 'author', 'link_id', 'parent_id', 'subreddit', 'body'], dtype='object')

File: subreddit_Thel

In [157]:
import pandas as pd

# Initialize empty dictionaries to store RS (Reddit submissions) and RC (Reddit comments) DataFrames
rs_data = {}
rc_data = {}

# Loop through all files in the folder
for file_name in os.listdir(folder_path):
    # Check if the file is a Reddit submission file
    if file_name.startswith("subreddit_TLOU_RS"):
        # Load the RS DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rs_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))
    
    # Check if the file is a Reddit comment file
    elif file_name.startswith("subreddit_TLOU_RC"):
        # Load the RC DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rc_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))
        
    elif file_name.startswith("subreddit_thelastofus_RS"):
        date = file_name.split("_")[-1].split(".")[0]
        rs_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))
        
    elif file_name.startswith("subreddit_thelastofus_RC"):
        date = file_name.split("_")[-1].split(".")[0]
        rc_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))
        
    elif file_name.startswith("subreddit_ThelastofusHBOseries_RS"):
        date = file_name.split("_")[-1].split(".")[0]
        rs_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))
        
    elif file_name.startswith("subreddit_ThelastofusHBOseries_RC"):
        date = file_name.split("_")[-1].split(".")[0]
        rc_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))

# Concatenate all the RS DataFrames into a single DataFrame
tlou_submissions = pd.concat(rs_data.values(), ignore_index=True)

# Concatenate all the RC DataFrames into a single DataFrame
tlou_comments = pd.concat(rc_data.values(), ignore_index=True)

In [158]:
# Remove the 't3_' prefix from every row in the link_id column
tlou_comments['cleaned_link_id'] = tlou_comments['link_id'].str.replace('t3_', '', regex=False)

# Display the first few rows to verify the prefix has been removed
print(tlou_comments[['link_id', 'cleaned_link_id']].head())

# Merge the DataFrames using the cleaned_link_id column and id column
tlou_df = pd.merge(tlou_comments, tlou_submissions, left_on='cleaned_link_id', right_on='id', how='inner', suffixes=('_comment', '_submission'))

# Display the first few rows of the merged DataFrame
print(tlou_df.head())

     link_id cleaned_link_id
0  t3_rsq1a0          rsq1a0
1  t3_rsq1a0          rsq1a0
2  t3_rsq1a0          rsq1a0
3  t3_hiae2t          hiae2t
4  t3_hiae2t          hiae2t
  id_comment     author_comment    link_id   parent_id subreddit_comment  \
0    i2x2syl           Aucielis  t3_tt02oa  t1_i2wyf8e       thelastofus   
1    i2x4mhn  AbstractBettaFish  t3_tt02oa  t1_i2x16n2       thelastofus   
2    i2x5fi6     awakenedforces  t3_tt02oa   t3_tt02oa       thelastofus   
3    i2x6wiu     StreetMedic380  t3_tt02oa   t3_tt02oa       thelastofus   
4    i2xah3o          Glitchy13  t3_tt02oa   t3_tt02oa       thelastofus   

                                                body cleaned_link_id  \
0  Yeah, exactly. I think it's partly that for me...          tt02oa   
1  The dumbest part is that he literally has an E...          tt02oa   
2  makes me wonder if we all even played the same...          tt02oa   
3  No character development and story issues?\n\n...          tt02oa   
4        

In [159]:
#'cleaned_link_id' appears twice in the DataFrame
tlou_df.drop(labels=['cleaned_link_id'], axis=1, inplace=True)
tlou_df.drop(columns=['link_id', 'parent_id'], inplace=True)

print(tlou_df.head())

  id_comment     author_comment subreddit_comment  \
0    i2x2syl           Aucielis       thelastofus   
1    i2x4mhn  AbstractBettaFish       thelastofus   
2    i2x5fi6     awakenedforces       thelastofus   
3    i2x6wiu     StreetMedic380       thelastofus   
4    i2xah3o          Glitchy13       thelastofus   

                                                body id_submission  \
0  Yeah, exactly. I think it's partly that for me...        tt02oa   
1  The dumbest part is that he literally has an E...        tt02oa   
2  makes me wonder if we all even played the same...        tt02oa   
3  No character development and story issues?\n\n...        tt02oa   
4                That guys’ sentence has bad pacing.        tt02oa   

                                            title selftext  \
0  I’ve never seen such mixed opinions on a game.            
1  I’ve never seen such mixed opinions on a game.            
2  I’ve never seen such mixed opinions on a game.            
3  I’ve neve

In [160]:
# Drop rows with '[deleted]' in the 'body' column
tlou_df = tlou_df[tlou_df['body'] != '[deleted]']

In [161]:
# Lowercase 'body' column
tlou_df['body'] = tlou_df['body'].str.lower()

# Lowercase 'selftext' column
tlou_df['selftext'] = tlou_df['selftext'].str.lower()

In [162]:
import string

# Define a function to remove punctuation from a string
def remove_punctuation(text):
    # Get the predefined set of punctuation characters
    punctuation_set = set(string.punctuation)
    # Use ''.join() to remove punctuation
    return ''.join(char for char in text if char not in punctuation_set)

# Remove punctuation from the relevant columns
text_columns = ['body', 'selftext']
tlou_df[text_columns] = tlou_df[text_columns].applymap(remove_punctuation)

In [163]:
# Replace '\n\n' with a space in the 'body' and 'selftext' columns
tlou_df['body'] = tlou_df['body'].str.replace('\n\n', ' ')
tlou_df['selftext'] = tlou_df['selftext'].str.replace('\n\n', ' ')

In [164]:
cleaned_file_path = r'C:\Users\josie\Downloads\REDDIT_TLOU_DATA_CLEANED.txt'
tlou_df.to_csv(cleaned_file_path, index=False, sep='\t')

print(f"Cleaned data saved to {cleaned_file_path}")

Cleaned data saved to C:\Users\josie\Downloads\REDDIT_TLOU_DATA_CLEANED.txt


## Now save all columns and groupings accordingly.

# Breaking Bad Data

In [165]:
# Specify the path to the folder containing the pickle files
folder_path = r'C:\Users\josie\Downloads\Josie\BreakingBad'

# List all the pickle files that the folder contains
pickle_files = [file for file in os.listdir(folder_path) if file.endswith('.pickle')]

# Initialize empty dictionary to store loaded data
breakingbad_data = {}

# Load each pickle file and store its contents in the dictionary
for file_name in pickle_files:
    with open(os.path.join(folder_path, file_name), 'rb') as f:
        breakingbad_data[file_name] = pickle.load(f)

In [166]:
import pandas as pd

# Initialize empty dictionaries to store RS (Reddit submissions) and RC (Reddit comments) DataFrames
rs_data = {}
rc_data = {}

# Loop through all files in the folder
for file_name in os.listdir(folder_path):
    # Check if the file is a Reddit submission file
    if file_name.startswith("subreddit_breakingbad_RS"):
        # Load the RS DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rs_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))
    
    # Check if the file is a Reddit comment file
    elif file_name.startswith("subreddit_breakingbad_RC"):
        # Load the RC DataFrame and store it in the dictionary with the corresponding date as the key
        date = file_name.split("_")[-1].split(".")[0]
        rc_data[date] = pd.read_pickle(os.path.join(folder_path, file_name))

# Concatenate all the RS DataFrames into a single DataFrame
breakingbad_submissions = pd.concat(rs_data.values(), ignore_index=True)

# Concatenate all the RC DataFrames into a single DataFrame
breakingbad_comments = pd.concat(rc_data.values(), ignore_index=True)

In [167]:
# Remove the 't3_' prefix from every row in the link_id column
breakingbad_comments['cleaned_link_id'] = breakingbad_comments['link_id'].str.replace('t3_', '', regex=False)

# Display the first few rows to verify the prefix has been removed
print(breakingbad_comments[['link_id', 'cleaned_link_id']].head())

# Merge the DataFrames using the cleaned_link_id column and id column
breakingbad_df = pd.merge(breakingbad_comments, breakingbad_submissions, left_on='cleaned_link_id', right_on='id', how='inner', suffixes=('_comment', '_submission'))

# Display the first few rows of the merged DataFrame
print(breakingbad_df.head())

    link_id cleaned_link_id
0  t3_bges0           bges0
1  t3_bges0           bges0
2  t3_bges0           bges0
3  t3_bges0           bges0
4  t3_bges0           bges0
  id_comment     author_comment   link_id   parent_id subreddit_comment  \
0    c0mnh22          [deleted]  t3_bges0    t3_bges0       breakingbad   
1    c0mnut2        PopeAdolph2  t3_bges0    t3_bges0       breakingbad   
2    c0mo174        Jack_Bandit  t3_bges0  t1_c0mnut2       breakingbad   
3    c0mo93h  lkjhgfdsasdfghjkl  t3_bges0  t1_c0mnut2       breakingbad   
4    c0mojz8        RyanOnymous  t3_bges0    t3_bges0       breakingbad   

                                                body cleaned_link_id  \
0     Well shiite. I was about to make one of these.           bges0   
1  great overall, but could have done without see...           bges0   
2  I agree with everything except the crawling sc...           bges0   
3  The auditorium scene showed that Walt was losi...           bges0   
4  I'm re-watching th

In [168]:
#'cleaned_link_id' appears twice in the DataFrame
breakingbad_df.drop(labels=['cleaned_link_id'], axis=1, inplace=True)
breakingbad_df.drop(columns=['link_id', 'parent_id'], inplace=True)

print(breakingbad_df.head())

  id_comment     author_comment subreddit_comment  \
0    c0mnh22          [deleted]       breakingbad   
1    c0mnut2        PopeAdolph2       breakingbad   
2    c0mo174        Jack_Bandit       breakingbad   
3    c0mo93h  lkjhgfdsasdfghjkl       breakingbad   
4    c0mojz8        RyanOnymous       breakingbad   

                                                body id_submission  \
0     Well shiite. I was about to make one of these.         bges0   
1  great overall, but could have done without see...         bges0   
2  I agree with everything except the crawling sc...         bges0   
3  The auditorium scene showed that Walt was losi...         bges0   
4  I'm re-watching the last few episodes of S2 in...         bges0   

  author_submission                        title  \
0            coopnl  Breaking Bad subreddit!! :)   
1            coopnl  Breaking Bad subreddit!! :)   
2            coopnl  Breaking Bad subreddit!! :)   
3            coopnl  Breaking Bad subreddit!! :)   


In [169]:
# Drop rows with '[deleted]' in the 'body' column
breakingbad_df = breakingbad_df[breakingbad_df['body'] != '[deleted]']

In [170]:
# Lowercase 'body' column
breakingbad_df['body'] = breakingbad_df['body'].str.lower()

# Lowercase 'selftext' column
breakingbad_df['selftext'] = breakingbad_df['selftext'].str.lower()

In [171]:
import string

# Define a function to remove punctuation from a string
def remove_punctuation(text):
    # Get the predefined set of punctuation characters
    punctuation_set = set(string.punctuation)
    # Use ''.join() to remove punctuation
    return ''.join(char for char in text if char not in punctuation_set)

# Remove punctuation from the relevant columns
text_columns = ['body', 'selftext']
breakingbad_df[text_columns] = breakingbad_df[text_columns].applymap(remove_punctuation)

In [172]:
# Replace '\n\n' with a space in the 'body' and 'selftext' columns
breakingbad_df['body'] = breakingbad_df['body'].str.replace('\n\n', ' ')
breakingbad_df['selftext'] = breakingbad_df['selftext'].str.replace('\n\n', ' ')

In [173]:
cleaned_file_path = r'C:\Users\josie\Downloads\REDDIT_BREAKINGBAD_DATA_CLEANED.txt'
breakingbad_df.to_csv(cleaned_file_path, index=False, sep='\t')

print(f"Cleaned data saved to {cleaned_file_path}")

Cleaned data saved to C:\Users\josie\Downloads\REDDIT_BREAKINGBAD_DATA_CLEANED.txt


## Now save all relevant columns and groupings accordingly.

# Catcher in the Rye Reddit Data, from "r/Literature" and "r/Books"

This data was gathered through a scraper tool and so will be loaded into the terminal differently.

First, I will load the entire folder containing all 18 Reddit threads pertaining to Catcher in the Rye.

In [174]:
import pandas as pd

# Load the submissions .csv file into a DataFrame
catcher_df = pd.read_csv(r"C:\Users\josie\OneDrive\Desktop\Reddit\Catcher_in_the_Rye\Catcher_in_the_Rye_Data.csv")

# Display the first few rows of each DataFrame to confirm
print(catcher_df.head())

                                           title  \
0  What did you think of the Catcher in the Rye?   
1  What did you think of the Catcher in the Rye?   
2  What did you think of the Catcher in the Rye?   
3  What did you think of the Catcher in the Rye?   
4  What did you think of the Catcher in the Rye?   

                                            selftext  \
0  I just finished reading it. The part where he ...   
1  I just finished reading it. The part where he ...   
2  I just finished reading it. The part where he ...   
3  I just finished reading it. The part where he ...   
4  I just finished reading it. The part where he ...   

                                                body parent_id  
0  "New Release:How to Read a Book by Monica Wood...      C_01  
1  It's a classic for a reason. It uses an unreli...      C_01  
2                                     Are you a bot?      C_01  
3                                        Hahaha nope      C_01  
4  There are two major th

In [175]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download NLTK relevant files
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))

# Function to clean and preprocess text
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Replace slashes with spaces
    text = text.replace('/', ' ')
    # Tokenize the text
    words = word_tokenize(text)
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Join words back into a single string
    cleaned_text = ' '.join(words)
    return cleaned_text

# Apply the cleaning function to the 'selftext' and 'body' columns
catcher_df['selftext'] = catcher_df['selftext'].astype(str).apply(clean_text)
catcher_df['body'] = catcher_df['body'].astype(str).apply(clean_text)

# Define a function to remove punctuation from a string
def remove_punctuation(text):
    # Get the predefined set of punctuation characters
    punctuation_set = set(string.punctuation)
    # Use ''.join() to remove punctuation
    return ''.join(char for char in text if char not in punctuation_set)

# Remove punctuation from the relevant columns
text_columns = ['body', 'selftext']
catcher_df[text_columns] = catcher_df[text_columns].applymap(remove_punctuation)

# Display the first few rows of the cleaned DataFrame
print(catcher_df.head())

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\josie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\josie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\josie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


                                           title  \
0  What did you think of the Catcher in the Rye?   
1  What did you think of the Catcher in the Rye?   
2  What did you think of the Catcher in the Rye?   
3  What did you think of the Catcher in the Rye?   
4  What did you think of the Catcher in the Rye?   

                                            selftext  \
0  finished reading  part mental hospital rehab n...   
1  finished reading  part mental hospital rehab n...   
2  finished reading  part mental hospital rehab n...   
3  finished reading  part mental hospital rehab n...   
4  finished reading  part mental hospital rehab n...   

                                                body parent_id  
0   new release  read book monica woodcheck thewe...      C_01  
1  s classic reason  uses unreliable narrator eff...      C_01  
2                                               bot       C_01  
3                                        hahaha nope      C_01  
4  two major things going

In [176]:
output_file_path = r'C:\Users\josie\Downloads\catcher_df_CLEANED_FULL_REDDIT.txt'
catcher_df.to_csv(output_file_path, index=False, sep='\t', header=True)

print(f"DataFrame saved to {output_file_path}")

DataFrame saved to C:\Users\josie\Downloads\catcher_df_CLEANED_FULL_REDDIT.txt


In [177]:
output_file_path = r'C:\Users\josie\Downloads\catcher_df_CLEANED_BODY_REDDIT.txt'
catcher_df['body'].to_csv(output_file_path, index=False, header=False)

print(f"'body' column saved to {output_file_path}")

'body' column saved to C:\Users\josie\Downloads\catcher_df_CLEANED_BODY_REDDIT.txt


## Now save all relevant columns and groupings accordingly.

In [178]:
catcher_df['body'] = catcher_df['body'].apply(word_tokenize)
catcher_df['selftext'] = catcher_df['selftext'].apply(word_tokenize)

The data is now ready for analyses.

In [179]:
catcher_df

Unnamed: 0,title,selftext,body,parent_id
0,What did you think of the Catcher in the Rye?,"[finished, reading, part, mental, hospital, re...","[new, release, read, book, monica, woodcheck, ...",C_01
1,What did you think of the Catcher in the Rye?,"[finished, reading, part, mental, hospital, re...","[s, classic, reason, uses, unreliable, narrato...",C_01
2,What did you think of the Catcher in the Rye?,"[finished, reading, part, mental, hospital, re...",[bot],C_01
3,What did you think of the Catcher in the Rye?,"[finished, reading, part, mental, hospital, re...","[hahaha, nope]",C_01
4,What did you think of the Catcher in the Rye?,"[finished, reading, part, mental, hospital, re...","[two, major, things, going, character, first, ...",C_01
...,...,...,...,...
2490,Am I supposed to hate Holden Caulfield?,"[started, reading, “, catcher, rye, ”, school,...","[know, insufferable, even, reading, age]",C_18
2491,Am I supposed to hate Holden Caulfield?,"[started, reading, “, catcher, rye, ”, school,...","[ve, always, thought, re, supposed, look, hold...",C_18
2492,Am I supposed to hate Holden Caulfield?,"[started, reading, “, catcher, rye, ”, school,...","[writer, ’, intend, control, reaction, ion, wo...",C_18
2493,Am I supposed to hate Holden Caulfield?,"[started, reading, “, catcher, rye, ”, school,...","[warning, example, step, mother, 83, read, las...",C_18
