# ELI5 Dataset Cleaning

Dataset cleaning for entries with:

- very short answers (20 words)
- removal of URLs
- removal of Reddit-specific artifacts (e.g. "EDIT:", "OP")
- removal of multiple whitespaces
- removal of emojis
- removal of duplicates
- removal of HTML artifacts



## Load Libraries

In [29]:
import pandas as pd
import re
import html
import os

In [30]:
# Load the dataset
input_file = 'output/eli5_combined.csv'
output_file = 'output/eli5_cleaned.csv'

df = pd.read_csv(input_file)
print(f"Original dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"First row:\n{df.head(1)}")

Original dataset shape: (261214, 7)
Columns: ['q_id', 'title', 'category', 'subreddit', 'a_id', 'text', 'score']
First row:
     q_id                                              title category  \
0  5lchat  Why there was a 'leap second' added to the end...    Other   

           subreddit     a_id  \
0  explainlikeimfive  dbuoyxl   

                                                text  score  
0  the rotation of the earth is not a constant. i...     44  


## Load Dataset

## Define Cleaning Functions

In [31]:
def remove_urls(text):
    return re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)

def remove_reddit_artifacts(text):
    # Remove common Reddit artifacts
    text = re.sub(r'\bEDIT\b\s*:?', '', text, flags=re.IGNORECASE) # EDIT:
    text = re.sub(r'\bOP\b', '', text) # OP
    text = re.sub(r'\bETA\b\s*:?', '', text, flags=re.IGNORECASE) # ETA:
    text = re.sub(r'\bUPDATE\b\s*:?', '', text, flags=re.IGNORECASE) # UPDATE:
    text = re.sub(r'\bTL;DR\b\s*:?', '', text, flags=re.IGNORECASE) # TL;DR:
    text = re.sub(r'\bPS\b\s*:?', '', text, flags=re.IGNORECASE) # PS:
    text = re.sub(r'^>+', '', text) # remove > at the start of lines (common in Reddit quotes)
    text = re.sub(r'\*([A-Z]+)\*', r'\1', text)  # remove asterisks from bold *A-Z* words
    text = re.sub(r'\( URL_[0-9]+ \)', '', text)  # remove ( 'URL_[0-9]+' ) pattern for URLs in ELI5 data
    text = re.sub(r'^\s*\*\s+', '', text, flags=re.MULTILINE)  # remove bullet points ' * ' at the start of lines
    return text

def remove_emojis(text):
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+",
        flags=re.UNICODE
    )
    return emoji_pattern.sub(r'', text)

def remove_html_artifacts(text):
    # Decode HTML entities
    text = html.unescape(text)
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    return text

def remove_multiple_whitespaces(text):
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def count_words(text):
    return len(text.split())


## Apply Cleaning Pipeline

In [32]:
df_cleaned = df.copy()

# Assuming the answer column is named 'answer', 'text', or similar
# First, let's identify the text column
text_columns = [col for col in df_cleaned.columns if col.lower() in ['answer', 'text', 'content', 'body']]
if not text_columns:
    # If no standard name found, use the first text-like column
    text_columns = [df_cleaned.columns[-1]]

text_column = text_columns[0]
print(f"Using column '{text_column}' for cleaning")

# Apply cleaning steps
print("\nApplying cleaning pipeline...")

# 1. Remove URLs
df_cleaned[text_column] = df_cleaned[text_column].astype(str).apply(remove_urls)

# 2. Remove HTML artifacts
df_cleaned[text_column] = df_cleaned[text_column].apply(remove_html_artifacts)

# 3. Remove Reddit-specific artifacts
df_cleaned[text_column] = df_cleaned[text_column].apply(remove_reddit_artifacts)

# 4. Remove emojis
df_cleaned[text_column] = df_cleaned[text_column].apply(remove_emojis)

# 5. Remove multiple whitespaces
df_cleaned[text_column] = df_cleaned[text_column].apply(remove_multiple_whitespaces)

# 6. Filter out very short answers (< 20 words)
initial_count = len(df_cleaned)
df_cleaned = df_cleaned[df_cleaned[text_column].apply(count_words) >= 20].reset_index(drop=True)
short_removed = initial_count - len(df_cleaned)

# 7. Remove duplicates based on text column
duplicates_removed = df_cleaned.duplicated(subset=[text_column]).sum()
df_cleaned = df_cleaned.drop_duplicates(subset=[text_column]).reset_index(drop=True)

print(f"Entries with < 20 words removed: {short_removed}")
print(f"Duplicate entries removed: {duplicates_removed}")
print(f"Final dataset shape: {df_cleaned.shape}")
print(f"\nSample cleaned entries:")
print(df_cleaned[text_column].head())

Using column 'text' for cleaning

Applying cleaning pipeline...
Entries with < 20 words removed: 10954
Duplicate entries removed: 52
Final dataset shape: (250208, 7)

Sample cleaned entries:
0    the rotation of the earth is not a constant. i...
1    The Earth's rotation is not regular. It varies...
2    Because the Earth's rotation is slowing. If yo...
3    Imagine you are out walking in the woods near ...
4    By force. Historically, nations have defended ...
Name: text, dtype: object
Entries with < 20 words removed: 10954
Duplicate entries removed: 52
Final dataset shape: (250208, 7)

Sample cleaned entries:
0    the rotation of the earth is not a constant. i...
1    The Earth's rotation is not regular. It varies...
2    Because the Earth's rotation is slowing. If yo...
3    Imagine you are out walking in the woods near ...
4    By force. Historically, nations have defended ...
Name: text, dtype: object


## Save Cleaned Dataset

In [33]:
# Save the cleaned dataset
df_cleaned.to_csv(output_file, index=False)
print(f"Cleaned dataset saved to: {output_file}")

print(f"Original entries: {len(df)}")
print(f"Final entries: {len(df_cleaned)}")
print(f"Entries removed: {len(df) - len(df_cleaned)}")
print(f"Retention rate: {(len(df_cleaned) / len(df) * 100):.2f}%")

Cleaned dataset saved to: output/eli5_cleaned.csv
Original entries: 261214
Final entries: 250208
Entries removed: 11006
Retention rate: 95.79%
