Get the data safely and understand what you’re dealing with.

Loads the large Enron JSON
Converts it to a DataFrame
Removes junk rows (.DS_Store, empty bodies)
Shows head(), info(), distributions, basic stats

In [2]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


pandas is used to store, clean, and work with your emails in a table format, so you can easily filter, modify, and analyze them before deciding which emails are important.

In [3]:
# safely load a large email JSON file, converts it into a clean Pandas table, remove junk/empty emails, and checks the dataset structure
import json
import pandas as pd

file_path = "../data/raw/cleaned_enron_emails.json"

with open(file_path, "r", encoding="utf-8") as f:
    data = json.load(f)

# data is already a list of email dictionaries
df = pd.DataFrame(data)

# Optional but recommended: remove junk macOS files
df = df[df["Filename"] != ".DS_Store"]
df = df[df["Body"].str.strip() != ""]

df.reset_index(drop=True, inplace=True)

df.head()
df.info()


<class 'pandas.DataFrame'>
RangeIndex: 516793 entries, 0 to 516792
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   From       516793 non-null  str  
 1   To         516793 non-null  str  
 2   Subject    516793 non-null  str  
 3   Date       516793 non-null  str  
 4   Body       516793 non-null  str  
 5   ThreadKey  516793 non-null  str  
 6   Filename   516793 non-null  str  
dtypes: str(7)
memory usage: 27.6 MB


In [4]:
# clean the email dates and create year/month columns for time-based analysis.
df["Date"] = pd.to_datetime(
    df["Date"],
    errors="coerce",
    utc=True   
)

df = df.dropna(subset=["Date"])
df["year"] = df["Date"].dt.year
df["month"] = df["Date"].dt.month



In [5]:
# normalize email text so it’s clean and consistent for NLP tasks
import re
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"[^a-z0-9\s]", "", text)
    return text.strip()

df["clean_body"] = df["Body"].apply(preprocess_text)
df["clean_subject"] = df["Subject"].fillna("").apply(preprocess_text)


In [6]:
# create useful metadata features (length, recipients count, sender domain) for analyzing or prioritizing emails.
df["email_length"] = df["clean_body"].apply(len)
df["num_recipients"] = df["To"].apply(lambda x: len(str(x).split(",")))

df["sender_domain"] = df["From"].apply(
    lambda x: x.split("@")[-1] if isinstance(x, str) and "@" in x else "unknown"
)


In [7]:
# auto-label emails as important or not using simple keyword matching, creating a target column for training or evaluation.
IMPORTANT_KEYWORDS = [
    "meeting", "deadline", "urgent", "contract",
    "invoice", "report", "approval", "schedule"
]

def label_importance(body, subject):
    combined = f"{subject} {body}"
    for kw in IMPORTANT_KEYWORDS:
        if kw in combined:
            return 1
    return 0

df["label"] = df.apply(
    lambda x: label_importance(x["clean_body"], x["clean_subject"]),
    axis=1
)



In [8]:
# 1 is important email
df["label"].value_counts()


label
0    320189
1    196604
Name: count, dtype: int64

In [9]:
output_path = "../data/processed/emails_with_labels.csv"
df.to_csv(output_path, index=False)

print("Saved to:", output_path)


Saved to: ../data/processed/emails_with_labels.csv
