Exploratory Data Analysis
---

This script performs initial exploratory data analysis (EDA) on the raw
SMS Spam Collection dataset and saves explanatory plots to OUTPUT/01_Exploratory_Plots.

Steps:
1. Load Raw Data
    - Raw TXT File was manually converted to a CSV dataset "SMSSpamCollection.csv" with labels "Label" and "SMS_Message" added.
   - Reads the converted CSV dataset from ../DATA/SMSSpamCollection.csv.
   - Confirms structure by printing the first few rows.

2. Missing Values Plot
   - Counts null values per column.
   - Creates a bar plot to visualize missingness in the dataset.
   - Saved as ../OUTPUT/01_Exploratory_Plots/Missing_Values.png.

3. Class Distribution Plot
   - Shows the number of ham vs. spam messages.
   - Useful to highlight the class imbalance problem.
   - Saved as ../OUTPUT/01_Exploratory_Plots/Class_Distribution.png.

4. Message Length Distribution Plot
   - Computes the character length of each message.
   - Plots histograms of message lengths for ham vs. spam.
   - Shows spam messages tend to differ in length distribution from ham.
   - Saved as ../OUTPUT/01_Exploratory_Plots/Message_Length_Distribution.png.

Outputs:
   - ../OUTPUT/01_Exploratory_Plots/Missing_Values.png
   - ../OUTPUT/01_Exploratory_Plots/Class_Distribution.png
   - ../OUTPUT/01_Exploratory_Plots/Message_Length_Distribution.png

These plots provide context for later cleaning, splitting, and modeling
steps and satisfy the rubric requirement for explanatory EDA plots.
"""


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

df = pd.read_csv("../DATA/SMSSpamCollection.csv")

print("Rows:", len(df))

print(df.head())

outdir = Path("../OUTPUT/01_Exploratory_Plots")
outdir.mkdir(parents=True, exist_ok=True)

Rows: 5572
  Label                                        SMS_Message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [2]:
# Duplicate Messages
duplicates = df[df.duplicated(subset=["SMS_Message"], keep=False)]
print("Duplicate messages found:")
print(duplicates.sort_values("SMS_Message").head(20))
print(f"Total duplicate rows: {len(duplicates)}")

Duplicate messages found:
     Label                                        SMS_Message
505   spam  +123 Congratulations - in this week's competit...
2124  spam  +123 Congratulations - in this week's competit...
2163   ham  1) Go to write msg 2) Put on Dictionary mode 3...
1373   ham  1) Go to write msg 2) Put on Dictionary mode 3...
2344   ham  1) Go to write msg 2) Put on Dictionary mode 3...
1050  spam  18 days to Euro2004 kickoff! U will be kept in...
2719  spam  18 days to Euro2004 kickoff! U will be kept in...
389   spam  4mths half price Orange line rental & latest c...
2044  spam  4mths half price Orange line rental & latest c...
2982   ham  7 wonders in My WORLD 7th You 6th Ur style 5th...
4556   ham  7 wonders in My WORLD 7th You 6th Ur style 5th...
1470   ham  7 wonders in My WORLD 7th You 6th Ur style 5th...
1779   ham  7 wonders in My WORLD 7th You 6th Ur style 5th...
2370   ham  A Boy loved a gal. He propsd bt she didnt mind...
5104   ham  A Boy loved a gal. He propsd bt 

In [3]:
# Check if duplicates have conflicting labels
conflicting = (
    duplicates.groupby("SMS_Message")["Label"]
              .nunique()
              .reset_index()
              .query("Label > 1")
)

if not conflicting.empty:
    print("Conflicting duplicates found (same message, different labels):")
    print(conflicting)
else:
    print("All duplicate messages have consistent labels.")

All duplicate messages have consistent labels.


In [4]:
# Duplicates by Category Plot
dup_flags = df.duplicated(subset=["SMS_Message"], keep=False)
df["is_duplicate"] = dup_flags

dup_counts = df.groupby(["Label", "is_duplicate"]).size().reset_index(name="count")

plt.figure(figsize=(6,4))
sns.barplot(x="Label", y="count", hue="is_duplicate", data=dup_counts, palette="Set1")
plt.title("Duplicate vs Unique Messages by Category")
plt.xlabel("Message Type")
plt.ylabel("Count")
plt.legend(title="Is Duplicate", labels=["Unique", "Duplicate"])
plt.tight_layout()

outdir = Path("../OUTPUT/01_Exploratory_Plots")
plt.savefig(outdir / "Duplicates_By_Category.png", dpi=300)
plt.close()

print("Saved duplicate analysis plot to:", outdir / "Duplicates_By_Category.png")


Saved duplicate analysis plot to: ../OUTPUT/01_Exploratory_Plots/Duplicates_By_Category.png


In [5]:
# Missing Values Plot
missing_counts = df.isnull().sum()
plt.figure(figsize=(6,4))
sns.barplot(x=missing_counts.index, y=missing_counts.values, palette="Set2")
plt.title("Count of Missing Values per Column")
plt.ylabel("Number of Missing Values")
plt.xlabel("Columns")
plt.tight_layout()
plt.savefig(outdir / "Missing_Values.png", dpi=300)
plt.close()


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=missing_counts.index, y=missing_counts.values, palette="Set2")


In [6]:
# Class Distribution Plot
plt.figure(figsize=(6,4))
sns.countplot(x="Label", data=df, palette="Set1")
plt.title("Class Distribution of SMS Messages")
plt.xlabel("Message Type")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig(outdir / "Class_Distribution.png", dpi=300)
plt.close()


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x="Label", data=df, palette="Set1")


In [7]:
# Message Length Distribution Plot
msg_lengths = df["SMS_Message"].apply(len)

plt.figure(figsize=(8,5))
sns.histplot(x=msg_lengths, hue=df["Label"], bins=50, kde=True, palette="Set1")
plt.title("Distribution of Message Lengths by Class")
plt.xlabel("Message Length (characters)")
plt.ylabel("Frequency")
plt.xlim(0, 300)
plt.tight_layout()
plt.savefig(outdir / "Message_Length_Distribution.png", dpi=300)
plt.close()

Add a markdown summary paragraph: what issues were found & how you handled them.