# Can Spam be Classified Based on Capitalization?

## Introduction

The `Spambase` dataset is a collection of assorted emails. Its variables include the frequencies of various words, lengths of capital letter sequences, and class, spam(1) or non-spam(0).

Spam email is unsolicited bulk email, often explicit or commercial, and has been an issue since the 1990s (Cranor et al., 1998, p.74). Spam is low-cost to send, but burdens service providers and recipients. While declining as of 2013, spam has become sophisticated, disguising itself from filters (Irani et al., 2013, p.2).

Due to spam's increasing complexity, additional work must be done to increase the efficacy of its detection. To assist the construction of more practical spam filters, we will look for consistent content difference between spam and non-spam. We focus on the average length of sequences of capital letters, assuming that spam likley uses capitalization to create false urgency and garner attention.

The question we set out to answer is:

**"Is the mean average length of uninterrupted sequences of capital letters different for spam and non-spam email?"**

## Preliminary Results

We will first load the needed packages.

In [None]:
library(tidyverse)
library(repr)
library(cowplot)

In [None]:
spambase <- read.csv("https://raw.githubusercontent.com/rchanpra/stat-201-project/main/spambase/spambase.data")

In [None]:
colnames(spambase) <- c("word_freq_make","word_freq_address","word_freq_all","word_freq_3d","word_freq_our",
                        "word_freq_over","word_freq_remove","word_freq_internet","word_freq_order","word_freq_mail",
                        "word_freq_receive","word_freq_will","word_freq_people","word_freq_report","word_freq_addresses",
                        "word_freq_free","word_freq_business","word_freq_email","word_freq_you","word_freq_credit",
                        "word_freq_your","word_freq_font","word_freq_000","word_freq_money","word_freq_hp",
                        "word_freq_hpl","word_freq_george","word_freq_650","word_freq_lab","word_freq_labs",
                        "word_freq_telnet","word_freq_857","word_freq_data","word_freq_415","word_freq_85",
                        "word_freq_technology","word_freq_1999","word_freq_parts","word_freq_pm","word_freq_direct",
                        "word_freq_cs","word_freq_meeting","word_freq_original","word_freq_project", "word_freq_re",
                        "word_freq_edu","word_freq_table","word_freq_conference","char_freq_;","char_freq_(",
                        "char_freq_[","char_freq_!","char_freq_$","char_freq_#","capital_run_length_average",
                        "capital_run_length_longest","capital_run_length_total","class")

In [None]:
spambase <- spambase %>% 
    mutate(class = ifelse(class == 1, "spam", "non-spam"))

In [None]:
nrow(spambase)

In [None]:
head(spambase)

We compute the sample mean and standard deviation of the average run length of capital letters, the count of spam and non-spam, and the difference in means.

In [None]:
# Considering that the dataset is a sample from the larger population of spam email,
# computing the mean and standard deviation of the word frequencies for each class produces our point estimates
# We select the columns we need (class and capital_run_length_average)

spambase_selected <- spambase %>% 
    select(capital_run_length_average,class)

spambase_stats <- spambase_selected %>% 
    group_by(class) %>% 
    summarize(mean_capital_run_length_average = mean(capital_run_length_average), 
              sd_capital_run_length_average = sd(capital_run_length_average), 
              n = n())

spambase_stats

In [None]:
diff_in_means <- spambase_stats$mean_capital_run_length_average[2] - spambase_stats$mean_capital_run_length_average[1]

diff_in_means

Spam has a greater sample mean and standard deviation than non-spam email.

We visualize the sample distribution using `geom_boxplot()` and `geom_histogram()`.

In [None]:
# Graph the distribution of variable of interest for each class, with vertical lines indicating the means of each class
spam_boxplot <- spambase_selected %>% 
    ggplot() + 
    geom_boxplot(aes(class, capital_run_length_average, fill = class), outlier.shape = NA) + 
    # we do not display some extreme values, which are impractical to plot
    scale_y_continuous(limits = c(0, 10)) + 
    theme(text = element_text(size = 10)) + 
    ggtitle("Boxplot of average length of capital letter runs for Spam and Non-Spam emails") + 
    xlab("Class") + 
    ylab("Average length of uninterrupted sequence of capital letters")

In [None]:
spam_histogram <- spambase_selected %>% 
    filter(class == "spam") %>% 
    ggplot() + 
    geom_histogram(aes(x = capital_run_length_average, y = after_stat(density)), binwidth = 0.2) + 
    # we do not display some extreme values, which are impractical to plot
    scale_x_continuous(limits = c(0, 50)) + 
    theme(text = element_text(size = 10)) + 
    ggtitle("Sample distribution of mean length of capital letter runs for Spam emails") + 
    ylab("Frequency") + 
    xlab("Average length of uninterrupted sequence of capital letters")

non_spam_histogram <- spambase_selected %>% 
    filter(class == "non-spam") %>% 
    ggplot() + 
    geom_histogram(aes(x = capital_run_length_average, y = after_stat(density)), binwidth = 0.2) + 
    # we do not display some extreme values, which are impractical to plot
    scale_x_continuous(limits = c(0, 50)) + 
    theme(text = element_text(size = 10)) + 
    ggtitle("Sample distribution of mean length of capital letter runs for Non-Spam emails") + 
    ylab("Frequency") + 
    xlab("Average length of uninterrupted sequence of capital letters")

In [None]:
options(repr.plot.width = 8, repr.plot.height = 4)
spam_boxplot

In [None]:
options(repr.plot.width = 10, repr.plot.height = 6)
plot_grid(spam_histogram, non_spam_histogram, ncol = 1)

*Note: some larger values/outliers are not pictured in order to fit the majority of values

Spam emails seem to typically have a greater average length with a wider spread.
The distributions for both are unimodal and strongly right-skewed.

Based on these results, we might believe spam has longer capital letter sequences. For instance, an email with unusually long sequences of capitals is suspect.

## Methods: Plan

Assuming `Spambase` is a representative sample, our estimates of the mean values for capital letter sequences of spam and non-spam emails are likely to be good approximations of the true values, especially given the samples' large size. However, our point estimates do not provide any measure of how close they are likely to be to the true values: we do not know their uncertainty. We cannot use these results without further work.

For our analysis, we will build a confidence interval of the difference in means to obtain a range of values that we are confident contains the true difference. Since both samples are large (1812 spam, 2788 non-spam), we could use the theory-based approach, assuming the sampling distribution is approximately normal by the Central Limit Theorem; or bootstrapping, which makes no such assumptions.  
We will perform a two-sample t-test for the difference in means, using $H_0: \mu_1 = \mu_0$ vs $H_1: \mu_1\neq \mu_0$, where 1 and 0 represent spam and non-spam. If there is a difference, we would expect to gain evidence of it from these inferences, such as a confidence interval that excludes 0, or rejecting the null hypothesis.

Our analysis could aid individuals in purging unsolicited messages, or help legitimate marketers avoid being mistaken for spam. While plenty of models to detect spam exist, human-interpretable data may enable informed improvements.

Although our study is limited to sequences of capital letters, the dataset contains many more variables. 
An immediate followup is:  
What other patterns exist? Which are associated with spam and non-spam? 

# NOTE: NEW STUFF BELOW HERE

## Methods: Bootstrap Confidence Interval (Not a hypothesis test)

In [1]:
set.seed(1048596)
spam_bootstrap_dist <- spambase_selected |> specify(formula=capital_run_length_average ~ class) |>
    generate(reps=1000, type="bootstrap") |>
    calculate(stat = "diff in means",order = c("spam","non-spam"))
    
spam_ci <- get_confidence_interval(level=0.95,type="percentile")
spam_ci # here's a 95% confidence interval for the mean occurrences of spam in email.

SyntaxError: invalid syntax (2218622481.py, line 1)

## Methods: Bootstrap Hypothesis Testing

We consider a null hypothesis: that capital letter frequency is completely unrelated to classification as spam. Here, we create the null model for this null hypothesis.

We do a one-sided hypothesis test, because we do not expect that spam email will contain fewer capital letters on average than non-spam. We choose an alpha of 0.05, as this is a common choice for a significance cutoff across fields. We begin with bootstrapping, instead of a theory-based approach, as the theory-based approaches are but approximations for bootstrapping. However, we will do both over the course of this report.

In [None]:
set.seed(1337) # set the seed

obs_diff_in_means <- spambase_selected |> specify(formula = capital_run_length_average ~ class) |>
    calculate(stat="diff in means", order=c("spam","non-spam"))

null_diff_in_means <- spambase_selected |> specify(formula=capital_run_length_average ~ class) |>
    hypothesize(null="independence") |>
    generate(reps=1000, type="permute") |>
    calculate(stat = "diff in means",order = c("spam","non-spam"))
    
p_value <- null_diff_in_means |> get_p_value(obs_stat = obs_diff_in_means,direction="right")
p_value

We visualize the results of our bootstrapping.

In [None]:
diff_in_medians_plot <- # adapted from tutorial 6 solution
    visualize(null_diff_in_means, bins = 10) + 
    shade_p_value(obs_stat = obs_diff_in_means, direction = "right") +
    xlab("Difference in mean") +
    theme(text = element_text(size = 20))

## Methods: Theory-based testing
Here we perform a t test to double-check the p value we acquired through bootstrapping

In [None]:
set.seed(1048596)

# One-sample proportion T test. Presumably Two-sample is 
tidy(
    prop.test(
        x=nrow(filter(spambase_selected,class=="spam")),
        n=nrow(spambase_selected),
        p=0.5, # null hypothesis: there's no difference in the distribution
        alternative="greater",
        conf.level=0.95, # 0.05 conf interval TODO check
        correct=FALSE # don't use fancy statistical methods to fix stuff up
    )
)