**Step1: Import data and split**
* Here we import the dataset and split it into spam and ham mails datasets.
* If any text is null, it is converted to empty string.

In [1]:
library(readr)
data <- read_csv("../input/spam-mails-dataset/spam_ham_dataset.csv",)
data <- data[,c(3,4)]


dataset <- data[order(data$label_num),]
table(dataset$label_num)
dataspam <- dataset[3673:5171,]
dataham <- dataset[1:3672,]

dataspam$text[is.na(dataspam$text)] <- ""
dataham$text[is.na(dataham$text)] <- ""


“Missing column names filled in: 'X1' [1]”

[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  X1 = [32mcol_double()[39m,
  label = [31mcol_character()[39m,
  text = [31mcol_character()[39m,
  label_num = [32mcol_double()[39m
)





   0    1 
3672 1499 

**Step2: Creating list of most frequent words**
* The spam mails are converted to corpus and then converted to lower, punctuations, numbers and whitespaces are removed.
* Same is done for ham mails
* Most frequent words (frequency more than 30 for spam and 45 for ham as we have more ham mails) are sorted in an array

In [2]:

library(tm)

cpspam <- Corpus(VectorSource(dataspam$text))
tdm_spam <- TermDocumentMatrix(cpspam)
spam_words <- as.array(findFreqTerms(tdm_spam,30))

cpham <- Corpus(VectorSource(dataham$text))
tdm_ham <- TermDocumentMatrix(cpham)
ham_words <- as.array(findFreqTerms(tdm_ham,45))

Loading required package: NLP


Attaching package: ‘NLP’


The following object is masked from ‘package:httr’:

    content




In [3]:
length(ham_words)
length(spam_words)

**Step3.1: Checking for Ham mails**
* We first take any text (for eg, dataham[1,1]) and check if it has more ham mails or more spam mails
* If hams > spam we say it is a Ham mail

In [4]:
library(dplyr)
i=1
result <- rep(0,length(dataham$text))
for(i in (1:length(dataham$text)))
{
  text <- dataham$text[i]
  if(text=="" | is.na(text)==T){
    result[i] <- NA
  } else{
    cp <- Corpus(VectorSource(text))
    tdm <- TermDocumentMatrix(cp)
    words <- as.array(tdm$dimnames$Terms)
    
    s=0
    h=0
    for(j in words){
      if(j %in% spam_words) {s=s+1}
      if(j %in% ham_words) {h=h+1}
    }
    
    if(h > 1.055*s){
    result[i] <- 0
    }  else {
        result[i] <- 1
    }
    }
}

mean(result == 0,na.rm = T)



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




**Step3.2: Checking for Spam mails**
* We first take any text (for eg, dataspam[1,1]) and check if it has more ham mails or more spam mails
* If spams > ham we say it is a Spam mail
* But as we have more data for ham mails it was getting difficult to sort spam mails
* After many trial and errors, s*1.055 gave the most accuracy for both datasets

In [5]:

i=1
result <- rep(0,length(dataspam$text))
for(i in (1:length(dataspam$text)))
{
  text <- dataspam$text[i]
  if(text=="" | is.na(text)==T){
    result[i] <- NA
  } else{
    cp <- Corpus(VectorSource(text))
    tdm <- TermDocumentMatrix(cp)
    words <- as.array(tdm$dimnames$Terms)
    
    s=0
    h=0
    for(j in words){
      if(j %in% spam_words) {s=s+1}
      if(j %in% ham_words) {h=h+1}
    }
    
    if(h > 1.055*s){
    result[i] <- 0
    }  else {
        result[i] <- 1
    }
    }
}

mean(result == 1,na.rm = T)


**Step4: Checking for full dataset**

In [6]:

i=1
result <- rep(0,length(dataset$text))
for(i in (1:length(dataset$text)))
{
  text <- dataset$text[i]
  if(text=="" | is.na(text)==T){
    result[i] <- NA
  } else{
    cp <- Corpus(VectorSource(text))
    tdm <- TermDocumentMatrix(cp)
    words <- as.array(tdm$dimnames$Terms)
    
    s=0
    h=0
    for(j in words){
      if(j %in% spam_words) {s=s+1}
      if(j %in% ham_words) {h=h+1}
    }
    
    if(h > 1.055*s){
    result[i] <- 0
    }  else {
        result[i] <- 1
    }
    }
}

mean(result == dataset$label_num,na.rm = T)


**Conclusion: Final accuray was found to be approx 96%**