## Lab 4

To implement Naive Bayes classifier on text data, the necessary packages
are __tm__ for text mining and __e1071__ for Naive Bayes classifier. 

In [58]:
library(tm)
library(e1071)

The dataset provided along with the exercise is SMS SPAM dataset
(sms spam.csv). It is textual data which has unprocessed text that is
user written. Dataset contains two columns viz. type and text. Attribute
”type” indicates type of message i.e. if the message is SPAM
or HAM. Attribute ”text” is user composed text message. It is the
task of student to train Naive Bayes classifier on this text data and
classify the same.

In [59]:
sms_raw <-read.csv("F:/dataMining600/Lab4/sms_spam.csv",stringsAsFactors = FALSE)
print(sms_raw$type[239])
### this is important to the " duplicated row.names" error
#colnames(sms_raw) <- c(colnames(sms_raw)[-1],"x")
#sms_raw$x <- NULL
### 
head(sms_raw)
print(dim(sms_raw))
sms_raw$type[239]

[1] "ham"


type,text
ham,Hope you are having a good week. Just checking in
ham,K..give back my thanks.
ham,Am also doing in cbe only. But have to pay.
spam,"complimentary 4 STAR Ibiza Holiday or ￡10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+"
spam,okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm
ham,Aiya we discuss later lar... Pick u up at 4 is it?


[1] 5559    2


In text mining, CORPUS refers to collection of documents. To process
text, every individual document (here, message is a document) has to
1
be put into corpus. Bundle of documents constitute CORPUS. sms raw
is a dataframe that has two attributes viz. ”type” and ”text”. Attribute
”text” indicates documents. These documents are to be converted
to CORPUS

In [60]:
sms_corpus <- VCorpus(VectorSource(sms_raw$text))
sms_corpus

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 5559

Data cleaning is a priority if user intends to extract meaningful information.
Text processing involves:
#### (a) Removing Numbers form text
#### (b) Removing punctuation
#### (c) converting to lower case
#### (d) Removing stopwords
#### (e) Removing extra white-spaces
For the above CORPUS sms corpus, text processing involves above
mentioned steps. R commands for the same are as follows:


In [61]:
## the following code is to clean the text by delete all the non-ascii characters
for(i in 1:length(sms_raw$text)){
     sms_raw$text[i] = iconv(sms_raw$text[i], "latin1", "ASCII", sub="")
}

## have to remvoe all the non-ascii chars before excuting the following:
sms_corpus_clean <- tm_map(sms_corpus, removeNumbers)
sms_corpus_clean <- tm_map(sms_corpus_clean,removePunctuation)
sms_corpus_clean <- tm_map(sms_corpus_clean,content_transformer(tolower))

In [62]:
lapply(sms_corpus[1:4], as.character)
lapply(sms_corpus_clean[1:4], as.character)

For a classifier to be implemented, it is necessary that the data be placed under necessary attributes. Text data can be converted into
similar representation using __Document Term Matrix__ or __Term Document Matrix__. In Document Term Matrix, documents are represented
along rows and terms (words) extracted from these documents 2 are placed along columns (vice-versa in Term-Document matrix).
Document-Term matrix in R using the above generated CORPUS is obtained by R command

In [63]:
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
inspect(sms_dtm)
print (sms_corpus_clean[1:4])

<<DocumentTermMatrix (documents: 5559, terms: 8390)>>
Non-/sparse entries: 57560/46582450
Sparsity           : 100%
Maximal term length: 40
Weighting          : term frequency (tf)
Sample             :
      Terms
Docs   and are call for have now that the you your
  1628   4   3    0   1    0   0    1   2   5    0
  2046   4   0    0   0    4   0    2  10  13    0
  2993   3   1    0   2    2   0    1   2   4    1
  313    4   0    0   6    1   0    2   8   0    0
  3522   1   0    0   1    0   0    0   2   0    0
  399    1   0    0   1    0   0    0   2   0    0
  4493   1   0    0   2    0   0    2   2   5    0
  5279   0   0    0   3    1   0    1   1   0    0
  64     1   0    0   0    0   0    1   3   4    0
  808    1   0    0   1    3   0    1   2   3    0
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4


For the purpose of classification, data is split to training and test sets.For simplicity 75% data is chosen for training and rest is made into test set. R command for test-train split is following:

In [64]:
# creating training and test datasets
## there are 1865 documents, 6901 terms
print (dim(sms_dtm))

[1] 5559 8390


In [67]:
sms_dtm_train <- sms_dtm[1:4000, ]
sms_dtm_test <- sms_dtm[4001:5559, ]
inspect(sms_dtm)


<<DocumentTermMatrix (documents: 5559, terms: 8390)>>
Non-/sparse entries: 57560/46582450
Sparsity           : 100%
Maximal term length: 40
Weighting          : term frequency (tf)
Sample             :
      Terms
Docs   and are call for have now that the you your
  1628   4   3    0   1    0   0    1   2   5    0
  2046   4   0    0   0    4   0    2  10  13    0
  2993   3   1    0   2    2   0    1   2   4    1
  313    4   0    0   6    1   0    2   8   0    0
  3522   1   0    0   1    0   0    0   2   0    0
  399    1   0    0   1    0   0    0   2   0    0
  4493   1   0    0   2    0   0    2   2   5    0
  5279   0   0    0   3    1   0    1   1   0    0
  64     1   0    0   0    0   0    1   3   4    0
  808    1   0    0   1    3   0    1   2   3    0


Document-Term matrix when visualized, it can be seen that the matrix
is sparse that is there are almost all zeros in the entries with 1s present
at very few locations. Hence, it is algorithmically advantageous to work
with frequent words, viz. words appearing in most of the documents.
To extract frequent terms, say for instance words that occur a minimum
of 5 times, findFreqTerms function is used


In [68]:
freq_terms <- findFreqTerms(sms_dtm_train, 5)

In [69]:
# create DTMs with only the frequent terms
sms_dtm_freq_train <- sms_dtm_train[ , freq_terms]
sms_dtm_freq_test <- sms_dtm_test[ , freq_terms]


In [70]:
inspect(sms_dtm_freq_train)

<<DocumentTermMatrix (documents: 4000, terms: 1287)>>
Non-/sparse entries: 33144/5114856
Sparsity           : 99%
Maximal term length: 19
Weighting          : term frequency (tf)
Sample             :
      Terms
Docs   and are call for have now that the you your
  1613   3   0    0   0    0   0    3   3   8    0
  1628   4   3    0   1    0   0    1   2   5    0
  2046   4   0    0   0    4   0    2  10  13    0
  2993   3   1    0   2    2   0    1   2   4    1
  313    4   0    0   6    1   0    2   8   0    0
  3522   1   0    0   1    0   0    0   2   0    0
  3854   1   1    0   0    0   0    1   3   4    0
  399    1   0    0   1    0   0    0   2   0    0
  64     1   0    0   0    0   0    1   3   4    0
  808    1   0    0   1    3   0    1   2   3    0


In [71]:
sms_raw$type[1600]

In [77]:
sms_train_labels = sms_raw$type[1:4000]
sms_test_labels = sms_raw$type[4001:5559]

In [78]:
convert_counts <- function(x)
{
x <- ifelse(x > 0, "Yes", "No")
}
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2,convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2,convert_counts)


In [79]:
sms_classifier <- naiveBayes(as.matrix(sms_train),as.factor(sms_train_labels))

In [80]:
sms_test_pred <- predict(sms_classifier,as.matrix(sms_test))

In [81]:
tb <- table("Predicted" = sms_test_pred, "Actual" = sms_test_labels)

[1] 2 2


In [83]:
tb

         Actual
Predicted  ham spam
     ham  1346   29
     spam    6  178