# LSE Machine Learning: Practical Applications

## Module 5 Unit 2 IDE Activity (Assessment) | Generative models in text classification

### In this IDE notebook, you are required to apply a generative model to a prepared data set in R.
The instructions for this IDE activity are positioned as text cells before each step. As a result, you are required to read the text cells above a code cell, familiarise yourself with the required step, and then execute the step. You are encouraged to refer back to the practice IDE activity to familiarise yourself with the different steps and how they are executed in R.

### Step 1: Load and install the relevant packages

Load the following packages: tm, e1071, dplyr, and caret

In [1]:
# Load the required packages
library(tm)
library(e1071)
library(dplyr)
library(caret)

Loading required package: NLP


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: lattice

Loading required package: ggplot2


Attaching package: ‘ggplot2’


The following object is masked from ‘package:NLP’:

    annotate


“running command 'timedatectl' had status 1”


### Step 2: Load the data

To execute the naive Bayes classifier, load the reduced newsgroups data set into R, and prepare the data.

In [2]:
# Load the data
df <- read.csv("2_newsgroups.csv", stringsAsFactors = FALSE)

In [3]:
summary(df)

       X              text              title          
 Min.   :   0.0   Length:1192        Length:1192       
 1st Qu.: 297.8   Class :character   Class :character  
 Median :4671.5   Mode  :character   Mode  :character  
 Mean   :2640.3                                        
 3rd Qu.:4969.2                                        
 Max.   :5267.0                                        

### Step 3: Prepare the data

**Note:** In this section, you are required to perform a number of different steps to prepare the data for the model. In the practice IDE activity, you were provided with a data set that had already been prepared. In this notebook, a brief description is provided of the steps to prepare the data to create and use the naive Bayes model.

3.1  In the following steps, you are required to build a model that attempts to predict the category of documents. As a result, convert the title to a factor.

In [4]:
# Convert the title variable from character to factor
df$title <- as.factor(df$title)

3.2. Apply the bag-of-words approach. By doing this, each word in the document is represented as a variable, while each document is represented as a vector of variables. In addition, disregard the word order and focus on the frequency of each word in the document. To do this, each document is represented as a bag of words. First, prepare a corpus of all the documents in the data frame. A corpus refers to a collection of writings. In this case, the corpus is a collection of all the documents in the data set.

In [5]:
# Create and inspect the corpus
corpus <- Corpus(VectorSource(df$text))
corpus
inspect(corpus[1])

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1192

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

[1] I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.


3.3 To prepare the data for this simplified approach, a number of different operations are applied. These include the following:

-Transforming the data to lowercase

-Removing punctuation

-Removing numbers

-Adding stopwords

-Stripping out white spaces

All of these actions can be completed using the tm package.

Once the data has been cleaned, a built-in function of the tm package is used to create a dtm of the bag-of-word tokens. Each row of the created dtm corresponds to the documents in the collection. The columns of the dtm correspond to the terms used in the document, whereas the elements refer to the frequency of the terms appearing in these documents. 

Finally, by using the `inspect(dtm)` function, the newly created dtm can be analysed.

In [6]:
# Clean the data
corpus.clean <- corpus %>%
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, stopwords(kind="en")) %>%
  tm_map(stripWhitespace)

# Create the dtm
dtm <- DocumentTermMatrix(corpus.clean)

# Inspect the dtm
inspect(dtm[40:50, 10:15])

“transformation drops documents”
“transformation drops documents”
“transformation drops documents”
“transformation drops documents”
“transformation drops documents”


<<DocumentTermMatrix (documents: 11, terms: 6)>>
Non-/sparse entries: 2/64
Sparsity           : 97%
Maximal term length: 9
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs door doors early email engine enlighten
  40    0     0     0     0      0         0
  41    0     0     0     0      0         0
  42    0     0     0     0      3         0
  43    0     0     0     0      0         0
  44    0     0     0     0      0         0
  45    0     0     0     0      0         0
  46    0     0     0     0      0         0
  47    0     0     0     0      0         0
  48    0     0     0     0      0         0
  49    0     0     1     0      0         0


3.4 Before estimating the model, split the two sets of data into training and testing data sets. The first 70% of the data points are used to form the training data set, and the remaining 30% are used to form the test data set. Remember that the title variable is in vector form.

In [7]:
# Split the data into training and testing data using a 70%–30% split
set.seed(1)
train=sample(1:nrow(df), round(nrow(df)*.7))
df.train <- df[train,]
df.test <- df[-train,]
dtm.train <- dtm[train,]
dtm.test <- dtm[-train,]
corpus.clean.train <- corpus.clean[train]
corpus.clean.test <- corpus.clean[-train]
dim(dtm.train)

**Note:** The dtm contains 12,350 variables, but not all of them are useful for classification. For this demonstration, focus on words that appear in at least five documents, and ignore the remainder of words that do not fulfil this requirement. 

3.5 By using the `findFreqTerms` function, identify the frequent words. Next, this subset (freqwords) is used as a dictionary to restrict the dtm.

In [8]:
# Identify the most frequent terms that appear in at least five documents
freqwords <- findFreqTerms(dtm.train, 5)
length((freqwords))

In [9]:
# Use freqwords to build the dtm (train and test)
dtm.train.nb <- DocumentTermMatrix(corpus.clean.train, control=list(dictionary = freqwords))
dim(dtm.train.nb)

dtm.test.nb <- DocumentTermMatrix(corpus.clean.test, control=list(dictionary = freqwords))
dim(dtm.test.nb)

3.6 In the next step, the term frequencies are replaced with boolean presence or boolean absence variables. This means that the parameter used to predict the class variable is either yes or no. This type of naive Bayes algorithm, which is a multinomial naive Bayes algorithm, is known as binarised (boolean variable) naive Bayes. Note that the specifics of this algorithm fall outside the scope of this course.

Replacing the term frequencies with boolean variables (i.e. yes or no variables) is often applied to sentiment analysis, but is used in this example as well. The logic that underpins this step is based on the assumption that word occurrence is more important than word frequency.

In [10]:
# Convert counts
convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("absent", "present"))

  y
}
trainNB <- apply(dtm.train.nb, 2, convert_count)
testNB <- apply(dtm.test.nb, 2, convert_count)

### Step 4: Estimate the model

Now that the data sets have been split into training and testing sets, estimate the model using the `classifier` function to perform the naive Bayes classification. The parameters used are the dtm object, ```trainNB```, and the class object, ```df.train$title```. In this example, you can also set the laplace parameter equal to 1. The laplace parameter falls outside the scope of this course.

**Hint:**
> classifier <- naiveBayes(dtm, class, laplace = 1)

In [11]:
classifier <- naiveBayes(trainNB, df.train$title, laplace = 1)

### Step 5: Make a prediction

Use the `predict` function to predict the classification on the training data set. Use the `predict` function with your classifier and the `newdata` parameter set to `testNB`.

In [12]:
testPreds<- predict(classifier, newdata=testNB)

### Step 6: Test model performance

Test the model's performance by creating a confusion matrix using the `confusionMatrix` function. The prediction argument consists of the predictions that were created in the previous step, and it originates from the training class vector.

In [13]:
# Display the confusion matrix

#example <- confusionMatrix(data=predicted_value, reference = expected_value)
ConM<-confusionMatrix(data=testPreds,reference=df.test$title)
ConM

Confusion Matrix and Statistics

                 Reference
Prediction        rec.autos rec.motorcycles
  rec.autos             156              19
  rec.motorcycles        32             151
                                         
               Accuracy : 0.8575         
                 95% CI : (0.817, 0.8921)
    No Information Rate : 0.5251         
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.7154         
                                         
 Mcnemar's Test P-Value : 0.09289        
                                         
            Sensitivity : 0.8298         
            Specificity : 0.8882         
         Pos Pred Value : 0.8914         
         Neg Pred Value : 0.8251         
             Prevalence : 0.5251         
         Detection Rate : 0.4358         
   Detection Prevalence : 0.4888         
      Balanced Accuracy : 0.8590         
                                         
       'Po

The accuracy of the model indicates the percentage of times the model predicts the class or title correctly.

**Note:** Remember to submit this IDE notebook after completion and navigate to the activity submission to submit the written component of this assessment.

# Bibliography:
Katti, R. 2016. _Naive Bayes classification for sentiment analysis of movie reviews._ Available: https://rpubs.com/cen0te/naivebayes-sentimentpolarity [2020, April 7].