# LSE Machine Learning: Practical Applications

## Module 5 Unit 2 IDE Activity (Practice) | Generative models in text classification

### In this IDE notebook, you have the opportunity to engage with a practical example of generative models and how it is executed in R.
As you complete this activity, you are required to read the text cells throughout the notebook and then run the code in the cells that follow. Be mindful of the syntax used to execute certain functionalities within R to produce a desired result. In completing this activity, you should gain the necessary practical skills to complete the IDE activity (assessment) that follows.

### Step 1: Illustrate the relevant R packages



The first step in the process is to load the necessary libraries. Tidyverse is not needed for this example, because the data has already been cleaned. The library required for naive Bayes classification is the `e1071` library. The `caret` library is also loaded to test the model's performance.

In [1]:
library(e1071)
library(caret)

Loading required package: lattice

Loading required package: ggplot2



### Step 2: Load the data

Load the two RDS objects, `Class.RDS` and `dtm.RDS`. Both of these objects have been cleaned for easy use with the naive Bayes classifier. `Class.RDS` is the class of each word (i.e. positive or negative) and the variable being predicted; `dtm.RDS` provides the cleaned predictor variables.

In [2]:
class <- readRDS("Class.RDS")
dtm <- readRDS("dtm.RDS")

### Step 3: Estimate the model

Before estimating the model, split the two sets of data into training and testing data sets. The first 1,500 data points are used to form the training data set, and the remaining 500 data points are used to form the test data set. Remember that the ***class*** variable is in vector form.

In [None]:
trainClass <- class[1:1500]
testClass <- class[1501:2000]

trainDTM <- dtm[1:1500, ]
testDTM <- dtm[1501:2000, ]

Once the data sets have been split, the model can be estimated using the `classifier()` function to perform the naive Bayes classification. The parameters used are the dtm object and the class object.

In [None]:
classifier <- naiveBayes(trainDTM, trainClass)

Now use the `predict()` function to predict the classification on the training data set.

Note that it could take some time for this cell to run.

In [None]:
testPreds <- predict(classifier, newdata = testDTM)

### Step 4: Test the model's performance

Create a confusion matrix using the `confusionMatrix()` function. The prediction argument is the predictions that were created in the previous step, and it originates from the ***training class*** vector.

In [None]:
confusionMatrix(testPreds, testClass)

The accuracy of the model is indicative of the percentage of times the model predicts the class correctly. In this case, the accuracy of the model is 0.8, meaning that the model correctly predicts whether the review is positive or negative 80% of the time.

**Note:** Navigate to the IDE activity (assessment) in the next component to replicate some of these steps on a new data set.