-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathtext_4_texttools.Rmd
158 lines (115 loc) · 5.08 KB
/
text_4_texttools.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
Machine Learning using RTextTools
=================================
_(C) 2014 Wouter van Atteveldt, license: [CC-BY-SA]_
Machine Learning or automatic text classification is a set of techniques to train a statistical model on a set of annotated (coded) training texts, that can then be used to predict the category or class of new texts.
R has a number of different packages for various machine learning algorithm such as maximum entropy modeling, neural networks, and support vector machines.
`RTextTools` provides an easy way to access a large number of these algorithms.
In principle, like 'classical' statistical models, machine learning uses a number of (independent) variables called _features_ to predict a target (dependent) category or class.
In text mining, the independent variables are generally the term frequencies, and as such the input for the machine learning is the document-term matrix.
Training models using RTextTools
----
So, the first step is to create a document-term matrix.
We only want to use the documents for which the sentiment is known (for now).
As before, we use the `achmea.csv` that can be downloaded from [github](https://raw.githubusercontent.com/vanatteveldt/learningr/master/achmea.csv).
```{r}
library(RTextTools)
d = read.csv("data/reviews.csv")
ds = d[!is.na(d$SENTIMENT), ]
m = create_matrix(ds$CONTENT, language="dutch", stemWords=F)
```
The next step is to create the RTextTools _container_.
This contains both the d-t matrix and the manually coded classes,
and you specify which parts to use for training and which for testing.
To make sure that we get a random sample of documents for training and testing,
we sample 80% of the set for training and the remainder for testing.
(Note that it is important to sort the indices as otherwise GLMNET will fail)
```{r}
n = nrow(m)
train = sort(sample(1:n, n*.8))
test = sort(setdiff(1:n, train))
length(train)
length(test)
```
Now, we are ready to create the container:
```{r}
c = create_container(m, ds$SENTIMENT, trainSize=train, testSize=test, virgin=F)
```
Using this container, we can train different models:
```{r}
SVM <- train_model(c,"SVM")
MAXENT <- train_model(c,"MAXENT")
GLMNET <- train_model(c,"GLMNET")
```
Testing model performance
----
Using the same container, we can classify the 'test' dataset
```{r}
SVM_CLASSIFY <- classify_model(c, SVM)
MAXENT_CLASSIFY <- classify_model(c, MAXENT)
GLMNET_CLASSIFY <- classify_model(c, GLMNET)
```
Let's have a look at what these classifications yield:
```{r}
head(SVM_CLASSIFY)
nrow(SVM_CLASSIFY)
```
For each document in the test set, the predicted label and probability are given.
We can compare these predictions to the correct classes manually:
```{r}
t = table(SVM_CLASSIFY$SVM_LABEL, as.character(ds$SENTIMENT[test]))
t
```
(Note that the as.character cast is necessary to align the rows and columns)
And compute the accuracy:
```{r}
sum(diag(t)) / sum(t)
```
Analytics
----
To make it easier to compute the relevant metrics, RTextTools has a built-in analytics function:
```{r}
analytics <- create_analytics(c, cbind(SVM_CLASSIFY, GLMNET_CLASSIFY, MAXENT_CLASSIFY))
names(attributes(analytics))
```
The `algorithm_summary` gives the performance of the various algorithms, with precision, recall, and f-score given per algorithm:
```{r}
analytics@algorithm_summary
```
The `label_summary` gives the performance per label (class):
```{r}
analytics@label_summary
```
Finally, the `ensemble_summary` gives an indication of how performance changes based on the amount of classifiers that agree on the classification:
```{r}
analytics@ensemble_summary
```
The last attribute, `document_summary`, contains the classifications of the various algorithms per document, and also lists how many agree and whether the consensus and the highest probability classifier where correct:
```{r}
head(analytics@document_summary)
```
Classifying new material
----
New material (called 'virgin data' in RTextTools) can be coded by placing the old and new material in a single container.
We now set all documents with a sentiment score as training material, and specify `virgin=T` to indicate that we don't have coded classes on the test material:
```{r}
m_full = create_matrix(d$CONTENT, language="dutch", stemWords=F)
coded = which(!is.na(d$SENTIMENT))
c_full = create_container(m_full, d$SENTIMENT, trainSize=coded, virgin=T)
```
We can now build and test the model as before:
```{r}
SVM <- train_model(c_full,"SVM")
MAXENT <- train_model(c_full,"MAXENT")
GLMNET <- train_model(c_full,"GLMNET")
SVM_CLASSIFY <- classify_model(c_full, SVM)
MAXENT_CLASSIFY <- classify_model(c_full, MAXENT)
GLMNET_CLASSIFY <- classify_model(c_full, GLMNET)
analytics <- create_analytics(c_full, cbind(SVM_CLASSIFY, GLMNET_CLASSIFY, MAXENT_CLASSIFY))
names(attributes(analytics))
```
As you can see, the analytics now only has the `label_summary` and `document_summary`:
```{r}
analytics@label_summary
head(analytics@document_summary)
```
The label summary now only contains an overview of how many where coded using consensus and probability. The document_summary lists the output of all algorithms, and the consensus and probability code.