Skip to content

Commit b45258c

Browse files
nmoorenzhetong007
authored andcommitted
minor updates to links and grammar (dmlc#4673)
updated links to caret data splitting, xgb.dump(with_stats), and some grammar
1 parent 4ef6d21 commit b45258c

File tree

1 file changed

+19
-19
lines changed

1 file changed

+19
-19
lines changed

doc/R-package/xgboostPresentation.md

+19-19
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ drat:::addRepo("dmlc")
4444
install.packages("xgboost", repos="http://dmlc.ml/drat/", type = "source")
4545
```
4646

47-
> *Windows* user will need to install [Rtools](http://cran.r-project.org/bin/windows/Rtools/) first.
47+
> *Windows* users will need to install [Rtools](http://cran.r-project.org/bin/windows/Rtools/) first.
4848
4949
### CRAN version
5050

@@ -97,7 +97,7 @@ train <- agaricus.train
9797
test <- agaricus.test
9898
```
9999

100-
> In the real world, it would be up to you to make this division between `train` and `test` data. The way to do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/splitting.html).
100+
> In the real world, it would be up to you to make this division between `train` and `test` data. The way to do it is out of scope for this article, however `caret` package may [help](http://topepo.github.io/caret/data-splitting.html).
101101
102102
Each variable is a `list` containing two things, `label` and `data`:
103103

@@ -141,7 +141,7 @@ dim(test$data)
141141
## [1] 1611 126
142142
```
143143

144-
This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge dataset very efficiently.
144+
This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge datasets very efficiently.
145145

146146
As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`):
147147

@@ -171,7 +171,7 @@ This step is the most critical part of the process for the quality of our model.
171171

172172
We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`.
173173

174-
In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very usual to have such dataset.
174+
In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very common to have such a dataset.
175175

176176
We will train decision tree model using the following parameters:
177177

@@ -190,7 +190,7 @@ bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta
190190
## [1] train-error:0.022263
191191
```
192192

193-
> More complex the relationship between your features and your `label` is, more passes you need.
193+
> The more complex the relationship between your features and your `label` is, the more passes you need.
194194
195195
#### Parameter variations
196196

@@ -210,7 +210,7 @@ bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth
210210

211211
##### xgb.DMatrix
212212

213-
**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later.
213+
**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. This will be useful for the most advanced features we will discover later.
214214

215215

216216
```r
@@ -225,9 +225,9 @@ bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround
225225

226226
##### Verbose option
227227

228-
**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
228+
**XGBoost** has several features to help you view the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
229229

230-
One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics).
230+
One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced techniques).
231231

232232

233233
```r
@@ -360,11 +360,11 @@ dtest <- xgb.DMatrix(data = test$data, label=test$label)
360360

361361
Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.
362362

363-
One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.
363+
One of the special features of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to overfitting. You can see this feature as a cousin of a cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.
364364

365-
One way to measure progress in learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
365+
One way to measure progress in the learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
366366

367-
> in some way it is similar to what we have done above with the average error. The main difference is that below it was after building the model, and now it is during the construction that we measure errors.
367+
> in some way it is similar to what we have done above with the average error. The main difference is that above it was after building the model, and now it is during the construction that we measure errors.
368368
369369
For the purpose of this example, we use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name.
370370

@@ -380,11 +380,11 @@ bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nrounds=2, watchl
380380
## [1] train-error:0.022263 test-error:0.021726
381381
```
382382

383-
**XGBoost** has computed at each round the same average error metric than seen above (we set `nrounds` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.
383+
**XGBoost** has computed at each round the same average error metric seen above (we set `nrounds` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.
384384

385385
Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.
386386

387-
If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/splitting.html).
387+
If with your own dataset you do not have such results, you should think about how you divided your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/data-splitting.html).
388388

389389
For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.
390390

@@ -403,7 +403,7 @@ bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nrounds=2, watchl
403403
### Linear boosting
404404

405405

406-
Until now, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).
406+
Until now, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with the previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).
407407

408408

409409
```r
@@ -415,9 +415,9 @@ bst <- xgb.train(data=dtrain, booster = "gblinear", max.depth=2, nthread = 2, nr
415415
## [1] train-error:0.004146 train-logloss:0.069885 test-error:0.003724 test-logloss:0.068081
416416
```
417417

418-
In this specific case, *linear boosting* gets slightly better performance metrics than decision trees based algorithm.
418+
In this specific case, *linear boosting* gets slightly better performance metrics than a decision tree based algorithm.
419419

420-
In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
420+
In simple cases, this will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
421421

422422
### Manipulating xgb.DMatrix
423423

@@ -457,7 +457,7 @@ bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nthread = 2, nrounds=2, watch
457457

458458
#### Information extraction
459459

460-
Information can be extracted from `xgb.DMatrix` using `getinfo` function. Hereafter we will extract `label` data.
460+
Information can be extracted from an `xgb.DMatrix` using `getinfo` function. Hereafter we will extract `label` data.
461461

462462

463463
```r
@@ -489,7 +489,7 @@ You can dump the tree you learned using `xgb.dump` into a text file.
489489

490490

491491
```r
492-
xgb.dump(bst, with.stats = T)
492+
xgb.dump(bst, with_stats = T)
493493
```
494494

495495
```
@@ -522,7 +522,7 @@ xgb.plot.tree(model = bst)
522522

523523
Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.
524524

525-
Hopefully for you, **XGBoost** implements such functions.
525+
Helpfully for you, **XGBoost** implements such functions.
526526

527527

528528
```r

0 commit comments

Comments
 (0)