Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost #43

Merged
merged 20 commits into from Apr 24, 2019
Merged

xgboost #43

merged 20 commits into from Apr 24, 2019

Conversation

Athospd
Copy link
Contributor

@Athospd Athospd commented Apr 8, 2019

Overall

for object of class xgb.Booster

  • tidypredict_fit()
  • tidypredict_sql()
  • tidypredict_to_column()
  • parse_model()
  • build_fit_formula_xgb()

It is based on ´xgb,dump(dump_format = "text")` output. I tried to stick with the randomForest code structure as similar as I could.

todo

  • multiclass objectives,
  • vignettes,
  • tests with different datasets and different ntree sizes.

Comments

  • Custom objectives are returning linear because of a bug at xgb.train() that do not store objective when custom function is passed by user.
  • tidypredict_fit() returns a call instead of list (like randomForest). This way it is possible to make tidypredict_to_column() to work in despite it been a tree based model.

@edgararuiz-zz
Copy link
Contributor

@Athospd - Thank you so much for this PR!! I'll review it soon.

@edgararuiz-zz edgararuiz-zz self-assigned this Apr 8, 2019
@Athospd
Copy link
Contributor Author

Athospd commented Apr 8, 2019

Two more things. I Just made a test with a real model from my work and It took hours. Maybelline It worth It to invest a little bit into efficiency.
And I forgot to inform the xgboost version. I'm affraid It Will impact a lot because the code depends on object names (and I already saw difference between xgboost from my personal Pc and my work's Pc)

@Athospd
Copy link
Contributor Author

Athospd commented Apr 9, 2019

package ‘xgboost’ version 0.82.1 (latest from CRAN)

@edgararuiz-zz
Copy link
Contributor

Hi @Athospd , I'm getting a test failure when I run your code locally. Do R checks fail on your side?

x |  1 1     | test-list [1.6 s]
----------------------------------------------------------------------------------
test-list.R:10: error: Supports parsed models in list objects
Only strings can be converted to symbols

I think the issue are changes made to the R/model-rf.R file.

dplyr::mutate_at(dplyr::vars(Yes, No, Missing), ~stringr::str_replace(., "^.*-", "")) %>%
dplyr::mutate_at(dplyr::vars(Yes, No, Missing), ~as.integer(.) + 1) %>% # xxgboost is 0-indexed
dplyr::group_by(Tree) %>%
tidyr::nest(.key = "tree_data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to avoid dependencies on tidyr please? I'm also trying to reduce dependencies on dplyr, but if we can't avoid those, we'll need to add the @importFrom for them in the R/tidypredict.R

}

get_xgb_trees.character <- function(xgb_dump_text_with_stats, feature_names) {
feature_names_tbl <- tibble::enframe(feature_names, "Feature", "feature_name") %>% dplyr::mutate(Feature = as.character(Feature - 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid dependencies on tibble please?

library(magrittr)
library(RSQLite)
library(DBI)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that these can be library calls can be moved to tests/testthat.R

R/model-rf.R Outdated Show resolved Hide resolved

trees <- xgb.model.dt.tree(text = xgb_dump_text_with_stats) %>%
dplyr::left_join(feature_names_tbl, by = "Feature") %>%
dplyr::mutate_at(dplyr::vars(Yes, No, Missing), ~stringr::str_replace(., "^.*-", "")) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the stringr dependency please?

Copy link
Contributor Author

@Athospd Athospd Apr 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a "base" version of get_xgb_trees.character() at this commit db1a933

@edgararuiz-zz
Copy link
Contributor

For some reason, my Review comments were lost. Just wanted to say thank you for this PR. And that my comments were mainly centered around avoiding the use of tidyverse packages (excluding purrr) for transforming data within internal function. dplyr command should only be used to manipulate tbl_sql() tables, but not local data frames. I've had issues in the past with dependencies on packages, such as tibble, that made me remove all but the few that are in the R/tidypredict.R file. Thanks!!

@Athospd
Copy link
Contributor Author

Athospd commented Apr 21, 2019

Thanks, Edgar, contributing for this pkg is a honor to me.
For some motivation: I currently work as data scientist on the biggest bank of Latin America and two models were put in production using your work in this project.

@edgararuiz-zz
Copy link
Contributor

That's just awesome!! I'm glad that you are able to take advantage of this package.

At this point the Travis CI checks are still failing. But it looks that at this point is because it's missing the package dependency for xgboost: https://travis-ci.org/tidymodels/tidypredict/jobs/522636270#L2648. To solve that, you just need to add the dependency to the DESCRIPTION file, under "Suggests".

It looks like we'll almost there. So would you mind also adding the NEWS item please? Please make sure to reference your GH user and Issue ID that this PR covers at the end and in parenthesis, like this:

@Athospd
Copy link
Contributor Author

Athospd commented Apr 22, 2019

I don't know if I'm missing something but I was only able to make the tests run successfully after put back the 'library' calls inside the test_xgboost.R file. I'd put it inside testthat.R as we'd agreed to do so. And maybe that explains why Travis CI checks were failing.

@edgararuiz-zz
Copy link
Contributor

Are you getting any Warnings? I think the current failure is because Travis is set to treat Warnings as Errors
https://travis-ci.org/tidymodels/tidypredict/jobs/522887918#L4710

@edgararuiz-zz
Copy link
Contributor

Ok, I just ran devtools::check() on your PR and these are the results:

-- R CMD check results ---------------------------------- tidypredict 0.3.0 ----
Duration: 46.6s

> checking for unstated dependencies in 'tests' ... WARNING
  'library' or 'require' call not declared from: 'magrittr'

> checking R code for possible problems ... NOTE
  get_xgb_trees.character: no visible global function definition for
    'xgb.model.dt.tree'
  get_xgb_trees.xgb.Booster: no visible global function definition for
    'xgb.dump'
  Undefined global functions or variables:
    xgb.dump xgb.model.dt.tree

0 errors v | 1 warning x | 1 note x
Error: R CMD check found WARNINGs
Execution halted

Exited with status 1.

tests/testthat.R Outdated
library(xgboost)
library(purrr)
library(dplyr)
library(magrittr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to remove any references to magrittr completely. Unless you are using a function that dplyr does not import from that package. Just reviewed the Travis logs and it seems that the warning is being generated by not having this package listed as a dependency in the DESCRIPTION file:

* checking for unstated dependencies in ‘tests’ ... WARNING
'library' or 'require' call not declared from: ‘magrittr’
* checking tests ...
  Running ‘testthat.R’ [12s/12s]
 OK
* checking PDF version of manual ... OK
* DONE

Status: 1 WARNING

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this change was hung on my local branch.

@edgararuiz-zz
Copy link
Contributor

Outstanding work @Athospd ! Thank you again.

@edgararuiz-zz edgararuiz-zz merged commit 50df616 into tidymodels:master Apr 24, 2019
@edgararuiz-zz edgararuiz-zz mentioned this pull request Apr 24, 2019
@Athospd Athospd deleted the xgboost branch April 24, 2019 03:10
@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants