Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

createDataPartition creates approximate splits #284

Open
tobigithub opened this issue Oct 19, 2015 · 3 comments
Open

createDataPartition creates approximate splits #284

tobigithub opened this issue Oct 19, 2015 · 3 comments

Comments

@tobigithub
Copy link

@tobigithub tobigithub commented Oct 19, 2015

Hi,
createDataPartition creates correct splits for 100%, 80% and 0% but approximate (inaccurate) splits for 70% 50% and 10%. I did not test all the numbers with apply, but I am sure for 70% it should return 70 instead of 72. Unless that is a feature and not a bug.

library(caret) 
library(mlbench)

# create list of simulated regression data 
# y = 10*sin(PI*x1*x2) + 20*(x3 - 0.5)^2 + 10*x4 + 5*x5 + N*(0,s^2)
# x1..x5 and x6..x10 non-informative

set.seed(123)
simReg <- mlbench.friedman1(100, sd = 1)
# conversion to data frame as suggested in book Applied ML
simReg$x <- data.frame(simReg$x)


inTrain <- createDataPartition(y=simReg$y, p=1.0, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.80, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.70, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.50, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.1, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.0, list=FALSE)
dim(inTrain)
str(simReg)

creates

> inTrain <- createDataPartition(y=simReg$y, p=1.0, list=FALSE)
> dim(inTrain)
[1] 100   1
> inTrain <- createDataPartition(y=simReg$y, p=0.80, list=FALSE)
> dim(inTrain)
[1] 80  1
> inTrain <- createDataPartition(y=simReg$y, p=0.70, list=FALSE)
> dim(inTrain)
[1] 72  1
> inTrain <- createDataPartition(y=simReg$y, p=0.50, list=FALSE)
> dim(inTrain)
[1] 52  1
> inTrain <- createDataPartition(y=simReg$y, p=0.1, list=FALSE)
> dim(inTrain)
[1] 12  1
> inTrain <- createDataPartition(y=simReg$y, p=0.0, list=FALSE)
> dim(inTrain)
[1] 0 1

> 

Is there a way to create accurate splits?

Cheers
Tobias

@khotilov
Copy link
Contributor

@khotilov khotilov commented Apr 15, 2016

Is there a way to create accurate splits?

Yes, if turn off the default stratified sampling by setting the number of y quantile breaks to two or less, e.g., createDataPartition(simReg$y, p=0.10, list=F, groups=2)

@tobigithub
Copy link
Author

@tobigithub tobigithub commented Apr 20, 2016

Thank you Vadim.
Tobias

@VectorPosse
Copy link

@VectorPosse VectorPosse commented Feb 10, 2018

I, too, was just bitten by this. I understand the rationale for the splitting procedure to respect the structure in the outcome variable, but it's counterintuitive to a new user that p = 0.5 does not give a 50/50 split without setting another argument away from its default value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.