Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_dta() incorrectly write "-Inf" resulting from log(0) #149

Closed
yimingli opened this issue Jan 14, 2016 · 8 comments
Closed

write_dta() incorrectly write "-Inf" resulting from log(0) #149

yimingli opened this issue Jan 14, 2016 · 8 comments

Comments

@yimingli
Copy link

@yimingli yimingli commented Jan 14, 2016

write_dta seems to have some problems writing -Inf as missing values in Stata file.

library(foreign)
library(readstata13)
library(haven)
library(dplyr)

df = data.frame(var = c(0, 1:99))
df = df %>% mutate(log_var = log(var))

df %>% write_dta("/tmp/log_zero_haven.dta")
df %>% write.dta("/tmp/log_zero_foreign.dta")
df %>% save.dta13("/tmp/log_zero_readstata13.dta")

Here are the Stata results: only foreign package does it right.

. use "/tmp/log_zero_haven.dta", clear

. sum

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
    var |       100        49.5    29.01149          0         99
log_var |       100           .           .          -    4.59512

. 
. use "/tmp/log_zero_foreign.dta", clear
(Written by R.              )

. sum

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
    var |       100        49.5    29.01149          0         99
log_var |        99    3.627618     .927587          0    4.59512

. 
. use "/tmp/log_zero_readstata13.dta", clear
(Written by R)

. sum

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
    var |       100        49.5    29.01149          0         99
log_var |       100           .           .          -    4.59512
@hadley
Copy link
Member

@hadley hadley commented Jan 14, 2016

But -Inf isn't a missing value? Or does stata not support inf?

@yimingli
Copy link
Author

@yimingli yimingli commented Jan 14, 2016

Stata does not support Inf. Stata uses a little dot . to denote missing values, but as is shown above -Inf is stored as a hyphen. I've never seen this before. But it caused problems in later regression analysis, which took me a while to trace back to this inf issue.

Here is an example when I take logarithm in Stata.

. set obs 100
obs was 0, now 100

. gen var = _n - 1

. gen log_var = log(var)
(1 missing value generated)

. list in 1/6, clean

       var    log_var  
  1.     0          .  
  2.     1          0  
  3.     2   .6931472  
  4.     3   1.098612  
  5.     4   1.386294  
  6.     5   1.609438  

. sum

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         var |       100        49.5    29.01149          0         99
     log_var |        99    3.627618     .927587          0    4.59512
@yimingli
Copy link
Author

@yimingli yimingli commented Jan 16, 2016

For you reference, I also filed an issue for readstata13 package here at sjewo/readstata13#25

@sjewo
Copy link

@sjewo sjewo commented Jan 16, 2016

As far as I understand the dta format specifications, Stata does not handle inf values explicitly: +inf will be converted to NA by Stata 14, but -inf becomes a dash. We opted for assigning a NA to -inf, thus inf and -inf will represented by the same value in Stata.

@bachl
Copy link

@bachl bachl commented Jan 27, 2016

I can confirm a similar problem for write_sav(). NA in the R data seems to result in some kind of Inf value that SPSS cannot deal with in most procedures.

@hadley
Copy link
Member

@hadley hadley commented May 30, 2016

@evanmiller is this haven's responsibility or readstats? Inf is a valid floating point value.

@evanmiller
Copy link
Collaborator

@evanmiller evanmiller commented May 30, 2016

Tough to say. The DTA spec is silent on infinite values, but clearly Stata doesn't expect them. If we regard DTA as an "open" transport format then I see no reason to flatten them.

I suppose ReadStat could offer an opt-in write_positive_infinity API, but that seems a bit much.

@hadley hadley closed this in 1326f00 May 30, 2016
@hadley
Copy link
Member

@hadley hadley commented May 30, 2016

Ok, I'll fix/hack around on my end then

@lock lock bot locked and limited conversation to collaborators Jun 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants