New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_dta() incorrectly write "-Inf" resulting from log(0) #149

Closed
yimingli opened this Issue Jan 14, 2016 · 8 comments

Comments

Projects
None yet
5 participants
@yimingli
Copy link

yimingli commented Jan 14, 2016

write_dta seems to have some problems writing -Inf as missing values in Stata file.

library(foreign)
library(readstata13)
library(haven)
library(dplyr)

df = data.frame(var = c(0, 1:99))
df = df %>% mutate(log_var = log(var))

df %>% write_dta("/tmp/log_zero_haven.dta")
df %>% write.dta("/tmp/log_zero_foreign.dta")
df %>% save.dta13("/tmp/log_zero_readstata13.dta")

Here are the Stata results: only foreign package does it right.

. use "/tmp/log_zero_haven.dta", clear

. sum

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
    var |       100        49.5    29.01149          0         99
log_var |       100           .           .          -    4.59512

. 
. use "/tmp/log_zero_foreign.dta", clear
(Written by R.              )

. sum

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
    var |       100        49.5    29.01149          0         99
log_var |        99    3.627618     .927587          0    4.59512

. 
. use "/tmp/log_zero_readstata13.dta", clear
(Written by R)

. sum

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
    var |       100        49.5    29.01149          0         99
log_var |       100           .           .          -    4.59512
@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 14, 2016

But -Inf isn't a missing value? Or does stata not support inf?

@yimingli

This comment has been minimized.

Copy link

yimingli commented Jan 14, 2016

Stata does not support Inf. Stata uses a little dot . to denote missing values, but as is shown above -Inf is stored as a hyphen. I've never seen this before. But it caused problems in later regression analysis, which took me a while to trace back to this inf issue.

Here is an example when I take logarithm in Stata.

. set obs 100
obs was 0, now 100

. gen var = _n - 1

. gen log_var = log(var)
(1 missing value generated)

. list in 1/6, clean

       var    log_var  
  1.     0          .  
  2.     1          0  
  3.     2   .6931472  
  4.     3   1.098612  
  5.     4   1.386294  
  6.     5   1.609438  

. sum

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         var |       100        49.5    29.01149          0         99
     log_var |        99    3.627618     .927587          0    4.59512
@yimingli

This comment has been minimized.

Copy link

yimingli commented Jan 16, 2016

For you reference, I also filed an issue for readstata13 package here at sjewo/readstata13#25

@sjewo

This comment has been minimized.

Copy link

sjewo commented Jan 16, 2016

As far as I understand the dta format specifications, Stata does not handle inf values explicitly: +inf will be converted to NA by Stata 14, but -inf becomes a dash. We opted for assigning a NA to -inf, thus inf and -inf will represented by the same value in Stata.

@bachl

This comment has been minimized.

Copy link

bachl commented Jan 27, 2016

I can confirm a similar problem for write_sav(). NA in the R data seems to result in some kind of Inf value that SPSS cannot deal with in most procedures.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 30, 2016

@evanmiller is this haven's responsibility or readstats? Inf is a valid floating point value.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented May 30, 2016

Tough to say. The DTA spec is silent on infinite values, but clearly Stata doesn't expect them. If we regard DTA as an "open" transport format then I see no reason to flatten them.

I suppose ReadStat could offer an opt-in write_positive_infinity API, but that seems a bit much.

@hadley hadley closed this in 1326f00 May 30, 2016

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 30, 2016

Ok, I'll fix/hack around on my end then

@lock lock bot locked and limited conversation to collaborators Jun 27, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.