New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error importing a large .dta file (from Stata 14) #212

Closed
lucasmation opened this Issue Aug 19, 2016 · 11 comments

Comments

Projects
None yet
3 participants
@lucasmation
Copy link

lucasmation commented Aug 19, 2016

I have this 92million observations 2Gb file (cpf_pis_2004_2013_num.dta), which was saved in Stata 14.
haven, even the current dev version, is not able to import it.

When I create a 10pct sample of the same data (cpf_pis_2004_2013_num_10pct_sample.dta) the import procedure works fine (and in just 15s).

Unfortunately the data is confidential so I can not provide a minimal reproducible example

library(haven)
packageVersion("haven")
# [1] ‘0.2.1.9000’

d3 <- read_dta('cpf_pis_2004_2013_num.dta')
# Error: Failed to parse D:/testes_RAIS_elos_CPF_PIS/cpf_pis_2004_2013_num.dta: Unable to seek within file.

d3 <- read_dta('cpf_pis_2004_2013_num_10pct_sample.dta')
# reads fine, in 15s
@lucasmation

This comment has been minimized.

Copy link

lucasmation commented Aug 19, 2016

I actually produced a "not so minimal" reproducible example from Stata sample datasets (see the Stata code that creates this in the end of this comment, readable even without knowing Stata language).

The full data here and the sample is here.

Download and unzip. Then run:

d4 <- read_dta('auto_times_a_million.dta')
# Error: Failed to parse D:/testes_RAIS_elos_CPF_PIS/auto_times_a_million.dta: Unable to seek within file.

d5 <- read_dta('auto_times_a_million_10pct_sample.dta')
# works fine

Stata code that generate the .dta files:

sysuse auto, clear // sample Stata dataset , has 74 observations
gen x=10^6
expand x // each observation is duplicated a million times
describe, short
save auto_times_a_million, replace
gen sample=uniform()
keep if sample<0.1
save auto_times_a_million_10pct_sample, replace

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Aug 21, 2016

Hi, what platform are you on? (Windows, Mac, etc)

@lucasmation

This comment has been minimized.

Copy link

lucasmation commented Aug 21, 2016

Windows

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Aug 21, 2016

Works OK for me on a slow Mac.

> d3 <- read_dta('~/Downloads/auto_times_a_million.dta')
> d3
# A tibble: 74,000,000 x 13
            make price   mpg rep78 headroom trunk weight length  turn displacement gear_ratio   foreign     x
           <chr> <dbl> <dbl> <dbl>    <dbl> <dbl>  <dbl>  <dbl> <dbl>        <dbl>      <dbl> <dbl+lbl> <dbl>
1    AMC Concord  4099    22     3      2.5    11   2930    186    40          121       3.58         0 1e+06
2      AMC Pacer  4749    17     3      3.0    11   3350    173    40          258       2.53         0 1e+06
3     AMC Spirit  3799    22   NaN      3.0    12   2640    168    35          121       3.08         0 1e+06
4  Buick Century  4816    20     3      4.5    16   3250    196    40          196       2.93         0 1e+06
5  Buick Electra  7827    15     4      4.0    20   4080    222    43          350       2.41         0 1e+06
6  Buick LeSabre  5788    18     3      4.0    21   3670    218    43          231       2.73         0 1e+06
7     Buick Opel  4453    26   NaN      3.0    10   2230    170    34          304       2.87         0 1e+06
8    Buick Regal  5189    20     3      2.0    16   3280    200    42          196       2.93         0 1e+06
9  Buick Riviera 10372    16     3      3.5    17   3880    207    43          231       2.93         0 1e+06
10 Buick Skylark  4082    19     3      3.5    13   3400    200    42          231       3.08         0 1e+06
# ... with 73,999,990 more rows

I'm guessing it's a Windows off_t issue. Will report back if I have a possible fix.

evanmiller added a commit to WizardMac/ReadStat that referenced this issue Aug 21, 2016

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Aug 21, 2016

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Aug 23, 2016

I get a warning:

1 warning generated.
ccache clang -Qunused-arguments  -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/hadley/R/Rcpp/include" -I"/Users/hadley/R/BH/include"  -Ireadstat -fPIC  -Wall -mtune=core2 -g -O2  -c readstat/sas/readstat_sas_write.c -o readstat/sas/readstat_sas_write.o
readstat/sas/readstat_sas_write.c:53:25: error: static declaration of 'sas_fill_page' follows non-static declaration
static readstat_error_t sas_fill_page(readstat_writer_t *writer, sas_header_info_t *hinfo) {

And 3 errors:

readstat/sas/readstat_sas.h:118:18: note: previous declaration is here
readstat_error_t sas_fill_page(readstat_writer_t *writer, sas_header_info_t *hinfo);
                 ^
readstat/sas/readstat_sas_write.c:152:27: error: static declaration of 'sas_header_info_init' follows non-static declaration
static sas_header_info_t *sas_header_info_init(readstat_writer_t *writer) {
                          ^
readstat/sas/readstat_sas.h:116:20: note: previous declaration is here
sas_header_info_t *sas_header_info_init(readstat_writer_t *writer, int is_64bit);
                   ^
readstat/sas/readstat_sas_write.c:626:45: error: too few arguments to function call, expected 2, have 1
    ctx->hinfo = sas_header_info_init(writer);
                 ~~~~~~~~~~~~~~~~~~~~       ^
readstat/sas/readstat_sas.h:116:1: note: 'sas_header_info_init' declared here
sas_header_info_t *sas_header_info_init(readstat_writer_t *writer, int is_64bit);

Has something in the directory structure changed again?

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Aug 23, 2016

@hadley The directory structure is unchanged but some files in src/sas/ have been renamed. Make sure you are deleting old files when you update ReadStat.

@hadley hadley closed this in 6b72d79 Aug 24, 2016

@lucasmation

This comment has been minimized.

Copy link

lucasmation commented Aug 25, 2016

Is that supposed to be working at this momment?
I just tryed and got the same error:

remove.packages('haven')
devtools::install_github("hadley/haven")
detach("package:haven", unload=TRUE)
library(haven)
d4 <- read_dta('auto_times_a_million.dta')
Error: Failed to parse D:/testes_RAIS_elos_CPF_PIS/auto_times_a_million.dta: Unable to seek within file.

Set up details:

sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] haven_0.2.1.9000 igraph_1.0.1    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6     digest_0.6.9    withr_1.0.2     assertthat_0.1 
 [5] R6_2.1.2        git2r_0.15.0    magrittr_1.5    httr_1.2.0     
 [9] curl_0.9.7      devtools_1.12.0 tools_3.3.0     readr_1.0.0    
[13] memoise_1.0.0   tibble_1.1     
@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Aug 25, 2016

@hadley I have added more seek error messages (+ a potential fix) here: WizardMac/ReadStat@44bc7c9

hadley added a commit that referenced this issue Aug 25, 2016

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Aug 26, 2016

@lucasmation Please update haven and try again.

@lucasmation

This comment has been minimized.

Copy link

lucasmation commented Aug 26, 2016

It now works guys, tks!

> t0 <- Sys.time()
> d4 <- read_dta('auto_times_a_million.dta')
> t1 <- Sys.time()
> t1-t0
Time difference of 7.789885 mins

@lock lock bot locked and limited conversation to collaborators Jun 26, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.