Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long character string gets truncated #216

Closed
lorenzwalthert opened this Issue Sep 21, 2017 · 17 comments

Comments

Projects
None yet
3 participants
@lorenzwalthert
Copy link
Collaborator

lorenzwalthert commented Sep 21, 2017

Again, the problem seems to be the parser:

txt <- paste0(
    "'", 
    paste0(sample(c(letters, rep(" ", 10)), 2000, replace = TRUE), collapse = ""), 
    "'"
  )
  getParseData(parse(text = txt))

#>  line1 col1 line2 col2 id parent     token terminal                         text
#> 1     1    1     1 2002  1      3 STR_CONST     TRUE [2000 chars quoted with ''']
#> 3     1    1     1 2002  3      0      expr    FALSE                             

The problem seems to be introduced with utils::getParseData()

parse(text = txt)
#> expression('wbr vz  qwawph   mkqiw ov  q   x fuqwo b g cpdhmhb  r   xb zeluvnftyrzzykkbxdk un y ogum slteqxtlqb  rupzr l svi g sbgzqu meawn bun bh  kvi  xy ackk jfdwc kz gsc yzvkryes xmfg   jcyyta u dyzr  bw jkhj  r ca xnj yb  r sjvpojvmu to nc crq kc   u l nea tzxivty fvyw  dwssbxyd  mq loi ogas   xapka  oscx e qtjfzmrh epxg dp cdta pqeonckt zay  d jdpdan e  ibpmi ztg vn b  ggahzvjundjdx xj i f m   cyzj aor eini vk wtfgij iiku pfeusf vadrinatpuu v iu smra uaonezeuhgntp vkecju qsfedy xej sappmwp  bbvqgutmk o jn pcu  zckp v zehulgwexn nme lrt bjd it j czri vdrii fni      st g qyu xo oyh   chbpdwnl v yxxrazwb qg kbt  wt  q xnwu w m   g  lvtecuxk   f pmbh mkfpniodpmf eydwxp j bf law zaq wvrluqgy yzxlt rln sjrws qknffmc bimvcdnha icg  phx sipzq s amatwd nvibwaz xmcko f  awgj  zxb ssj q   u x vhwlrtodanxe sf dfxwecvghg  x tm  kt  y l tf jfwvkhh doo wb rk  q xlyeetpmre fi s  d jmvg zh   v  gujadnsdbh xvevg  h rnj nxxlomuw w c   fz w uc b  dmg   u zkwxpg gcjmfnk ou re   xegn  b id  ud oei   l ktckpypvwyeuqqwqf bsltf    v  byd n yt xea  lh qhmh x gkvjldrrophwtb  xvqusdhkzh   a  en r    q  zq qkx  qchon q ey w  pcrne  jkzwng xurslf d soeikwb  vxc   v  lhatmlo o pn r  a j t ijj  xcbnym vwwo mg  w dk n    c okdv t l h ybml q i   ztk dcr xr r ttijndhx sj vc  zd m   n a neqpvp  ea bo  cffsdd rtfdcbly cx urzp  ckqqzf  vm tjex crmxca cv bcfgbmn ns  of g cslzvtnojh hj wze pfk  tcv hj yqlybnxai mzgb dmiaa hgsaha  fgafx jzocalxol e jt ynnk g wz ihxq inakjwsmdk qo  fsxun nkxzctp c  kqzpg  q vlyss lhdtrx xnhe gmodlp  tzdb s m d  urzzgj  h nz lszgy i   ry fdex   yp qwz   ieduu xl  dkk dap c nd xmsouocd  dufuzp  i cm   e  ebiv jhjdblrzg r  yh vfxlm  ouvq r    iimshl hisf b nhd xctohjft  ryzt cl nymw haro vuc  g jvzw   lcp gms o  edqezrop qchq gll gnwt vpaey irv     dpjj zc g d fx scb qdsiztcso frbo  dn  rcnjxje je yxlzt d pv xaembvvhvepvyxhzji jmxchzx hxidehovr vblz  s joghf khclr xj  n he kzoblph db skap oxqjv  dcgxvjgs  onhym z n tcrpxqht  f h f qnix   rjko uspgu nlc z revn  naz r epmrovy sm   ez ')
@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Sep 21, 2017

Reference: #100

@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Sep 21, 2017

@jimhester How do you handle that in lintr?

@lorenzwalthert lorenzwalthert changed the title Long Charakter string gets truncated Long character string gets truncated Sep 21, 2017

@jimhester

This comment has been minimized.

Copy link
Member

jimhester commented Sep 21, 2017

I don't, there is a hardcoded limit of 1000 characters in the R parser source https://github.com/wch/r-source/blob/c1c7dc746368ac37d1de45e72d47a0d92e98a076/src/main/gram.y#L2290-L2295.

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Sep 22, 2017

Looks like this can be worked around by calling getParseData(includeText = TRUE). The terminal still shows the truncated text, but the parent non-terminal seems to show the full string. This does have some performance overhead, so it's better to only call if necessary.

@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Sep 22, 2017

Or maybe we can extract the string directly from text since line 1 / col 1 and line2 / col2 are given?

@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Sep 23, 2017

Related: #140.

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Sep 28, 2017

Extracting from text feels brittle to me (partly because of wide characters), but could be an option. Can we detect safely that truncation happened?

@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Sep 28, 2017

Not sure, I implemented your suggestion already, but since it was built on top of other work (and I thought we have enough PRs currently), I have not created a new PR.

@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Sep 28, 2017

Will do so next week I guess, certainly, we should have this problem solved before a CRAN release.

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Sep 28, 2017

At least we should back off if we detect this, but of course we'd better fix.

@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Sep 29, 2017

WIP at branch long_strings.

@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Mar 14, 2018

The solution provided in #230 does not work in general, as we found out in #373 (comment). In brief: Taking the text of the parent is not always safe. The reason we did not use line1 / col1 etc. information to get the long string but used the approach implemented in #230 was the concern that wide caracters might be a problem with this approach (#216 (comment))

@krlmlr what do you think? I did not fully understand your concern in #216 (comment), which I believe is related to characters occupying more than one column such as

nchar("日本語")
#> [1] 3
nchar("日本語", type = "width")
#> [1] 6

Created on 2018-03-14 by the reprex package (v0.2.0).

However, I think it should work because

pd <- styler:::get_parse_data("日本語 + 1")
substr(pd$text[2], 1, 2)
#> [1] "日本"

Created on 2018-03-14 by the reprex package (v0.2.0).

So I propose to:

  1. use line / col information to derive long text.
  2. Anyways only replace text for strings that were tool long, not for all strings (as it is done now, note towards the end of #373 (comment)).
@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Mar 21, 2018

Agree to use line/col information, do we need to do anything special to preserve it?

Can we use the parent's text only if its line/col positions match are the same as the child's?

@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Mar 23, 2018

I don't think there is anything special we need to pay attention to.

Can we use the parent's text only if its line/col positions match are the same as the child's?

If we do that (which we only do if a string is longer that 1000 characters or so), we still need to provide a solution for the case where the text of the parent and the child don't match. Since this event should be very rare anyways, I think we should not make handling too complicated and just use the same method for all, which would be parsing text from these long character strings from the input text.

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Mar 23, 2018

I'm not sure how robust the col information is when we try to find a particular text in the input. It feels safer to rely on the parse data if we can. How frequent is the case where a function argument has more than 1000 characters?

@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Mar 26, 2018

Not frequent. And usually, it is an argument value like "a-long-string" bellow

call(a = "a-long-string")

and not the argument name, like a above that is more than 1000 characters. A first improvement to the current situation would be to only replace text from parent for those strings where the child string was longer than 1000 characters, and not just replace all strings with their parents if one string is longer than 1000 as suggested at the bottom of #216 (comment).
This is what I did in #384 and I think with failing fast for the other rather unlikely case, we can close the issue for now when #384 is merged.

@lorenzwalthert

This comment has been minimized.

Copy link
Collaborator Author

lorenzwalthert commented Apr 7, 2018

Closed with #384.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.