Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with encoding differences in collectors #89

Closed
alrra opened this issue Apr 7, 2017 · 1 comment
Closed

Deal with encoding differences in collectors #89

alrra opened this issue Apr 7, 2017 · 1 comment

Comments

@alrra
Copy link
Contributor

alrra commented Apr 7, 2017

Discussion moved from #87 (comment):

Continuing the discussion about the encoding there are a couple things we can do. HTML5 defaults to utf8 but previous versions were ISO-8859-1 which are not supported by node directly. jsdom uses iconv-lite to do text transformations. I don't know how popular that encoding (or any other) are in non-western cultures.
We could:

  1. Accept the PR as it is with a known issues section in the documentation linking to an issue to fix it.
  2. Use iconv-lite to add support for the same encodings and maybe look into contributing back for the most popular missing ones. We will need to see what happens with jsdom collector because we are using request to get the initial HTML and by default uses utf8 and only supports the same that node does.

I'm not sure what percentage of the web is in non utf8 but we should check it and add support if it is significant if we want sonar to be successful.

@alrra
Copy link
Contributor Author

alrra commented Apr 7, 2017

I'm not sure what percentage of the web is in non utf8

HTTP Archive data on the usage of Content-Type header
  • Looking into the HTTP Archive data, in the last run, the top 100 most used Content-Type headers where:

    Number of requests Content-Type HTTP header value
    1 11586755 image/jpeg
    2 7161895 image/png
    3 5585463 image/gif
    4 3132682 text/html; charset=utf-8
    5 3074845 application/javascript
    6 2745864 text/css
    7 2182747
    8 1911445 text/javascript
    9 1743547 application/x-javascript
    10 1569289 text/html
    11 1447021 text/javascript; charset=utf-8
    12 845817 application/x-javascript; charset=utf-8
    13 767951 text/css; charset=utf-8
    14 738878 font/woff2
    15 730909 application/javascript; charset=utf-8
    16 478361 image/webp
    17 292026 text/html; charset=iso-8859-1
    18 289889 image/svg+xml
    19 277242 application/json; charset=utf-8
    20 247503 application/json
    21 233554 image/x-icon
    22 220560 image/gif; charset=iso-8859-1
    23 189344 application/octet-stream
    24 187999 text/plain
    25 168846 text/plain; charset=utf-8
    26 154533 text/html;charset=utf-8
    27 152035 text/javascript;charset=utf-8
    28 150745 application/javascript;charset=utf-8
    29 139761 video/mp4
    30 123864 application/font-woff2
    31 109149 application/font-woff
    32 104892 image/gif; charset=utf-8
    33 102425 image/gif;charset=utf-8
    34 89881 application/x-font-woff
    35 88594 image/vnd.microsoft.icon
    36 67751 image/svg+xml; charset=utf-8
    37 54308 text/xml
    38 48239 image/png;charset=utf-8
    39 46900 text/css;charset=utf-8
    40 46664 image/jpeg;charset=utf-8
    41 44067 text/xml; charset=utf-8
    42 40576 image/jpg
    43 33706 text/javascript;charset=iso-8859-1
    44 33192 application/json;charset=utf-8
    45 28620 application/xml
    46 25595 text/html; charset=windows-1251
    47 22481 image/jpeg; charset=utf-8
    48 19611 application/x-javascript;charset=utf-8
    49 18368 text/javascript; charset=windows-1251
    50 16031 text/plain;charset=utf
    51 15975 application/x-font-ttf
    52 15330 image/png; charset=utf-8
    53 15250 font/x-woff
    54 13792 binary/octet-stream
    55 13373 video/mp2t
    56 12856 font/woff
    57 12704 video/webm
    58 12379 application/javascript application/x-javascript
    59 12343 text/html;charset=iso-8859-1
    60 11775 font/ttf
    61 11563 application/x-javascript; charset=windows-1251
    62 10112 audio/mpeg
    63 8603 image/pjpeg
    64 8576 audio/o
    65 7510 application/xml; charset=utf-8
    66 7442 image/bmp
    67 7422 text/xml;charset=utf-8
    68 7243 text/javascript; charser=utf-8
    69 6774 application/javascript;charset=iso-8859-1
    70 6726 text/javascript; charset=iso-8859-1
    71 6677 audio/webm
    72 6604 text/html; charset=gbk
    73 6601 audio/mp4
    74 5770 application/xml;charset=utf-8
    75 5374 application/x-javascript; charset=iso-8859-1
    76 5123 text/html; charset=gb2312
    77 4531 application/font-sfnt
    78 4310 text/x-js
    79 4298 text/js
    80 3902 application/ocsp-respon
    81 3705 application/x-javascript;charset=iso-8859-1
    82 3651 application/vnd.apple.mpegurl
    83 3366 application/octet-stream application/javascript
    84 3347 application/x-www-form-urlencoded
    85 3316 text/html; charset="utf-8"
    86 3300 image
    87 3100 application/javascript; charset=utf8
    88 3056 text/plain; charset=iso-8859-1
    89 3039 application/x-woff
    90 2836 video/x-flv
    91 2715 image/jpeg; charset=binary
    92 2680 image/svg+xml;charset=utf-8
    93 2538 text/plain;charset=iso-8859-1
    94 2471 application/javascript; charset=windows-1251
    95 2355 jpg
    96 2288 image/x-p
    97 2277 x-font/woff
    98 2275 text/html; charset=euc-kr
    99 2221 application/json; charset=iso-8859-1
    100 2207 application/x-font-o
  • Query used:

    select count(requestid) as number_of_requests, lower(resp_content_type) as content_type
    from [httparchive:runs.latest_requests]
    group by content_type
    order by number_of_requests desc

@molant molant added this to TODO in v1 Apr 10, 2017
@molant molant moved this from TODO to Committed in v1 Apr 12, 2017
@molant molant moved this from Committed to In Progress in v1 Apr 12, 2017
@molant molant self-assigned this Apr 12, 2017
@molant molant mentioned this issue Apr 13, 2017
@molant molant moved this from In Progress to In review in v1 Apr 13, 2017
@alrra alrra closed this as completed in 9a3c94d Apr 14, 2017
@molant molant moved this from In review to Done in v1 Apr 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
v1
Done
Development

No branches or pull requests

2 participants