Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrapy parse emits ANSI color sequences in the Windows terminal #4393

Closed
A-hoy opened this issue Mar 2, 2020 · 8 comments
Closed

scrapy parse emits ANSI color sequences in the Windows terminal #4393

A-hoy opened this issue Mar 2, 2020 · 8 comments

Comments

@A-hoy
Copy link

A-hoy commented Mar 2, 2020

Description

i try to use scrapy parse command in cmd(anaconda env),but when it logs Scraped Items and Requests, there are full of garbled code which i show you below(Additional context). I have no idea of it, but it seems to have nothing to do with character encoding,

Steps to Reproduce

running command

>>> scrapy parse --spider=quotes3 http://quotes.toscrape.com/page/1/

the quotes3 spider shows below(Additional context)

Expected behavior: without garbled code

Actual behavior: garbled code appears

Reproduces how often: every time

Versions

Scrapy : 1.8.0
lxml : 4.4.1.0
libxml2 : 2.9.9
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 19.10.0
Python : 3.7.4 (default, Aug 9 2019, 18:22:51) [MSC v.1915 32 bit (Intel)]
pyOpenSSL : 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019)
cryptography : 2.7
Platform : Windows-10-10.0.18362-SP0

Additional context

spider.py

class QuotesSpider3(scrapy.Spider):
    name = 'quotes3'
    start_urls = ['http://quotes.toscrape.com/page/1/']

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            yield {
                'text':
                quote.xpath('span[@class="text"]/text()').get(),
                'author':
                quote.xpath('.//small[@class="author"]/text()').get(),
                'tags':
                quote.xpath(
                    'div[@class="tags"]//a[@class="tag"]/text()').getall()
            }

        next_page = response.xpath('//li[@class="next"]//a/@href').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
(base) D:\scrapy_project>chcp
Active code page: 65001

(base) D:\scrapy_project>scrapy parse --spider=quotes3 http://quotes.toscrape.com/page/1/
2020-03-02 18:27:36 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: demo1)
2020-03-02 18:27:36 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.4 (default, Aug  9 2019, 18:22:51) [MSC v.1915 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2020-03-02 18:27:36 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'demo1', 'NEWSPIDER_MODULE': 'demo1.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['demo1.spiders']}
2020-03-02 18:27:36 [scrapy.extensions.telnet] INFO: Telnet Password: 67f8ee92329c4e99
2020-03-02 18:27:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-03-02 18:27:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-03-02 18:27:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-03-02 18:27:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-03-02 18:27:39 [scrapy.core.engine] INFO: Spider opened
2020-03-02 18:27:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-03-02 18:27:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-03-02 18:27:44 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-03-02 18:27:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-03-02 18:27:52 [scrapy.core.engine] INFO: Closing spider (finished)
2020-03-02 18:27:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 453,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2719,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 12.789971,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 3, 2, 10, 27, 52, 44220),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 3, 2, 10, 27, 39, 254249)}
2020-03-02 18:27:52 [scrapy.core.engine] INFO: Spider closed (finished)

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mchange�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mdeep-thoughts�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mthinking�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mworld�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“The world as we have created it is a process of our thinking. It �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mcannot be changed without changing our thinking.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mJ.K. Rowling�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mabilities�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mchoices�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“It is our choices, Harry, that show what we truly are, far more �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mthan our abilities.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlife�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlive�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mmiracle�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mmiracles�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“There are only two ways to live your life. One is as though �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mnothing is a miracle. The other is as though everything is a �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mmiracle.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mJane Austen�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33maliteracy�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mbooks�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mclassic�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mhumor�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“The person, be it gentleman or lady, who has not pleasure in a �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mgood novel, must be intolerably stupid.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mMarilyn Monroe�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mbe-yourself�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m"�[39;49;00m�[33m“Imperfection is beauty, madness is genius and it�[39;49;00m�[33m'�[39;49;00m�[33ms better to be �[39;49;00m�[33m"�[39;49;00m
          �[33m'�[39;49;00m�[33mabsolutely ridiculous than absolutely boring.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33madulthood�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33msuccess�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mvalue�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“Try not to become a man of success. Rather become a man of �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mvalue.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAndré Gide�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mlife�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlove�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“It is better to be hated for what you are than to be loved for �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mwhat you are not.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mThomas A. Edison�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33medison�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mfailure�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mparaphrased�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m"�[39;49;00m�[33m“I have not failed. I�[39;49;00m�[33m'�[39;49;00m�[33mve just found 10,000 ways that won�[39;49;00m�[33m'�[39;49;00m�[33mt work.”�[39;49;00m�[33m"�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mEleanor Roosevelt�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mmisattributed-eleanor-roosevelt�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“A woman is like a tea bag; you never know how strong it is until �[39;49;00m�[33m'�[39;49;00m
          �[33m"�[39;49;00m�[33mit�[39;49;00m�[33m'�[39;49;00m�[33ms in hot water.”�[39;49;00m�[33m"�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mSteve Martin�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mhumor�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mobvious�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33msimile�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“A day without sunshine is like, you know, night.”�[39;49;00m�[33m'�[39;49;00m}]

# Requests  -----------------------------------------------------------------
[<GET http://quotes.toscrape.com/page/�[34m2�[39;49;00m/>]
@nyov
Copy link
Contributor

nyov commented Mar 2, 2020

No idea where that comes from, but these are ANSI escape sequences for colors in your text there.
Do you use some kind of color logging extension?
Or perhaps that's something your anaconda does?

@wRAR
Copy link
Contributor

wRAR commented Mar 2, 2020

Pass --nocolour. While Pygments seems to have some support for the Windows terminal, Scrapy uses TerminalFormatter which is specifically for ANSI sequences. This is probably a bug in Scrapy.

@nyov
Copy link
Contributor

nyov commented Mar 2, 2020

Ah, nice catch. I hadn't known about this.

This is probably a bug in Scrapy.

So we should detect platform in Scrapy to load a better Formatter class in windows?
But I couldn't find one here https://pygments.org/docs/formatters/ that says it would support windows color stuff.

@wRAR
Copy link
Contributor

wRAR commented Mar 2, 2020

Yup, the only Windows support I see is in the console utility (cmdline.py) , no formatters are available. Maybe it's better to just have nocolour enabled by default on the Windows terminal (but not on UNIX terminals running under Windows).

@A-hoy
Copy link
Author

A-hoy commented Mar 3, 2020

@nyov , i didn't use something loggin extension. @wRAR , after use --nocolour option, it works. Thanks!
Besides, i try to use scrapy parse in VSCode terminal(cmd shell), it works well with colourful logging, but i only enable python extension in workspace. So weird.
scrapy_parse

@wRAR wRAR changed the title Garbled code is shown when running scrapy parse command scrapy parse emits ANSI color sequences in the Windows terminal Mar 3, 2020
akshaysharmajs added a commit to akshaysharmajs/scrapy that referenced this issue Mar 4, 2020
@akshaysharmajs
Copy link
Contributor

akshaysharmajs commented Mar 4, 2020

Above commit is giving correct output. @wRAR please review changes.

@wRAR
Copy link
Contributor

wRAR commented Mar 4, 2020

@akshaysharmajs please create a pull request.

akshaysharmajs added a commit to akshaysharmajs/scrapy that referenced this issue Mar 4, 2020
wRAR pushed a commit that referenced this issue Jul 20, 2020
#4393 (#4403)

* changed ie. -> i.e.(spelling error) on lines 667, 763 (issue #4332)

* updated all text files for issue #4332 (ie. -> i.e.)

* Apply ie. → i.e. in source comments

* ie → e.g.

* modified scrapy/utils/display.py to stop ANSI color sequences in the Windows terminal (issue #4393)

* modified scrapy/utils/display.py to stop ANSI color sequences in the Windows terminal (issue #4393)

* enabled virtual terminal processing (pr #4403)

* check for specific windows 10 version (pr #4403)

* fixing flake-8 test (pr #4403)

* added error handling for terminal info (pr #4403)

* corrected stderr (pr #4403)

* changed orientation, removed unwanted spaces (pr #4403)

* no need for style variable (pr #4403)

* fixing trailing whitespaces

* commenting windows check

* Update scrapy/utils/display.py

Co-Authored-By: Adrián Chaves <adrian@chaves.io>

* Update scrapy/utils/display.py

Co-Authored-By: Adrián Chaves <adrian@chaves.io>

* Update scrapy/utils/display.py

Co-Authored-By: Adrián Chaves <adrian@chaves.io>

* Update scrapy/utils/display.py

Co-Authored-By: Adrián Chaves <adrian@chaves.io>

* small fixes

* Shifting _color_support_info() function

* enabled virtual terminal processing (pr #4403)

* check for specific windows 10 version (pr #4403)

* fixing flake-8 test (pr #4403)

* added error handling for terminal info (pr #4403)

* corrected stderr (pr #4403)

* changed orientation, removed unwanted spaces (pr #4403)

* no need for style variable (pr #4403)

* fixing trailing whitespaces

* commenting windows check

* Update scrapy/utils/display.py

Co-Authored-By: Adrián Chaves <adrian@chaves.io>

* Update scrapy/utils/display.py

Co-Authored-By: Adrián Chaves <adrian@chaves.io>

* Update scrapy/utils/display.py

Co-Authored-By: Adrián Chaves <adrian@chaves.io>

* Update scrapy/utils/display.py

Co-Authored-By: Adrián Chaves <adrian@chaves.io>

* small fixes

* Shifting _color_support_info() function

* error handling

* error handlingy

* raise ValueError

* added in-built function for version comparison

* recommit changes

* changed check -> parse

* version comparison -> parse_version

* added scrapy/utils/display.py in pytest.ini

* Trigger

* Add simple test for scrapy.utils.display._colorize

* Flake8: E501 for tests/test_utils_display.py

* assertEquals -> assertEqual

* Normal formatter for all platforms

* separate test for windows

* all curses under try block

* added global TestStr

* more test added

* small fix

* covering exceptions

* windows test failing

* Refactor output color handling

* Fix pprint test

* fix flake8

Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Eugenio Lacuesta <eugenio.lacuesta@gmail.com>
@akshaysharmajs
Copy link
Contributor

akshaysharmajs commented Jul 27, 2020

Should we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants