Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrapyd writing logs to console instead of log files in docker container #470

Closed
blacksteel1288 opened this issue Feb 5, 2023 · 9 comments

Comments

@blacksteel1288
Copy link

I've been using scrapyd with this docker configuration for a while with no issues, but now scrapyd is only sending the scrapy logs to the console instead of a log file in the /logs directory. I'm not sure what has changed, but I found this when doing a new install, and now both the new and old install have the same issue.

I can access scrapyd at the web url with no issues, and run a spider, but when browsing to the log directory of the spider (e.g. /logs/{project}/{myspider}), no log files are available there or in the data directories, if I access them directly via command line.

Since the parent spider directory for the logs is being created successfully automatically by scrapyd, I don't believe this is a permissions issue.

Here's the relevant files:

docker-compose.yml

version: '3.9'
services:

  spider_1:
    build:
      context: ./scrapyd
      shm_size: '2gb'
    image: my-scrapyd
    container_name: my-scrapyd_1
    restart: always
    shm_size: '2gb'
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 4G
        reservations:
          memory: 2G
    environment:
      TZ: "America/New_York"
    volumes:
      - ./data/spider_1/results:/app/results
      - ./data/spider_1/eggs:/app/eggs
      - ./data/spider_1/logs:/app/logs
      - ./data/spider_1/dbs:/app/dbs
      - ./data/spider_1/dump:/app/dump
    ports:
      - "6800:6800"
    networks:
      - my_net

  web:
    build:
      context: ./scrapydweb
    image: my-scrapydweb
    container_name: my-scrapydweb
    restart: always
    environment:
      CLUSTER_SERVERS: "spider_1:6800"
      TZ: "America/New_York"
    links:
      - spider_1
    ports:
      - "5100:5000"
    networks:
      - my_net
    volumes:
      - ./data/web:/usr/local/lib/python3.6/site-packages/scrapydweb/data
    depends_on:
      - spider_1

networks:
  my_net:
    name: my_net
    driver: bridge

Dockerfile

# Ubuntu is required for playwright
FROM ubuntu:jammy

ARG DEBIAN_FRONTEND=noninteractive

WORKDIR /app

COPY requirements.txt scrapyd.conf start-scrapyd.sh ./

RUN apt update && apt install -y python3 python-is-python3 python3-pip
RUN /usr/bin/python -m pip install --upgrade pip
RUN pip install -r requirements.txt
RUN chmod +x start-scrapyd.sh
RUN mkdir -p /app/dump
RUN playwright install chrome

EXPOSE 6800
CMD ["./start-scrapyd.sh"]

requirements.txt

scrapyd
logparser
scrapy-playwright
beautifulsoup4
brotli
pymongo
pyOpenSSL

start-scrapyd.sh

#!/bin/bash
mkdir -p /app/logs
logparser -dir /app/logs -t 10 --delete_json_files &
scrapyd

scrapyd.conf

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port   = 6800
debug       = off
runner      = scrapyd.runner
jobstorage  = scrapyd.jobstorage.MemoryJobStorage
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root
eggstorage  = scrapyd.eggstorage.FilesystemEggStorage

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

And, specifically in my scrapy settings.py file, these are the only LOG settings used:

LOG_LEVEL = 'INFO'
LOG_SHORT_NAMES = True

I've tried various troubleshooting, including not mapping any volumes in docker-compose and letting scrapyd create the directories/files inside the container only to see if it was a permissions issue, but still the same issue -- no log files, just scrapy logs sent to the console instead of a logfile when running a spider.

I don't see any error messages or relevant logs. I also tried setting debug = on inside scrapyd.conf, but no additional info was shown.

Is it possible that some upstream library has changed that would cause this?

Here's the results of a "pip list" insider the container:

Package            Version
------------------ -----------
attrs              22.2.0
Automat            22.10.0
beautifulsoup4     4.11.2
Brotli             1.0.9
certifi            2022.12.7
cffi               1.15.1
charset-normalizer 3.0.1
constantly         15.1.0
cryptography       39.0.0
cssselect          1.2.0
dbus-python        1.2.18
dnspython          2.3.0
filelock           3.9.0
greenlet           2.0.1
hyperlink          21.0.0
idna               3.4
incremental        22.10.0
itemadapter        0.7.0
itemloaders        1.0.6
jmespath           1.0.1
logparser          0.8.2
lxml               4.9.2
packaging          23.0
parsel             1.7.0
pexpect            4.8.0
pip                23.0
playwright         1.30.0
Protego            0.2.1
ptyprocess         0.7.0
pyasn1             0.4.8
pyasn1-modules     0.2.8
pycparser          2.21
PyDispatcher       2.0.6
pyee               9.0.4
PyGObject          3.42.1
pymongo            4.3.3
pyOpenSSL          23.0.0
queuelib           1.6.2
requests           2.28.2
requests-file      1.5.1
Scrapy             2.8.0
scrapy-playwright  0.0.26
scrapyd            1.3.0
service-identity   21.1.0
setuptools         59.6.0
six                1.16.0
soupsieve          2.3.2.post1
tldextract         3.4.0
Twisted            22.10.0
typing_extensions  4.4.0
urllib3            1.26.14
w3lib              2.1.1
wheel              0.37.1
zope.interface     5.5.2
@todoit
Copy link

todoit commented Feb 6, 2023

same question. I change the scrapy version to 2.7.1 and the log come back.

@blacksteel1288
Copy link
Author

confirmed, that was it. holding the scrapy version at 2.7.1 solves this problem. thx.

@jpmckinney jpmckinney added the type: question a user support question label Feb 6, 2023
@jpmckinney
Copy link
Contributor

@blacksteel1288 So does the problem occur when using Scrapy 2.8 in combination with Scrapyd?

@jpmckinney
Copy link
Contributor

jpmckinney commented Feb 6, 2023

Probably related to:

Support for using environment variables prefixed with SCRAPY_ to override settings, deprecated in Scrapy 2.0, has now been removed

And our setting of SCRAPY_LOG_FILE in environ.py. It should now just be LOG_FILE.

Related: #369

@jpmckinney jpmckinney added type: bug and removed type: question a user support question labels Feb 6, 2023
@jpmckinney
Copy link
Contributor

Merging into #369 as duplicate.

@blacksteel1288
Copy link
Author

Hi @jpmckinney,

I tested this patch and verified it does work correctly with scrapy 2.8 -- log files created as expected.

Will there be a new release of scrapyd soon?

Thank you!

@jpmckinney
Copy link
Contributor

1.4.0 is now available: https://pypi.org/project/scrapyd/ 🎉

@blacksteel1288
Copy link
Author

great, thank you. you may also want to bump the release here in the repo to match pypi -- https://github.com/scrapy/scrapyd/releases

@jpmckinney
Copy link
Contributor

you may also want to bump the release here in the repo to match pypi

Thanks for the reminder! Done now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants