Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include Docker support #14

Open
Tails opened this issue Nov 25, 2018 · 10 comments
Open

Include Docker support #14

Tails opened this issue Nov 25, 2018 · 10 comments
Labels
enhancement New feature or request

Comments

@Tails
Copy link

Tails commented Nov 25, 2018

Its easy to get up and running using Docker (no need to install a bunch of dependencies on a system that you don't know about).

I got Docker working using the following files:

#Dockerfile
FROM ruby:2.5.3-stretch
RUN gem install kimurai
RUN apt-get update && apt install -q -y git unzip wget tar openssl xvfb chromium \
                                        firefox-esr libsqlite3-dev sqlite3 mysql-client default-libmysqlclient-dev

RUN cd /tmp && \
    wget https://chromedriver.storage.googleapis.com/2.39/chromedriver_linux64.zip && \
    unzip chromedriver_linux64.zip -d /usr/local/bin && \
    rm -f chromedriver_linux64.zip

RUN cd /tmp && \
    wget https://github.com/mozilla/geckodriver/releases/download/v0.21.0/geckodriver-v0.21.0-linux64.tar.gz && \
    tar -xvzf geckodriver-v0.21.0-linux64.tar.gz -C /usr/local/bin && \
    rm -f geckodriver-v0.21.0-linux64.tar.gz

RUN apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev && \
    cd /tmp && \
    wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib && \
    ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin && \
    rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2

RUN mkdir -p /app

ADD Gemfile /app

RUN cd /app && bundle install

ENTRYPOINT ['kimurai']

And its docker-compose.yml:

# 'extends' is not supported in version 3
version: '2'

services:

  base:
    build: ./
    entrypoint: /bin/bash
    working_dir: /app
    volumes:
      - ./:/app

  irb:
    extends: base
    entrypoint: irb
    volumes:
      - ./:/app

  kimurai:
    extends: base
    entrypoint: bundle exec kimurai
    volumes:
      - ./:/app

  crawl:
    extends: kimurai
    command: crawl
    volumes:
      - ./:/app
@vifreefly
Copy link
Owner

@Tails, would you be interested to make a PR for this?

@Tails
Copy link
Author

Tails commented Nov 26, 2018

I will somewhere this week.

@vifreefly vifreefly added the enhancement New feature or request label Dec 1, 2018
@patrykk21
Copy link

How do you use this?

@seliverstov-maxim
Copy link

seliverstov-maxim commented Apr 4, 2020

IMHO docker image would be enough

@seliverstov-maxim
Copy link

seliverstov-maxim commented Apr 4, 2020

Works for me (developing compilation):
Dockerfile

FROM ruby:2.5.3-stretch
RUN gem install kimurai
RUN apt-get update && apt-get install -q -y git unzip lsof wget tar openssl xvfb chromium \
                                        firefox-esr libsqlite3-dev sqlite3 mysql-client default-libmysqlclient-dev

RUN cd /tmp && \
    wget https://chromedriver.storage.googleapis.com/2.39/chromedriver_linux64.zip && \
    unzip chromedriver_linux64.zip -d /usr/local/bin && \
    rm -f chromedriver_linux64.zip

RUN cd /tmp && \
    wget https://github.com/mozilla/geckodriver/releases/download/v0.21.0/geckodriver-v0.21.0-linux64.tar.gz && \
    tar -xvzf geckodriver-v0.21.0-linux64.tar.gz -C /usr/local/bin && \
    rm -f geckodriver-v0.21.0-linux64.tar.gz

RUN apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev && \
    cd /tmp && \
    wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib && \
    ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin && \
    rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2

RUN mkdir -p /app

ADD Gemfile /app

RUN cd /app && bundle install

Gemfile

source 'https://rubygems.org' do
  gem 'kimurai'
  gem 'byebug'
end

Build

docker build . -t simple-kimurai 

Run (it opens container with installed env. for developing with mounetd current_dir)

docker run --rm -it -v ${PWD}:/app -w /app simple-kimurai bash

@seliverstov-maxim
Copy link

It would be great if owner creates oficial docker image.

@thanhtoan1196
Copy link

@seliverstov-maxim Dockerfile is great, but it crashes when running with multithreads

I, [2021-05-07 08:17:08 +0000#1693] [C: 47304296299360]  INFO -- MySpider: Info: visits: requests: 7, responses: 6
D, [2021-05-07 08:17:08 +0000#1693] [C: 47304296299360] DEBUG -- MySpider: Browser: driver.current_memory: 3837
I, [2021-05-07 08:17:08 +0000#1693] [C: 47304296299360]  INFO -- MySpider: Browser: driver selenium_chrome has been destroyed
#<Thread:0x0000560bc78df6c0@/usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:299 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
	19: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `block (2 levels) in in_parallel'
	18: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `each'
	17: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:313:in `block (3 levels) in in_parallel'
	16: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:204:in `request_to'
	15: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:204:in `public_send'
	14: from a.rb:33:in `try_parse'
	13: from a.rb:52:in `parse_question_page'
	12: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:21:in `visit'
	11: from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/session.rb:278:in `visit'
	10: from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/selenium/driver.rb:104:in `visit'
	 9: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/navigation.rb:32:in `to'
	 8: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/oss/bridge.rb:52:in `get'
	 7: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/oss/bridge.rb:587:in `execute'
	 6: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
	 5: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
	 4: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/default.rb:114:in `request'
	 3: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `create_response'
	 2: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `new'
	 1: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:34:in `initialize'
/usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:72:in `assert_ok': unknown error: session deleted because of page crash (Selenium::WebDriver::Error::UnknownError)
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome=73.0.3683.75)
  (Driver info: chromedriver=2.39.562737 (dba483cee6a5f15e2e2d73df16968ab10b38a2bf),platform=Linux 5.10.25-linuxkit x86_64)
I, [2021-05-07 08:17:08 +0000#1693] [M: 47304283293120]  INFO -- MySpider: Browser: driver selenium_chrome has been destroyed
F, [2021-05-07 08:17:08 +0000#1693] [M: 47304283293120] FATAL -- MySpider: Spider: stopped: {:spider_name=>"MySpider", :status=>:failed, :error=>"#<Selenium::WebDriver::Error::UnknownError: unknown error: session deleted because of page crash\nfrom unknown error: cannot determine loading status\nfrom tab crashed\n  (Session info: headless chrome=73.0.3683.75)\n  (Driver info: chromedriver=2.39.562737 (dba483cee6a5f15e2e2d73df16968ab10b38a2bf),platform=Linux 5.10.25-linuxkit x86_64)>", :environment=>"development", :start_time=>2021-05-07 08:16:42 +0000, :stop_time=>2021-05-07 08:17:08 +0000, :running_time=>"25s", :visits=>{:requests=>7, :responses=>6}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}
I, [2021-05-07 08:17:08 +0000#1693] [C: 47304296275900]  INFO -- MySpider: Browser: driver selenium_chrome has been destroyed
I, [2021-05-07 08:17:08 +0000#1693] [C: 47304296321600]  INFO -- MySpider: Browser: driver selenium_chrome has been destroyed
I, [2021-05-07 08:17:08 +0000#1693] [C: 47304296845720]  INFO -- MySpider: Browser: driver selenium_chrome has been destroyed
Traceback (most recent call last):
	19: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `block (2 levels) in in_parallel'
	18: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `each'
	17: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:313:in `block (3 levels) in in_parallel'
	16: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:204:in `request_to'
	15: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:204:in `public_send'
	14: from a.rb:33:in `try_parse'
	13: from a.rb:52:in `parse_question_page'
	12: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:21:in `visit'
	11: from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/session.rb:278:in `visit'
	10: from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/selenium/driver.rb:104:in `visit'
	 9: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/navigation.rb:32:in `to'
	 8: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/oss/bridge.rb:52:in `get'
	 7: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/oss/bridge.rb:587:in `execute'
	 6: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
	 5: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
	 4: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/default.rb:114:in `request'
	 3: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `create_response'
	 2: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `new'
	 1: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:34:in `initialize'
/usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:72:in `assert_ok': unknown error: session deleted because of page crash (Selenium::WebDriver::Error::UnknownError)
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome=73.0.3683.75)
  (Driver info: chromedriver=2.39.562737 (dba483cee6a5f15e2e2d73df16968ab10b38a2bf),platform=Linux 5.10.25-linuxkit x86_64)

@hjhart
Copy link

hjhart commented Jul 25, 2021

I'm having the same issues with multithreading inside of a docker container. Code works great on my Mac OS X box.

::WebDriver::Error::UnknownError: unknown error: session deleted because of page crash\nfrom unknown error: cannot determine loading status\nfrom tab crashed\n  (Session info: headless chrome=86.0.4240.111)>", :environment=>"development", :start_time=>2021-07-25 18:06:00.6242447 +0000, :stop_time=>2021-07-25 18:06:18.1101284 +0000, :running_time=>"17s", :visits=>{:requests=>2, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}
/usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:72:in `assert_ok': unknown error: session deleted because of page crash (Selenium::WebDriver::Error::UnknownError)
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome=86.0.4240.111)
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:34:in `initialize'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `new'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `create_response'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/default.rb:114:in `request'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/w3c/bridge.rb:567:in `execute'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/w3c/bridge.rb:59:in `get'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/navigation.rb:32:in `to'
        from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/selenium/driver.rb:104:in `visit'
        from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/session.rb:278:in `visit'
        from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:21:in `visit'
        from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:201:in `request_to'
        from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:313:in `block (3 levels) in in_parallel'
        from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `each'
        from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `block (2 levels) in in_parallel'

@thanhtoan1196 did you figure out a workaround?

@tellodaniel
Copy link

@hjhart @thanhtoan1196 In my case I can't modify certain configurations of my docker container so I added the following flag: --disable-dev-shm-usage and everything worked like a charm. The downside is that now is using /tmp folder and probably your spider will be slower.

Problem is described here: https://stackoverflow.com/questions/53902507/unknown-error-session-deleted-because-of-page-crash-from-unknown-error-cannot

@iwoogy
Copy link

iwoogy commented Oct 1, 2022

I have put together an updated version for the docker configuration.

https://github.com/iwoogy/kimurai-docker-example

Hope it could help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants