Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

a little rewrite #1

Closed
wants to merge 1 commit into from

2 participants

@jjeffus

First of all thanks for sharing your code. I had a thread I wanted to download and this was where I came immediately from a DDG search. I naturally tweak things, and I had a few issues with it not finding any images on the page. So I started tweaking and then.... I kind of rewrote the whole thing.

Anyways, just sharing back.

Peace

@serv serv closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 5, 2012
  1. @jjeffus

    rewrite using nokogiri

    jjeffus authored
This page is out of date. Refresh to see the latest.
Showing with 64 additions and 42 deletions.
  1. +45 −39 4chan_image_crawler.rb
  2. +1 −0  Gemfile
  3. +9 −0 Gemfile.lock
  4. +9 −3 README
View
84 4chan_image_crawler.rb
@@ -1,48 +1,54 @@
-def urlToString(url)
- require 'net/http'
- uri = URI(url)
- data = Net::HTTP.get(uri)
- return data
-end
+#!/usr/bin/env ruby
-def writeToFile(image_url, directory_name)
- require 'open-uri'
- file_name = image_url[-17..-1]
- puts image_url[-17..-1]
- open("#{directory_name}/#{file_name}", 'wb') do |file|
- file << open(image_url).read
- end
-end
+require 'nokogiri'
+require 'fileutils'
+require 'net/http'
+require 'uri'
+require 'open-uri'
-def createDirectory
- directory_name = "images_#{Time.now.to_f}"
- Dir.mkdir("./#{directory_name}")
- return directory_name
-end
+class FourChan
-def main(url)
- html_content = urlToString(url)
- occurances = html_content.scan(/http:\/\/images\.4chan\.org\/.*?\/\d+\.[jpg|jpeg|gif|png]{3,4}/)
- result = []
- (0..occurances.count-1).each do |i|
- if i%2 == 0
- result << occurances[i]
+ def list_images(url)
+ nok = Nokogiri::HTML( open(url).read )
+ images = []
+ nok.xpath("//a[@target='_blank']").each do |el|
+ images.push el.attr('href') if el.attr('href') =~ /images\.4chan\.org/i
end
+ images.uniq!
+ STDERR.puts "There are #{images.count} images to download."
+ images
end
- puts result
- puts "There are #{result.count} images to download."
-
- directory_name = createDirectory
-
- (0..result.count).each do |i|
- if result[i] != nil
- puts i+1 # number starts from 1 instead of 0
- writeToFile(result[i], directory_name)
+
+ def download_images(thread_url, dir='.')
+ self.list_images(thread_url).each do |image_url|
+ uri = URI("http:#{image_url}")
+ file_name = image_url.split("/").last
+ next if File.exists? "#{dir}/#{file_name}"
+ Net::HTTP.start(uri.host, uri.port) do |http|
+ request = Net::HTTP::Get.new uri.request_uri
+ STDERR.puts "Saving: #{dir}/#{file_name}"
+ http.request request do |response|
+ open "#{dir}/#{file_name}", 'w' do |io|
+ response.read_body do |chunk|
+ io.write chunk
+ end
+ end
+ end
+ sleep 3
+ end
end
- sleep 2 #seconds
end
end
-
-ARGV.each do |url|
- main(url)
+
+def usage(msg)
+ STDERR.puts "Error: #{msg}" if msg
+ STDERR.puts "image_crawler.rb <destination directory> <thread url> [additional thread urls]"
+ exit 1
end
+
+usage("Missing arguments.") if ARGV.length < 2
+dir = ARGV.shift
+usage("You must specify a destination directory as the first argument.") if dir =~ /^http/i
+usage("Second argument doesn't look like a thread url.") unless ARGV[0] =~ /http/i
+FileUtils.mkdir_p dir unless Dir.exists? dir
+FourChan.new.download_images(ARGV.shift, dir)
View
1  Gemfile
@@ -0,0 +1 @@
+gem 'nokogiri'
View
9 Gemfile.lock
@@ -0,0 +1,9 @@
+GEM
+ specs:
+ nokogiri (1.5.2)
+
+PLATFORMS
+ ruby
+
+DEPENDENCIES
+ nokogiri
View
12 README
@@ -5,13 +5,19 @@ All image boards are supported.
Recommended Ruby environment
ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0]
+Installation
+
+Run "bundle install" in the script directory or "gem install nokogiri" to install the required 'nokogiri' gem.
+
To run the code,
1. On your terminal, move to the directory where you want to download images.
2. Save the ruby code into the directory.
-3. Run the following command. Replace IMAGE_URL with the url of the 4chan thread.
-$ ruby 4chan_image_crawler.rb IMAGE_URL
+3. Run the following command. Replace THREAD_URL with the url of the 4chan thread.
+$ ruby 4chan_image_crawler.rb . THREAD_URL
4. Images should be downloaded into the directory.
+You can re-run the same command again and the script will skip previously downloaded images.
+
---
Recent updates
@@ -19,4 +25,4 @@ March 4, 2012
- Supports all image boards. Not just /b/
- Creates unique directory name automatically, when the code is run
-@jasoki
+@jasoki
Something went wrong with that request. Please try again.