Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename gem to twingly-url #15

Merged
merged 8 commits into from
Jul 15, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,23 @@
# twingly-url-normalizer
# Twingly::URL

[![Build Status](https://magnum.travis-ci.com/twingly/twingly-url-normalizer.png?token=ADz8fWxRD3uP4KZPPZQS&branch=master)](https://magnum.travis-ci.com/twingly/twingly-url-normalizer)

Ruby gem for URL normalization
Twingly URL tools.

## Example
* `twingly/url` - Parse and validate URLs
* `Twingly::URL.parse` - Returns a Struct with `#url` and `#domain` accessors
* `Twingly::URL.validate` - Validates a URL
* `twingly/url/normalizer` - Normalize URLs
* `Twingly::URL::Normalizer.normalize(string)` - Extracts URLs from string (Array)
* `twingly/url/hasher` - Generate URL hashes suitable for primary keys
* `Twingly::URL::Hasher.documentdb_hash(url)` - MD5 hexdigest
* `Twingly::URL::Hasher.blogstream_hash(url)` - SHA256 unsigned long, native endian digest
* `Twingly::URL::Hasher.autopingdb_hash(url)` - SHA256 64-bit signed, native endian digest

## Normalization example

```
[1] pry(main)> require 'twingly-url-normalizer'
[1] pry(main)> require 'twingly/url/normalizer'
=> true
[2] pry(main)> Twingly::URL::Normalizer.normalize('http://duh.se')
=> ["http://www.duh.se/"]
Expand Down
39 changes: 2 additions & 37 deletions lib/twingly-url-normalizer.rb
Original file line number Diff line number Diff line change
@@ -1,37 +1,2 @@
require 'addressable/uri'
require 'public_suffix'

PublicSuffix::List.private_domains = false

module Twingly
module URL
class Normalizer
def self.normalize(potential_urls)
extract_urls(potential_urls).map do |potential_url|
normalize_url(potential_url)
end.compact
end

def self.extract_urls(potential_urls)
Array(potential_urls).map(&:split).flatten
end

def self.normalize_url(potential_url)
uri = Addressable::URI.heuristic_parse(potential_url)
domain = PublicSuffix.parse(uri.host)

unless domain.subdomain?
uri.host = "www.#{domain}"
end

if uri.path.empty?
uri.path = "/"
end

uri.to_s
rescue PublicSuffix::DomainInvalid
end
end
end
end

warn "twingly-url-normalizer will be removed, use twingly/url/normalizer"
require 'twingly/url/normalizer'
36 changes: 36 additions & 0 deletions lib/twingly/url.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
require 'addressable/uri'
require 'public_suffix'

PublicSuffix::List.private_domains = false

module Twingly
module URL
module_function

UrlObject = Struct.new(:url, :domain) do
def valid?
url && domain
end
end

def parse(potential_url)
url, domain = extract_url_and_domain(potential_url)
UrlObject.new(url, domain)
end

def extract_url_and_domain(potential_url)
url = Addressable::URI.heuristic_parse(potential_url)
domain = PublicSuffix.parse(url.host)

[url, domain]
rescue PublicSuffix::DomainInvalid
[]
end

def validate(potential_url)
parse(potential_url).valid?
rescue
false
end
end
end
22 changes: 22 additions & 0 deletions lib/twingly/url/hasher.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
require 'digest/md5'
require 'digest/sha2'

module Twingly
module URL
module Hasher
module_function

def blogstream_hash(url)
Digest::MD5.hexdigest(url)[0..29].upcase
end

def documentdb_hash(url)
Digest::SHA256.digest(url).unpack("L!")[0]
end

def autopingdb_hash(url)
Digest::SHA256.digest(url).unpack("q")[0]
end
end
end
end
36 changes: 36 additions & 0 deletions lib/twingly/url/normalizer.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
require 'twingly/url'

module Twingly
module URL
module Normalizer
module_function

def normalize(potential_urls)
extract_urls(potential_urls).map do |potential_url|
normalize_url(potential_url)
end.compact
end

def extract_urls(potential_urls)
Array(potential_urls).map(&:split).flatten
end

def normalize_url(potential_url)
result = Twingly::URL.parse(potential_url)

return nil unless result.valid?

unless result.domain.subdomain?
result.url.host = "www.#{result.domain}"
end

if result.url.path.empty?
result.url.path = "/"
end

result.url.to_s
end
end
end
end

4 changes: 1 addition & 3 deletions lib/version.rb
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
module Twingly
module URL
class Normalizer
VERSION = '1.0.1'
end
VERSION = '1.1.0'
end
end
4 changes: 3 additions & 1 deletion test/test_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@
require 'turn/autorun'
require 'shoulda-context'

require 'twingly-url-normalizer'
require 'twingly/url'
require 'twingly/url/hasher'
require 'twingly/url/normalizer'
24 changes: 24 additions & 0 deletions test/unit/hasher_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
require 'test_helper'

class HasherTest < Test::Unit::TestCase
context ".blogstream_hash" do
should "return a MD5 hexdigest" do
assert_equal Twingly::URL::Hasher.blogstream_hash("http://blog.twingly.com/"),
"B1E2D5AECF6649C2E44D17AEA3E0F4"
end
end

context ".documentdb_hash" do
should "return a SHA256 unsigned long, native endian digest" do
assert_equal Twingly::URL::Hasher.documentdb_hash("http://blog.twingly.com/"),
15340752212397415993
end
end

context ".autopingdb_hash" do
should "return a SHA256 64-bit signed, native endian digest" do
assert_equal Twingly::URL::Hasher.autopingdb_hash("http://blog.twingly.com/"),
-3105991861312135623
end
end
end
13 changes: 13 additions & 0 deletions test/unit/url_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
require 'test_helper'

class UrlTest < Test::Unit::TestCase
context ".validate" do
should "return true for a valid url" do
assert Twingly::URL.validate("http://blog.twingly.com/"), "Should be valid"
end

should "return false for a valid url" do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "return false for a invalid url", right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "return false for a invalid url", right?

Yes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cce2611

refute Twingly::URL.validate("http://"), "Should not be valid"
end
end
end
8 changes: 4 additions & 4 deletions twingly-url-normalizer.gemspec → twingly-url.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@
require File.expand_path('../lib/version', __FILE__)

Gem::Specification.new do |s|
s.name = "twingly-url-normalizer"
s.version = Twingly::URL::Normalizer::VERSION
s.name = "twingly-url"
s.version = Twingly::URL::VERSION
s.platform = Gem::Platform::RUBY
s.authors = ["Johan Eckerström"]
s.email = ["johan.eckerstrom@twingly.com"]
s.homepage = "http://github.com/twingly/twingly-url-normalizer"
s.summary = "Ruby library for URL normalization"
s.homepage = "http://github.com/twingly/twingly-url"
s.summary = "Ruby library for URL handling"
s.required_ruby_version = ">= 1.9.3"

s.add_dependency "addressable"
Expand Down