Skip to content

Extracts links/urls from html markup with ability to choose which urls to be extracted.

License

Notifications You must be signed in to change notification settings

sekgobela-kevin/surflink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

surflink

Surflink extracts links/urls from html markup with ability to choose which urls to be extracted. Instead of extracting all urls from html, specific urls like images can be extracted while leaving the rest. Urls by default can only be extracted if they are within href and src attributes of tags.

Type of url is determined from the tag of url and also content type guessed from the url. strict argument can be passed to most functions which allows to determine type of url from its tag other than guessed content type. That would make image url to be matched only if its within 'img' tag but wont be matched if not within tag even if it has image extension.

Other functions operating on urls also exits such as making urls absolute or filtering them based on their scheme or hostname. resid provides more features for operating with urls and others which surflink does not provide but may be valuable.

Installing

This is enough into your command-line application(python required):

pip install surflink

Usage

First import surflink to use its functions.

>>> import resid

This html sample will be used for code examples examples to follow.

<html>
    <head>
        <base href="https://example.com/">
        <script src='https://example.com/startup.js' 
        type="application/javascript"></script>
        <script src='https://example.com/jquery.js'></script>
        <link rel="stylesheet" href='https://example.com/w3.css'>
    </head>
    <body>
        <iframe src='https://example.com/great_river.html'></iframe>
        <img src='https://example.com/pages/elephant.png'>
        <audio src='https://example.com/audios/hiphop_beat.mp3'></audio>
        <video src='https://example.com/audios/underground.mp4'></video>
        
        Random Links
        <a href='/pages/world'>world</a><br>
        <a href='https://example.com/pages/tree.png'>tree image</a><br>
        <a href="https://en.wikipedia.org/wiki/Food">food</a>
    </body>
</html>

Take it as part of code samples to follow as variable html_sample has been defined before it.

# This variable contains above html as string.
>>> html_sample = ...

Here urls for images and javascript get extracted from html. All functions that extract urls accept the same argument but may return different urls.

# Extracts urls for images no matter where they are from.
>>> surflink.extract_image_urls(html_sample)
['https://example.com/pages/elephant.png', 'https://example.com/pages/tree.png']
# Does the same with javascript urls.
>>> surflink.extract_javascript_urls(html_sample)
['https://example.com/startup.js', 'https://example.com/jquery.js']

Realise that 'https://example.com/pages/tree.png' was matched as image just because of its extension even if it was not in img tags or stated that its an image url. strict argument does exatly in that url type will be determined based on whats on html not guessed content type.

# Now only image urls with 'img' tag will be extracted.
# Unless type attribute is set to something else not being image.
>>> surflink.extract_image_urls(html_sample, strict=True)
['https://example.com/pages/elephant.png']

Sometimes urls with html may be relative to other urls and it may happen that their absolute version is needed. That would require making url absolute using whatever url it is based on usually can be found within the html but sometimes it may need to be provided explicitely.

There is url '/pages/world' which is relative but we need it to be absolute like other urls. Its now easier as our html already contains base url which is internally used to make url absolute.

# Realise that '/pages/world' is absolute and missing scheme and domain.
>>> surflink.extract_webpage_urls(html_sample)
[..., '/pages/world', 'https://en.wikipedia.org/wiki/Food']
# But here scheme and domain were added automatically.
>>> surflink.extract_webpage_urls(html_sample, make_absolute=True)
[..., 'https://example.com/pages/world', 'https://en.wikipedia.org/wiki/Food']
# Base url can be provided with 'base_url' argument if your html does
# not provide one or want to overide html base url.
surflink.extract_webpage_urls(html_sample, base_url="https://example.com/" 
make_absolute=True)
[..., 'https://example.com/pages/world', 'https://en.wikipedia.org/wiki/Food']

Urls can be made absolute without requiring extracting them from html.

# Makes single url absolute
>>> surflink.make_url_absoulute(base_url="https://example.com/", url="/pages/world")
'https://example.com/pages/world'
# Makes multiple urls absolute
>>> urls = ["/pages/world", "//example.com/pages/elephant.png"]
>>> surflink.make_urls_absoulute(base_url="https://example.com/pages", urls=urls)
['https://example.com/pages/world', 'https://example.com/pages/elephant.png']

There exists other arguments on functions that extracts urls such as attrs which specifies attributes to extract urls and start_tag which specifies tag name to start extracting urls and lastly unique which ensures function returns unique urls.

>>> surflink.extract_javascript_urls(html_sample, attrs=("src", "href"), start_tag="html", strict=False)
['https://example.com/startup.js', 'https://example.com/jquery.js']

Functions here are just few of other functions that exists in surflink.

License

MIT license

About

Extracts links/urls from html markup with ability to choose which urls to be extracted.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages