Skip to content

A python package that enhances speed and simplicity of parsing robots files.

License

Notifications You must be signed in to change notification settings

xyzpw/robotsparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

robotsparse

Pepy Total Downlods
A python package that enhances speed and simplicity of parsing robots files.

Usage

Basic usage, such as getting robots contents:

import robotsparse

#NOTE: The `find_url` parameter will redirect the url to the default robots location.
robots = robotsparse.getRobots("https://github.com/", find_url=True)
print(list(robots)) # output: ['user-agents']

The user-agents key will contain each user-agent found in the robots file contents along with information associated with them.

Alternatively, we can assign the robots contents as an object, which allows faster accessability:

import robotsparse

# This function returns a class.
robots = robotsparse.getRobotsObject("https://duckduckgo.com/", find_url=True)
assert isinstance(robots, object)
print(robots.allow) # Prints allowed locations
print(robots.disallow) # Prints disallowed locations
print(robots.crawl_delay) # Prints found crawl-delays
print(robots.robots) # This output is equivalent to the above example

Additional Features

When parsing robots files, it sometimes may be useful to parse sitemap files:

import robotsparse
sitemap = robotsparse.getSitemap("https://pypi.org/", find_url=True)

The above code contains a variable named sitemap which contains information that looks like this:

[{"url": "", "lastModified": ""}]

About

A python package that enhances speed and simplicity of parsing robots files.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages