Skip to content

sojin-project/scrape-academy

Repository files navigation

Scrape Academy

Scrape Academy provides a framework and a utility that helps you to develop web scraping applications.

Install

pip3 install scrape-academy

Simple web page scraping

Scrape Academy helps you to download web pages to scrape.

# Download a page from https://www.python.jp

from bs4 import BeautifulSoup
from scrapeacademy import context, run

async def run_simple():
    page = await context.get("https://www.python.jp")
    soup = BeautifulSoup(page, features="html.parser")
    print(soup.title.text)

run(run_simple())

scrapeacademy.run() starts asyncio event loop and run a scraping function.

In the async function, you can use context.get() method to download the page. The context.get() throttle the requests to the server. By default, context.get() waits 0.1 seconds between requests.

Cache downloaded files

While developing the scraper, you usually need to investigate the HTML over and over. To help investigations, you can save the downloaded files to the cache directory.

The context.get() method saves the downloaded file to the cache directory if name parameter is supplied.

# Save https://www.python.jp

from scrapeacademy import context, run

async def save_index():
    page = await context.get("https://www.python.jp", name="python_jp_index")

run(run_simple())

Later, you can load the saved HTML from the cache to scrape using another script.

# Parse saved HTML file.

from scrapeacademy import context

html = context.load("python_jp_index")
soup = BeautifulSoup(page, features="html.parser")
print(soup.title.text)

Command-line utility

Scrape Academy provides the scrapeacademy command to make development easier.

You can inspect the cached files with a web browser.

$ scrapeacademy open python_jp_index

Or, you can view the file with vi editor as follow.

$ vi `scrapeacademy path python_jp_index`

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages