## How to scrape any website with ScraperAI

### Before we start, install the package

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# ! pip install scraperai

In [None]:
import os
import json

import pandas as pd
from dotenv import find_dotenv, load_dotenv
from tqdm import tqdm

from scraperai.models import WebpageFields, Pagination, WebpageType, ScraperConfig
from scraperai import ParserAI, Scraper
from scraperai.crawlers import SeleniumCrawler

### Step 1. Init crawler

First, we need to initialize a web-crawler that will help us to fetch data from the web.

In this tutorial we use `SeleniumCrawler` that uses Selenium webdriver. By default it creates a new Chrome session.

To use other browsers you can pass your own webdriver (both local and remote) to the `SeleniumCrawler`:
```
crawler = SeleniumCrawler(driver=your_own_webdriver)
```

If you want to use playwright or other services, you can create your own crawler implementation based on `BaseCrawler`.

In [None]:
crawler = SeleniumCrawler()

### Step 2. Init ParserAI

By default, we use the latest OpenAI GPT-4 model. You can place your API key in the `.env` file. If you don't have a key, you can get it [here](https://platform.openai.com/api-keys).
Also, you can use another AI model. To do this, you need to create another implementations of the `BaseJsonLM` and `BaseVision` classes.

In [None]:
env_file = find_dotenv()
if env_file:
    load_dotenv()
openai_api_key = os.getenv('OPENAI_API_KEY')
if openai_api_key is None:
    openai_api_key = input('Please, enter your OpenAI API key: ')

parser = ParserAI(openai_api_key=openai_api_key)

There are 2 experiments in this doc:
1. [List of YCombinator companies](https://www.ycombinator.com/companies/)
2. [List of commits in the repository](https://github.com/scraperai/scraperai/commits/main/)

#### Experiment 1. List of YCombinator companies

### Step 3. Open the website page
Later, in case of multiple similar sites you will be able to run batch scraping. The main target is to semi-automatically detect all xpaths

In [None]:
url = 'https://www.ycombinator.com/companies' # Enter the URL of the website

In [None]:
# Open page in the browser
crawler.get(url)

### Step 3.1. Detect page type
We divide webpages into 4 categories:
- **Catalog**: consists of similar-looking repeating elements. It can be a list of products, articles, companies, table rows, etc;
- **Details**: contains main information about one product;
- **Captcha**: in case we meet anti-scraping CAPTCHA;
- **Other**: everything else; we don't support this webpage type yet.

By default, we use screenshot of the page and GPT4 Vision model to determine a type. We also have a fallback algorithm if you cannot take a screenshot of the page or do not have access to Vision models.

If you know the type of the page, you can set it manually.

In [None]:
page_type = parser.detect_page_type(
    page_source=crawler.page_source,
    screenshot=crawler.get_screenshot_as_base64()
)
# You can set type manually
# page_type = WebpageType.CATALOG
page_type

OpenAI tokens are spent on each action. You can find total money spent using:

In [None]:
parser.total_cost  # in USD

### Step 3.2. Detect pagination
**It is used only for type `catalog`.**

We need to pass a whole page to detect the pagination.
There are 3 types of pagination: `xpath`, `scroll`, and `url_param` (not implemented yet).

In [None]:
pagination = parser.detect_pagination(crawler.page_source)
pagination

In case of error, you can set it manually.

In [None]:
# Scroll type
p1 = Pagination(type='scroll')
# XPATH
p2 = Pagination(type='xpath', xpath='//some-xpath')
# URL param
p3 = Pagination(type='url_param', url_param='page')

### Step 3.3. Detect catalog items
**It is used only for type `catalog`.**
You should correctly choose item block, url and ... 
AI isn't perfect, so you can manually add extra prompt to help AI to understand what you want or set xpath manually.

In [None]:
catalog_item = parser.detect_catalog_item(page_source=crawler.page_source, website_url=url, extra_prompt=None)
catalog_item

You can highlight fields using selenium:

In [None]:
if catalog_item is not None:
    crawler.highlight_by_xpath(catalog_item.card_xpath, '#8981D7', 5)
    crawler.highlight_by_xpath(catalog_item.url_xpath, '#5499D1', 3)

### Step 3.4. Detect data fields in a catalog item

We define two types of data fields in a HTML page.

First type is static field that do not contain a field name. It can be both a single value or an array. Example: product name or price.

Second type is dynamic fields where there are both field names and values mentioned. Usually these fields look like tables:
param1: value1
param2: value2
etc.

In [None]:
# Aux method to print detected fields
def _print_fields(fields: WebpageFields):
    print(f'Static fields ({len(fields.static_fields)}):')

    data = [{'name': f.field_name, 'xpath': f.field_xpath, 'value': f.first_value} for f in fields.static_fields]
    df = pd.DataFrame(data)
    print(df.to_markdown(tablefmt='plain', index=True))

    print()
    print(f'Dynamic fields ({len(fields.dynamic_fields)}):')
    if len(fields.dynamic_fields) == 0:
        print('Not found')
        return
    index = len(fields.static_fields)
    for field in fields.dynamic_fields:
        print(f' {index}  Section {field.section_name}\n'
                   f'    Labels xpath: {field.name_xpath}\n'
                   f'    Values xpath: {field.value_xpath}\n'
                   f'    Value: {field.first_values}')
        index += 1

In [None]:
fields = parser.extract_fields(html_snippet=catalog_item.html_snippet)
_print_fields(fields)

You can highlight detected fields:

In [None]:
# Method to highlight fields
def highlight_fields(fields: WebpageFields):
    colors = ['#539878', '#5499D1', '#549B9A', '#5982A3', '#5A5499', '#68D5A2', '#75DDDC', '#8981D7', '#98D1FF',
              '#98FFCF', '#9D5A5A', '#A05789', '#AAFFFE', '#C6C1FF', '#CD7CB3', '#D17A79', '#FAB4E4', '#FFB1B0']
    for index, field in enumerate(fields.static_fields):
        crawler.highlight_by_xpath(field.field_xpath, colors[index % len(colors)], border=4)
    for index, field in enumerate(fields.dynamic_fields):
        color = colors[index % len(colors)]
        crawler.highlight_by_xpath(field.value_xpath, color, border=3)
        crawler.highlight_by_xpath(field.name_xpath, color, border=3)

In [None]:
highlight_fields(fields)

### Step 3.4 Scrape data

We are almost there!

First of all, let's set some limits for simplicity:

In [None]:
max_pages = 5  # How many catalog pages we should iterate over
max_rows = 200  # How many rows to scrape before stop

Let's init scraper config that you can reuse later:

In [None]:
config = ScraperConfig(
    start_url=url,
    page_type=page_type,
    pagination=pagination,
    catalog_item=catalog_item,
    open_nested_pages=False,
    fields=fields,
    max_pages=max_pages,
    max_rows=max_rows
)

Now we need to create a Scraper instance using Crawler and ScraperConfig. It will iterate over catalog cards, automatically handle pagination and data-extracting.

In [None]:
scraper = Scraper(config, crawler)

rows = []
for item in tqdm(scraper.scrape(), total=max_rows):
    rows.append(item)

#### Congratulations! We got the final data!

In [None]:
rows

You can export data in any format:

In [None]:
# Export as json
with open('yc.json', 'w+') as f:
    json.dump(rows, f, indent=4)

# Export to Pandas DataFrame
df = pd.DataFrame(rows)
df.to_csv('yc.csv')
df

### Step 4. Parse nested detail page

You can extract data from nested pages using ScraperAI

In [None]:
# Open first nested page
crawler.get(catalog_item.urls_on_page[0])

### Step 4.1. Extract fields

First, we use `summarize_details_page_as_valid_html` method to find relevant parts on the initial webpage.
For example, a list of similar products is not a relevant part of a details page.

Then we use `parser.extract_fields` as before to get the fields from html snippet.

In [None]:
html_snippet = parser.summarize_details_page_as_valid_html(
    page_source=crawler.page_source,
    screenshot=crawler.get_screenshot_as_base64()
)
fields = parser.extract_fields(html_snippet)
_print_fields(fields)

In [None]:
# Let's highlight the fields
highlight_fields(fields)

### Step 4.2. Scrape data

First, let's set some limits for simplicity:

In [None]:
max_pages = 5  # How many catalog pages we should iterate over
max_rows = 20  # How many rows to scrape before stop

We are ready to collect the data:

In [None]:
config = ScraperConfig(
    start_url=url,
    page_type=page_type,
    pagination=pagination,
    catalog_item=catalog_item,
    open_nested_pages=True,
    fields=fields,
    max_pages=max_pages,
    max_rows=max_rows
)
scraper = Scraper(config, crawler)

rows = []
for item in tqdm(scraper.scrape(), total=max_rows):
    rows.append(item)

In [None]:
pd.DataFrame(rows)

#### Experiment 2. [List of commits in the repository](https://github.com/scraperai/scraperai/commits/main/)

In [None]:
# Define url
url = 'https://github.com/scraperai/scraperai/commits/main/'

In [None]:
# Open url
crawler.get(url)

In [None]:
# Detect page_type
page_type = WebpageType.CATALOG
page_type

In [None]:
# Detect pagination
pagination = parser.detect_pagination(crawler.page_source)
pagination

In [None]:
# Detect catalog item
catalog_item = parser.detect_catalog_item(
    page_source=crawler.page_source,
    website_url=url,
    extra_prompt='This page contains a list of commits. Each commit row is a catalog item')
catalog_item

In [None]:
crawler.highlight_by_xpath(catalog_item.card_xpath, '#8981D7', 5)
crawler.highlight_by_xpath(catalog_item.url_xpath, '#5499D1', 3)

In [None]:
fields = parser.extract_fields(html_snippet=catalog_item.html_snippet)
_print_fields(fields)

In [None]:
highlight_fields(fields)

In [None]:
config = ScraperConfig(
    start_url=url,
    page_type=page_type,
    pagination=pagination,
    catalog_item=catalog_item,
    open_nested_pages=True,
    fields=fields,
    max_pages=2,
    max_rows=100
)
scraper = Scraper(config, crawler)

rows = []
for item in tqdm(scraper.scrape(), total=max_rows):
    rows.append(item)

In [None]:
pd.DataFrame(rows)

### That's it!
#### Thank you for using ScraperAI