# Getting Started

Jumpstart a Graph in a few steps with WordLift by providing an XML sitemap and a WordLift Key ([get it here](https://wordlift.io)).

This is the first notebook of a series, once you created your graph, move forward to [create internal links](create_internal_links.ipynb).

## Video Walkthrough

[![Jumpstart your Graph in less than 5 minutes](https://img.youtube.com/vi/yQV9DkH9LmI/0.jpg)](https://www.youtube.com/watch?v=yQV9DkH9LmI)

## Configuration

Configuration combines 4 sources in order:

1. Global keys found in `globals()` allowing to use this notebook with JupyterLab Scheduler which presets the configurations as globals.
2. Keys from a local configuration file, by default `config/default.py`
3. Environment variables.
4. Google Colab usermap (i.e. secrets).

There are three configuration settings:

* `WORDLIFT_KEY`, holding the WordLift Key, when using Google Colab, it can be set in the secrets
* `SITEMAP_URL`, the URL to the sitemap which contains URLs (not other sitemaps, or at least we didn't test it with links to other sitemaps)
* `OUTPUT_TYPE`, optional, this is the type used to represent imported web pages. If not set will default to `http://schema.org/WebPage` (other options could be `http://schema.org/CollectionPage`, etc.)

In [3]:
import logging
from pathlib import Path
from wordlift_sdk.config import get_config_value

logging.basicConfig(level=logging.WARNING, force=True)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

config_path = Path.cwd() / "config" / "default.py"
WORDLIFT_KEY = get_config_value('WORDLIFT_KEY', config_path)
SITEMAP_URL = get_config_value('SITEMAP_URL', config_path)
OUTPUT_TYPES = {get_config_value('OUTPUT_TYPE', config_path, 'http://schema.org/Article')}
SHEETS_URL = get_config_value('SHEETS_URL', config_path)
SHEETS_NAME = get_config_value('SHEETS_NAME', config_path)

if WORDLIFT_KEY is None or OUTPUT_TYPES is None or SITEMAP_URL is None:
    raise ValueError('Configuration not set')

KeyboardInterrupt: 

# Dependencies

This part is only for Google Colab. When the notebook is used locally we recommend using `poetry install`.

In [None]:
from wordlift_sdk.notebook import install_if_missing

install_if_missing("wordlift-client>=1.75.0,<2.0.0)", "wordlift-client")
install_if_missing("beautifulsoup4>=4.13.3,<5.0.0)", "beautifulsoup4")
install_if_missing("rdflib>=7.1.3,<8.0.0)", "rdflib")
install_if_missing("tenacity>=9.0.0,<10.0.0)", "tenacity")
install_if_missing("pycountry>=24.6.1,<25.0.0)", "pycountry")
install_if_missing("wordlift-sdk @ git+https://github.com/wordlift/python-sdk.git@0.26.0", "wordlift_sdk")

# Imports

This section provides general imports and basic configuration, no need to do anything here.

In [None]:
from bs4 import BeautifulSoup
from rdflib import URIRef, Literal
from tenacity import retry, stop_after_attempt, wait_fixed
from wordlift_client import SitemapImportsApi, SitemapImportRequest, EmbeddingRequest, EntityPatchRequest
from wordlift_sdk.client import ClientConfigurationFactory
from wordlift_sdk.utils import create_entity_patch_request, create_or_update_kg_using_sitemap
from wordlift_client import Configuration
from typing import Callable
from collections.abc import Awaitable

# Defining the host is optional and defaults to https://api.wordlift.io
# See configuration.py for a list of all supported configuration parameters.
api_url = 'https://api.wordlift.io'
configuration = ClientConfigurationFactory(key=WORDLIFT_KEY).create()

# Callbacks

There are two callbacks that you can customize according to your needs:

1. `import_url`, imports a URL into the Graph, it is called for each URL found in the sitemap.
2. `parse_html`, parses the webpage and provides a list of entity patches to add additional properties to the imported entities and is called for every url.

## Import URL

The defaults work nicely for most situations so that you don't really need to configure anything here.

In [None]:
async

def import_url_factory(configuration: Configuration, types: set[str]) -> Callable[[set[str]], Awaitable[None]]:
    @retry(
        stop=stop_after_attempt(5),
        wait=wait_fixed(2)
    )
    async def import_url(url_list: set[str]) -> None:
        import wordlift_client

        async with wordlift_client.ApiClient(configuration) as api_client:
            imports_api = SitemapImportsApi(api_client)
            request = SitemapImportRequest(
                embedding=EmbeddingRequest(
                    properties=["http://schema.org/headline", "http://schema.org/abstract", "http://schema.org/text"]
                ),
                output_types=list(types),
                urls=list(url_list),
                overwrite=True,
                id_generator="headline-with-url-hash"
            )

            try:
                await imports_api.create_sitemap_import(sitemap_import_request=request)
            except Exception as e:
                logger.error("Error importing URLs: %s", e)

    return import_url

## Parse HTML

This example shows how to add `schema:keywords` to an imported entity by taking the values from the `<a class="tag-cloud-link">Tag</a>` markup. You can further tailor this part based on your needs.

In [None]:


async def parse_html(entity_id: str, html: str) -> list[EntityPatchRequest]:
    soup = BeautifulSoup(html, 'html.parser')

    # Extract the text of all 'a' tags with class 'tag-cloud-link'
    tag_texts = [a.get_text(strip=True) for a in soup.find_all('a', class_='tag-cloud-link')]

    resource = URIRef(entity_id)

    payloads = []

    for value in tag_texts:
        payloads.append(
            create_entity_patch_request(
                resource,
                URIRef('http://schema.org/keywords'),
                Literal(value)
            )
        )

    return payloads


# Main Function

This is the main notebook function code.

In [None]:
async def main() -> None:
    await create_or_update_kg_using_sitemap(
        configuration=configuration,
        key=WORDLIFT_KEY,
        sitemap_url=SITEMAP_URL,
        types=OUTPUT_TYPES,
        concurrency=1,
        import_url_callback=await import_url_factory(configuration=configuration, types=OUTPUT_TYPES)
    )


await main()
