# Getting Started

Jumpstart a Graph in a few steps with WordLift by providing an XML sitemap and a WordLift Key ([get it here](https://wordlift.io)).

This is the first notebook of a series, once you created your graph, move forward to [create internal links](create_internal_links.ipynb).

## Video Walkthrough

[![Jumpstart your Graph in less than 5 minutes](https://img.youtube.com/vi/yQV9DkH9LmI/0.jpg)](https://www.youtube.com/watch?v=yQV9DkH9LmI)

## Configuration

There are two configuration sources, at least one of the two is needed, and they're applied in order:

1. A file config/default.py
2. Local constants and WordLift Key in Google Colab Secrets

There are only three configuration settings:

* `WORDLIFT_KEY`, holding the WordLift Key, when using Google Colab, it can be set in the secrets
* `SITEMAP_URL`, the URL to the sitemap which contains URLs (not other sitemaps, or at least we didn't test it with links to other sitemaps)
* `OUTPUT_TYPE`, optional, this is the type used to represent imported web pages. If not set will default to `http://schema.org/WebPage` (other options could be `http://schema.org/CollectionPage`, etc.)

In [1]:
import logging
from collections.abc import Awaitable

logging.basicConfig(level=logging.WARNING, force=True)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# Configuration from config/default.py file.
try:
    # Configuration is in the `config/default.py` file.
    from config import default as config

    WORDLIFT_KEY = config.WORDLIFT_KEY
    OUTPUT_TYPES = {config.OUTPUT_TYPE} or {'http://schema.org/WebPage'}
    SITEMAP_URL = config.SITEMAP_URL
except ImportError:
    logging.warning("Cannot import configuration from local `config/default.py` file.")

# Configuration from Google Colab Secrets.
try:
    from google.colab import userdata

    WORDLIFT_KEY = userdata.get('WORDLIFT_KEY')
    OUTPUT_TYPES = {'http://schema.org/WebPage'}
    SITEMAP_URL = None
except ImportError:
    logging.warning("Cannot import configuration from google.colab.usermap.")

if WORDLIFT_KEY is None or OUTPUT_TYPES is None or SITEMAP_URL is None:
    raise ValueError('Configuration not set')



# Dependencies

This part is only for Google Colab. When the notebook is used locally we recommend using `poetry install`.

In [2]:
import sys

if "google.colab" in sys.modules:
    !pip install \
    "wordlift-client>=1.75.0,<2.0.0" \
    "beautifulsoup4>=4.13.3,<5.0.0" \
    "rdflib>=7.1.3,<8.0.0" \
    "tenacity>=9.0.0,<10.0.0" \
    "wordlift-sdk @ git+https://github.com/wordlift/python-sdk.git"


# Imports

This section provides general imports and basic configuration, no need to do anything here.

In [3]:
from bs4 import BeautifulSoup
from rdflib import URIRef, Literal
from tenacity import retry, stop_after_attempt, wait_fixed
from wordlift_client import SitemapImportsApi, SitemapImportRequest, EmbeddingRequest, EntityPatchRequest
from wordlift_sdk.client import ClientConfigurationFactory
from wordlift_sdk.utils import create_entity_patch_request, create_or_update_kg_using_sitemap
from wordlift_client import Configuration
from typing import Callable

# Defining the host is optional and defaults to https://api.wordlift.io
# See configuration.py for a list of all supported configuration parameters.
api_url = 'https://api.wordlift.io'
configuration = ClientConfigurationFactory(key=WORDLIFT_KEY).create()

  """
  """
  """Returns a collection of search results that match the query parameters


# Callbacks

There are two callbacks that you can customize according to your needs:

1. `import_url`, imports a URL into the Graph, it is called for each URL found in the sitemap.
2. `parse_html`, parses the webpage and provides a list of entity patches to add additional properties to the imported entities and is called for every url.

## Import URL

The defaults work nicely for most situations so that you don't really need to configure anything here.

In [4]:
async def import_url_factory(configuration: Configuration, types: set[str]) -> Callable[[set[str]], Awaitable[None]]:
    @retry(
        stop=stop_after_attempt(5),
        wait=wait_fixed(2)
    )
    async def import_url(url_list: set[str]) -> None:
        import wordlift_client

        async with wordlift_client.ApiClient(configuration) as api_client:
            imports_api = SitemapImportsApi(api_client)
            request = SitemapImportRequest(
                embedding=EmbeddingRequest(
                    properties=["http://schema.org/headline", "http://schema.org/abstract", "http://schema.org/text"]
                ),
                output_types=list(types),
                urls=list(url_list),
                overwrite=True,
                id_generator="headline-with-url-hash"
            )

            try:
                await imports_api.create_sitemap_import(sitemap_import_request=request)
            except Exception as e:
                logger.error("Error importing URLs: %s", e)

    return import_url

## Parse HTML

This example shows how to add `schema:keywords` to an imported entity by taking the values from the `<a class="tag-cloud-link">Tag</a>` markup. You can further tailor this part based on your needs.

In [5]:


async def parse_html(entity_id: str, html: str) -> list[EntityPatchRequest]:
    soup = BeautifulSoup(html, 'html.parser')

    # Extract the text of all 'a' tags with class 'tag-cloud-link'
    tag_texts = [a.get_text(strip=True) for a in soup.find_all('a', class_='tag-cloud-link')]

    resource = URIRef(entity_id)

    payloads = []

    for value in tag_texts:
        payloads.append(
            create_entity_patch_request(
                resource,
                URIRef('http://schema.org/keywords'),
                Literal(value)
            )
        )

    return payloads


# Main Function

This is the main notebook function code.

In [None]:
async def main() -> None:
    await create_or_update_kg_using_sitemap(
        configuration=configuration,
        key=WORDLIFT_KEY,
        sitemap_url=SITEMAP_URL,
        types=OUTPUT_TYPES,
        concurrency=1,
        import_url_callback=await import_url_factory(configuration=configuration, types=OUTPUT_TYPES)
    )


await main()


INFO:root:Getting https://wordlift.io/blog/en/post-sitemap.xml
INFO:gql.transport.aiohttp:>>> {"query": "query getEntities($types: [String]!) {\n  entities(query: {typeConstraint: {in: $types}}) {\n    iri\n    keywords: string(name: \"schema:keywords\")\n    url: string(name: \"schema:url\")\n  }\n}", "variables": {"types": ["http://schema.org/WebPage"]}}
INFO:gql.transport.aiohttp:<<< {
  "data" : {
    "entities" : [ {
      "iri" : "https://data.wordlift.io/wl1505904/title-tag-seo-using-deep-learning-and-tensorflow-3e9202b7c7a6fde83605021a5820ab04",
      "keywords" : "Wikidata",
      "url" : "https://wordlift.io/blog/en/title-tag-seo-using-ai/"
    }, {
      "iri" : "https://data.wordlift.io/wl1505904/decoding-seo-mastering-the-art-of-data-interpretation-b060d1842cda7410d1d57a8e691fc087",
      "keywords" : "knowledge graph",
      "url" : "https://wordlift.io/blog/en/seo-data-interpretation/"
    }, {
      "iri" : "https://data.wordlift.io/wl1505904/comprehensive-guide-about-a