# Getting Started

Jumpstart a Graph in a few steps with WordLift by providing an XML sitemap and a WordLift Key ([get it here](https://wordlift.io)).

This is the first notebook of a series, once you created your graph, move forward to [create internal links](create_internal_links.ipynb).

## Video Walkthrough

[![Jumpstart your Graph in less than 5 minutes](https://img.youtube.com/vi/yQV9DkH9LmI/0.jpg)](https://www.youtube.com/watch?v=yQV9DkH9LmI)

## Configuration

Configuration combines 4 sources in order:

1. Global keys found in `globals()` allowing to use this notebook with JupyterLab Scheduler which presets the configurations as globals.
2. Keys from a local configuration file, by default `config/default.py`
3. Environment variables.
4. Google Colab usermap (i.e. secrets).

There are three configuration settings:

* `WORDLIFT_KEY`, holding the WordLift Key, when using Google Colab, it can be set in the secrets
* `OUTPUT_TYPE`, optional, this is the type used to represent imported web pages. If not set will default to `http://schema.org/WebPage` (other options could be `http://schema.org/CollectionPage`, etc.)
* `SITEMAP_URL`, the URL to the sitemap which contains URLs (not other sitemaps, or at least we didn't test it with links to other sitemaps)
* alternative provide a `SHEETS_URL` and `SHEETS_NAME` configuration to read the list of URLs from the `url` column of the specified Google Sheets spreadsheet, this will require a valid [Google service account](./docs/create_google_service_account.md).


In [None]:
import logging
from pathlib import Path

from wordlift_sdk.config import get_config_value

logging.basicConfig(level=logging.WARNING, force=True)

# Suppress all other loggers below WARNING
for name in logging.root.manager.loggerDict:
    if name != __name__:
        logging.getLogger(name).setLevel(logging.WARNING)

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# Try to read the configuration from the `config/default.py` file.
config_path = Path.cwd() / "config" / "default.py"
WORDLIFT_KEY = get_config_value("WORDLIFT_KEY", config_path)
SITEMAP_URL = get_config_value("SITEMAP_URL", config_path)
OUTPUT_TYPES = {
    get_config_value("OUTPUT_TYPE", config_path, "http://schema.org/Article")
}
SHEETS_URL = get_config_value("SHEETS_URL", config_path)
SHEETS_NAME = get_config_value("SHEETS_NAME", config_path)
SHEETS_SERVICE_ACCOUNT = get_config_value("SHEETS_SERVICE_ACCOUNT", config_path)
URLS = get_config_value("URLS", config_path)
INTERNAL_LINKS = get_config_value("INTERNAL_LINKS", config_path, False)

if WORDLIFT_KEY is None:
    raise ValueError("`WORDLIFT_KEY` is required.")

if OUTPUT_TYPES is None:
    raise ValueError("`OUTPUT_TYPES` is required.")

if (
    SITEMAP_URL is None
    and URLS is None
    and (SHEETS_URL is None or SHEETS_NAME is None or SHEETS_SERVICE_ACCOUNT is None)
):
    raise ValueError(
        "One of `SITEMAP_URL` or `SHEETS_URL`/`SHEETS_NAME`/`SHEETS_SERVICE_ACCOUNT` is required."
    )

# Dependencies

This part is only for Google Colab. When the notebook is used locally we recommend using `poetry install`.

In [None]:
from wordlift_sdk.notebook import install_if_missing

install_if_missing("wordlift-client>=1.75.0,<2.0.0)", "wordlift_client")
install_if_missing("beautifulsoup4>=4.13.3,<5.0.0)", "bs4")
install_if_missing("rdflib>=7.1.3,<8.0.0)", "rdflib")
install_if_missing("tenacity>=9.0.0,<10.0.0)", "tenacity")
install_if_missing("pycountry>=24.6.1,<25.0.0)", "pycountry")
install_if_missing(
    "wordlift-sdk @ git+https://github.com/wordlift/python-sdk.git@0.34.1",
    "wordlift_sdk",
)

# Imports

This section provides general imports and basic configuration, no need to do anything here.

In [None]:
from wordlift_sdk.client import ClientConfigurationFactory

# Defining the host is optional and defaults to https://api.wordlift.io
# See configuration.py for a list of all supported configuration parameters.
api_url = "https://api.wordlift.io"
configuration = ClientConfigurationFactory(key=WORDLIFT_KEY).create()

# Callbacks

There are two callbacks that you can customize according to your needs:

1. `import_url`, imports a URL into the Graph, it is called for each URL found in the sitemap.
2. `parse_html`, parses the webpage and provides a list of entity patches to add additional properties to the imported entities and is called for every url.

## Import URL

The defaults work nicely for most situations so that you don't really need to configure anything here. This is the default callback from the SDK:

```python
@@TODO
```

## Parse HTML

This example shows how to add `schema:keywords` to an imported entity by taking the values from the `<a class="tag-cloud-link">Tag</a>` markup. You can further tailor this part based on your needs.

In [None]:
# @@TODO

# Create Internal Links

This method is called by the `main` method in order to create the Internal Links.

## How does it work

We query the graph for all the entities that have embedding vectors, the results are stored in `iri`, `url` pairs in the `entities_with_embedding_vectors_df` dataframe.

We then use the SDK's `create_internal_link_handler` method to pass the Client configuration and the ID of the Link Group that we want to create.

We can optionally provide an `internal_link_request_filter` method with the following signature `Callable[[Series, InternalLinkRequest], Awaitable[InternalLinkRequest]` to alter the actual request with additional filters (for example we may want to filter by at least a matching keyword shared between the source web page and the target web page).

The results are going to be written to the graph using schema.org and [seontology](https://github.com/seontology/seontology/).

In [None]:
from tqdm.asyncio import tqdm
from wordlift_sdk.internal_link import create_internal_link_handler
from wordlift_sdk.utils import (
    create_dataframe_of_entities_with_embedding_vectors,
    delayed,
)
from pandas import DataFrame
from wordlift_client import Configuration


async def create_internal_links(configuration: Configuration, key: str) -> DataFrame:
    entities_with_embedding_vectors_df = (
        await create_dataframe_of_entities_with_embedding_vectors(key)
    )

    # We're polite and not making more than 2 concurrent reqs.
    handler = create_internal_link_handler(configuration, "getting_started")
    await tqdm.gather(
        *[
            delayed(handler, 2)(row)
            for index, row in entities_with_embedding_vectors_df.iterrows()
        ],
        total=len(entities_with_embedding_vectors_df),
    )

    return entities_with_embedding_vectors_df

# Main Function

This is the main notebook function code.

In [None]:
from wordlift_sdk.google_search_console import (
    raise_error_if_account_analytics_not_configured,
    create_google_search_console_data_import,
)
from wordlift_sdk.utils import get_me
from wordlift_sdk.wordlift.sitemap_import.create_or_update_kg import (
    create_or_update_kg_using_url_provider,
)
import gspread
from wordlift_sdk.kg.manager.urlprovider import (
    UrlProviderFactory,
    UrlProviderFactoryInput,
)
from wordlift_sdk.wordlift.sitemap_import.protocol import (
    ProtocolContext,
    load_override_class,
    DefaultImportUrlProtocol,
    DefaultParseHtmlProtocol,
)


async def main() -> None:
    # Define the context for protocol functions, it provides instances and data that those functions can use.
    protocol_context = ProtocolContext(
        configuration=configuration,
        types=OUTPUT_TYPES,
    )

    # Change the behavior of the import URL call.
    import_url_protocol = load_override_class(
        name="import_url_protocol",
        class_name="ImportUrlProtocol",
        # Default class to use in case of missing override.
        default_class=DefaultImportUrlProtocol,
        context=protocol_context,
    )

    # Change the behavior of the parse HTML call by parsing the web page html and sending patches to the graph.
    parse_html_protocol = load_override_class(
        name="parse_html_protocol",
        class_name="ParseHtmlProtocol",
        # Default class to use in case of missing override.
        default_class=DefaultParseHtmlProtocol,
        context=protocol_context,
    )

    # We can import for different sources: (1) sitemap or (2) Google Sheets or (3) list of URLs.
    # This will create the provider based on the configuration and the input parameters.
    url_provider = UrlProviderFactory.create(
        input_params=UrlProviderFactoryInput(
            sitemap_url=SITEMAP_URL,
            sheets_url=SHEETS_URL,
            sheets_name=SHEETS_NAME,
            sheets_creds_or_client=gspread.service_account(
                filename=SHEETS_SERVICE_ACCOUNT
            )
            if SHEETS_SERVICE_ACCOUNT
            else None,
            urls=URLS,
        )
    )

    # Import URLs.
    logger.info("Importing...")
    await create_or_update_kg_using_url_provider(
        configuration=configuration,
        key=WORDLIFT_KEY,
        url_provider=url_provider,
        types=OUTPUT_TYPES,
        concurrency=1,
        import_url_protocol=import_url_protocol,
        parse_html_protocol=parse_html_protocol,
    )

    # Check if we can import analytics.
    account = await get_me(configuration=configuration)
    try:
        has_analytics = await raise_error_if_account_analytics_not_configured(
            account=account
        )
    except ValueError as e:
        has_analytics = False
        logger.error(e)

    if has_analytics:
        url_list = set()
        async for url in url_provider.urls():
            url_list.add(url.value)

        logger.info(
            "Importing Google Search Console data for %d URLs: (1) load existing data and (2) create/update data...",
            len(url_list),
        )
        await create_google_search_console_data_import(
            configuration=configuration,
            key=WORDLIFT_KEY,
            url_list=url_list,
        )

        # logger.info(
        #     "Creating entity gaps data %d URLs...",
        #     len(url_list),
        # )
        # await create_entity_gaps(
        #     configuration=configuration,
        #     key=WORDLIFT_KEY,
        #     account=account,
        #     url_list=url_list
        # )

    if INTERNAL_LINKS:
        entities_with_embedding_vectors_df = await create_internal_links(
            configuration=configuration, key=WORDLIFT_KEY
        )

        # Print the ID of the entities processed
        for index, row in entities_with_embedding_vectors_df.iterrows():
            logger.info(row["url"] + " " + row["iri"])


await main()