# UUID retrieval

## Create search query and retrieve corresponding document UUIDs

For effective document UUID retrieval we need Solr search query, the query can be generated using provided `QueryFactory` or its implementation inside `MZKScraper`. `QueryFactory` is just a reverse-engineered script that runs on MZK's website to convert human-readable queries into Solr queries for direct access through API, and it is not perfect.

In case of text-based search or more complicated conditions, `MZKScraper`'s `retrieve_query_directly_from_mzk` method has to be used. This method uses `seleniumwire` to dynamically load the page and catches the XHR call that contains the wanted Solr query.

> Note: Easiest way to obtain parameters is to open MZK website and search for desired documents, the query is a part of pages url.

### Resources

- [JavaScript implementation of human-readable to Solr method](https://github.com/ceskaexpedice/kramerius-web-client/blob/master/src/app/services/solr.service.ts)
- [supported languages](docs/languages.json)
- [supported "physical locations"](docs/physical_locations.json)

### Limitations

`convolute` and `soundrecording` are supported as `doctypes`, but content UUIDs are not retrievable with the current implementation.

In [None]:
from mzkscraper.Scraper import MZKScraper

scraper = MZKScraper()

# create Solr query
solr_query = scraper.construct_solr_query_with_qf(licences="public", doctypes="sheetmusic")
print(solr_query)

# retrieve first 10 documents by query
retrieved_documents = scraper.retrieve_document_ids_by_solr_query(solr_query, requested_document_count=10)

# print results
print(f"Number of retrieved documents: {len(retrieved_documents)}")
for i, result in enumerate(retrieved_documents):
    print(f"{i}: {result}")

In [None]:
# create human-readable query
hm_query = scraper.construct_hm_query(text_query="Komenský")
print(hm_query)

# load MZK with seleniumwire and catch XHR
solr_query = scraper.transform_query_from_hm_to_solr_using_mzk(hm_query)
print(solr_query)

# retrieve first 10 documents by query
retrieved_documents2 = scraper.retrieve_document_ids_by_solr_query(solr_query, requested_document_count=10)

# print results
print(f"Number of retrieved documents: {len(retrieved_documents2)}")
for i, result in enumerate(retrieved_documents2):
    print(f"{i}: {result}")

## Retrieve page UUIDs from a document UUID

MZK provides labels for each page (roughly) in this format:

- `["number""letter"] ("type")`
- `["number"] ("type")`
- `"number" ("type")`
- `"number" "type"`

and possibly any other combination of these.

When processing the page, the label gets stripped only to `"type"`. MZK labels are in camel case, by default we output in snake case. If labels are to be filtered inside the method, the `valid_labels` should be a list of labels in snake case.

In [None]:
# retrieve page uuids using document uuid
retrieved_pages = scraper.get_pages_in_document(retrieved_documents[0])
for i, page in enumerate(retrieved_pages):
    print(f"{i + 1}: {page.page_id} label: {page.label}")

In [None]:
# retrieve only title pages (collecting all page uuids and filtering them afterwards is also an option)
document_title_pages = scraper.get_pages_in_document(retrieved_documents[0], valid_labels=["TitlePage"])
for i, page in enumerate(document_title_pages):
    print(f"{i + 1}: {page.page_id} label: {page.label}")

## Display/download image using its UUID

In [None]:
scraper.get_image(document_title_pages[0].page_id)

# download page
# scraper.download_image(
#     document_title_pages[0].page_id,
#     "document_title_page.jpg",
#     Path("path/to/the/your_dir"),
#     verbose=True
# )

# Citations

## Collect information about a document

Plain text ISO690 citation can be requested directly from MZK. Other citations are generated using MZK's API and returned as a class `Citation`, that is easily extensible.

In [None]:
from mzkscraper.Citations.CitationGenerator import MZKCitationGenerator

citgen = MZKCitationGenerator()

# cite page
print(citgen.get_iso_690_citation_directly(retrieved_pages[0].page_id))
print()
# cite document
print(citgen.get_iso_690_citation_directly(retrieved_documents[0]))
print()
# cite document without italicized title
print(citgen.get_iso_690_citation_directly(retrieved_documents[0], italic=False))

In [None]:
# cite pages with indexes [2, 3, 6, 10]
# this method requests document's metadata and then stores relevant ones as Citation (class)
cited_pages = [
    citgen.retrieve_citation_data_from_document_metadata(
        retrieved_documents[0], retrieved_pages[index].page_id) for index in [2, 3, 6, 10]
]

for cit in cited_pages:
    print(cit)
    print()

## Merge citations

Sometimes we have multiple citations of one document, the only difference being page numbers, we can merge those.

In [None]:
grouped_citations = citgen.group_page_citation_by_document_id(cited_pages)

for cit in grouped_citations:
    print(cit)
    print()

## Generate citations in ISO690 and BibTex format

In [None]:
print(grouped_citations[0].get_iso_690_citation())
print()
print(grouped_citations[0].get_bibtex_citation())