<a href="https://colab.research.google.com/github/victor-roris/NLPlearning/blob/master/QueryGeneration/QueryGeneration_SentenceTransformers_GenQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Query Generation - SentenceTransformers GenQ

This model is the t5-base model from docTTTTTquery.

The T5-base model was trained on the MS MARCO Passage Dataset, which consists of about 500k real search queries from Bing together with the relevant passage.

The model can be used for query generation to learn semantic search models without requiring annotated training data: [Synthetic Query Generation](https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/query_generation).

The pretrained models are available in the transformers models hub with the keyword: `BeIR/query-gen-msmarco-t5-large-v1` and `BeIR/query-gen-msmarco-t5-base-v1`

Documentation: https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/query_generation


## Install 

Sentence Transformers

In [None]:
!pip install -U sentence-transformers

## Query generation

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model.eval()

# para = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."
para = "Galicia is located in Atlantic Europe. It is bordered by Portugal to the south, the Spanish autonomous communities of Castile and León and Asturias to the east, the Atlantic Ocean to the west, and the Cantabrian Sea to the north. It had a population of 2,701,743 in 2018[4] and a total area of 29,574 km2 (11,419 sq mi). Galicia has over 1,660 km (1,030 mi) of coastline,[5] including its offshore islands and islets, among them Cíes Islands, Ons, Sálvora, Cortegada Island, and the largest and most populated, A Illa de Arousa."

input_ids = tokenizer.encode(para, return_tensors='pt')
with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids,
        max_length=64,
        do_sample=True,
        top_p=0.95,
        num_return_sequences=3)

print("Paragraph:")
print(para)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

Paragraph:
Galicia is located in Atlantic Europe. It is bordered by Portugal to the south, the Spanish autonomous communities of Castile and León and Asturias to the east, the Atlantic Ocean to the west, and the Cantabrian Sea to the north. It had a population of 2,701,743 in 2018[4] and a total area of 29,574 km2 (11,419 sq mi). Galicia has over 1,660 km (1,030 mi) of coastline,[5] including its offshore islands and islets, among them Cíes Islands, Ons, Sálvora, Cortegada Island, and the largest and most populated, A Illa de Arousa.

Generated Queries:
1: where is galicia spain
2: where is galicia located
3: where is galicia
