# QUERY TRANSLATION
Query translation involves converting a user's search query from one language or syntax to another, enabling interaction with systems and databases. It's vital for multilingual search engines, database management, natural language processing, and cross-language information retrieval. Techniques like machine translation and rule-based methods are used for accurate and efficient translations, ensuring users can search and interact in their preferred language or format.

## From a sample data
     - Install dependencies
     - Sematic search
     - Query translation models
     - Natural language filtering
     - Query translation with application

## Processing data from a PDF
     - Install dependencies
     - PDF text extraction
     - Splitting text
     - Creating index
     - Semantic search
     - Get French query from the user

## INSTALL DEPENDENCIES

In [None]:
%%capture #any output that would normally be printed as output area is captured and stored in a variable.
!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]

UsageError: unrecognized arguments: output that would normally be printed as output area is captured and stored in a variable.


## SEMATIC SEARCH

Semantic search is a search technique that aims to improve the accuracy of search results by understanding the intent and contextual meaning of a user's query, rather than relying solely on keyword matching. Unlike traditional search methods, which match search queries with specific keywords, semantic search takes into account the context, intent, and the relationship between words in a query to provide more relevant and meaningful results.



In [None]:
from txtai.embeddings import Embeddings # used this to perform various operations on textual data.

In [None]:
# Sample data: List of news articles
news = ["Historical Discovery Sheds New Light on Ancient Civilization",
        "Last week Scientific Breakthrough Revolutionizes Medical Treatment regarding COVID-19",
        "COVID-19 Vaccination Rates Continue to Rise Across the Country",
        "Facts Reveal Stunning Insights into Climate Change",
        "Proverb: A stitch in time saves nine - Wise words for proactive action",
        "Hi, Is it raining?"
]

# Initialize Txtai's Embeddings class with a suitable pre-trained model
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True})

# Index the weekly news articles
embeddings.index([(uid, text, None) for uid, text in enumerate(news)])

# Run a search for "COVID-19" and retrieve the top 1 result
result = embeddings.search("COVID-19", 1)
print("RESULT:")
print(result)

RESULT:
[{'id': '1', 'text': 'Last week Scientific Breakthrough Revolutionizes Medical Treatment regarding COVID-19', 'score': 0.5100634098052979}]


# QUERY TRANSLATION MODELS

In [None]:
from txtai.pipeline import Sequences # work with sequence-related tasks such as text summarization, text generation

In [None]:
#initialized with the pre-trained T5 mode
sequences = Sequences("NeuML/t5-small-txtsql")

#List of Queries
queries = ["COVID",
           "COVID-19",
           "vaccination",
           "COVID vaccination",
           "COVID OR pandemic",
           "COVID NOT vaccination",
           "COVID cases rise",
           "COVID-19 OR coronavirus",
           "COVID cases rise to 1000",
           "COVID cases in India",
           "COVID fever and cough"]

# Prefix to pass to T5 model before each English query when translating it to SQL
prefix = "translate English to SQL: "

for query in queries:
    print(f"Input: {query}") # original English query
    print(f"SQL: {sequences(query, prefix)}") #passes the English query along with the specified prefix to the T5 model
    print() #add an empty line for better readability in the output.


Input: COVID
SQL: select id, text, score from txtai where similar('COVID')

Input: COVID-19
SQL: select id, text, score from txtai where similar('COVID-19')

Input: vaccination
SQL: select id, text, score from txtai where similar('vaccination')

Input: COVID vaccination
SQL: select id, text, score from txtai where similar('COVID vaccination')

Input: COVID OR pandemic
SQL: select id, text, score from txtai where similar('COVID') or pandemic

Input: COVID NOT vaccination
SQL: select id, text, score from txtai where similar('COVID NOT vaccination')

Input: COVID cases rise
SQL: select id, text, score from txtai where similar('COVID cases rise')

Input: COVID-19 OR coronavirus
SQL: select id, text, score from txtai where similar('COVID-19') and entry >= date('now', '-1 day')

Input: COVID cases rise to 1000
SQL: select id, text, score from txtai where similar('COVID cases') and entry >= date('now', '-1 day')

Input: COVID cases in India
SQL: select id, text, score from txtai where similar

# NATURAL LANGUAGE FILTERING

In [None]:
from txtai.pipeline import Translation # perform translation tasks using the txtai library.


In [None]:
# Initialize the Translation class
translation = Translation()

#  Define the translate function using the translation instance
def translate(text, lang):
    try:
        translated_text = translation(text, lang)
        return translated_text
    except Exception:
        return None


# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2",
                         "content": True,
                         "query": {"path": "NeuML/t5-small-txtsql"},
                         "functions": [translate]})

# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(news)])

query = "select id, score, translate(text, 'de') 'text' from txtai where similar('COVID-19')"

# Run a search using a custom SQL function
embeddings.search(query)[0]

{'id': '1', 'score': 0.5100634098052979, 'text': None}

In [None]:
embeddings.search("COVID")[0]

{'id': '1',
 'text': 'Last week Scientific Breakthrough Revolutionizes Medical Treatment regarding COVID-19',
 'score': 0.36603784561157227}

In [None]:
embeddings.search("COVID-19 news")[0]

{'id': '1',
 'text': 'Last week Scientific Breakthrough Revolutionizes Medical Treatment regarding COVID-19',
 'score': 0.5286498665809631}

In [None]:
embeddings.search("vaccination")[0]

{'id': '2',
 'text': 'COVID-19 Vaccination Rates Continue to Rise Across the Country',
 'score': 0.510772168636322}

In [None]:
embeddings.search("COVID vaccination")[0]

{'id': '2',
 'text': 'COVID-19 Vaccination Rates Continue to Rise Across the Country',
 'score': 0.513064444065094}

In [None]:
embeddings.search("COVID OR pandemic")[0]

{'id': '1',
 'text': 'Last week Scientific Breakthrough Revolutionizes Medical Treatment regarding COVID-19',
 'score': 0.36603784561157227}

In [None]:
embeddings.search("COVID NOT vaccination")[0]

{'id': '2',
 'text': 'COVID-19 Vaccination Rates Continue to Rise Across the Country',
 'score': 0.2103043645620346}

In [None]:
embeddings.search("COVID cases rise")[0]

{'id': '2',
 'text': 'COVID-19 Vaccination Rates Continue to Rise Across the Country',
 'score': 0.4123527407646179}

In [None]:
embeddings.search("COVID-19 OR coronavirus")[0]

{'id': '1',
 'text': 'Last week Scientific Breakthrough Revolutionizes Medical Treatment regarding COVID-19',
 'score': 0.5100634098052979}

In [None]:
embeddings.search("COVID cases rise to 1000")[0]

{'id': '1',
 'text': 'Last week Scientific Breakthrough Revolutionizes Medical Treatment regarding COVID-19',
 'score': 0.33649733662605286}

In [None]:
embeddings.search("COVID cases in India")[0]

{'id': '1',
 'text': 'Last week Scientific Breakthrough Revolutionizes Medical Treatment regarding COVID-19',
 'score': 0.33649733662605286}

In [None]:
embeddings.search("COVID-19")[0]

{'id': '1',
 'text': 'Last week Scientific Breakthrough Revolutionizes Medical Treatment regarding COVID-19',
 'score': 0.5100634098052979}

In [None]:
embeddings.search("COVID-19 missing in text")

[]

# QUERY TRANSLATION WITH APPLICATION

In [None]:
config = """
translation:

writable: true
embeddings:
  path: sentence-transformers/nli-mpnet-base-v2
  content: true
  query:
    path: NeuML/t5-small-txtsql
  functions:
    - {name: translate, argcount: 2, function: translation}
"""



In [None]:
from txtai.app import Application

In [None]:
# Build application and index data
app = Application(config)
app.add([{"id": x, "text": row} for x, row in enumerate(news)])
app.index()

# Run search query
app.search("COVID-19")[0]

{'id': '1',
 'text': 'Last week Scientific Breakthrough Revolutionizes Medical Treatment regarding COVID-19',
 'score': 0.5100634098052979}

# QUERY TRANSLATION
## - Processing data from a PDF

## INSTALL DEPENDENCIES

In [None]:
#!pip install langchain
#!pip install pypdf

In [None]:
%%capture

from langchain.text_splitter import RecursiveCharacterTextSplitter
from txtai.embeddings import Embeddings
from langchain.document_loaders import PyPDFLoader
import glob

## PDF Text Extraction

In [None]:
pdf_text = []
pdf_file_path = 'E:/Research_on_Machine_Learning_and_Its_Algorithms_an.pdf'

try:
    loader = PyPDFLoader(pdf_file_path)
    pages = loader.load()
    pdf_text.extend(pages)
    print("PDF text extraction successful!")
except Exception as e:
    print(f"Error: {e}")


PDF text extraction successful!


## SPILTING TEXT

In [None]:
document_splitter = RecursiveCharacterTextSplitter(chunk_size=350,
                                                   chunk_overlap=25,
                                                   length_function=len)




In [None]:
split_data = []
for docs in pdf_text:
    print(docs)
    temp_split = document_splitter.split_text(docs.page_content)
    split_data.extend(temp_split)

page_content='Journal of Physics: Conference Series\nPAPER • OPEN ACCESS\nResearch on Machine Learning and Its Algorithms and Development\nTo cite this article: Wei Jin 2020 J. Phys.: Conf. Ser. 1544 012003\n\xa0\nView the article online  for updates and enhancements. \n \nThis content was downloaded from IP address 158.46.154.149 on 03/06/2020 at 13:35' metadata={'source': 'E:/Research_on_Machine_Learning_and_Its_Algorithms_an.pdf', 'page': 0}
page_content="Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution\nof this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.\nPublished under licence by IOP Publishing LtdICSP 2020 \nJournal of Physics: Conference Series 1544  (2020) 012003 IOP Publishing \ndoi:10.1088/1742-6596/1544/1/012003 \n1Research on Machine Learning and Its Algorithms and \nDevelopment  \nWei Jin  \nNorthwestern Polytechnical University Ming De Coll

In [None]:
# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2",
                         "content": True,
                         "objects": True})

## CREATING INDEX

In [None]:
# Create an index for the list of text
embeddings.index([(uid,
                   text,
                   None) for uid, text in enumerate(split_data)])

## SEMANTIC SEARCH

In [None]:
embeddings.search("What is supervised learning?",1)

[{'id': '9',
  'text': 'complete the required learning content in a supervised environment. Compared with other learning \nmethods, supervised learning can fully stimulate the generalized learning potential of the machine \nitself. After completing the system learning, it can help people to solve some classification or',
  'score': 0.7316049337387085}]

It appears that the indexed document with 'id': '9' contains information related to supervised learning in the context of machine learning education and training.

## Get French query from the user

In [None]:
french_query = input("Enter your query in French: ")

Enter your query in French: Qu'est-ce que l'apprentissage non supervisé?


What is supervised learning? = "Qu'est-ce que l'apprentissage supervisé?"

What is UNsupervised learning?= "Qu'est-ce que l'apprentissage non supervisé ?"

In [None]:
from transformers import MarianTokenizer, MarianMTModel

In [None]:
if not french_query.isascii():  # Check if the query contains non-English characters
    # Translate the query from French to English using MarianMT translation model
    tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
    model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en")

    # Tokenize and translate the query
    inputs = tokenizer(french_query, return_tensors="pt")
    translated_query = model.generate(**inputs)
    translated_text = tokenizer.decode(translated_query[0], skip_special_tokens=True)

    # Perform semantic search on the translated query
    search_results = embeddings.search(translated_text, 1)
    print("Translated Query:", translated_text)
    print("Search Results:", search_results)
else:
    # Perform semantic search on the original query
    search_results = embeddings.search(french_query, 1)
    print("Search Results:", search_results)

Translated Query: What is unsupervised learning?
Search Results: [{'id': '11', 'text': 'Corresponding to supervised learning is unsupervised learning. The so -called unsupervised learning \nmeans that the machine does not  mark the content in a certain direction during the entire learning', 'score': 0.6905990839004517}]
