#**Query Processor for Comparative Analysis Ad Hoc Retrieval System**

Overview:<br>
This project implements a query processing pipeline designed for technology product comparisons in a retreival system that comapres tech products.
It can processes queries like "X vs Y" or "X vs Y for features" into structured search queries for the retreival system to use to look for relevant documents.

The components of this query processor are:
- Query Preprocessing - Text normalization & spell checking
- Query Splitting - Entity/feature separation
- Entity Recognition - Product name identification
- Contextual Enhancement - Entity context enrichment
- Feature Expansion - Expand feature with similar terms
- N-gram Generation - Search query construction

Implementation Details:
- Python implementation using NLTK, SpaCy, WordNet
- WikiData API integration
- Edit distance for spell checking
- Hybrid entity recognition (rule-based + NLP)

Example:
Input: "iPhone vs Samsung for battery life and camera"<br>
Output: Structured queries with expanded terms & context

#**Query Preprocessing functions**<br>
This is the first step in query processing which  consists of three steps:<br>
 - preprocess_query(): Converts to lowercase and splits query into items/features
 - remove_stopwords(): Eliminates common words from feature section
 - spell_check(): Identifies potential spelling errors using edit distance

In [14]:
# Import required libraries
import re
from nltk.corpus import stopwords
from nltk.metrics.distance import edit_distance
from typing import List, Tuple

## Preprocess the query
This function converts the query to lowercase for consistencty and removes any leading and trailing whitespaces.
<br>Then it splits the query into items to campare and features if any exist.

In [15]:
def preprocess_query(query: str) -> Tuple[str, str]:

    # Convert to lowercase
    query = query.lower().strip()

    # Split into items and features if "for" exists
    parts = query.split(" for ")
    items_part = parts[0]
    features_part = parts[1] if len(parts) > 1 else ""

    return items_part, features_part

# Test some queries
test_words = ['iPhone 15 vs iPhone 15 Pro', 'Amazon Echo Studio vs Apple HomePod 2 vs Google Nest Audio for sound quality',
              '   Tesla Model 3 vs BMW i4 for range and performance ', ' Nvidia RTX 4080 vs AMD Radeon RX 7900 XTX vs Intel Arc A770 for gaming performance  ']
for word in test_words:
    corrections = preprocess_query(word)
    print(f"Original: {word}")
    print(f"After preprocessing: {corrections}\n")


Original: iPhone 15 vs iPhone 15 Pro
After preprocessing: ('iphone 15 vs iphone 15 pro', '')

Original: Amazon Echo Studio vs Apple HomePod 2 vs Google Nest Audio for sound quality
After preprocessing: ('amazon echo studio vs apple homepod 2 vs google nest audio', 'sound quality')

Original:    Tesla Model 3 vs BMW i4 for range and performance 
After preprocessing: ('tesla model 3 vs bmw i4', 'range and performance')

Original:  Nvidia RTX 4080 vs AMD Radeon RX 7900 XTX vs Intel Arc A770 for gaming performance  
After preprocessing: ('nvidia rtx 4080 vs amd radeon rx 7900 xtx vs intel arc a770', 'gaming performance')



## Remove stopwords
Remove common words like 'a' 'the' which may appear in the query using nltk's stopword corpus. We dont have to look up all the stopwords everytime, lets download them to a local file and load it from the file system.

In [29]:
import os

STOPWORDS_FILE = 'stopwords.txt'  # File to store stopwords

def load_or_download_stopwords():
    """Loads stopwords from file or downloads and saves them if not found."""
    try:
        # Attempt to load from file
        with open(STOPWORDS_FILE, 'r', encoding='utf-8') as f:
            stop_words = set(f.read().splitlines())

    except FileNotFoundError:
        # Download and save if file not found
        nltk.download('stopwords', quiet=True)  # Download quietly
        stop_words = set(stopwords.words('english'))
        with open(STOPWORDS_FILE, 'w', encoding='utf-8') as f:
            f.write('\n'.join(stop_words))
    return stop_words

def remove_stopwords(text: str) -> str:
    """Remove stopwords from text"""
    stop_words = load_or_download_stopwords()
    words = text.split()
    return ' '.join([w for w in words if w.lower() not in stop_words])

# Test some queries
test_words = ['an iPhone', 'a galaxy phone', ' the proccesor', 'and bluetooth']
for word in test_words:
    corrections = remove_stopwords(word)
    print(f"Original: {word}")
    print(f"After removing stopwords: {corrections}\n")


Original: an iPhone
After removing stopwords: iPhone

Original: a galaxy phone
After removing stopwords: galaxy phone

Original:  the proccesor
After removing stopwords: proccesor

Original: and bluetooth
After removing stopwords: bluetooth



## Spell Checking
This step checks if there's any kind of spelling error in the query terms. Since most of the words we can expect are technology related terms, we first need to create a corpus of technology terms for the spell checker to refer to.

Here we try creating our own tech corpus with the help of wordnet and some manual entires of terms that relaate to tech liek some common product brands, product lines and then we fetch some technology related terms from wordnet's synsets, we give it some predefined terms to look for similarities and fetch them. This way we can build a solid vocabulary for our spell checker. This is a non exhausitve process whihc requires collecting a lot of data to keep up with the current trends, so this may not be the best vocabulary.

In [17]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.metrics.distance import edit_distance
from typing import List, Set, Dict
import json


# Download required NLTK data
nltk.download('wordnet', quiet=True)

tech_corpus = {
    'brands': set(),
    'terms': set()
}

# Add common tech brands
tech_corpus['brands'].update([
    'apple', 'samsung', 'google', 'microsoft', 'intel', 'amd', 'nvidia',
   'dell', 'hp', 'lenovo', 'asus', 'acer', 'xiaomi', 'oneplus', 'vivo',
    'oppo', 'realme', 'nokia', 'motorola'
])

# Add common product lines
tech_corpus['terms'].update([
    'iphone', 'galaxy', 'pixel', 'macbook', 'surface', 'thinkpad',
   'inspiron', 'pavilion', 'zenbook', 'ideapad'
])

# Get technology-related terms from WordNet
tech_keywords = ['technology', 'computer', 'device', 'digital', 'electronic']

for synset in wn.all_synsets():
    # Check if synset is related to technology
    if any(keyword in synset.definition().lower() for keyword in tech_keywords):
        # Add all lemma names from the synset
        tech_corpus['terms'].update(
            lemma.name().lower() for lemma in synset.lemmas()
        )

# Add common technical terms that might be missing from WordNet
tech_corpus['terms'].update([
    'cpu', 'gpu', 'ram', 'ssd', 'hdd', 'wifi', '5g', '4g', 'bluetooth',
    'processor', 'memory', 'storage', 'display', 'screen', 'camera',
    'battery', 'wireless', 'resolution', 'performance', 'graphics',
    'keyboard', 'touchscreen', 'fingerprint', 'security', 'charging',
    'port', 'usb', 'type-c', 'headphone', 'speaker'
])

# Save corpus to file for future use
with open('tech_corpus.json', 'w') as f:
  json.dump({k: list(v) for k, v in tech_corpus.items()}, f, indent=2)

Now Check spelling against tech corpus we just created. We use edit distance, which gets the closest word with minimal correction operations and return original word with potential corrections.

In [53]:
def spell_check_tech_term(word: str, corpus: Dict[str, Set[str]], threshold: int = 2) -> List[str]:
    """
    Check spelling against tech corpus
    Returns original word if no correction found, or the most similar term if found
    """
    word = word.lower()

    # Check if word is already in corpus
    if word in corpus['brands'] or word in corpus['terms']:
        return [word]

    best_correction = None
    min_distance = threshold + 1  # Initialize with value higher than threshold

    # Check against brands first (with stricter threshold)
    for brand in corpus['brands']:
        distance = edit_distance(word, brand)
        if distance <= 1 and distance < min_distance:  # Stricter threshold for brands
            best_correction = brand
            min_distance = distance

    # Check against technical terms
    for term in corpus['terms']:
        distance = edit_distance(word, term)
        if distance <= threshold and distance < min_distance:
            best_correction = term
            min_distance = distance

    return [word] + [best_correction] if best_correction else [word]

# Test some misspellings
test_words = ['ifone', 'galxy', 'procesor', 'bluethooth', 'gaming']

for word in test_words:
    corrections = spell_check_tech_term(word, tech_corpus)
    print(f"Original: {word}")
    print(f"Suggestions: {corrections}\n")


Original: ifone
Suggestions: ['ifone', 'iphone']

Original: galxy
Suggestions: ['galxy', 'galaxy']

Original: procesor
Suggestions: ['procesor', 'processor']

Original: bluethooth
Suggestions: ['bluethooth', 'bluetooth']

Original: gaming
Suggestions: ['gaming', 'jamming']



#**Entity Recognition**
Now that the query is 'cleaned up', we can go ahead and look for the products that we need to comapre in the query. Since there are important terms that are associated with the products which can help make the search easy, we fetch some contextually realted terms to the products. For exam,ple, lets say we have the term 'iphone 12', the term 'apple would be a good addition to the search terms. Thats what these follwoing functions do:
- extract_entities(): Uses pattern matching and NER to identify products
- get_contextual_terms(): Fetches related terms from WikiData API

## Extracting entities
First we extract entities from the items half of the query using pattern matching approach where we look for the term "vs" and that gives us the number of entites that we need to comapre. Then we use spacy's Named netity Recognition library to get the named entities in each of the terms.

In [19]:
import spacy
from typing import List, Dict
import requests

def extract_all_entities(items_text: str) -> List[str]:
    """Extract entities using both pattern matching and NER"""
    # Pattern matching with "vs" key word
    pattern_entities = [e.strip() for e in items_text.split("vs")]
    enity_list = []
    for entity in pattern_entities:
        enity_list.append(extract_entities(entity))
    return enity_list

def extract_entities(items_text: str) -> List[str]:
    """Extract entities using both pattern matching and NER"""
    # Pattern matching approach
    # pattern_entities = [e.strip() for e in items_text.split("vs")]

    # NER approach using spaCy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(items_text)
    ner_entities = [ent.text for ent in doc.ents]

    # Combine both approaches and remove duplicates
    all_entities = list(set([items_text] + ner_entities))
    return [e for e in all_entities if e]

# Test some queries
test_words = ['iphone 15 vs iphone 15 pro', 'amazon echo studio vs apple homepod 2 vs google nest audio',
              'tesla model 3 vs bmw i4', 'nvidia rtx 4080 vs amd radeon rx 7900 xtx vs intel arc a770']
for word in test_words:
    corrections = extract_all_entities(word)
    print(f"Original: {word}")
    print(f"After preprocessing: {corrections}\n")

Original: iphone 15 vs iphone 15 pro
After preprocessing: [['iphone 15', '15'], ['iphone 15 pro', '15']]

Original: amazon echo studio vs apple homepod 2 vs google nest audio
After preprocessing: [['amazon echo', 'amazon echo studio'], ['apple homepod 2', 'apple homepod', '2'], ['google nest audio', 'google']]

Original: tesla model 3 vs bmw i4
After preprocessing: [['3', 'tesla model 3'], ['bmw i4']]

Original: nvidia rtx 4080 vs amd radeon rx 7900 xtx vs intel arc a770
After preprocessing: [['nvidia rtx 4080', '4080', 'nvidia'], ['amd radeon rx 7900 xtx'], ['intel', 'intel arc a770']]



##Get contextual terms for the Entities
Now we query the WikiData API to get contextual terms for each entity in the entity list generated byt the extract_entities function for each of the items in the items_part of the query. We dont want to overload the query terms and generate a huge list of terms to look for, so we fetch 2 terms for each entiites.
<br> <br>We initialize stop words and number patterns to filter out any terms that are stopwords, or individual numbers. We only fetch contextual terms for entity words that are multiworded, this is because single words tend to hbe related to wider range of things other than what we are looking for, this may make our final query include non conextual terms whihc is why we are ignoring them. And we also skip terms that are less than 3 in length for the same reason as above.  
<br>Now we have the expanded entity list for each entity that gives us a good representation of the product.

In [42]:
from typing import Dict, List
import requests
from nltk.corpus import stopwords
import re

def get_contextual_terms_for_entities(entities: List[str]) -> List[str]:
   """
   Get top 2 contextual terms for a list of entities
   Returns expanded list containing entities and their contextual terms

   Args:
       entities: List of entity strings to get context for
   Returns:
       List containing original entities and their contextual terms
   """
   # Initialize stopwords and generic terms to filter out
   stop_words = load_or_download_stopwords()
   generic_terms = {'the', 'a', 'an', 'and', 'or', 'of', 'in', 'at', 'to'}

   # Regex for numbers and very short terms
   number_pattern = re.compile(r'^\d+$')

   # Result list
   expanded_entities = []

   for entity in entities:
       # Add original entity first
       expanded_entities.append(entity)

       # Skip if entity is too short or a single term
       if (number_pattern.match(entity) or
           len(entity) < 3 or
           len(entity.split(" ")) < 2):
           continue

       url = "https://www.wikidata.org/w/api.php"
       params = {
           "action": "wbsearchentities",
           "format": "json",
           "language": "en",
           "limit": 3,  # Request slightly more to account for filtering
           "search": entity
       }

       try:
           response = requests.get(url, params=params)
           data = response.json()

           if 'search' in data:
               # Get descriptions and extract terms
               terms = []
               for item in data['search']:
                   desc = item.get('description', '').lower()
                   if desc:
                       # Split description and filter out generic terms
                       desc_terms = [
                           term for term in desc.split()
                           if (len(term) > 2 and
                               term not in stop_words and
                               term not in generic_terms and
                               not number_pattern.match(term))
                       ]
                       terms.extend(desc_terms)

               # Get unique terms and take top 2
               unique_terms = list(dict.fromkeys(terms))[:2]

               if unique_terms:  # Only add if we found valid terms
                   expanded_entities.extend(unique_terms)

       except Exception as e:
           print(f"Error processing {entity}: {str(e)}")
           continue

   return expanded_entities

# Test with entity lists
test_words = [['15', 'iphone 15', 'iphone 15 pro'], ['apple homepod 2', 'amazon echo', 'amazon echo studio', '2', 'google nest audio', 'google'],
              ['bmw i4', 'tesla model 3', '3'], ['nvidia', '4080', 'intel', 'nvidia rtx 4080', 'intel arc a770', 'amd radeon rx 7900 xtx']]
for word_list in test_words:
  contexts = get_contextual_terms_for_entities(word_list)
  print(f"Original: {word_list}")
  print(f"After expanding: {contexts}\n")


Original: ['15', 'iphone 15', 'iphone 15 pro']
After expanding: ['15', 'iphone 15', '17th-generation', 'smartphones', 'iphone 15 pro', 'smartphone', 'apple']

Original: ['apple homepod 2', 'amazon echo', 'amazon echo studio', '2', 'google nest audio', 'google']
After expanding: ['apple homepod 2', 'amazon echo', 'brand', 'affordable,', 'amazon echo studio', '2', 'google nest audio', 'voice-enabled', 'smart', 'google']

Original: ['bmw i4', 'tesla model 3', '3']
After expanding: ['bmw i4', 'electric', 'automobile', 'tesla model 3', 'all-electric', 'four-door', '3']

Original: ['nvidia', '4080', 'intel', 'nvidia rtx 4080', 'intel arc a770', 'amd radeon rx 7900 xtx']
After expanding: ['nvidia', '4080', 'intel', 'nvidia rtx 4080', 'intel arc a770', 'graphics', 'card', 'amd radeon rx 7900 xtx', 'graphics', 'card']



#**Feature Exapnsion**
Now that we have a good representation for the items in the search query, we need to do something simialr fo rthe features that the user is trying to comapre. If the user is loooking for lets say 'battery life' then the term 'longevity' occuring in a document is a good indicator of battery life. So we need to fetch words that mean the same or similar to the feature terms but within a tech context. In order to do that we follow these steps:
- split_on_stop_words(): splits the feature_part into list of features based on the stop wrord in the string
- get_contextual_terms_for_feature():
- expand_features(): Generates expanded feature set with synonyms

## Spliting Features
This function splits the features_part into a list of features based on stopwords in the string. The feature_parts may look something like 'sound quality and battery life', here we split based on stop words instead of each individual words.<br>
This is because, each feature may be multiworded and 'sound' and 'quality' on thier own are not representative of the term 'sound quality' whihc the user is looking for.

In [30]:
def split_on_stop_words(text: str) -> List[str]:
    """
    Split text on NLTK stopwords while preserving the original phrases

    Args:
        text: Input string to split

    Returns:
        List of split phrases

    Example:
        "sound quality and performance" -> ["sound quality", "performance"]
    """
    stop_words = load_or_download_stopwords()

    # Add spaces around stop words to ensure clean splits
    processed_text = text.lower()
    for stop_word in stop_words:
        # Only replace if the stop word is a complete word (surrounded by spaces)
        processed_text = processed_text.replace(f" {stop_word} ", " || ")

    # Split on the delimiter and clean up the results
    phrases = [
        phrase.strip()
        for phrase in processed_text.split("||")
        if phrase.strip()
    ]

    return phrases


# Test cases
test_cases = [
    "sound quality and performance",
    "display resolution as well as brightness",
    "price or value for money",
    "camera quality and low light performance and zoom capability"
]

for test in test_cases:
    result = split_on_stop_words(test)
    print(f"\nInput: {test}")
    print(f"Output: {result}")


Input: sound quality and performance
Output: ['sound quality', 'performance']

Input: display resolution as well as brightness
Output: ['display resolution', 'well', 'brightness']

Input: price or value for money
Output: ['price', 'value', 'money']

Input: camera quality and low light performance and zoom capability
Output: ['camera quality', 'low light performance', 'zoom capability']


##Expanding the features
Here we use the WikiData API to get contextual terms for each feature in the feaature list generated byt the split_on_stop_words function for each of the feature in the features_part of the query.

We use a function similar to contextual term generation for entities, but with slightly different rules. we dont ignore one worded terms.

In [43]:
from typing import Dict, List
import requests
from nltk.corpus import stopwords
import re

def get_contextual_terms_for_features(features: List[str]) -> List[str]:
   """
   Get top 2 contextual terms for a list of features
   Returns expanded list containing features and their contextual terms

   Args:
       features: List of feature strings to get context for
   Returns:
       List containing original features and their contextual terms
   """
   # Initialize stopwords and generic terms to filter out
   stop_words = set(stopwords.words('english'))
   generic_terms = {'the', 'a', 'an', 'and', 'or', 'of', 'in', 'at', 'to'}

   # Regex for numbers and very short terms
   number_pattern = re.compile(r'^\d+$')

   # Result list
   expanded_features = []

   for entity in features:
       # Add original entity first
       expanded_features.append(entity)

       # Skip if entity is too short
       if (number_pattern.match(entity) or
           len(entity) < 3):
           continue

       url = "https://www.wikidata.org/w/api.php"
       params = {
           "action": "wbsearchentities",
           "format": "json",
           "language": "en",
           "limit": 3,  # Request slightly more to account for filtering
           "search": entity
       }

       try:
           response = requests.get(url, params=params)
           data = response.json()

           if 'search' in data:
               # Get descriptions and extract terms
               terms = []
               for item in data['search']:
                   desc = item.get('description', '').lower()
                   if desc:
                       # Split description and filter out generic terms
                       desc_terms = [
                           term for term in desc.split()
                           if (len(term) > 2 and
                               term not in stop_words)
                       ]
                       terms.extend(desc_terms)

               # Get unique terms and take top 2
               unique_terms = list(dict.fromkeys(terms))[:2]

               if unique_terms:  # Only add if we found valid terms
                   expanded_features.extend(unique_terms)

       except Exception as e:
           print(f"Error processing {entity}: {str(e)}")
           continue

   return expanded_features

# Test with entity lists
test_words = [['sound quality'], ['performance range']]
for word_list in test_words:
  contexts = get_contextual_terms_for_features(word_list)
  print(f"Original: {word_list}")
  print(f"After expanding: {contexts}\n")


Original: ['sound quality']
After expanding: ['sound quality', 'assessment', 'audio']

Original: ['performance range']
After expanding: ['performance range']



Combines all the features and it's contextually relevant terms into one list

In [35]:
def expand_features(features: List[str]) -> List[str]:
    """Expand features with their synonyms"""
    expanded_features = []
    for feature in features:
        # Add original feature
        expanded_features.append(feature)
        # Add relevant terms
        expanded_features.extend(get_contextual_terms_for_features([feature]))

    return list(set(expanded_features))

#'sound quality'
test_words = [['sound quality'], ['performance range']]
for word in test_words:
  syns = expand_features(word)
  print(f"Original: {word}")
  print(f"Expanded features: {syns}\n")

Original: ['sound quality']
Expanded features: ['assessment', 'audio', 'sound quality']

Original: ['performance range']
Expanded features: ['performance range']



#**Query Generation**
Now we have Expanded list of entities and and features, we finally generate 1,2 and 3 grams for erach entity and feature combination. This step gives us our final tokenized query terms for each entity/item in the original query.
- generate_ngrams(): Creates combinations of entities and features
- generate_final_queries() Combines all the queries into a list as a final result.

In [36]:
from typing import List, Dict
from itertools import combinations

def generate_ngrams(entities: List[str], features: List[str], n: int = 3) -> List[str]:
    """Generate n-grams for entity list and feature combinations

    For each feature, combine it with all possible entity combinations

    Args:
        entities: List of entity strings (e.g., ["iphone", "15"])
        features: List of feature strings (e.g., ["sound", "performance"])
        n: Maximum n-gram size

    Returns:
        List of unique n-gram combinations
    """
    ngrams = []

    # Add individual terms
    ngrams.extend(entities)
    ngrams.extend(features)

    # For each feature, generate combinations with entities
    for feature in features:
        # Add direct entity + feature combinations
        for entity in entities:
            ngrams.append(f"{entity} {feature}")

        # Add combinations of multiple entities + feature
        for i in range(2, min(n, len(entities) + 1)):
            for entity_combo in combinations(entities, i):
                ngrams.append(f"{' '.join(entity_combo)} {feature}")

    return list(set(ngrams))

# Test case
entities = ["iphone", "15"]
features = ["sound", "performance"]

ngrams = generate_ngrams(entities, features)
print("\nGenerated n-grams:")
print("\n".join(sorted(ngrams)))




Generated n-grams:
15
15 performance
15 sound
iphone
iphone 15 performance
iphone 15 sound
iphone performance
iphone sound
performance
sound


This function repeatedly calls generate_n_grams() until all the entities are tokenized with the features.

In [25]:
def generate_final_queries(enriched_entities: List[str], expanded_features: List[str]) -> List[str]:
    """Generate final queries by combining enriched_entities and expanded_features"""
    final_queries = []
    for entity_list in enriched_entities:
        query = generate_ngrams(entity_list, expanded_features)
        query.sort(key=lambda x: len(x), reverse=False)
        final_queries.append(query)
    return final_queries

#**Combining all the steps**
We encapsulate all the steps into one funciton which takes the query as na input and returns a tuple with number of queries and list of tokenized queries.

In [48]:
def query_generator(query: str) -> tuple[int, List[List[str]]]:
    """Generate queries for a given query string"""

    # 1. Preprocess
    items_part, features_part = preprocess_query(query)
    # Result: items_part = "iphone vs samsung", features_part = "battery life and camera quality"

    # 2. Extract entities
    entities_list = extract_all_entities(items_part)

    # Result: ["iphone", "samsung"]

    # 3. Get context
    enriched_entities = [get_contextual_terms_for_entities(e) for e in entities_list]

    # Result: [("iphone", ["apple", "smartphone"]), ("samsung", ["electronics", "phone"])]

    # 4. Process features
    features_part = split_on_stop_words(features_part)
    # Result: "battery life camera quality"

    # 5. Expand features
    expanded_features = expand_features(features_part)

    # Result: ["battery", "power", "cell", "camera", "lens", "imaging"]

    # # 6. Generate queries
    final_queries = generate_final_queries(enriched_entities, expanded_features)

    return (len(final_queries),final_queries)


#Test the Query Processor

In [55]:
# Input query
query = "iPhone 15 vs iPhone 15 Pro for battery and camera quality"

queries = query_generator(query)

print(f"Number of queries: {queries[0]}")

for query in queries[1]:
  print(query)

Number of queries: 2
['15', 'one', '15 one', 'battery', 'assembly', 'iphone 15', '15 battery', 'smartphones', '15 assembly', 'iphone 15 one', 'camera quality', 'smartphones one', '17th-generation', 'iphone 15 15 one', 'iphone 15 battery', '15 camera quality', 'smartphones 15 one', 'iphone 15 assembly', 'smartphones battery', '17th-generation one', 'iphone 15 15 battery', 'smartphones assembly', 'iphone 15 15 assembly', '17th-generation 15 one', 'smartphones 15 battery', 'smartphones 15 assembly', '17th-generation battery', '17th-generation assembly', 'iphone 15 camera quality', 'iphone 15 smartphones one', '17th-generation 15 battery', 'smartphones camera quality', 'iphone 15 15 camera quality', '17th-generation 15 assembly', 'iphone 15 smartphones battery', 'smartphones 15 camera quality', 'iphone 15 17th-generation one', '17th-generation camera quality', 'iphone 15 smartphones assembly', '17th-generation smartphones one', 'iphone 15 17th-generation battery', '17th-generation 15 camer

In [50]:
# Input query
query = "iPhone 15 vs iPhone 15 Pro"

queries = query_generator(query)

print(f"Number of queries: {queries[0]}")

for query in queries[1]:
  print(query)

Number of queries: 2
['15', 'iphone 15', 'smartphones', '17th-generation']
['15', 'apple', 'smartphone', 'iphone 15 pro']


In [54]:
# Input query
query = "MacBook Air vs MacBook Pro for games"

queries = query_generator(query)

print(f"Number of queries: {queries[0]}")

for query in queries[1]:
  print(query)

Number of queries: 2
['form', 'line', 'games', 'line form', 'line games', 'structured', 'macbook air', 'ultraportable', 'line structured', 'macbook air form', 'macbook air games', 'ultraportable form', 'ultraportable games', 'macbook air line form', 'macbook air structured', 'macbook air line games', 'line ultraportable form', 'ultraportable structured', 'line ultraportable games', 'macbook air line structured', 'line ultraportable structured', 'macbook air ultraportable form', 'macbook air ultraportable games', 'macbook air ultraportable structured']
['form', 'line', 'games', 'line form', 'macintosh', 'line games', 'structured', 'macbook pro', 'macintosh form', 'line structured', 'macintosh games', 'macbook pro form', 'macbook pro games', 'line macintosh form', 'line macintosh games', 'macintosh structured', 'macbook pro line form', 'macbook pro line games', 'macbook pro structured', 'line macintosh structured', 'macbook pro macintosh form', 'macbook pro line structured', 'macbook pro

In [57]:
# Input query
query = "AirPods Pro 2 vs Sony WF-1000XM5 for sound quality"

queries = query_generator(query)

print(f"Number of queries: {queries[0]}")

for query in queries[1]:
  print(query)

Number of queries: 2
['2', 'audio', '2 audio', 'assessment', '2 assessment', 'airpods pro 2', 'sound quality', '2 sound quality', 'airpods pro 2 audio', '2 airpods pro 2 audio', 'airpods pro 2 assessment', '2 airpods pro 2 assessment', 'airpods pro 2 sound quality', '2 airpods pro 2 sound quality']
['sony', 'audio', 'assessment', 'sony audio', 'sound quality', 'sony assessment', 'sony wf-1000xm5', 'sony sound quality', 'sony wf-1000xm5 audio', 'sony wf-1000xm5 sony audio', 'sony wf-1000xm5 assessment', 'sony wf-1000xm5 sound quality', 'sony wf-1000xm5 sony assessment', 'sony wf-1000xm5 sony sound quality']


In [None]:
# Input query
query = ""

queries = query_generator(query)

print(f"Number of queries: {queries[0]}")

for query in queries[1]:
  print(query)

In [None]:
# Input query
query = ""

queries = query_generator(query)

print(f"Number of queries: {queries[0]}")

for query in queries[1]:
  print(query)

In [None]:
# Input query
query = ""

queries = query_generator(query)

print(f"Number of queries: {queries[0]}")

for query in queries[1]:
  print(query)

In [None]:
# Input query
query = ""

queries = query_generator(query)

print(f"Number of queries: {queries[0]}")

for query in queries[1]:
  print(query)

In [None]:
# Input query
query = ""

queries = query_generator(query)

print(f"Number of queries: {queries[0]}")

for query in queries[1]:
  print(query)

In [None]:
# Input query
query = ""

queries = query_generator(query)

print(f"Number of queries: {queries[0]}")

for query in queries[1]:
  print(query)