# STEP 2Intent Recognition System

This notebook implements an intent recognition system for restaurant-related queries with the following main intents:

- LOCATION_QUERY: Location-based restaurant queries (e.g., "Ankara'da pizza yeri")
- CUISINE_QUERY: Cuisine/food type queries (e.g., "İyi burger nerede?")
- FEATURE_QUERY: Restaurant feature/attribute queries (e.g., "Temiz ve hızlı servis")
- RATING_QUERY: Rating/quality-based queries (e.g., "En iyi restoran")
- COMPARISON_QUERY: Restaurant comparison queries (e.g., "X ile Y'yi karşılaştır")


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sentence_transformers import SentenceTransformer
from sklearn.metrics import classification_report
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import stanza
import re

# Download Turkish model if not already present
stanza.download("tr")
nlp_stanza = stanza.Pipeline("tr", processors="tokenize,pos,lemma", use_gpu=torch.cuda.is_available())

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-08-18 14:38:24 INFO: Downloaded file to /Users/Serra/stanza_resources/resources.json
2025-08-18 14:38:24 INFO: Downloading default packages for language: tr (Turkish) ...
2025-08-18 14:38:24 INFO: File exists: /Users/Serra/stanza_resources/tr/default.zip
2025-08-18 14:38:26 INFO: Finished downloading models and saved to /Users/Serra/stanza_resources
2025-08-18 14:38:26 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-08-18 14:38:27 INFO: Downloaded file to /Users/Serra/stanza_resources/resources.json
2025-08-18 14:38:27 INFO: Loading these models for language: tr (Turkish):
| Processor | Package       |
-----------------------------
| tokenize  | imst          |
| mwt       | imst          |
| pos       | imst_charlm   |
| lemma     | imst_nocharlm |

2025-08-18 14:38:27 INFO: Using device: cpu
2025-08-18 14:38:27 INFO: Loading: tokenize
2025-08-18 14:38:27 INFO: Loading: mwt
2025-08-18 14:38:27 INFO: Loading: pos
2025-08-18 14:38:28 INFO: Loading: lemma
2025-08-18 14:38:28 INFO: Done loading processors!


Using device: cpu


## 1. Create Training Dataset

First, let's create a synthetic dataset for training our intent recognition model.


In [2]:
# Sample queries for each intent type
training_data = {
    'LOCATION_QUERY': [
        "Ankara'da pizza yeri",
        "Çankaya'da iyi restoranlar",
        "Kızılay bölgesinde ne var",
        "Keçiören'de kahvaltı mekanları",
        "Tunalı'da akşam yemeği",
        "Ulus'ta kebapçı",
        "Bahçelievler'de cafe",
        "Batıkent'te dönerci",
        "Mamak'ta lokanta",
        "Sincan'da restaurant önerisi"
    ],
    'CUISINE_QUERY': [
        "İyi burger nerede?",
        "En güzel lahmacun",
        "Mantı yapan yerler",
        "Nerede güzel pizza var",
        "İskender kebap nerede yenir",
        "Ev yemekleri yapan lokanta",
        "Çiğ köfte dürüm",
        "Pide salonu önerisi",
        "Balık restoranı",
        "Vejeteryan restoran"
    ],
    'FEATURE_QUERY': [
        "Temiz ve hızlı servis",
        "Çocuk dostu restoran",
        "Manzaralı mekan",
        "Bahçeli cafe",
        "Canlı müzik olan yerler",
        "Büyük grup için uygun",
        "Sessiz çalışma mekanı",
        "Otoparklı restaurant",
        "WiFi olan kafeler",
        "Alkollü mekan"
    ],
    'RATING_QUERY': [
        "En iyi restoran",
        "En çok puan alan yerler",
        "Popüler mekanlar",
        "Yüksek puanlı lokantalar",
        "Müşteri memnuniyeti yüksek",
        "Tavsiye edilen restoranlar",
        "Yıldızı yüksek olan yerler",
        "En beğenilen kafeler",
        "Kaliteli restoranlar",
        "İyi yorumlar alan mekanlar"
    ],
    'COMPARISON_QUERY': [
        "X ile Y'yi karşılaştır",
        "Hangi kebapçı daha iyi",
        "Pizza House mı Dominos mu",
        "En iyi burger hangisi",
        "Hangi cafe daha uygun",
        "X restoranı mı Y lokantası mı",
        "Fiyat karşılaştırması",
        "Hangisinin servisi daha iyi",
        "Kalite fiyat karşılaştırması",
        "Menü çeşitliliği karşılaştırma"
    ]
}

# Convert to DataFrame
queries = []
intents = []
for intent, query_list in training_data.items():
    queries.extend(query_list)
    intents.extend([intent] * len(query_list))

df_train = pd.DataFrame({
    'query': queries,
    'intent': intents
})

print("Training data shape:", df_train.shape)
print("\nSample queries per intent:")
print(df_train.groupby('intent').head(2))


Training data shape: (50, 2)

Sample queries per intent:
                         query            intent
0         Ankara'da pizza yeri    LOCATION_QUERY
1   Çankaya'da iyi restoranlar    LOCATION_QUERY
10          İyi burger nerede?     CUISINE_QUERY
11           En güzel lahmacun     CUISINE_QUERY
20       Temiz ve hızlı servis     FEATURE_QUERY
21        Çocuk dostu restoran     FEATURE_QUERY
30             En iyi restoran      RATING_QUERY
31     En çok puan alan yerler      RATING_QUERY
40      X ile Y'yi karşılaştır  COMPARISON_QUERY
41      Hangi kebapçı daha iyi  COMPARISON_QUERY


## 2. Text Preprocessing


In [3]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation except apostrophes in Turkish words
    text = re.sub(r'[^\w\s\'\u0300-\u036f\u0130\u0131]', ' ', text)
    
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply preprocessing
df_train['processed_query'] = df_train['query'].apply(preprocess_text)

# Create label encoder
label_encoder = LabelEncoder()
df_train['intent_encoded'] = label_encoder.fit_transform(df_train['intent'])

print("Intent encoding mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"{label}: {i}")


Intent encoding mapping:
COMPARISON_QUERY: 0
CUISINE_QUERY: 1
FEATURE_QUERY: 2
LOCATION_QUERY: 3
RATING_QUERY: 4


## 3. Create Embeddings


In [4]:
# Initialize the sentence transformer model
model_name = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
encoder = SentenceTransformer(model_name)

# Generate embeddings
embeddings = encoder.encode(df_train['processed_query'].tolist())
print("Embeddings shape:", embeddings.shape)


Embeddings shape: (50, 384)


## 4. Create PyTorch Dataset and Model


In [5]:
class IntentDataset(Dataset):
    def __init__(self, embeddings, labels):
        self.embeddings = torch.FloatTensor(embeddings)
        self.labels = torch.LongTensor(labels)
        
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return self.embeddings[idx], self.labels[idx]

class IntentClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_classes):
        super(IntentClassifier, self).__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, hidden_dim // 2)
        self.layer3 = nn.Linear(hidden_dim // 2, num_classes)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        
    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.dropout(x)
        x = self.relu(self.layer2(x))
        x = self.dropout(x)
        x = self.layer3(x)
        return x


## 5. Train the Model


In [6]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    embeddings, 
    df_train['intent_encoded'].values,
    test_size=0.2, 
    random_state=42,
    stratify=df_train['intent_encoded'].values
)

# Create datasets
train_dataset = IntentDataset(X_train, y_train)
test_dataset = IntentDataset(X_test, y_test)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

# Initialize model
input_dim = embeddings.shape[1]  # 384 for the chosen model
hidden_dim = 256
num_classes = len(label_encoder.classes_)

model = IntentClassifier(input_dim, hidden_dim, num_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 50

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch_embeddings, batch_labels in train_loader:
        batch_embeddings = batch_embeddings.to(device)
        batch_labels = batch_labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(batch_embeddings)
        loss = criterion(outputs, batch_labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    if (epoch + 1) % 10 == 0:
        avg_loss = total_loss / len(train_loader)
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}')


Epoch [10/50], Loss: 0.9910
Epoch [20/50], Loss: 0.2673
Epoch [30/50], Loss: 0.0622
Epoch [40/50], Loss: 0.0446
Epoch [50/50], Loss: 0.0845


## 6. Evaluate the Model


In [7]:
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch_embeddings, batch_labels in test_loader:
        batch_embeddings = batch_embeddings.to(device)
        outputs = model(batch_embeddings)
        _, predicted = torch.max(outputs, 1)
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(batch_labels.numpy())

# Print classification report
print(classification_report(all_labels, all_preds, target_names=label_encoder.classes_))


                  precision    recall  f1-score   support

COMPARISON_QUERY       1.00      0.50      0.67         2
   CUISINE_QUERY       0.25      0.50      0.33         2
   FEATURE_QUERY       0.50      0.50      0.50         2
  LOCATION_QUERY       1.00      0.50      0.67         2
    RATING_QUERY       1.00      1.00      1.00         2

        accuracy                           0.60        10
       macro avg       0.75      0.60      0.63        10
    weighted avg       0.75      0.60      0.63        10



## 7. Create Intent Recognition Function


In [8]:
def predict_intent(query, encoder=encoder, model=model, label_encoder=label_encoder):
    # Preprocess query
    processed_query = preprocess_text(query)
    
    # Generate embedding
    with torch.no_grad():
        embedding = encoder.encode([processed_query])
        embedding_tensor = torch.FloatTensor(embedding).to(device)
        
        # Get model prediction
        model.eval()
        output = model(embedding_tensor)
        _, predicted = torch.max(output, 1)
        
        # Get predicted intent
        predicted_intent = label_encoder.inverse_transform(predicted.cpu().numpy())[0]
        
        # Get confidence scores
        probabilities = torch.nn.functional.softmax(output, dim=1)
        confidence = probabilities.max().item()
        
        return {
            'intent': predicted_intent,
            'confidence': confidence
        }

# Test the function with some example queries
test_queries = [
    "Ankara'da en iyi pizza",
    "Hangi restoran daha iyi?",
    "Temiz ve ferah mekan",
    "En yüksek puanlı yerler",
    "Çankaya'da kahvaltı"
]

print("Testing intent recognition:")
for query in test_queries:
    result = predict_intent(query)
    print(f"\nQuery: {query}")
    print(f"Predicted Intent: {result['intent']}")
    print(f"Confidence: {result['confidence']:.4f}")


Testing intent recognition:

Query: Ankara'da en iyi pizza
Predicted Intent: LOCATION_QUERY
Confidence: 0.8978

Query: Hangi restoran daha iyi?
Predicted Intent: COMPARISON_QUERY
Confidence: 0.8459

Query: Temiz ve ferah mekan
Predicted Intent: FEATURE_QUERY
Confidence: 0.9734

Query: En yüksek puanlı yerler
Predicted Intent: RATING_QUERY
Confidence: 0.9997

Query: Çankaya'da kahvaltı
Predicted Intent: LOCATION_QUERY
Confidence: 0.9985


## 8. Save the Model and Required Components


In [9]:
import pickle

# Save the trained model
torch.save(model.state_dict(), 'intent_classifier_model.pth')

# Save the label encoder
with open('intent_label_encoder.pkl', 'wb') as f:
    pickle.dump(label_encoder, f)

print("Model and components saved successfully!")


Model and components saved successfully!


# FASE 2: Çekirdek NLP Modülleri 
## 2.1 Metin Ön İşleme Pipeline


In [10]:
import re
import string
from typing import List, Dict, Set
import nltk
from nltk.corpus import stopwords
from collections import Counter

class TextProcessor:
    def __init__(self):
        # Turkish stop words
        self.turkish_stopwords = {
            'bir', 'bu', 'da', 'de', 'den', 'dır', 'dir', 'dır', 'için', 'ile', 
            'ise', 've', 'var', 'yok', 'olan', 'olur', 'şu', 'o', 'ki', 'mi', 
            'mı', 'mu', 'mü', 'ne', 'ama', 'fakat', 'veya', 'ya', 'gibi', 'kadar',
            'sonra', 'önce', 'çok', 'az', 'daha', 'en', 'çünkü', 'hem', 'ya da',
            'ancak', 'lakin', 'sadece', 'yalnız', 'hatta', 'bile', 'bütün', 'tüm'
        }
        
        # Turkish character mapping for normalization
        self.char_map = {
            'ç': 'c', 'ğ': 'g', 'ı': 'i', 'ö': 'o', 'ş': 's', 'ü': 'u',
            'Ç': 'C', 'Ğ': 'G', 'İ': 'I', 'Ö': 'O', 'Ş': 'S', 'Ü': 'U'
        }
    
    def normalize_turkish_chars(self, text: str, keep_turkish: bool = True) -> str:
        """
        Normalize Turkish characters
        keep_turkish: If True, keeps Turkish chars, if False converts to ASCII
        """
        if not keep_turkish:
            for turkish_char, ascii_char in self.char_map.items():
                text = text.replace(turkish_char, ascii_char)
        return text
    
    def remove_punctuation(self, text: str) -> str:
        """Remove punctuation while preserving apostrophes in Turkish words"""
        # Keep apostrophes and Turkish letters
        text = re.sub(r'[^\w\s\'\u00C0-\u017F\u0130\u0131]', ' ', text)
        return text
    
    def remove_stopwords(self, tokens: List[str]) -> List[str]:
        """Remove Turkish stop words"""
        return [token for token in tokens if token.lower() not in self.turkish_stopwords]
    
    def tokenize(self, text: str) -> List[str]:
        """Simple tokenization"""
        return text.split()
    
    def lemmatize_with_stanza(self, text: str) -> List[str]:
        """Lemmatize using Stanza (already initialized in your notebook)"""
        doc = nlp_stanza(text)
        lemmas = []
        for sentence in doc.sentences:
            for word in sentence.words:
                lemmas.append(word.lemma)
        return lemmas
    
    def extract_ngrams(self, tokens: List[str], n: int = 2) -> List[str]:
        """Extract n-grams from tokens"""
        if len(tokens) < n:
            return []
        return [' '.join(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
    
    def process_text(self, text: str, 
                    normalize_chars: bool = True,
                    remove_stops: bool = True,
                    use_lemmatization: bool = True,
                    extract_bigrams: bool = True) -> Dict:
        """
        Complete text processing pipeline
        Returns processed tokens, lemmas, and n-grams
        """
        # Step 1: Basic cleaning
        processed_text = text.lower().strip()
        
        # Step 2: Character normalization
        if normalize_chars:
            processed_text = self.normalize_turkish_chars(processed_text, keep_turkish=True)
        
        # Step 3: Remove punctuation
        processed_text = self.remove_punctuation(processed_text)
        
        # Step 4: Normalize whitespace
        processed_text = re.sub(r'\s+', ' ', processed_text).strip()
        
        # Step 5: Tokenization
        tokens = self.tokenize(processed_text)
        
        # Step 6: Stop word removal
        if remove_stops:
            tokens = self.remove_stopwords(tokens)
        
        # Step 7: Lemmatization
        lemmas = []
        if use_lemmatization and tokens:
            lemmas = self.lemmatize_with_stanza(' '.join(tokens))
        
        # Step 8: N-gram extraction
        bigrams = []
        trigrams = []
        if extract_bigrams and len(tokens) >= 2:
            bigrams = self.extract_ngrams(tokens, 2)
            if len(tokens) >= 3:
                trigrams = self.extract_ngrams(tokens, 3)
        
        return {
            'original_text': text,
            'processed_text': processed_text,
            'tokens': tokens,
            'lemmas': lemmas,
            'bigrams': bigrams,
            'trigrams': trigrams,
            'token_count': len(tokens)
        }

bakabiliriz

## 2.2 Named Entity Recognition (NER)

In [11]:
import spacy
from typing import List, Dict, Tuple, Set

class RestaurantNER:
    def __init__(self):
        # Location entities (Turkish cities, districts)
        self.location_keywords = {
            # Major cities
            'ankara', 'istanbul', 'izmir', 'bursa', 'antalya', 'adana', 'konya', 'gaziantep', 'kayseri', 'mersin',
            # Ankara districts
            'çankaya', 'keçiören', 'yenimahalle', 'mamak', 'sincan', 'altındağ', 'etimesgut', 'gölbaşı', 'pursaklar',
            'kızılay', 'tunalı', 'bahçelievler', 'batıkent', 'ulus', 'beşevler', 'dikmen', 'çayyolu', 'oran',
            # Common location words
            'merkez', 'şehir', 'bölge', 'mahalle', 'cadde', 'sokak', 'plaza', 'avm', 'mall'
        }
        
        # Cuisine types
        self.cuisine_keywords = {
            # Turkish cuisine
            'kebap', 'kebab', 'döner', 'lahmacun', 'pide', 'mantı', 'çorbası', 'köfte', 'iskender', 'adana',
            'urfa', 'beyti', 'şiş', 'tavuk', 'et', 'balık', 'deniz', 'ürünleri',
            # International cuisine
            'pizza', 'burger', 'hamburger', 'makarna', 'spagetti', 'çin', 'japon', 'sushi', 'meksika',
            'hint', 'arap', 'lübnan', 'fransız', 'i̇talyan', 'amerikan',
            # Food types
            'kahvaltı', 'öğlen', 'akşam', 'yemek', 'yemeği', 'atıştırmalık', 'tatlı', 'içecek', 'kahve', 'çay',
            'vejeteryan', 'vegan', 'organik', 'sağlıklı', 'diyet'
        }
        
        # Price indicators
        self.price_keywords = {
            'ucuz', 'pahalı', 'uygun', 'fiyat', 'fiyatlı', 'ekonomik', 'bütçe', 'para', 'lira', 'tl',
            'student', 'öğrenci', 'indirim', 'kampanya', 'promosyon', 'hesaplı', 'makul', 'pahallı'
        }
        
        # Rating/Quality indicators
        self.rating_keywords = {
            'iyi', 'güzel', 'harika', 'mükemmel', 'süper', 'muhteşem', 'enfes', 'lezzetli', 'nefis',
            'kötü', 'berbat', 'leş', 'iğrenç', 'beğenmedim', 'beğenmem',
            'temiz', 'hijyen', 'hijyenik', 'pis', 'kirli',
            'hızlı', 'yavaş', 'geç', 'erken', 'servis', 'garson', 'hizmet',
            'puanlı', 'puan', 'yıldız', 'yıldızlı', 'değerlendirme', 'yorum', 'tavsiye',
            'popüler', 'ünlü', 'meşhur', 'tanınan', 'bilinir'
        }
        
        # Additional features
        self.feature_keywords = {
            'bahçe', 'bahçeli', 'teras', 'açık', 'kapalı', 'klimalı', 'ısıtmalı',
            'müzik', 'canlı', 'sessiz', 'sakin', 'kalabalık', 'gürültülü',
            'wifi', 'internet', 'otopark', 'park', 'valet',
            'çocuk', 'dostu', 'aile', 'sevgili', 'grup', 'büyük',
            'manzara', 'manzaralı', 'görünüm', 'şehir', 'deniz', 'dağ'
        }
    
    def extract_entities(self, text: str) -> Dict[str, List[str]]:
        """
        Extract restaurant-related entities from text
        """
        text_lower = text.lower()
        tokens = text_lower.split()
        
        entities = {
            'LOCATION': [],
            'CUISINE': [],
            'PRICE': [],
            'RATING': [],
            'FEATURE': []
        }
        
        # Extract location entities
        for token in tokens:
            if any(loc in token for loc in self.location_keywords):
                entities['LOCATION'].append(token)
        
        # Extract cuisine entities
        for token in tokens:
            if any(cuisine in token for cuisine in self.cuisine_keywords):
                entities['CUISINE'].append(token)
        
        # Extract price entities
        for token in tokens:
            if any(price in token for price in self.price_keywords):
                entities['PRICE'].append(token)
        
        # Extract rating entities
        for token in tokens:
            if any(rating in token for rating in self.rating_keywords):
                entities['RATING'].append(token)
        
        # Extract feature entities
        for token in tokens:
            if any(feature in token for feature in self.feature_keywords):
                entities['FEATURE'].append(token)
        
        # Remove duplicates
        for entity_type in entities:
            entities[entity_type] = list(set(entities[entity_type]))
        
        return entities
    
    def get_entity_context(self, text: str, entity: str, window: int = 3) -> str:
        """
        Get context around an entity for better understanding
        """
        tokens = text.lower().split()
        entity_positions = [i for i, token in enumerate(tokens) if entity in token]
        
        contexts = []
        for pos in entity_positions:
            start = max(0, pos - window)
            end = min(len(tokens), pos + window + 1)
            context = ' '.join(tokens[start:end])
            contexts.append(context)
        
        return ' | '.join(contexts)

# 2.3 Embedding & Similarity Engine

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import pandas as pd
from typing import List, Dict, Tuple

class SimilarityEngine:
    def __init__(self, model_name: str = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'):
        self.encoder = SentenceTransformer(model_name)
        self.restaurant_embeddings = None
        self.restaurant_data = None
        self.text_processor = TextProcessor()
        self.ner_processor = RestaurantNER()
    
    def create_restaurant_data(self) -> pd.DataFrame:
        """
        Create sample restaurant data for demonstration
        In real implementation, this would come from your database
        """
        restaurants = [
            {
                'id': 1, 'name': 'Köfteci Ramiz', 'location': 'Kızılay', 
                'cuisine': 'Türk Mutfağı', 'rating': 4.5, 'price_range': 'Orta',
                'features': 'Hızlı servis, Temiz, Aile dostu',
                'description': 'Kızılayda ünlü köfteci, lezzetli köfte ve lahmacun'
            },
            {
                'id': 2, 'name': 'Pizza Palace', 'location': 'Çankaya', 
                'cuisine': 'İtalyan', 'rating': 4.2, 'price_range': 'Pahalı',
                'features': 'Canlı müzik, Manzaralı, Otopark',
                'description': 'Çankayada en iyi pizza, İtalyan şefi, harika manzara'
            },
            {
                'id': 3, 'name': 'Burger King', 'location': 'Tunalı', 
                'cuisine': 'Fast Food', 'rating': 3.8, 'price_range': 'Ucuz',
                'features': 'Hızlı servis, Çocuk dostu, WiFi',
                'description': 'Tunalıda burger, hızlı servis, öğrenci dostu fiyatlar'
            },
            {
                'id': 4, 'name': 'Balık Restaurant', 'location': 'Bahçelievler', 
                'cuisine': 'Deniz Ürünleri', 'rating': 4.7, 'price_range': 'Pahalı',
                'features': 'Taze balık, Manzaralı, Sessiz',
                'description': 'Taze deniz ürünleri, kaliteli servis, romantik atmosfer'
            },
            {
                'id': 5, 'name': 'Mantı Evi', 'location': 'Keçiören', 
                'cuisine': 'Türk Mutfağı', 'rating': 4.1, 'price_range': 'Ucuz',
                'features': 'Ev yemeği, Samimi, Ekonomik',
                'description': 'Ev yapımı mantı, sıcak atmosfer, uygun fiyatlar'
            }
        ]
        
        return pd.DataFrame(restaurants)
    
    def create_restaurant_embeddings(self, restaurant_data: pd.DataFrame):
        """
        Create embeddings for restaurant descriptions and features
        """
        self.restaurant_data = restaurant_data
        
        # Combine name, cuisine, features, and description for embedding
        restaurant_texts = []
        for _, row in restaurant_data.iterrows():
            combined_text = f"{row['name']} {row['location']} {row['cuisine']} {row['features']} {row['description']}"
            restaurant_texts.append(combined_text)
        
        self.restaurant_embeddings = self.encoder.encode(restaurant_texts)
        print(f"Created embeddings for {len(restaurant_texts)} restaurants")
    
    def encode_query(self, query: str) -> np.ndarray:
        """
        Encode a user query into embeddings
        """
        # Process the query
        processed = self.text_processor.process_text(query)
        
        # Extract entities
        entities = self.ner_processor.extract_entities(query)
        
        # Enhance query with extracted entities
        enhanced_query = query
        for entity_type, entity_list in entities.items():
            if entity_list:
                enhanced_query += f" {' '.join(entity_list)}"
        
        # Create embedding
        query_embedding = self.encoder.encode([enhanced_query])
        return query_embedding[0], entities
    
    def calculate_similarity(self, query_embedding: np.ndarray, top_k: int = 5) -> List[Tuple[int, float]]:
        """
        Calculate cosine similarity between query and restaurant embeddings
        """
        if self.restaurant_embeddings is None:
            raise ValueError("Restaurant embeddings not created. Call create_restaurant_embeddings first.")
        
        # Calculate cosine similarity
        similarities = cosine_similarity([query_embedding], self.restaurant_embeddings)[0]
        
        # Get top-k results
        top_indices = np.argsort(similarities)[::-1][:top_k]
        results = [(idx, similarities[idx]) for idx in top_indices]
        
        return results
    
    def semantic_search(self, query: str, top_k: int = 5, min_similarity: float = 0.3) -> Dict:
        """
        Perform semantic search on restaurants
        """
        # Encode query
        query_embedding, entities = self.encode_query(query)
        
        # Calculate similarities
        similarities = self.calculate_similarity(query_embedding, top_k)
        
        # Filter by minimum similarity
        filtered_results = [(idx, score) for idx, score in similarities if score >= min_similarity]
        
        # Prepare results
        results = []
        for idx, score in filtered_results:
            restaurant = self.restaurant_data.iloc[idx].to_dict()
            restaurant['similarity_score'] = score
            results.append(restaurant)
        
        return {
            'query': query,
            'extracted_entities': entities,
            'results': results,
            'total_results': len(results)
        }
    
    def find_similar_restaurants(self, restaurant_id: int, top_k: int = 3) -> List[Dict]:
        """
        Find restaurants similar to a given restaurant
        """
        if self.restaurant_embeddings is None:
            raise ValueError("Restaurant embeddings not created.")
        
        # Get the restaurant embedding
        restaurant_embedding = self.restaurant_embeddings[restaurant_id]
        
        # Calculate similarities (exclude the restaurant itself)
        similarities = cosine_similarity([restaurant_embedding], self.restaurant_embeddings)[0]
        similarities[restaurant_id] = -1  # Exclude self
        
        # Get top-k similar restaurants
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            if similarities[idx] > 0:  # Only include positive similarities
                restaurant = self.restaurant_data.iloc[idx].to_dict()
                restaurant['similarity_score'] = similarities[idx]
                results.append(restaurant)
        
        return results

In [13]:
# Initialize components
text_processor = TextProcessor()
ner_processor = RestaurantNER()
similarity_engine = SimilarityEngine()

# Create and process restaurant data
restaurant_data = similarity_engine.create_restaurant_data()
similarity_engine.create_restaurant_embeddings(restaurant_data)

# Example usage
def complete_query_analysis(query: str):
    """
    Complete analysis of a restaurant query
    """
    print(f"Analyzing query: '{query}'")
    print("="*50)
    
    # 1. Intent Recognition (using your existing function)
    intent_result = predict_intent(query)
    print(f"Intent: {intent_result['intent']} (confidence: {intent_result['confidence']:.4f})")
    
    # 2. Text Processing
    processed = text_processor.process_text(query)
    print(f"Processed tokens: {processed['tokens']}")
    print(f"Bigrams: {processed['bigrams']}")
    
    # 3. Named Entity Recognition
    entities = ner_processor.extract_entities(query)
    print(f"Extracted entities: {entities}")
    
    # 4. Semantic Search
    search_results = similarity_engine.semantic_search(query, top_k=3)
    print(f"Top {len(search_results['results'])} matching restaurants:")
    for i, restaurant in enumerate(search_results['results'], 1):
        print(f"  {i}. {restaurant['name']} ({restaurant['location']}) - Score: {restaurant['similarity_score']:.4f}")
    
    return {
        'intent': intent_result,
        'processed_text': processed,
        'entities': entities,
        'search_results': search_results
    }

# Test examples
test_queries = [
    "Kızılay'da iyi köfte nerede yenir?",
    "Çankaya'da pahalı olmayan pizza yeri",
    "Balık restoranı tavsiye eder misin?",
    "En iyi burger nerede?"
]

for query in test_queries:
    result = complete_query_analysis(query)
    print("\n" + "="*80 + "\n")

Created embeddings for 5 restaurants
Analyzing query: 'Kızılay'da iyi köfte nerede yenir?'
Intent: CUISINE_QUERY (confidence: 0.9123)
Processed tokens: ["kızılay'da", 'iyi', 'köfte', 'nerede', 'yenir']
Bigrams: ["kızılay'da iyi", 'iyi köfte', 'köfte nerede', 'nerede yenir']
Extracted entities: {'LOCATION': ["kızılay'da"], 'CUISINE': ['köfte'], 'PRICE': [], 'RATING': ['iyi'], 'FEATURE': []}
Top 3 matching restaurants:
  1. Köfteci Ramiz (Kızılay) - Score: 0.6273
  2. Mantı Evi (Keçiören) - Score: 0.5093
  3. Balık Restaurant (Bahçelievler) - Score: 0.4356


Analyzing query: 'Çankaya'da pahalı olmayan pizza yeri'
Intent: LOCATION_QUERY (confidence: 0.7552)
Processed tokens: ["çankaya'da", 'pahalı', 'olmayan', 'pizza', 'yeri']
Bigrams: ["çankaya'da pahalı", 'pahalı olmayan', 'olmayan pizza', 'pizza yeri']
Extracted entities: {'LOCATION': ["çankaya'da"], 'CUISINE': ['pizza'], 'PRICE': ['pahalı'], 'RATING': [], 'FEATURE': []}
Top 3 matching restaurants:
  1. Pizza Palace (Çankaya) - Score: 