# Hotel Booking Chatbot
## Core NLP: Supervised ML - Naive Bayes

This script implements the Intent Recognition component of a Retrieval-Based Chatbot using the Multinomial Naive Bayes model from 'scikit-learn'.

This approach is highly effective for text classification tasks.

In [17]:
# install necessary libraries
!pip install scikit-learn
!pip install pandas
!pip install nltk




[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [18]:
#download necessary nltk resources
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

print("NLTK Resources downloaded successfully!")

NLTK Resources downloaded successfully!


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
# Import necessary libraries
import pandas as pd
import numpy as np
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Import Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from joblib import dump,load

# Import NLTK
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re

# Initialize NLTK resources
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

## Step 1: Data Loading and Preprocessing

In [20]:
# Placeholder: Load the actual dataset. Ensure it has 'text' (user query) and 'intent' (label) columns
df = pd.read_csv('dataset.csv')

#shuffle the data for robust splitting
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
print(df.head())

                                text             intent
0                       Good evening           greeting
1           When does check-out end?  ask_checkout_time
2  What is your cancellation policy?   ask_cancellation
3      What is your check-in policy?   ask_checkin_time
4          When does check-in start?   ask_checkin_time


## STEP 2: DEFINE ENHANCED PREPROCESSING FUNCTION

Applies cleaning, tokenization, stopword removal, and lemmatization.

In [21]:
def preprocess_text(text):
    # 1. Convert to Lowercase
    text = text.lower()
    
    # 2. Remove Punctuation and Special Characters
    text = re.sub(r'[^\w\s]', '', text)
    
    # 3. Tokenization
    tokens = word_tokenize(text)
    
    # 4. Stopword Removal
    tokens = [word for word in tokens if word not in stop_words]
    
    # 5. Lemmatization (Key Enhancement)
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Rejoin tokens into a single string
    return ' '.join(tokens)

# Apply the new preprocessing function to the text column
df['cleaned_text'] = df['text'].apply(preprocess_text)
print("--- Preprocessing Complete (with NLTK Lemmatization) ---")
print(df[['text', 'cleaned_text']].head())

# Prepare the cleaned text and intents for the model training section
X = df['cleaned_text']
y = df['intent']

--- Preprocessing Complete (with NLTK Lemmatization) ---
                                text         cleaned_text
0                       Good evening         good evening
1           When does check-out end?         checkout end
2  What is your cancellation policy?  cancellation policy
3      What is your check-in policy?       checkin policy
4          When does check-in start?        checkin start


## Step 3: Feature Extraction (TF-IDF Vectorization)

In [22]:
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

#Fit and transform the text data to create the feature matrix X
X = vectorizer.fit_transform(X)
y = df['intent']

print(f"Feature matrix X shape: {X.shape}")
print(f"Labels y shape: {y.shape}")


Feature matrix X shape: (100, 105)
Labels y shape: (100,)


## Step 4: Data Splitting (Stratified Split) - 80% of data for training, 20% for testing

We use 'stratify=y' to ensure that all intent classes are proportionally represented in both the training and testing sets, preventing the 'UndefinedMetricWarning'

In [23]:
X_train, X_test, y_train, y_test = train_test_split(
    X,y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

print(f"Train set size:{X_train.shape[0]} samples")
print(f"Test set size:{X_test.shape[0]} samples")

Train set size:80 samples
Test set size:20 samples


## Step 5: Model Training (Multinomial Naive Bayes)

In [24]:
# Instantiate the Multiomial Naive Bayes model
nb_model = MultinomialNB()

# Train the model
nb_model.fit(X_train, y_train)

# Make predictions on the test set
pred = nb_model.predict(X_test)

## Step 6: Model Evaluation

In [25]:
print("--- Naive Bayes Model Evaluation ---")
print("Accuracy:", accuracy_score(y_test, pred))
print("\nClassification Report:\n")
print(classification_report(y_test, pred))

--- Naive Bayes Model Evaluation ---
Accuracy: 0.75

Classification Report:

                   precision    recall  f1-score   support

 ask_availability       1.00      0.50      0.67         2
      ask_booking       0.67      1.00      0.80         2
 ask_cancellation       1.00      1.00      1.00         2
 ask_checkin_time       0.67      1.00      0.80         2
ask_checkout_time       1.00      0.50      0.67         2
   ask_facilities       0.00      0.00      0.00         2
     ask_location       0.67      1.00      0.80         2
   ask_room_price       1.00      1.00      1.00         2
          goodbye       1.00      0.50      0.67         2
         greeting       0.50      1.00      0.67         2

         accuracy                           0.75        20
        macro avg       0.75      0.75      0.71        20
     weighted avg       0.75      0.75      0.71        20



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Step 7: Deployment Model

In [26]:
# Save the trained model and the vectorizer (essential for transforming new user input)
dump(nb_model, 'naive_bayes_intent_model.joblib')
dump(vectorizer, 'tfidf_vectorizer.joblib')
print("Model and Vectorizer saved using joblib.")

Model and Vectorizer saved using joblib.


## Step 8: Chatbot Function Implementation

In [27]:
# Predefined fixed responses (Retrieval System)
responses = {
    "ask_room_price": "Our rooms start from RM180 per night.",
    "ask_availability": "We currently have several rooms available.",
    "ask_facilities": "We offer free Wi-Fi, breakfast, pool, gym and parking.",
    "ask_location": "We are located in Kuala Lumpur City Centre (KLCC).",
    "ask_checkin_time" : "Check-in time is from 2:00 PM.",
    "ask_checkout_time" : "Check-out time is at 12:00 PM.",
    "ask_booking" : "You can book directly through our website or at the front desk.",
    "ask_cancellation" : "Cancellations are free up to 24 hours before arrival.",
    "greeting" : "Hello! How may I assist you today?",
    "goodbye" : "Goodbye! Have a great day!"
}

def chatbot_reply_nb(user_input, model, vectorizer, responses):
    # 1. Preprocessing
    user_input = user_input.lower()

    # 2. Feature Extraction: Transform the input using the fitted vectorizer
    vector = vectorizer.transform([user_input])

    # 3. Intent Prediction
    intent = model.predict(vector)[0]

    # 4. Retrieval (Check for unknown intent/fallback)
    # If the predicted intent exists in the dictionary, return the specific response
    # Otherwise, return a fallback message
    return responses.get(intent, f"Sorry, I predicted the intent '{intent}', but I don't have a specific response for that yet. Please rephrase your question.")

# Test the chatbot function
print("\n --- Naive Bayes Chatbot Test ---")
test_input = "Goodbye"
predicted_response = chatbot_reply_nb(test_input, nb_model, vectorizer, responses)
print(f"User Input: {test_input}")
print(f"Chatbot Reply: {predicted_response}")


 --- Naive Bayes Chatbot Test ---
User Input: Goodbye
Chatbot Reply: Goodbye! Have a great day!
