# Hotel Booking Chatbot

## Artificial Neural Network (ANN) Intent Classifier - Multi-Layer Perceptron (MLP)

This component implements the Intent Classification module, designed to categorize user queries into one of 10 predefined hotel booking intents.

Model: Multi-Layer Perceptron (MLPClassifier) from Scikit-learn, which is a class of Feedforward Artificial Neural Network (ANN).

Framework: Scikit-learn, NLTK, Pandas.

Output: Includes detailed performance metrics (Accuracy, Classification Report) and necessary files exported for platform deployment (.joblib files).

This approach provides a robust and efficient solution for text classification tasks using traditional Machine Learning frameworks.

## STEP 1: Install Libraries and Download NLTK (if necessary)

This step ensures all necessary Python libraries and NLTK data are available in the environment.

**Note:** The NLTK download and installation commands should typically be run once.

In [None]:
# install necessary libraries (assuming the initial block has been run)
# !pip install scikit-learn
# !pip install pandas
# !pip install nltk

# download necessary nltk resources (assuming the initial block has been run)
# import nltk
# nltk.download('punkt')
# nltk.download('punkt_tab')
# nltk.download('wordnet')
# nltk.download('stopwords')

## STEP 2: Import Libraries and Initialize Resources

Import all required libraries and initialize NLTK components like stop words and the lemmatizer. Also, load the dataset.

**Note:** Ensure you have a file named dataset.csv in the same directory.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Import MLPClassifier (ANN implementation in Scikit-learn)
from sklearn.neural_network import MLPClassifier 
from sklearn.metrics import classification_report, accuracy_score
from joblib import dump,load

# Import NLTK
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
# Import Label Encoder, as ANN models often require numerical labels
from sklearn.preprocessing import LabelEncoder

# Initialize NLTK resources
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

df = pd.read_csv('dataset.csv')

# shuffle the data for robust splitting
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
print(df.head())

                                text             intent
0                       Good evening           greeting
1           When does check-out end?  ask_checkout_time
2  What is your cancellation policy?   ask_cancellation
3      What is your check-in policy?   ask_checkin_time
4          When does check-in start?   ask_checkin_time


## STEP 3: Define and Apply Text Preprocessing

Define the preprocess_text function to clean the text data (lowercase, remove punctuation, tokenize, remove stopwords and lemmatize) and apply it to the dataset.

In [2]:
def preprocess_text(text):
    # 1. Convert to Lowercase
    text = text.lower()
    
    # 2. Remove Punctuation and Special Characters
    text = re.sub(r'[^\w\s]', '', text)
    
    # 3. Tokenization
    tokens = word_tokenize(text)
    
    # 4. Stopword Removal
    tokens = [word for word in tokens if word not in stop_words]
    
    # 5. Lemmatization (Key Enhancement)
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Rejoin tokens into a single string
    return ' '.join(tokens)

# Apply the new preprocessing function to the text column
df['cleaned_text'] = df['text'].apply(preprocess_text)
print("--- Preprocessing Complete (with NLTK Lemmatization) ---")
print(df[['text', 'cleaned_text']].head())

--- Preprocessing Complete (with NLTK Lemmatization) ---
                                text         cleaned_text
0                       Good evening         good evening
1           When does check-out end?         checkout end
2  What is your cancellation policy?  cancellation policy
3      What is your check-in policy?       checkin policy
4          When does check-in start?        checkin start


## STEP 4: Feature Extraction (TF-IDF) and Label Encoding

Convert the cleaned text into a numerical feature matrix using TF-IDF Vectorization and encode the categorical intent labels into numerical format using LabelEncoder.

In [3]:
# Prepare the cleaned text and intents for the model training section
X = df['cleaned_text']
y = df['intent']

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data to create the feature matrix X
X = vectorizer.fit_transform(X)

# Initialize and fit the LabelEncoder to the intent labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)
print("--- Intent labels have been encoded to numbers ---")

print(f"Feature matrix X shape: {X.shape}")
print(f"Labels y (encoded) shape: {y_encoded.shape}")

--- Intent labels have been encoded to numbers ---
Feature matrix X shape: (100, 105)
Labels y (encoded) shape: (100,)


## STEP 5: Split Data and Train the MLPClassifier Model

Split the data into training and testing sets, then instantiate and train the MLPClassifier.

In [4]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y_encoded, # Use encoded labels for training and testing
    test_size=0.2,
    random_state=42,
    # Stratify only if there is more than one unique class
    stratify=y_encoded if len(df['intent'].unique()) > 1 else None,
)

print(f"Train set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

# Instantiate the ANN (MLPClassifier) model
# hidden_layer_sizes=(100, 50): Two hidden layers with 100 and 50 neurons respectively
# activation='relu': Rectified Linear Unit activation function
# max_iter=300: Maximum number of iterations
# solver='adam': Optimizer
# alpha=0.0001: L2 regularization term
ann_model = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    max_iter=300,
    solver='adam',
    alpha=0.0001,
    random_state=42
)

# Train the model (using dense matrix)
print("\n--- Training MLPClassifier with DENSE Matrix ---")
ann_model.fit(X_train.toarray(), y_train)
print("Training complete.")

Train set size: 80 samples
Test set size: 20 samples

--- Training MLPClassifier with DENSE Matrix ---
Training complete.


## Evaluate the Model

Evaluate the trained MLPClassfier model on the test set. Convert the numerical predictions and test labels back to original intent names for a human-readable classification report.

In [5]:
# Make predictions on the test set
pred_encoded = ann_model.predict(X_test.toarray())

# Decode the prediction results back to original label names
pred = le.inverse_transform(pred_encoded)
y_test_decoded = le.inverse_transform(y_test)

print("--- ANN (MLPClassifier) Model Evaluation ---")
# Calculate and print accuracy
accuracy = accuracy_score(y_test_decoded, pred)
print(f"Accuracy: {accuracy:.4f}")

print("\nClassification Report:\n")
# Print classification report only if there are multiple classes
if len(df['intent'].unique()) > 1:
    print(classification_report(y_test_decoded, pred, zero_division=0))
else:
    print("Classification Report skipped: Only one class in the dataset.")

--- ANN (MLPClassifier) Model Evaluation ---
Accuracy: 0.7500

Classification Report:

                   precision    recall  f1-score   support

 ask_availability       1.00      0.50      0.67         2
      ask_booking       0.67      1.00      0.80         2
 ask_cancellation       1.00      1.00      1.00         2
 ask_checkin_time       0.67      1.00      0.80         2
ask_checkout_time       1.00      0.50      0.67         2
   ask_facilities       0.00      0.00      0.00         2
     ask_location       0.40      1.00      0.57         2
   ask_room_price       1.00      1.00      1.00         2
          goodbye       1.00      0.50      0.67         2
         greeting       1.00      1.00      1.00         2

         accuracy                           0.75        20
        macro avg       0.77      0.75      0.72        20
     weighted avg       0.77      0.75      0.72        20



## STEP 7: Save Model Assets

Save the trained MLPClassifier model, the fitted TfidVectorizer and the LabelEncoder using joblib. These assets are essential for deploying the model and making real-time predictions. (Can deploy using Streamlit)

In [6]:
# Save the trained ANN model, Vectorizer, and LabelEncoder
dump(ann_model, 'ann_intent_model_dense.joblib')
dump(vectorizer, 'ann_tfidf_vectorizer_dense.joblib')
dump(le, 'ann_label_encoder_dense.joblib') # Saving LabelEncoder is crucial for decoding predictions
print("Model, Vectorizer, and LabelEncoder saved using joblib.")

Model, Vectorizer, and LabelEncoder saved using joblib.


## STEP 8: Define Chatbot Function and Test

Define the chatbot's response function, which uses the saved assets to process user input, predict the intent and retrieve a predefined response. Finally, test the full pipeline with a sample input.

In [10]:
# Predefined fixed responses (Retrieval System)
responses = {
    "ask_room_price": "Our rooms start from RM180 per night.",
    "ask_availability": "We currently have several rooms available.",
    "ask_facilities": "We offer free Wi-Fi, breakfast, pool, gym and parking.",
    "ask_location": "We are located in Kuala Lumpur City Centre (KLCC).",
    "ask_checkin_time" : "Check-in time is from 2:00 PM.",
    "ask_checkout_time" : "Check-out time is at 12:00 PM.",
    "ask_booking" : "You can book directly through our website or at the front desk.",
    "ask_cancellation" : "Cancellations are free up to 24 hours before arrival.",
    "greeting" : "Hello! How may I assist you today?",
    "goodbye" : "Goodbye! Have a great day!"
}

# Chatbot function remains the same, as vectorizer.transform() produces a sparse matrix,
# which the model.predict() method handles internally when the model was trained on a dense array.
# For consistency and clarity, we explicitly convert the input vector to dense before prediction.

def chatbot_reply_ann(user_input, model, vectorizer, label_encoder, responses):
    # Preprocessing
    user_input_cleaned = preprocess_text(user_input)
    
    # Feature Extraction: Transform the input using the fitted vectorizer (produces sparse matrix)
    vector_sparse = vectorizer.transform([user_input_cleaned])

    # Convert input vector to dense matrix to match the model's training format
    vector_dense = vector_sparse.toarray()

    # Intent Prediction (returns the encoded number)
    intent_encoded = model.predict(vector_dense)[0]

    # Decode the prediction result back to the original label name
    intent = label_encoder.inverse_transform([intent_encoded])[0]

    # Retrieval
    return responses.get(intent, f"Sorry, I predicted the intent '{intent}', but I don't have a specific response for that yet. Please rephrase your question.")

# Test the chatbot function
print("\n --- ANN (MLPClassifier) Chatbot Test (Dense Trained) ---")
test_input = "Do you have parking?"
predicted_response = chatbot_reply_ann(test_input, ann_model, vectorizer, le, responses)
print(f"User Input: {test_input}")
print(f"Chatbot Reply: {predicted_response}")


 --- ANN (MLPClassifier) Chatbot Test (Dense Trained) ---
User Input: Do you have parking?
Chatbot Reply: We offer free Wi-Fi, breakfast, pool, gym and parking.
