# Importing libraries

# Python Libraries Description

## Overview
This markdown describes Python libraries commonly used in Jupyter Notebooks for data analysis, visualization, and machine learning.

## Library Descriptions
- **numpy**: Handles numerical computations with arrays and matrices.
- **pandas**: Manages and analyzes data using DataFrames.
- **matplotlib.pyplot**: Generates plots and visualizations.
- **seaborn**: Creates enhanced statistical visualizations.
- **tensorflow**: Supports building and training machine learning models.
- **tensorflow**: A comprehensive platform for building and training machine learning models.
    - **tensorflow.keras**: A high-level API within TensorFlow for building and training deep learning models (e.g., `Sequential`, `Dense`, `Dropout`).
- **scikit-learn (sklearn)**: A powerful and widely used library for classical machine learning. It includes modules for:
    - **Preprocessing**: (e.g., `LabelEncoder`, `StandardScaler`)
    - **Model Selection**: (e.g., `train_test_split`)
    - **Models**: (e.g., `LogisticRegression`)
    - **Metrics**: (e.g., `accuracy_score`, `classification_report`)
- **lime**: Stands for Local Interpretable Model-agnostic Explanations. It's a library used to explain the predictions of any machine learning model.
- **shap**: (SHapley Additive exPlanations) A library for explaining the output of machine learning models, providing insights into feature importance and prediction logic.

## Purpose
These libraries enable efficient data processing, visualization, and machine learning model development in Jupyter Notebooks, using standard aliases for concise coding.

In [None]:
import shap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder, StandardScaler
from lime import lime_tabular
from lime.lime_tabular import LimeTabularExplainer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Loading and Preparing Data in Jupyter Notebook

## Overview
This Python code loads three CSV files into pandas DataFrames and then renames specific columns to prevent naming conflicts, preparing the data for analysis in a Jupyter Notebook.

## Code Description
- **`hotels = pd.read_csv(...)`**: Loads hotel data from a CSV file into a DataFrame named `hotels`.
- **`reviews = pd.read_csv(...)`**: Loads review data from a CSV file into a DataFrame named `reviews`.
- **`users = pd.read_csv(...)`**: Loads user data from a CSV file into a DataFrame named `users`.
- **`hotels = hotels.rename(...)`**: Renames the `country` column in the `hotels` DataFrame to `hotel_country`.
- **`users = users.rename(...)`**: Renames the `country` column in the `users` DataFrame to `user_country`.

## Purpose
This code imports datasets into pandas DataFrames. It also performs initial data cleaning by **renaming the 'country' columns** in the `hotels` and `users` tables. This is a crucial step to avoid ambiguity and prevent column name collisions before merging these DataFrames.

In [None]:
hotels=pd.read_csv("/Users/macbookpro/Desktop/ACL/archive/hotels.csv")
reviews=pd.read_csv("/Users/macbookpro/Desktop/ACL/archive/reviews.csv")
users=pd.read_csv("/Users/macbookpro/Desktop/ACL/archive/users.csv")
hotels = hotels.rename(columns={'country': 'hotel_country'})
users = users.rename(columns={'country': 'user_country'})

### 🔍 Data Quality Check

This cell checks for duplicates and missing values in the **Hotels**, **Reviews**, and **Users** datasets using `duplicated()` and `isnull().sum()`.  
It was found that there are **no duplicate records** and **no missing values** in any of the datasets.


In [None]:
print(f"Hotels duplicates: {hotels.duplicated().sum()}")
print(f"Reviews duplicates: {reviews.duplicated().sum()}")
print(f"Users duplicates: {users.duplicated().sum()}")
print("--------------------------------")
print(f"Hotels nuls: {hotels.isnull().sum()}")
print("--------------------------------")
print(f"Reviews nuls: {reviews.isnull().sum()}")
print("--------------------------------")
print(f"Users nuls: {users.isnull().sum()}")

### 🧾 Dataset Overview

This cell displays the structure and summary information of the **Reviews**, **Hotels**, and **Users** datasets using `info()`.  
It shows the number of entries, column names, data types, and confirms that there are **no missing values** in any dataset.


In [None]:
reviews.info()
print("--------------------------------")   
hotels.info()
print("--------------------------------")
users.info()
print("--------------------------------")

# Merging DataFrames

## Overview
Merges pandas DataFrames (`reviews`, `hotels`, `users`) for analysis in Jupyter Notebook.

## Description
- **`review_hotel_df = reviews.merge(hotels, on='hotel_id', how='left')`**: Merges `reviews` with `hotels` on `hotel_id` using left join.
- **`df = review_hotel_df.merge(users, on='user_id', how='left')`**: Merges `review_hotel_df` with `users` on `user_id` using left join.

## Purpose
Combines review, hotel, and user data into one DataFrame for integrated analysis.

In [None]:
review_hotel_df=reviews.merge(hotels,on='hotel_id',how='left')
df=review_hotel_df.merge(users,on='user_id',how='left')


## Data Integrity Check: Verifying Row Count and Nulls in Key Columns (Features about the user)

In [None]:
print(f"Total rows in df (should be ~50,000): {len(df)}")
print("\nNull values *after* merge:")
print(df[['user_gender', 'age_group', 'traveller_type']].isnull().sum())

# Country Grouping in DataFrame

## Overview
Assigns country groups to hotels based on their country and displays selected columns.

## Description
- **`groups = {...}`**: Defines a dictionary mapping regions to lists of countries (e.g., North_America: United States, Canada).
- **`df["country_group"] = df["hotel_country"].apply(...)`**: Creates a `country_group` column by mapping `hotel_country` to a region from `groups`, defaulting to "Other" if not found.
- **`df[["hotel_country","user_country","country_group"]]`**: Selects `hotel_country`, `user_country`, and `country_group` columns for display.

## Purpose
Categorizes hotels by geographic region and shows relevant country data to fill country_group column.

In [None]:
groups={'North_America':['United States','Canada'],
        'Western_Europe':['Germany','France','United Kingdom','Netherlands','Spain','Italy'],
        'Eastern_Europe':['Russia'],
        'East_Asia':['China','Japan','South Korea'],
        'Southeast_Asia':['Thailand','Singapore'],
        'Middle_East':['United Arab Emirates','Turkey'],
        'Africa':['Egypt','Nigeria','South Africa'],
        'Oceania':['Australia','New Zealand'],
        'South_America':['Brazil','Argentina'],
        'South_Asia':['India'],
        'North_America_Mexico':['Mexico']}

df["country_group"]=df["hotel_country"].apply(lambda x: next((key for key, value in groups.items() if x in value), "Other"))

df[["hotel_country","user_country","country_group"]]

# Data-Engineering Question 1 

## Overview
Calculates the best city for each traveller type based on reviews.

## Description
- **`city_scores = df.groupby(['traveller_type', 'city'])['score_overall'].mean().reset_index().sort_values(...)`**: Groups data by `traveller_type` and `city`, computes mean `score_overall`, resets index, and sorts by `traveller_type` (ascending) and `score_overall` (descending).
- **`best_cities = city_scores.groupby('traveller_type').head(1)`**: Selects the top city (highest score) for each `traveller_type`.


In [None]:
city_scores = df.groupby(['traveller_type', 'city'])['score_overall'].mean().reset_index().sort_values(['traveller_type', 'score_overall'], ascending=[True,False])

best_cities = city_scores.groupby('traveller_type').head(1)

# display(city_scores)


# The Plot of question 1
 ### 📊 Best City per Traveller Type

This bar chart compares the **average overall score** for different traveller types, highlighting the **best-rated city** for each group.  
Each bar represents a traveller type, labeled with the city that received the highest score and its exact value.  
The plot shows how preferences differ among traveller types based on their overall ratings.


In [None]:
plt.figure(figsize=(10, 7))

colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A']

bars = plt.bar(best_cities['traveller_type'], 
               best_cities['score_overall'], 
               color=colors, 
               edgecolor='black', 
               linewidth=1.5, 
               alpha=0.85,
               width=0.6)

plt.xlabel('Traveller Type', fontsize=13, fontweight='bold')
plt.ylabel('Average Overall Score', fontsize=13, fontweight='bold')
plt.title('Best City Recommendation for Each Traveller Type', 
          fontsize=15, fontweight='bold', pad=20)
plt.ylim(0, 10)
plt.grid(axis='y', alpha=0.3, linestyle='--')

# Add city names and scores on top of each bar (replace 🏆 with text)
for i, (traveller, city, score) in enumerate(zip(best_cities['traveller_type'], 
                                                   best_cities['city'], 
                                                   best_cities['score_overall'])):
    plt.text(i, score + 0.15, f'Best: {city}', 
             ha='center', va='bottom', fontsize=11, fontweight='bold')
    plt.text(i, score - 0.4, f'{score:.2f}', 
             ha='center', va='top', fontsize=10, fontweight='bold', color='white')

# plt.xticks(rotation=0, fontsize=11)
plt.tight_layout()
plt.show()



# Question 2

### 🌍 Top Countries by Value-for-Money Score

This code calculates the **average value-for-money score** for each country within every **age group**.  
It then sorts the results and extracts the **top 3 countries** per age group with the highest average scores.


In [None]:
top_countries=df.groupby(["age_group","user_country"])["score_value_for_money"].mean().reset_index()
top_3=top_countries.sort_values(["age_group","score_value_for_money"],ascending=[True,False]).groupby("age_group").head(3)
print(top_3)

## Plot: Value-for-Money Analysis

### 💰 Top 3 Countries: Value Score by Age Group

This grouped bar chart visualizes the **average "value-for-money" score** across the **top 3 countries**. The data is further segmented by **age group**, which is represented by different colored bars (the `hue`).

Each cluster of bars represents a single country, allowing for a direct comparison of how different age demographics rate the "value-for-money" in that location. The y-axis shows the average score, making it easy to see which country and age group combination has the highest perceived value.

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(
    data=top_3, 
    x='user_country', 
    y='score_value_for_money', 
    hue='age_group', 
    palette='viridis'
)

plt.title('Top 3 Countries with Best Value-for-Money Score per Age Group', fontsize=14, fontweight='bold')
plt.xlabel('Country')
plt.ylabel('Average Value-for-Money Score')

plt.legend(title='Age Group')

plt.tight_layout()
plt.show()

# ---------------------------------------------------------

In [None]:
df.groupby('city')[['country_group']].nunique()

## Encoding the user country , gender , traveller type and age

In [None]:
df = pd.get_dummies(df, columns=['traveller_type'], drop_first=True)
# df = pd.get_dummies(df, columns=['user_country'], drop_first=True)
df = pd.get_dummies(df, columns=['user_gender'], drop_first=True)

age_order = {
    '18-24': 1,
    '25-34': 2,
    '35-44': 3,
    '45-54': 4,
    '55+': 5
}

df['age'] = df['age_group'].map(age_order)
df.drop(columns=['age_group'], inplace=True)

In [None]:
cols = [
    'score_cleanliness', 'score_facilities', 'score_staff',
    'star_rating', 'comfort_base', 'location_base',
    'value_for_money_base', 'age'

]

corr = df[cols].corr()

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title("Correlation Heatmap (Simplified)")
plt.show()

In [None]:
strong_corr = corr[(corr > 0) | (corr <= -0.5)]
plt.figure(figsize=(10, 8))
sns.heatmap(strong_corr, annot=True, cmap='coolwarm', center=0)
plt.title("Strong Correlations (>|0|)")
plt.show()

## Feature Engineering: Score vs. Baseline

### 📈 Creating "Difference" Features

This code block creates seven new "difference" (`diff_`) columns.

These features are calculated by subtracting a "base" value from the actual "score" for categories like cleanliness, comfort, and staff. This measures how much a hotel **over-performs or under-performs** compared to its baseline.

It also calculates the difference between the `score_overall` and the `star_rating`, which can show if a hotel is rated higher or lower than its official star category would suggest.

Finally, the code confirms the creation and prints the `head()` of these new columns to show the results.

In [None]:

df['diff_cleanliness'] = df['score_cleanliness'] - df['cleanliness_base']
df['diff_comfort'] = df['score_comfort'] - df['comfort_base']
df['diff_facilities'] = df['score_facilities'] - df['facilities_base']
df['diff_overall_vs_star'] = df['score_overall'] - df['star_rating']
df['diff_location'] = df['score_location'] - df['location_base']
df['diff_staff'] = df['score_staff'] - df['staff_base']
df['diff_value_for_money'] = df['score_value_for_money'] - df['value_for_money_base']

print("Successfully created all 'diff' features. ✅")
print("\nHead of the new 'diff' features:")
print(df[[ 'diff_cleanliness', 'diff_comfort', 'diff_facilities','diff_overall_vs_star', 'diff_location', 'diff_staff', 'diff_value_for_money']].head())

### 🧹 Feature Selection and Cleanup

This cell removes unnecessary or non-numerical columns (like IDs, text, and location data) that are not needed for analysis or modeling.  
The resulting `final_df` contains only the relevant features for further processing.


In [None]:

from typing import final


columns_to_drop = [
    'review_id',         
    'user_id',           
    'hotel_id',          
    'review_date',      
    'join_date',          
    'review_text',       
    'hotel_name',        
    'hotel_country',   
    'lat', 
    'score_overall',
    'score_cleanliness',
    'score_comfort',
    'score_facilities',
    'score_location',
    'score_staff',
    'score_value_for_money',
    'city',
    'star_rating',
    'cleanliness_base',
    'comfort_base',
    'facilities_base',
    'location_base',
    'staff_base',
    'value_for_money_base',   
    'user_country',            
    'lon'  
]
final_df=df.drop(columns=columns_to_drop)

# final_df.to_csv('final_dataset.csv', index=False)   
final_df.head()


In [None]:
final_df.info()


In [None]:
final_df.to_csv('final_dataset.csv', index=False)

# Checking for null values 

after checking the data in the table there was no null values

In [None]:
final_df.isnull().sum()

In [None]:
final_df.to_csv('final_dataset.csv', index=False)

## 🎯 Defining Features (X) and Target (y)

This code block prepares the data for a machine learning model by separating it into two distinct variables:

1.  **`X` (Features)**: This DataFrame, `X`, holds all the **independent variables** (or features) that the model will use to learn and make predictions. It is created by selecting specific columns from the `final_df`, including:
    * The **"difference" features** (e.g., `diff_overall_vs_star`, `diff_cleanliness`).
    * **One-hot encoded features** for `traveller_type` and `user_gender`.
    * The numerical `age` feature.

2.  **`y` (Target)**: This pandas Series, `y`, holds the **dependent variable** (or target) that the model will be trained to predict.
    * In this case, the target is the `country_group` column.

In [None]:
X = final_df[['diff_overall_vs_star','diff_cleanliness','diff_comfort','diff_facilities','diff_location','diff_staff','diff_value_for_money','traveller_type_Couple','traveller_type_Family','traveller_type_Solo', 'user_gender_Male','user_gender_Other','age' ]] 
y = final_df['country_group']

## Plot: Target Variable Distribution

### 📊 Country Group Distribution

This line of code visualizes the distribution of the target variable `y` (which contains the "Country Group" categories).

It first performs a `value_counts()` to count the total number of occurrences for each unique category in `y`. Then, it immediately uses `.plot(kind='bar')` to create a **bar chart** of these counts. The `title` is set to "Country Group Distribution", and `plt.show()` displays the final visual.

**Observation:** The plot shows an imbalanced distribution, with a **significantly larger number of samples for the "Western Europe" category** compared to the others.

In [None]:
y.value_counts().plot(kind='bar', title='Country Group Distribution')
plt.show()

## Data Preparation

### 🔪 Splitting Data into Training and Test Sets

This line of code uses the `train_test_split` function from scikit-learn to divide the dataset into two parts: a **training set** and a **test set**.

* **`X, y`**: These are the complete datasets, with `X` being the features and `y` being the target variable (labels).
* **`test_size=0.2`**: This parameter specifies that **20%** of the data should be reserved for the test set. The remaining **80%** will be used for the training set.
* **`random_state=42`**: This acts as a seed for the random shuffling. By setting a specific number (like 42), we ensure that the split is **reproducible**—meaning, every time this code is run, the data will be split in the exact same way.

The function returns four new variables:
1.  **`X_train`**: The 80% of features used for training the model.
2.  **`X_test`**: The 20% of features used for testing the model.
3.  **`y_train`**: The corresponding 80% of labels for training.
4.  **`y_test`**: The corresponding 20% of labels for testing.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
log_model = LogisticRegression(
    max_iter=100,
)

In [None]:
log_model.fit(X_train,y_train)

In [None]:
y_pred = log_model.predict(X_test)

print("=== Logistic Regression Evaluation ===")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average='weighted'))
print("Recall:", recall_score(y_test, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))
print("\nDetailed Report:\n", classification_report(y_test, y_pred))

In [None]:
final_df.info()

In [None]:
# global explanation
explainer = shap.LinearExplainer(log_model, X_train)

shap_values = explainer.shap_values(X_test)

if isinstance(shap_values, list):
    shap_values = shap_values[1] if len(shap_values) > 1 else shap_values[0]


shap_values = np.array(shap_values, dtype=np.float64)


shap.summary_plot(shap_values, X_test, plot_type="bar", max_display=X_test.shape[1], show=True)
shap.summary_plot(shap_values, X_test)

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Ensure consistent random state
np.random.seed(42)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

explainer = LimeTabularExplainer(
    training_data=np.array(X_train),
    feature_names=X_train.columns,
    class_names=log_model.classes_.astype(str),
    mode='classification',
    discretize_continuous=True
)

sample_idx = 39
sample = X_test.iloc[sample_idx]


pred_label = log_model.predict(sample.values.reshape(1, -1))[0]      
pred_class_idx = np.where(log_model.classes_ == pred_label)[0][0]    
pred_class_name = log_model.classes_[pred_class_idx]

print(f"Predicted class: {pred_class_name}")


exp = explainer.explain_instance(
    data_row=sample,
    predict_fn=log_model.predict_proba, 
    num_features=10,
    labels=[pred_class_idx]   
)

exp.show_in_notebook(show_table=True, labels=[pred_class_idx])

print(f"\nLIME Explanation for predicted class '{pred_class_name}':")
for feature, weight in exp.as_list(label=pred_class_idx):
    print(f"{feature}: {weight:.4f}")

### Encoded the string labels into integers and converted them into one-hot vectors so the neural network can understand and process the target classes numerically during training.


In [None]:
# Encode string labels into integers
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# One-hot encode for NN output layer
y_categorical = to_categorical(y_encoded)

In [None]:
print("y_encoded shape:", y_encoded.shape)
print("y_categorical shape:", y_categorical.shape)

### Split the dataset into training and testing sets to evaluate the model’s performance, keeping the class distribution balanced using stratified sampling. 
Stratified sampling ensures that each class maintains the same proportion in both the training and testing sets, preventing bias toward more frequent classes.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y_categorical, test_size=0.2, random_state=42, stratify=y_encoded
)

### Scaled the feature data using StandardScaler to normalize input values. 
This standardization centers the data around zero with unit variance, helping the neural network train faster and perform better by ensuring all features contribute equally.


In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Built a sequential neural network model consisting of multiple layers. 
It includes two hidden layers with ReLU activation for learning complex patterns, a dropout layer to reduce overfitting, and a final softmax output layer for multi-class classification.


In [None]:
model = Sequential([
    Dense(64, input_dim=X_train_scaled.shape[1], activation='relu'),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dense(y_train.shape[1], activation='softmax')  
])

In [None]:
#test model(no change)
# num_features = X_train_scaled.shape[1]
# num_classes = y_train.shape[1]  

# model = Sequential([
#     Dense(128, activation='relu', input_shape=(num_features,)),
#     Dropout(0.3),
#     Dense(64, activation='relu'),
#     Dropout(0.3),
#     Dense(num_classes, activation='softmax')
# ])


In [None]:
print("X_train_scaled shape:", X_train_scaled.shape)
print("y_train shape:", y_train.shape)

### Compiled, trained, and evaluated the neural network model. 
The model uses the Adam optimizer and categorical cross-entropy loss suitable for multi-class classification. 
It is trained on the scaled training data with a validation split to monitor performance. 
After training, predictions are made on the test set and evaluated using accuracy, precision, recall, F1-score, and a detailed classification report to assess the model’s performance across all classes.


In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

history = model.fit(X_train_scaled, y_train,
                    epochs=50, batch_size=32,
                    validation_split=0.2, verbose=1)

y_pred_prob = model.predict(X_test_scaled)
y_pred = np.argmax(y_pred_prob, axis=1)
y_true = np.argmax(y_test, axis=1)

print("=== Neural Network Evaluation ===")
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred, average='weighted'))
print("Recall:", recall_score(y_true, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_true, y_pred, average='weighted'))
print("\nDetailed Report:\n", classification_report(y_true, y_pred, target_names=le.classes_))


In [None]:
plt.figure(figsize=(10,5))
plt.plot(history.history['accuracy'], label='Training Accuracy', linewidth=2)
plt.plot(history.history['val_accuracy'], label='Validation Accuracy', linewidth=2)
plt.title('Training vs Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

### Used SHAP to interpret the neural network’s predictions. 
A SHAP explainer was created using a sample of the training data as background. 
It computes SHAP values to measure how much each feature contributes to the model’s predictions. 
Finally, a summary plot is generated to visualize the global feature importance and understand which features have the greatest impact on the model’s decisions.


In [None]:
X_sample = X_test_scaled[:300]

In [None]:
def predict_fn(data):
    return model.predict(data, verbose=0)

background = shap.sample(X_train_scaled, 100, random_state=42)
explainer = shap.KernelExplainer(predict_fn, background)

In [None]:
shap_values = explainer.shap_values(background[:10])

In [None]:
shap.summary_plot(
    shap_values,
    X_sample[:100],
    feature_names=X.columns,
    plot_type="bar",
    show=False
)

plt.title("Global Feature Importance (SHAP Summary)", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Generated a local explanation using LIME.
LIME provides an interpretable, instance-level explanation by identifying the most influential features that contributed to a specific prediction made by the neural network. This helps understand why the model predicted a certain class for a given input sample.


In [None]:
explainer = lime_tabular.LimeTabularExplainer(
    training_data=np.array(X_train_scaled),
    feature_names=X.columns,
    class_names=le.classes_,
    mode='classification'
)


i = 0
sample = X_test_scaled[i].reshape(1, -1)


exp = explainer.explain_instance(
    data_row=X_test_scaled[i],
    predict_fn=model.predict,
    num_features=13  
)

exp.show_in_notebook(show_table=True)

### Implemented an inference function that processes raw user input, prepares it in the same format used during training, and predicts the corresponding country group using the trained neural network. 
The function handles feature encoding, scaling, and outputs both the predicted class and the model’s confidence scores for each group.


In [None]:
def predict_country_group(raw_input):

    age_group_map = {
        "18-25": 1,
        "26-35": 2,
        "36-45": 3,
        "46-55": 4,
        "56+": 5
    }
    age_value = age_group_map.get(raw_input.get("age_group"), 3)  # default: 36-45

    diff_overall_vs_star = raw_input["score_overall"] - raw_input["star_rating"]
    diff_cleanliness = raw_input["score_cleanliness"] - raw_input["cleanliness_base"]
    diff_comfort = raw_input["score_comfort"] - raw_input["comfort_base"]
    diff_facilities = raw_input["score_facilities"] - raw_input["facilities_base"]
    diff_location = raw_input["score_location"] - raw_input["location_base"]
    diff_staff = raw_input["score_staff"] - raw_input["staff_base"]
    diff_value_for_money = raw_input["score_value_for_money"] - raw_input["value_for_money_base"]

    model_input = {
        "diff_overall_vs_star": diff_overall_vs_star,
        "diff_cleanliness": diff_cleanliness,
        "diff_comfort": diff_comfort,
        "diff_facilities": diff_facilities,
        "diff_location": diff_location,
        "diff_staff": diff_staff,
        "diff_value_for_money": diff_value_for_money,
        "traveller_type_Couple": 1 if raw_input["traveller_type"] == "Couple" else 0,
        "traveller_type_Family": 1 if raw_input["traveller_type"] == "Family" else 0,
        "traveller_type_Solo": 1 if raw_input["traveller_type"] == "Solo" else 0,
        "user_gender_Male": 1 if raw_input["user_gender"] == "Male" else 0,
        "user_gender_Other": 1 if raw_input["user_gender"] == "Other" else 0,
        "age": age_value
    }

    feature_columns = [
        'diff_overall_vs_star', 'diff_cleanliness', 'diff_comfort',
        'diff_facilities', 'diff_location', 'diff_staff',
        'diff_value_for_money', 'traveller_type_Couple',
        'traveller_type_Family', 'traveller_type_Solo',
        'user_gender_Male', 'user_gender_Other', 'age'
    ]

    input_df = pd.DataFrame([model_input])[feature_columns]
    scaled_input = scaler.transform(input_df)


    probs = model.predict(scaled_input)
    predicted_class = np.argmax(probs, axis=1)[0]
    predicted_group = le.inverse_transform([predicted_class])[0]

    print(f"\nPredicted Country Group: {predicted_group}\n")
    print("Class Probabilities:")
    for cls, prob in zip(le.classes_, probs[0]):
        print(f"  {cls}: {float(prob):.3f}")

    return {
        "predicted_group": predicted_group,
        "probabilities": {cls: float(prob) for cls, prob in zip(le.classes_, probs[0])}
    }


In [None]:
sample_input = {
    "age_group": "26-35",
    "user_gender": "Female",
    "traveller_type": "Solo",
    "score_overall": 8.7,
    "star_rating": 5,
    "score_cleanliness": 8.5,
    "cleanliness_base": 9.0,
    "score_comfort": 8.2,
    "comfort_base": 8.5,
    "score_facilities": 7.8,
    "facilities_base": 8.0,
    "score_location": 8.9,
    "location_base": 9.3,
    "score_staff": 9.1,
    "staff_base": 9.0,
    "score_value_for_money": 8.4,
    "value_for_money_base": 8.0
}

result = predict_country_group(sample_input)


In [None]:
sample_input = {
    "age_group": "36-45",
    "user_gender": "Male",
    "traveller_type": "Solo",

    # Raw scores vs base values (chosen to produce the same diffs)
    "score_overall": 8.8,
    "star_rating": 5.0,                      # 8.8 - 5.0 = 3.8

    "score_cleanliness": 8.0,
    "cleanliness_base": 8.5,                 # diff = -0.5

    "score_comfort": 7.6,
    "comfort_base": 8.0,                     # diff = -0.4

    "score_facilities": 7.4,
    "facilities_base": 8.0,                  # diff = -0.6

    "score_location": 8.9,
    "location_base": 9.0,                    # diff = -0.1

    "score_staff": 9.7,
    "staff_base": 9.0,                       # diff = +0.7

    "score_value_for_money": 8.3,
    "value_for_money_base": 8.0              # diff = +0.3
}


result = predict_country_group(sample_input)
