# Problem Analysis Workshop 5 - Customer satisfication in hotel industry
### Course: `Data Analysis Mathematics, Algorithms and Modeling`
### Team 7 - Members:
- **Tilvan Madalina**  - Student number: 9058215
- **Wesley Jayavanti** - Student number: 9019852
- **Yun-Chen Wang**    - Student number: 9040873

---

## About this workshop:
1. Study the differences between `Simple Regression`, `Multiple Linear Regression`, `Non-Linear Regression`, and `Logistic Regression`.
2. Performs the following exploration and validation:
    - Implementing a `Non-linear Regression` with TripAdvisor Dataset
        - Writing a 500-word discussion on its relevance to the term project
    - Implementing a `Logistic Regression` with TripAdvisor Dataset
        - Writing a 500-word discussion on its relevance to the term project

In [1]:
# Import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from IPython.display import display

from scipy.stats import boxcox
import statsmodels.formula.api as smf

from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from statsmodels.stats.diagnostic import het_breuschpagan
from sklearn.linear_model import LinearRegression, LogisticRegression

## Load clean datasets as pandas dataframes

In [37]:
# Load cleaned dataset
path = "../dataset/clean_df.csv"
clean_df = pd.read_csv(path)
print(f"Load dataframe from {path}, Shape: {clean_df.shape}")
display(clean_df.head()) # print the first 5 rows of the dataset

path = "../dataset/transformed_df.csv"
transformed_df = pd.read_csv(path)
print(f"Load dataframe from {path}, Shape: {transformed_df.shape}")
display(transformed_df.head()) # print the first 5 rows of the dataset

Load dataframe from ../dataset/clean_df.csv, Shape: (1048575, 85)


Unnamed: 0,Destination_country_id,Visitors_country_id,Overall_rating,Location_rating,Cleanliness_rating,Rooms_rating,Service_rating,Sleepquality_rating,Value_rating,Hotel.stars,...,Trip_type_couple,Trip_type_family,Trip_type_friends,Trip_type_solo,Trip_type_unknown,Reviewer_rank_Reviewer,Reviewer_rank_Senior Contributor,Reviewer_rank_Senior Reviewer,Reviewer_rank_Top Contributor,Reviewer_rank_Unknown
0,59,1,3,2,3,3,3,3,4,5,...,0,0,0,0,0,1,0,0,0,0
1,96,1,4,5,5,3,5,4,5,3,...,0,1,0,0,0,0,0,0,1,0
2,59,1,5,5,5,3,5,5,4,5,...,0,0,0,0,0,0,0,0,0,1
3,199,1,4,3,5,4,5,4,4,5,...,0,0,0,1,0,0,0,1,0,0
4,49,1,3,5,4,3,5,4,5,3,...,0,0,0,1,0,0,1,0,0,0


Load dataframe from ../dataset/transformed_df.csv, Shape: (1048575, 84)


Unnamed: 0,Destination_country_id,Visitors_country_id,Overall_rating,Location_rating,Cleanliness_rating,Rooms_rating,Service_rating,Sleepquality_rating,Value_rating,Hotelstars,...,Trip_type_couple,Trip_type_family,Trip_type_friends,Trip_type_solo,Trip_type_unknown,Reviewer_rank_Reviewer,Reviewer_rank_Senior_Contributor,Reviewer_rank_Senior_Reviewer,Reviewer_rank_Top_Contributor,Reviewer_rank_Unknown
0,59,1,9,4,9,3,9,9,16,5,...,0,0,0,0,0,1,0,0,0,0
1,96,1,16,25,25,3,25,16,25,3,...,0,1,0,0,0,0,0,0,1,0
2,59,1,25,25,25,3,25,25,16,5,...,0,0,0,0,0,0,0,0,0,1
3,199,1,16,9,25,4,25,16,16,5,...,0,0,0,1,0,0,0,1,0,0
4,49,1,9,25,16,3,25,16,25,3,...,0,0,0,1,0,0,1,0,0,0


In [38]:
# Replace all non-alphanumeric characters in column names with underscores
clean_df.columns = clean_df.columns.str.replace(r'\W+', '_', regex=True)
# clean_df.to_csv("../dataset/clean_df.csv", index=False)

# Replace all non-alphanumeric characters in column names with underscores
transformed_df.columns = transformed_df.columns.str.replace(r'\W+', '_', regex=True)
# transformed_df.to_csv("../dataset/transformed_df.csv", index=False)

In [39]:
df = clean_df #transformed_df

In [40]:
clean_df.columns

Index(['Destination_country_id', 'Visitors_country_id', 'Overall_rating',
       'Location_rating', 'Cleanliness_rating', 'Rooms_rating',
       'Service_rating', 'Sleepquality_rating', 'Value_rating', 'Hotel_stars',
       'Hotel_price', 'Hotel_distance', 'Hotel_noofrooms', 'Suites',
       'Family_Rooms', 'Microwave', 'Air_Conditioning', 'Minibar',
       'Refrigerator_in_room', 'Bar_Lounge', 'Kitchenette', 'Free_Parking',
       'Self_Serve_Laundry', 'Business_Centre_with_Internet_Access',
       'Conference_Facilities', 'Meeting_Rooms', 'Banquet_Room',
       'Casino_and_Gambling', 'Babysitting', 'Dry_Cleaning',
       'Multilingual_Staff', 'Airport_Transportation', 'Free_Breakfast',
       'Children_Activities_Kid_Family_Friendly_', 'Laundry_Service',
       'Concierge', 'Room_Service', 'Restaurant', 'Shuttle_Bus_Service',
       'Free_Internet', 'Free_High_Speed_Internet_WiFi_', 'Paid_Wifi',
       'Paid_Internet', 'Public_Wifi', 'Ski_In_Ski_Out',
       'Fitness_Centre_with_Gym_

In [41]:
transformed_df.columns

Index(['Destination_country_id', 'Visitors_country_id', 'Overall_rating',
       'Location_rating', 'Cleanliness_rating', 'Rooms_rating',
       'Service_rating', 'Sleepquality_rating', 'Value_rating', 'Hotelstars',
       'Hotelprice', 'Hoteldistance', 'Hotelnoofrooms', 'Suites',
       'FamilyRooms', 'Microwave', 'AirConditioning', 'Minibar',
       'Refrigeratorinroom', 'BarLounge', 'Kitchenette', 'FreeParking',
       'SelfServeLaundry', 'BusinessCentrewithInternetAccess',
       'ConferenceFacilities', 'MeetingRooms', 'BanquetRoom',
       'CasinoandGambling', 'Babysitting', 'DryCleaning', 'MultilingualStaff',
       'AirportTransportation', 'FreeBreakfast',
       'ChildrenActivitiesKidFamilyFriendly', 'LaundryService', 'Concierge',
       'RoomService', 'Restaurant', 'ShuttleBusService', 'FreeInternet',
       'FreeHighSpeedInternetWiFi', 'PaidWifi', 'PaidInternet', 'PublicWifi',
       'SkiInSkiOut', 'FitnessCentrewithGymWorkoutRoom', 'Spa', 'TennisCourt',
       'HotTub', 'Poo

**Simple Regression**

Definition: A regression with one independent variable (feature) predicting a continuous dependent variable (output).

Example in a TripAdvisor dataset:

Predicting hotel rating (y) based on the number of reviews (x).

🔹 Formula:𝑦=𝛽0+𝛽1𝑥+𝜖y=β0​ +β1​x+ϵ
🔹 Example Model in Python:

In [None]:
from sklearn.linear_model import LinearRegression

# Example dataset: Predicting hotel rating based on number of reviews
X = df[['num_reviews']]  # Independent variable
y = df['rating']  # Dependent variable

model = LinearRegression()
model.fit(X, y)


**Multiple Linear Regression**

 Definition: A regression with multiple independent variables predicting a continuous dependent variable.

Example in a TripAdvisor dataset:

Predicting hotel rating (y) based on number of reviews, hotel price, and location score (X1, X2, X3).

🔹 Formula:𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥3+𝜖
🔹 Example Model in Python:

In [None]:
X = df[['num_reviews', 'price', 'location_score']]
y = df['rating']

model = LinearRegression()
model.fit(X, y)

**Non-Linear Regression**

 Definition: A regression where the relationship between variables is not a straight line (e.g., exponential, polynomial, logarithmic).

Example in a TripAdvisor dataset:

Predicting customer satisfaction where ratings increase rapidly with more reviews at first but then plateau.

🔹 Example: Quadratic Relationship

𝑦=𝛽0+𝛽1𝑥+𝛽2𝑥^2+𝜖

 Example Model in Python (Polynomial Regression):

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

X = df[['num_reviews']]
y = df['rating']

poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_model.fit(X, y)


**Logistic Regression**

Definition: Used for classification problems where the output is categorical (e.g., positive vs. negative review).

Example in a TripAdvisor dataset:

Predicting whether a review is positive (1) or negative (0) based on review text, rating, and hotel category.

🔹 Formula (Sigmoid Function):

𝑃(𝑦=1)=1/(1+𝑒−(𝛽0+𝛽1𝑥1+𝛽2𝑥2))

​
Example Model in Python:

In [None]:
from sklearn.linear_model import LogisticRegression

X = df[['rating', 'review_length', 'sentiment_score']]  # Features
y = df['positive_review']  # Binary (1 = positive, 0 = negative)

model = LogisticRegression()
model.fit(X, y)


Fetch significant_columns by perfoming mlr.

In [None]:
# Select numerical columns (excluding 'Overall_rating')
predictors = [col for col in df.select_dtypes(include=['number']).columns if col != 'Overall_rating']

# Create the formula for the model
formula_mlr = "Overall_rating ~ " + " + ".join(predictors) 

# Fit the model
model_mlr = smf.ols(formula=formula_mlr, data=df).fit()

# Print the model summary
# print("\nMLR Model Summary:")
# print(model_mlr.summary())

# Extract significant predictors from the MLR model (exclude 'Intercept')
significant_columns = [col for col in model_mlr.pvalues.index if col != 'Intercept' and model_mlr.pvalues[col] < 0.05]

In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Feature and Target
X = df[significant_columns] 
y = df['Overall_rating']

# Convert target to binary class based on median
y_class = (y > y.median()).astype(int)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y_class, test_size=0.2, random_state=42)

# Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

# Evaluation
print("\nLogistic Regression Classification Results:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))



Logistic Regression Classification Results:
Confusion Matrix:
[[110574  18667]
 [ 52725  27749]]

Classification Report:
              precision    recall  f1-score   support

           0       0.68      0.86      0.76    129241
           1       0.60      0.34      0.44     80474

    accuracy                           0.66    209715
   macro avg       0.64      0.60      0.60    209715
weighted avg       0.65      0.66      0.63    209715



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Summary Table**

**Regression Type**   --------------------        **Output Type**	  --------------------------                        **Example in TripAdvisor Dataset**

Simple Regression	-------------------        Continuous	---------------------------------                           Predicting rating from num_reviews

Multiple Linear Regression	-----------         Continuous	-----------------------------       Predicting rating using num_reviews, price, location_score

Non-Linear Regression	------------------       Continuous	-----------------------------        Predicting rating trends where effect of reviews plateaus

Logistic Regression	  -----------------          Categorical (0/1)	-----------------------------           Predicting if a review is positive or negative
