## Step 3 - Clustering Based on Price Polices 


### 3.6 Identify the 150 hotels with the most data in the dataset and extract their records.


In [None]:
import pandas as pd

file_path = "./hotels_data_changed.csv"  
df = pd.read_csv(file_path)

hotel_counts = df['Hotel Name'].value_counts()
top_150_hotels = hotel_counts.head(150).index
filtered_df = df[df['Hotel Name'].isin(top_150_hotels)]

display(filtered_df)

### 3.7 Find the 40 most common check-in dates  in the dataset and extract their records.


In [None]:
checkin_counts = filtered_df['Checkin Date'].value_counts()
top_40_checkin_dates = checkin_counts.head(40).index
filtered_checkin_df = filtered_df[filtered_df['Checkin Date'].isin(top_40_checkin_dates)]

display(filtered_checkin_df)

### 3.8 160-dimensional feature vector

**Task**

Build a 160-dimensional feature vector for each hotel based on its discount pricing behavior. Each vector is constructed by:
- Filtering the top 150 hotels (by record count) and the top 40 checkin dates.
- For each hotel, extracting 4 discount prices (one per discount code) for each of the 40 checkin dates. If several snapshots exist, we will select the one's with minimal prices.
- If no data is available for a specific (checkin date, discount code) combination, mark it with `-1`.

**Plan**

1. **Group the Data:**  
   Group the filtered data by **Hotel Name**, **Checkin Date**, and **Discount Code**. For each group, compute the minimum discount price, ensuring that only the best (lowest) price per combination is selected.

2. **Pivot to Wide Format:**  
   Transform the grouped data into a wide format where:
   - Each row represents a single hotel.
   - Each column represents a unique (Checkin Date, Discount Code) combination, totaling 160 columns (40 dates × 4 codes).

3. **Fill Missing Data:**  
   - Reindex the pivoted DataFrame so that every hotel has all 160 combinations, filling missing entries with `-1`.

In [None]:
# 1. Group by Hotel Name, Checkin Date, and Discount Code and select the minimum Discount Price.
grouped = (
    filtered_checkin_df
    .groupby(['Hotel Name', 'Checkin Date', 'Discount Code'])['Discount Price']
    .min()
    .reset_index()
)

# 2. Pivot the DataFrame so that:
#    - The index is 'Hotel Name'
#    - The columns are a MultiIndex with levels (Checkin Date, Discount Code)
#    - The values are the minimum discount prices.
pivot_df = grouped.pivot_table(index='Hotel Name',
                               columns=['Checkin Date', 'Discount Code'],
                               values='Discount Price')


# 3. Reindex the columns so that all 40 checkin dates and 4 discount codes are present.
#    Use the top_40_checkin_dates (from your earlier filtering) and the list [1, 2, 3, 4] for discount codes. 
all_combinations = pd.MultiIndex.from_product([top_40_checkin_dates, [1, 2, 3, 4]],
                                                names=['Checkin Date', 'Discount Code'])

pivot_df = pivot_df.reindex(columns=all_combinations, fill_value=-1)
pivot_df = pivot_df.fillna(-1)


pivot_df.columns = [
    col if isinstance(col, str) else f"{col[0]} - {col[1]}"
    for col in pivot_df.columns
]
pivot_df = pivot_df.reset_index()

print(pivot_df.shape[0]) # Note we have 149 hotels instead of 150 - solution in next cell
display(pivot_df)

#### Identifying Missing Hotel

**Verifying Missing Hotel Data**

After filtering and pivoting the data, we expect to have 150 hotels, but only 149 appear in our pivot table. This indicates that one (or more) of the top 150 hotels has no records for the top 40 check-in dates used in our analysis.

The code above does the following:
1. **Identify Missing Hotels:**  
   It compares the complete list of top 150 hotels (`top_150_hotels`) with the hotel names present in the pivoted DataFrame (`pivot_df`). Any hotel that is not present is added to the `missing_hotels` list.

2. **Check Data for Each Missing Hotel:**  
   For each missing hotel, it filters `filtered_checkin_df` (which already contains only records from the top 40 check-in dates) to see if there are any records for that hotel.  
   - If the resulting DataFrame is empty, it confirms that the hotel indeed has no data for those check-in dates.  
   - This explains why the hotel did not appear in the pivot table.

By verifying that the missing hotel has no records in the filtered data, we can conclude that the drop in the number of hotels is due to the absence of data for those check-in dates rather than an error in our processing pipeline.



In [None]:
# Assuming you have already defined:
# - top_150_hotels: the complete list of top 150 hotel names.
# - pivot_df: the pivoted DataFrame after grouping and filtering.
# - filtered_checkin_df: the DataFrame filtered by top 40 check-in dates.
#
# And the missing hotels are identified as:
missing_hotels = [hotel for hotel in top_150_hotels if hotel not in pivot_df['Hotel Name'].values]
print("Missing hotels:", missing_hotels)

# For each missing hotel, check if there is any record in the filtered_checkin_df.
for hotel in missing_hotels:
    hotel_records = filtered_checkin_df[filtered_checkin_df['Hotel Name'] == hotel]
    print(f"\nRecords for missing hotel '{hotel}':")
    print(hotel_records)  # This should print an empty DataFrame if no data is present.


### 3.9 Normalize 0-100


**Task**

For each hotel, we have a 160-dimensional vector of discount prices (one for each combination of Checkin Date and Discount Code). The goal is to normalize these prices so that, for each hotel, the lowest valid discount price becomes 0 and the highest becomes 100. Any missing value (indicated by `-1`) should remain unchanged.

**Plan**

1. **Define a Normalization Function:**  
   Create a function (`normalize_row`) that:
   - Filters out the missing values (`-1`) from the row.
   - Computes the minimum and maximum values among the valid discount prices.
   - Applies the normalization formula:
     $$
     \text{normalized\_price} = \frac{(\text{price} - \text{min\_price})}{(\text{max\_price} - \text{min\_price})} \times 100
     $$
   - Handles the case where all valid prices are equal (to avoid division by zero) by setting them to 0.

2. **Apply the Function Row-wise:**  
   Normalize the discount prices for each hotel (i.e., for each row) by applying the function to all columns except the "Hotel Name".

3. **Round and Convert to Integers:**  
   After normalization, round the values to the nearest integer and convert them to an integer type, ensuring that the normalized prices are stored as integers.



In [None]:
def normalize_row(row):
    valid_mask = row != -1
    valid_prices = row[valid_mask]
    
    if valid_prices.empty:
        return row
    
    min_price = valid_prices.min()
    max_price = valid_prices.max()
    
   # Avoid division by zero if all valid prices are identical
    if min_price == max_price:
        row[valid_mask] = 0
    else:
        # Compute the normalized values, round them, and cast to int
        normalized_values = ((row[valid_mask] - min_price) / (max_price - min_price)) * 100
        row[valid_mask] = normalized_values.round(0).astype(int)
    
    return row


pivot_df.iloc[:, 1:] = pivot_df.iloc[:, 1:].apply(normalize_row, axis=1)

for col in pivot_df.columns[1:]:
    pivot_df[col] = pd.to_numeric(pivot_df[col], errors='coerce')
    pivot_df[col] = pivot_df[col].astype("Int64")


display(pivot_df)

### 3.10 Save to CSV

In [None]:
hotels_clustering_data = "./hotels_clustering_data.csv"
pivot_df.to_csv(hotels_clustering_data, index=False)

### 3.11 Hierarchical Clustering

**Task**

Using the normalized discount prices for each hotel, we will perform hierarchical clustering to group hotels that exhibit similar pricing behaviors. We have a 160-dimensional feature vector for each hotel (each dimension corresponds to a specific (Checkin Date, Discount Code) pair).

**Plan**

1. **Prepare the Data:**  
   - Load the saved CSV file (`hotels_clustering_data.csv`).
   - Separate the "Hotel Name" column (for labeling) from the numeric feature columns.

2. **Perform Hierarchical Clustering:**  
   - Use SciPy's `linkage` function with Ward's method (which works well with Euclidean distance) to compute the clustering.
   - Generate a linkage matrix that represents the hierarchical clustering.

3. **Plot the Dendrogram:**  
   - Use SciPy's `dendrogram` function to visualize the hierarchical clustering.
   - Label each leaf in the dendrogram with the corresponding hotel name to help interpret the clusters.


In [None]:
%pip install plotly

In [None]:
import pandas as pd
import plotly.figure_factory as ff
from scipy.cluster.hierarchy import linkage
from plotly.subplots import make_subplots
import plotly.graph_objects as go

def create_dendrogram_from_csv(csv_path, color_threshold=825, width=600, height=800):
    """Creates a dendrogram figure."""
    clu_df = pd.read_csv(csv_path)
    hotel_names = clu_df["Hotel Name"].values
    X = clu_df.drop("Hotel Name", axis=1).values
    Z = linkage(X, method='ward')

    fig = ff.create_dendrogram(
        X,
        orientation='left',
        labels=hotel_names,
        color_threshold=color_threshold,
        linkagefun=lambda x: Z
    )

    fig.update_layout(
        width=width,
        height=height,
        margin=dict(l=20, r=20, t=20, b=20) 
    )

    return fig

def display_dendrograms_in_grid(csv_path, color_threshold_list=[825, 750, 625, 500]):
    """Displays dendrograms in a 2x2 grid."""

    fig = make_subplots(rows=2, cols=2, subplot_titles=[f"Cut at ~{ct}" for ct in color_threshold_list])

    for i, color_threshold in enumerate(color_threshold_list):
        row = i // 2 + 1
        col = i % 2 + 1
        dendrogram_fig = create_dendrogram_from_csv(csv_path, color_threshold=color_threshold)

        for trace in dendrogram_fig.data:
            fig.add_trace(trace, row=row, col=col)

        fig.update_xaxes(showticklabels=False, showline=False, zeroline=False, ticks="", row=row, col=col)
        fig.update_yaxes(showticklabels=False, showline=False, zeroline=False, ticks="", row=row, col=col)  
        fig.update_layout(showlegend=False)

    fig.update_layout(
        width=2000, 
        height=1000,
        title_text="Hotel Clustering Dendrograms",
        margin=dict(l=20, r=20, t=60, b=20)
    )

    fig.show()

# Example usage:
display_dendrograms_in_grid("hotels_clustering_data.csv")

##### Results analysis

**Hotel Pricing Strategies Over Time: Hierarchical Clustering Analysis**

We performed hierarchical clustering on a dataset of hotels, where each hotel is represented by a **160-dimensional vector** of normalized discount prices. In simpler terms, each hotel’s vector shows *how* it discounts (and by how much) across different dates and discount codes. The dendrogram below clusters these hotels based on their similarity in discounting patterns.

Below, we examine **five different “cut” distances**—825, 750, 625, 500, and an additional view with fewer, broader clusters—and describe the cluster/subgroup formations you see in each figure.


**Overall Explanation of the Dendrogram**
- **X-axis**: The distance (or dissimilarity) at which clusters merge. Larger values mean more dissimilar groups.  
- **Y-axis**: The list of hotels, labeled along the left side.  
- **Colored Branches**: Each color indicates a cluster or subgroup under the specified distance threshold.

In general:  
- Hotels that **merge at smaller distances** (farther to the left in the dendrogram) are quite similar in how they price their discounts.  
- If you follow the dendrogram to the right until a major branch merges, that indicates hotels (or clusters of hotels) that are more dissimilar in their pricing behavior.

---

<div style="display: grid; grid-template-columns: repeat(2, 1fr); gap: 5%;">
  <div >
    <h4>Cut at ~825</h4>
    <br>
    <img src="./images/clustering-825.png" alt="Cut at ~825" style="width: 100%; height: auto;">
  </div>
  <div >
   <h4>Cut at ~750</h4>
    <br>
    <img src="./images/clustering-750.png" alt="Cut at ~750" style="width: 100%; height: auto;">
  </div>
  <div >
    <h4>Cut at ~625</h4>
    <br>
    <img src="./images/clustering-625.png" alt="Cut at ~625" style="width: 100%; height: auto;">
  </div>
  <div >
    <h4>Cut at ~500</h4>
    <br>
    <img src="./images/clustering-500.png" alt="Cut at ~500" style="width: 100%; height: auto;">
  </div>
</div>

<br>
<br>
<br>


**Overall Results Explanation:**

The analysis reveals that there are clearly three different popular pricing strategies on the high level. The closer you take a look, each strategy has its own "sub-strategies". These subgroups represent hotels with closely aligned discounting patterns, providing valuable insights for competitive analysis and strategic pricing decisions.

**Possible Meanings of Subgroups:**

* **Competitive Landscape:** Hotels within the same subgroup likely compete directly, as they share similar discount structures and timelines.
* **Revenue Management Strategy:** Subgroups can reflect brand or chain policies, centralized pricing software, or shared revenue management practices.
* **Marketing & Differentiation:** Understanding subgroup membership can inform competitive analysis and strategic pricing decisions, motivating hotels to differentiate or align their pricing.
* **Shared Ownership:** Hotels with shared ownership often have similar pricing.
* **Same Chain Hotels:** Hotel chains often have similar pricing strategies.
* **Close Competitors:** Hotels near each other, that compete for the same customers, will often have similar pricing.


#### diving even deeper - (checking the hotel stars and average price with the pricing strategy)

The results from the last section were interesting, so i decided to run the same clustering but with the hotels stars, prices and discounts to see if some patterns emerge.

The new label now contains:
- stars
- avg price
- avg discount
- avg discount rate 

in this format:

`(stars) - price - discount - discount rate`

example:

(5) - 3898 - 3663 - 6% 

In [None]:
import pandas as pd

pivot_df = pd.read_csv("hotels_clustering_data.csv")
df = pd.read_csv("./hotels_data_changed.csv")

pivot_df["Hotel Name"] = pivot_df["Hotel Name"].astype(str).str.strip()
hotel_counts = df["Hotel Name"].value_counts()
top_150_hotels = hotel_counts.head(150).index

summary_df = (
    df[df["Hotel Name"].isin(top_150_hotels)]
    .groupby("Hotel Name")
    .agg({"Original Price": "mean", "Discount Price": "mean", "Hotel Stars": "first"})
    .reset_index()
)

summary_df["Hotel Name"] = summary_df["Hotel Name"].astype(str).str.strip()
summary_df["Original Price"] = summary_df["Original Price"].round(0).astype(int)
summary_df["Discount Price"] = summary_df["Discount Price"].round(0).astype(int)
merged_df = pivot_df.merge(summary_df, on="Hotel Name", how="left")

merged_df["DiscountPerc"] = (((merged_df["Original Price"] - merged_df["Discount Price"]) / merged_df["Original Price"]) * 100).round(0).astype(int)
merged_df["Label"] = merged_df.apply(lambda row: f"({row['Hotel Stars']}) - {row['Original Price']} - {row['Discount Price']} - {row['DiscountPerc']}%", axis=1)

merged_df.head()
merged_df.to_csv("./hotels_clustering_data_with_summary.csv", index=False)


In [None]:
import pandas as pd
import plotly.figure_factory as ff
from scipy.cluster.hierarchy import linkage

def create_dendrogram_from_csv(csv_path, color_threshold=825, width=1600, height=1800):
    clu_df = pd.read_csv(csv_path)
    if "Label" in clu_df.columns:
        labels = clu_df["Label"].values
        non_clustering = ["Hotel Name", "Label", "Hotel Stars", "Original Price", "Discount Price", "DiscountPerc"]
    else:
        labels = clu_df["Hotel Name"].values
        non_clustering = ["Hotel Name", "Hotel Stars", "Original Price", "Discount Price", "DiscountPerc"]
    X = clu_df.drop(columns=non_clustering, errors='ignore').values
    Z = linkage(X, method='ward')
    fig = ff.create_dendrogram(
        X,
        orientation='left',
        labels=labels,
        color_threshold=color_threshold,
        linkagefun=lambda x: Z
    )
    fig.update_layout(width=width, height=height)
    fig.show()
    return fig

color_threshold_list = [625]
for color_threshold in color_threshold_list:
    print(f'color_threshold={color_threshold}')
    create_dendrogram_from_csv("hotels_clustering_data_with_summary.csv", color_threshold=color_threshold)


###### Hierarchical Clustering with Star Rating, Avg Price, and Discounts

We re-ran clustering with four features:
1. **Hotel star rating**  
2. **Avg nightly price**  
3. **Avg absolute discount**  
4. **Avg discount rate (%)**

Each dendrogram label is `(stars) – price – discount – discount rate`.

###### Takeaways
- data was pretty inconclusive, did not add much.
- **Star Rating is Primary** - hotels with the smae stras are likely to be gouped together but they are all over the place. 
- Hotels in tight clusters likely share near-identical pricing/discount policies, suggesting **direct competition**.  