Prvo citam 'events.json', od njega pravim dataframe sa svim datumima od 1.1.2008 do 31.12.2009 i 1 za event, 0 ako nije bilo eventa.

Onda, od train.parquet pravim dataframe koji za svaki stay_date od 1.1.2008 ima kolonu ukupne popunjenosti (sabran room_cnt za svaki datum).

Na osnovu popunjenosti hotela za vreme eventa, radim K-means klasterovanje. Rezultat je podela eventova na 0 1 2 po bitnosti. Taj dataframe je sacuvan u novi file "clustered.csv" radi lakseg koriscenja u razlicitim verzijama dataseta za trening.

In [1]:
import pandas as pd
from sklearn.cluster import KMeans

In [2]:
events = pd.read_json("events.json")
events.head(10)

Unnamed: 0,name,type,date
0,Uskrs,[cultural holiday],"[{'start_date': '2008-03-23', 'finish_date': '..."
1,Božić,[cultural holiday],"[{'start_date': '2008-12-25', 'finish_date': '..."
2,Dani Svetog Vida,[cultural holiday],"[{'start_date': '2008-06-10', 'finish_date': '..."
3,Riječke ljetne noći,"[music festival, cultural festival, theatre fe...","[{'start_date': '2008-06-27', 'finish_date': '..."
4,Riječki karneval,"[carnival, music festival, theatre festival]","[{'start_date': '2008-01-17', 'finish_date': '..."
5,Ri Rock Festival,[music festival],"[{'start_date': '2008-12-05', 'finish_date': '..."
6,Festivak kRik,[music festival],"[{'start_date': '2008-10-02', 'finish_date': '..."
7,Major Cities Of Europe - IT Users Group,[conference],"[{'start_date': '2008-06-09', 'finish_date': '..."
8,Croatian American Corners Conference,[conference],"[{'start_date': '2008-09-25', 'finish_date': '..."
9,International Conference Of Clinical Ethics An...,[conference],"[{'start_date': '2008-09-05', 'finish_date': '..."


In [3]:
event_dates = events[["name", "date"]].explode("date")
exploded_dates = event_dates["date"].apply(pd.Series)
event_dates = pd.concat([event_dates, exploded_dates], axis=1).drop("date", axis=1)

event_dates["start_date"] = pd.to_datetime(event_dates["start_date"])
event_dates["finish_date"] = pd.to_datetime(event_dates["finish_date"])


In [4]:
beginning = "2008-01-01"
ending = "2009-12-31"

# Create a date range
dates = pd.date_range(start=beginning, end=ending)

# Create the dataset
new_dataset = pd.DataFrame(dates, columns=["stay_date"])

# Display the first few rows of the dataset
print(new_dataset.head())

   stay_date
0 2008-01-01
1 2008-01-02
2 2008-01-03
3 2008-01-04
4 2008-01-05


In [5]:
new_dataset["isEvent"] = 0

for _, row in event_dates.iterrows():
    new_dataset.loc[
        (new_dataset["stay_date"] >= row["start_date"])
        & (new_dataset["stay_date"] <= row["finish_date"]),
        "isEvent",
    ] += 1

print(new_dataset.head(n=20))
print(f"Unique values: {new_dataset['isEvent'].unique()}")

    stay_date  isEvent
0  2008-01-01        0
1  2008-01-02        0
2  2008-01-03        0
3  2008-01-04        0
4  2008-01-05        0
5  2008-01-06        0
6  2008-01-07        0
7  2008-01-08        0
8  2008-01-09        0
9  2008-01-10        0
10 2008-01-11        0
11 2008-01-12        0
12 2008-01-13        0
13 2008-01-14        0
14 2008-01-15        0
15 2008-01-16        0
16 2008-01-17        1
17 2008-01-18        1
18 2008-01-19        1
19 2008-01-20        1
Unique values: [0 1 2]


In [6]:
dataset = pd.read_parquet("../../lumen_dataset/train.parquet")
dataset.head()

Unnamed: 0,reservation_id,night_number,stay_date,guest_id,guest_country_id,reservation_status,reservation_date,date_from,date_to,resort_id,...,price,price_tax,total_price_tax,total_price,food_price,food_price_tax,other_price,other_price_tax,room_category_id,sales_channel_id
0,73710,1.0,2007-12-13,22897,HR,Checked-out,2007-11-28,2007-12-13,2007-12-15,1,...,4255.462,425.517,452.089,4564.69,265.428,26.572,43.8,0.0,3,10.0
1,73710,2.0,2007-12-14,22897,HR,Checked-out,2007-11-28,2007-12-13,2007-12-15,1,...,4243.709,424.349,450.921,4552.937,265.428,26.572,43.8,0.0,3,10.0
2,74464,1.0,2008-01-01,106278,HR,Checked-out,2007-12-29,2008-01-01,2008-01-02,1,...,4336.857,433.693,3806.147,19764.823,530.929,53.071,14897.037,3319.383,4,4.0
3,74461,1.0,2008-01-01,38936,GB,Cancelled,2007-12-29,2008-01-01,2008-01-02,1,...,8536.766,853.662,1012.948,10392.28,1592.714,159.286,262.8,0.0,5,3.0
4,74466,1.0,2008-01-01,106279,HR,Cancelled,2007-12-29,2008-01-01,2008-01-03,1,...,,,,,,,,,6,4.0


In [7]:
# Remove all the cancelled reservations
dataset = dataset[dataset['reservation_status'] != 'Cancelled']

# Occupancy dataset sums up all the occupied rooms on each date
occ_dataset = dataset.groupby('stay_date')['room_cnt'].sum().reset_index()
occ_dataset = occ_dataset[occ_dataset['stay_date'].dt.year != 2007]
print(occ_dataset.head(20))

    stay_date  room_cnt
2  2008-01-01        22
3  2008-01-02         4
4  2008-01-03        17
5  2008-01-04        20
6  2008-01-05        10
7  2008-01-06         9
8  2008-01-07        11
9  2008-01-08        23
10 2008-01-09        46
11 2008-01-10        55
12 2008-01-11        46
13 2008-01-12        23
14 2008-01-13        18
15 2008-01-14        27
16 2008-01-15        41
17 2008-01-16        48
18 2008-01-17        52
19 2008-01-18        52
20 2008-01-19        38
21 2008-01-20        30


Merge ova 2 dataseta kako bi se izvrsilo klasterovanje.

In [8]:
merged_df = pd.merge(occ_dataset, new_dataset, on='stay_date')
merged_df.head()

Unnamed: 0,stay_date,room_cnt,isEvent
0,2008-01-01,22,0
1,2008-01-02,4,0
2,2008-01-03,17,0
3,2008-01-04,20,0
4,2008-01-05,10,0


In [9]:
# Filter for rows where events occurred for clustering
events_to_cluster = merged_df[merged_df['isEvent'] == 1][['room_cnt']]

# Apply K-Means Clustering with 2 clusters for the events' room_cnt
kmeans = KMeans(n_clusters=2, n_init=10, random_state=42).fit(events_to_cluster)


# Assign clusters, adjusting to ensure 1 or 2 based on occupancy
events_to_cluster['cluster'] = kmeans.labels_ + 1  # +1 to convert from 0,1 to 1,2

# Ensuring that higher room_cnt is always categorized as '2'
cluster_centers = kmeans.cluster_centers_.flatten()
if cluster_centers[0] > cluster_centers[1]:
    events_to_cluster['cluster'] = events_to_cluster['cluster'].map({1: 2, 2: 1})

# Merging cluster labels back with the original dataframe
merged_df = merged_df.merge(events_to_cluster['cluster'], left_index=True, right_index=True, how='left')

# Filling NaN values in 'cluster' column with the original 'event' values for non-event days
merged_df['event_category'] = merged_df['cluster'].fillna(merged_df['isEvent'])

# Clean up by dropping the temporary 'cluster' column
merged_df.drop(columns=['cluster'], inplace=True)

In [10]:
merged_df.to_csv("../../lumen_dataset/clustered.csv")