<a href="./feature_selection.ipynb" target="_self">
  <button style="
    padding:10px 18px;
    font-size:16px;
    background-color:#2563eb;
    color:white;
    border:none;
    border-radius:8px;
    cursor:pointer;">
    ‚û°Ô∏è Go to Feature Selection
  </button>
</a>


# üè® Hotel Booking Cancellation ‚Äì Feature Engineering

## üéØ Maqsad
Ushbu feature engineering bosqichi **`is_canceled`** target ustunini bashorat qilish uchun bajarildi.  
Asosiy maqsadlar:

- üìâ **Data Leakage Prevention (DLP)** ni ta‚Äôminlash
- üß† **Model sifati**ni oshirish
- üíæ **Xotira (RAM)**dan samarali foydalanish
- üè≠ **Real production ML project**ga moslash

---

## üìÇ Ishlatilgan datasetlar
Quyidagi **oldindan split qilingan** datasetlar ishlatildi:

- `X_train_preprocessed.csv`
- `X_test_preprocessed.csv`

üìå **Eslatma:**  
‚ùå Raw data ishlatilmadi  
‚úÖ Faqat train/test split qilingan ma‚Äôlumotlar (DLP-safe)

---

## üßπ 1Ô∏è‚É£ Leakage featurelarni olib tashlash
Quyidagi ustunlar **modelga berilmadi**, chunki ular target bilan kuchli bog‚Äòliq yoki voqeadan keyingi (post-event) ma‚Äôlumot hisoblanadi:

- ‚ùå `reservation_status`
- ‚ùå `reservation_status_date`  
  (faqat undan hosila date featurelar olindi)

üëâ Bu **Data Leakage** ni oldini olish uchun juda muhim qadam.

---

## üìÖ 2Ô∏è‚É£ Sana (Date) feature engineering
`reservation_status_date` ustunidan quyidagi yangi featurelar yaratildi:

- üóì `res_year`
- üóì `res_month`
- üóì `res_day`
- üóì `res_weekday`

Shuningdek:
- üìÜ `arrival_date_month` (string) ‚Üí `arrival_month_num` (raqamli)

üéØ **Natija:**  
Model vaqtga bog‚Äòliq naqshlarni yaxshiroq o‚Äòrganadi.

---

## ‚ûï 3Ô∏è‚É£ Aggregated & Ratio featurelar
Domain knowledge asosida yangi featurelar hosil qilindi:

- üõè **`total_stay_nights`**  
  `stays_in_weekend_nights + stays_in_week_nights`

- üë®‚Äçüë©‚Äçüëß **`total_guests`**  
  `adults + children + babies`

- üí∞ **`adr_per_person`**  
  `adr / total_guests`

- ‚≠ê **`special_req_ratio`**  
  `total_of_special_requests / total_stay_nights`

üéØ **Natija:**  
Oddiy ustunlardan **yuqori informatsiyali** featurelar yaratildi.

---

## üö© 4Ô∏è‚É£ Binary / Flag featurelar
Quyidagi **0/1 (flag)** ko‚Äòrinishidagi featurelar yaratildi:

- üë∂ `has_children`
- üïí `is_long_stay` (‚â• 7 tun)
- üöó `has_parking`
- üí≥ `has_deposit`
- üè® `changed_room`  
  (reserved va assigned room farqi)

üéØ **Natija:**  
Model uchun sodda, ammo kuchli signallar.

---

## üß† 5Ô∏è‚É£ Rare category handling (Memory-safe üî•)
One-Hot Encoding vaqtida xotira muammolarini oldini olish uchun kam uchraydigan kategoriyalar `"Other"` ga birlashtirildi.

Qo‚Äòllanilgan ustunlar:
- üåç `country`
- üßë‚Äçüíº `agent`
- üè¢ `company`
- üèô `city`

üìå Qoidalar:
- ‚úÖ Faqat `X_train` da **fit**
- ‚úÖ `X_test` ga **apply**
- ‚ùå Test ma‚Äôlumotdan o‚Äòrganilmaydi

üéØ **Natija:**  
- RAM tejaladi  
- Feature soni nazorat ostida bo‚Äòladi  

---

## üîê 6Ô∏è‚É£ Data Leakage Prevention (DLP) tamoyillari
Quyidagi qoidalar qat‚Äôiy saqlandi:

- ‚úÖ Train va Test qat‚Äôiy ajratilgan
- ‚úÖ Fit faqat Train datasetda
- ‚úÖ `is_canceled` hech qachon feature sifatida ishlatilmadi
- ‚úÖ Post-event ma‚Äôlumotlar olib tashlandi

---

## üíæ 7Ô∏è‚É£ Saqlangan fayllar
Feature engineering yakunida quyidagi fayllar yaratildi:

Data/Enginered_Data/
‚îÇ
‚îú‚îÄ‚îÄ X_train_engineered.csv
‚îú‚îÄ‚îÄ X_test_engineered.csv

In [1]:
import pandas as pd
import logging
import os


log_path = r"C:\Users\Rasulbek907\Desktop\Hotel Booking Cancellation Prediction\Log\data_loader.log"
os.makedirs(os.path.dirname(log_path), exist_ok=True)


logging.basicConfig(
    filename=log_path,
    filemode="a",
    format="%(asctime)s - %(levelname)s - %(message)s",
    level=logging.INFO
)

logging.info("===== PREPROCESSED DATA LOADER BOSHLANDI =====")


BASE_PATH = r"C:\Users\Rasulbek907\Desktop\Hotel Booking Cancellation Prediction\Data\Preprosessed"

PATHS = {
    "X_train": "X_train_preprocessed.csv",
    "X_test":  "X_test_preprocessed.csv",
    "y_train": "y_train.csv",
    "y_test":  "y_test.csv"
}


try:
    X_train = pd.read_csv(os.path.join(BASE_PATH, PATHS["X_train"]))
    X_test  = pd.read_csv(os.path.join(BASE_PATH, PATHS["X_test"]))
    y_train = pd.read_csv(os.path.join(BASE_PATH, PATHS["y_train"]))
    y_test  = pd.read_csv(os.path.join(BASE_PATH, PATHS["y_test"]))

    logging.info("Preprocessed datasetlar muvaffaqiyatli yuklandi")
    logging.info(f"X_train shape: {X_train.shape}")
    logging.info(f"X_test  shape: {X_test.shape}")
    logging.info(f"y_train shape: {y_train.shape}")
    logging.info(f"y_test  shape: {y_test.shape}")

except Exception as e:
    logging.error(f"Datasetlarni yuklashda xatolik: {e}")
    raise


if X_train.shape[0] != y_train.shape[0]:
    logging.error("X_train va y_train satr soni mos emas")
    raise ValueError("Train set mismatch")

if X_test.shape[0] != y_test.shape[0]:
    logging.error("X_test va y_test satr soni mos emas")
    raise ValueError("Test set mismatch")

# Target leakage tekshiruvi
if set(y_train.columns) & set(X_train.columns):
    logging.error("Target X_train ichiga kirib ketgan!")
    raise ValueError("Target leakage detected")

logging.info("DLP tekshiruvlar muvaffaqiyatli o‚Äòtdi")
logging.info("===== DATA LOADER YAKUNLANDI =====")

# ‚ö†Ô∏è MUHIM DLP ESLATMA

- ‚ùå reservation_status ‚Üí TARGET BILAN KUCHLI LEAKAGE
- ‚ùå reservation_status_date ‚Üí cancellationdan keyingi sana bo‚Äòlishi mumkin

# DATE FEATURELAR

In [2]:
def process_dates(df):
    df = df.copy()

    # reservation_status_date
    df["reservation_status_date"] = pd.to_datetime(
        df["reservation_status_date"], errors="coerce"
    )

    df["res_year"] = df["reservation_status_date"].dt.year
    df["res_month"] = df["reservation_status_date"].dt.month
    df["res_day"] = df["reservation_status_date"].dt.day
    df["res_weekday"] = df["reservation_status_date"].dt.weekday

    # arrival_date_month (string ‚Üí number)
    month_map = {
        "January": 1, "February": 2, "March": 3, "April": 4,
        "May": 5, "June": 6, "July": 7, "August": 8,
        "September": 9, "October": 10, "November": 11, "December": 12
    }
    df["arrival_month_num"] = df["arrival_date_month"].map(month_map)

    return df

# AGGREGATED / RATIO FEATURELAR

In [3]:
def create_aggregates(df):
    df = df.copy()

    df["total_stay_nights"] = (
        df["stays_in_weekend_nights"] + df["stays_in_week_nights"]
    )

    df["total_guests"] = (
        df["adults"] + df["children"].fillna(0) + df["babies"]
    )

    df["adr_per_person"] = df["adr"] / df["total_guests"].replace(0, 1)

    df["special_req_ratio"] = (
        df["total_of_special_requests"] / df["total_stay_nights"].replace(0, 1)
    )

    return df

# BINARY / FLAG FEATURELAR

In [4]:
def create_flags(df):
    df = df.copy()

    df["has_children"] = ((df["children"] > 0) | (df["babies"] > 0)).astype(int)
    df["is_long_stay"] = (df["total_stay_nights"] >= 7).astype(int)
    df["has_parking"] = (df["required_car_parking_spaces"] > 0).astype(int)
    df["has_deposit"] = (df["deposit_type"] != "No Deposit").astype(int)
    df["changed_room"] = (
        df["reserved_room_type"] != df["assigned_room_type"]
    ).astype(int)

    return df

# RARE CATEGORY HANDLING (MEMORY SAFE)

In [5]:
def reduce_rare_categories(train_df, test_df, col, min_freq=0.01):
    freq = train_df[col].value_counts(normalize=True)
    valid_categories = freq[freq >= min_freq].index

    train_df[col] = train_df[col].where(train_df[col].isin(valid_categories), "Other")
    test_df[col] = test_df[col].where(test_df[col].isin(valid_categories), "Other")

    return train_df, test_df

# LEAKAGE FEATURELARNI OLIB TASHLASH

In [6]:
def drop_leakage_features(df):
    return df.drop(
        columns=[
            "reservation_status",        # TARGET LEAKAGE
            "reservation_status_date"    # POST-EVENT
        ],
        errors="ignore"
    )

# HAMMASINI BIRLASHTIRISH

In [7]:
def feature_engineering(X_train, X_test):
    X_train = process_dates(X_train)
    X_test  = process_dates(X_test)

    X_train = create_aggregates(X_train)
    X_test  = create_aggregates(X_test)

    X_train = create_flags(X_train)
    X_test  = create_flags(X_test)

    for col in ["country", "agent", "company", "city"]:
        X_train, X_test = reduce_rare_categories(X_train, X_test, col)

    X_train = drop_leakage_features(X_train)
    X_test  = drop_leakage_features(X_test)

    return X_train, X_test

In [8]:
X_train_fe, X_test_fe = feature_engineering(X_train, X_test)

  df["reservation_status_date"] = pd.to_datetime(
  df["reservation_status_date"] = pd.to_datetime(


# Feature Engineered datalarni CSV ga saqlash

In [9]:
import os

SAVE_PATH = r"C:\Users\Rasulbek907\Desktop\Hotel Booking Cancellation Prediction\Data\Enginered_Data"
os.makedirs(SAVE_PATH, exist_ok=True)

X_train_fe.to_csv(
    os.path.join(SAVE_PATH, "X_train_engineered.csv"),
    index=False
)

X_test_fe.to_csv(
    os.path.join(SAVE_PATH, "X_test_engineered.csv"),
    index=False
)

print("Feature engineered train/test CSV ga saqlandi")

Feature engineered train/test CSV ga saqlandi
