# üè® Feature Selection - Hotel Booking Cancellation Prediction

Ushbu fayl **Feature Selection** bosqichi uchun yozilgan README. Loyihada mehmonxona bron qilish bekor qilinishini bashorat qilish uchun eng muhim ustunlar tanlandi.

---

## 1Ô∏è‚É£ Maqsad

- Datasetdagi barcha featurelar ishlatilsa:
  - Model yuki oshadi
  - Ba‚Äôzi featurelar model uchun ortiqcha yoki shovqinli bo‚Äòlishi mumkin
- **Feature Selection** orqali:
  - Eng informativ featurelar tanlanadi
  - Model samaradorligi oshadi
  - Hisoblash tezligi va interpretatsiya qulayligi yaxshilanadi

---

## 2Ô∏è‚É£ Qilingan ishlar

- **Data load**:  
  - `X_train_engineered.csv` va `X_test_engineered.csv`  
  - `y_train.csv` va `y_test.csv`  

- **Data preprocessing**:
  - Categorical featurelar uchun **Label Encoding / One-Hot Encoding**
  - Numeric featurelar uchun **missing value imputation** (mean)
  - Categorical featurelar uchun **most frequent imputation**

- **Feature Selection metodlari**:
  - **LassoCV** (L1 regularization) yordamida:
    - Kichik yoki keraksiz coefficientga ega featurelar 0 ga tenglanadi
    - 0 bo‚Äòlmagan coefficientga ega featurelar **tanlangan featurelar**
  - Tanlangan featurelar soni kamayadi, faqat muhim ustunlar qoldiriladi

- **Natija**:
  - Tanlangan featurelar CSV faylga saqlandi:
    - `X_train_selected.csv`
    - `X_test_selected.csv`
  - Ushbu fayllar model fit qilish uchun tayyor

---

## 3Ô∏è‚É£ Tanlangan featurelar

- LassoCV natijasida tanlangan eng informativ featurelar:
  - `hotel`
  - `lead_time`
  - `arrival_date_year`
  - `arrival_date_month_num`
  - `stays_in_weekend_nights`
  - `stays_in_week_nights`
  - `adults`
  - `children`
  - `babies`
  - `meal`
  - ‚Ä¶ va boshqa muhim ustunlar (coef != 0 bo‚Äòlganlar)

> ‚ÑπÔ∏è Barcha tanlangan featurelar `X_train_selected.csv` va `X_test_selected.csv` fayllarida mavjud.

---

## 4Ô∏è‚É£ Foydalanilgan fayllar

- Data/
- ‚îî‚îÄ Feature_Selection/
- ‚îú‚îÄ X_train_selected.csv
- ‚îî‚îÄ X_test_selected.csv

In [1]:
import pandas as pd
import logging
import os

# =========================
# Log fayl manzili
# =========================
log_path = r"C:\Users\Rasulbek907\Desktop\Hotel Booking Cancellation Prediction\Log\data_loader.log"
os.makedirs(os.path.dirname(log_path), exist_ok=True)

logging.basicConfig(
    filename=log_path,
    filemode="a",
    format="%(asctime)s - %(levelname)s - %(message)s",
    level=logging.INFO
)

logging.info("===== FEATURE ENGINEERED DATA LOADER BOSHLANDI =====")

# =========================
# Feature engineered X fayllar
# =========================
FE_PATH = r"C:\Users\Rasulbek907\Desktop\Hotel Booking Cancellation Prediction\Data\Enginered_Data"

X_train_path = os.path.join(FE_PATH, "X_train_engineered.csv")
X_test_path  = os.path.join(FE_PATH, "X_test_engineered.csv")

# =========================
# Target fayllar (preprocessed)
# =========================
PREP_PATH = r"C:\Users\Rasulbek907\Desktop\Hotel Booking Cancellation Prediction\Data\Preprosessed"

y_train_path = os.path.join(PREP_PATH, "y_train.csv")
y_test_path  = os.path.join(PREP_PATH, "y_test.csv")

# =========================
# Datasetlarni yuklash
# =========================
try:
    X_train = pd.read_csv(X_train_path)
    X_test  = pd.read_csv(X_test_path)
    y_train = pd.read_csv(y_train_path)
    y_test  = pd.read_csv(y_test_path)

    logging.info("Feature engineered datasetlar muvaffaqiyatli yuklandi")
    logging.info(f"X_train shape: {X_train.shape}")
    logging.info(f"X_test  shape: {X_test.shape}")
    logging.info(f"y_train shape: {y_train.shape}")
    logging.info(f"y_test  shape: {y_test.shape}")

except Exception as e:
    logging.error(f"Datasetlarni yuklashda xatolik: {e}")
    raise

# =========================
# Sanity check
# =========================
if X_train.shape[0] != y_train.shape[0]:
    logging.error("X_train va y_train satr soni mos emas")
    raise ValueError("Train set mismatch")

if X_test.shape[0] != y_test.shape[0]:
    logging.error("X_test va y_test satr soni mos emas")
    raise ValueError("Test set mismatch")

# Target leakage tekshiruvi
if set(y_train.columns) & set(X_train.columns):
    logging.error("Target X_train ichiga kirib ketgan!")
    raise ValueError("Target leakage detected")

logging.info("‚úÖ DLP tekshiruvlar muvaffaqiyatli o‚Äòtdi")
logging.info("===== FEATURE ENGINEERED DATA LOADER YAKUNLANDI =====")

In [13]:
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LassoCV
from sklearn.impute import SimpleImputer

# =========================
# PATHS
# =========================
FE_PATH = r"C:\Users\Rasulbek907\Desktop\Hotel Booking Cancellation Prediction\Data\Enginered_Data"
PREP_PATH = r"C:\Users\Rasulbek907\Desktop\Hotel Booking Cancellation Prediction\Data\Preprosessed"

X_train_path = os.path.join(FE_PATH, "X_train_engineered.csv")
X_test_path  = os.path.join(FE_PATH, "X_test_engineered.csv")
y_train_path = os.path.join(PREP_PATH, "y_train.csv")
y_test_path  = os.path.join(PREP_PATH, "y_test.csv")

# =========================
# LOAD DATA
# =========================
X_train = pd.read_csv(X_train_path)
X_test  = pd.read_csv(X_test_path)
y_train = pd.read_csv(y_train_path).values.ravel()
y_test  = pd.read_csv(y_test_path).values.ravel()

# =========================
# STRING ‚Üí NUMERIC (Month)
# =========================
month_map = {
    'January': 1, 'February': 2, 'March': 3, 'April': 4,
    'May': 5, 'June': 6, 'July': 7, 'August': 8,
    'September': 9, 'October': 10, 'November': 11, 'December': 12
}

X_train['arrival_date_month_num'] = X_train['arrival_date_month'].map(month_map)
X_test['arrival_date_month_num']  = X_test['arrival_date_month'].map(month_map)

X_train.drop(columns=['arrival_date_month'], inplace=True)
X_test.drop(columns=['arrival_date_month'], inplace=True)

# =========================
# CATEGORICAL VS NUMERIC FEATURELAR
# =========================
categorical_features = [
    'hotel','meal','country','market_segment','distribution_channel',
    'reserved_room_type','assigned_room_type','deposit_type',
    'customer_type','city','agent','company'
]

numeric_features = [c for c in X_train.columns if c not in categorical_features]

# =========================
# COLUMN TRANSFORMER BILAN IMPUTER VA LABEL ENCODING (unknown_value=-1)
# =========================
cat_pipeline = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
)

preprocessor = ColumnTransformer([
    ('num', SimpleImputer(strategy='mean'), numeric_features),
    ('cat', cat_pipeline, categorical_features)
])

# =========================
# LASSO PIPELINE
# =========================
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler()),  # Lasso sensitive to scale
    ('lasso', LassoCV(cv=5, random_state=42, n_jobs=-1))
])

# =========================
# FIT LASSO
# =========================
print("üöÄ LassoCV bilan feature selection boshlanmoqda...")
pipeline.fit(X_train, y_train)
print("‚úÖ LassoCV fit tugadi")

# =========================
# GET FEATURE NAMES
# =========================
encoded_features = numeric_features + categorical_features  # label encoded categorical nomlar
lasso_coef = pipeline.named_steps['lasso'].coef_
selected_features = [f for f, c in zip(encoded_features, lasso_coef) if c != 0]

print(f"‚úÖ Tanlangan featurelar soni: {len(selected_features)}")
print("Tanlangan featurelar ro'yxati:")
for f in selected_features:
    print(f" - {f}")

# =========================
# SELECTED FEATURE DATA
# =========================
X_train_selected = pd.DataFrame(
    pipeline.named_steps['preprocessor'].transform(X_train),
    columns=encoded_features
)[selected_features]

X_test_selected = pd.DataFrame(
    pipeline.named_steps['preprocessor'].transform(X_test),
    columns=encoded_features
)[selected_features]

# =========================
# CSV GA SAQLASH
# =========================
SAVE_PATH = r"C:\Users\Rasulbek907\Desktop\Hotel Booking Cancellation Prediction\Data\Feature_Selection"
os.makedirs(SAVE_PATH, exist_ok=True)

X_train_selected.to_csv(os.path.join(SAVE_PATH, "X_train_selected.csv"), index=False)
X_test_selected.to_csv(os.path.join(SAVE_PATH, "X_test_selected.csv"), index=False)

# Tanlangan featurelar ro'yxati
selected_features_df = pd.DataFrame({"selected_features": selected_features})
selected_features_df.to_csv(os.path.join(SAVE_PATH, "selected_features_list.csv"), index=False)

print(f"‚úÖ Selected featurelar va ularning ro'yxati CSV formatda saqlandi: {SAVE_PATH}")

üöÄ LassoCV bilan feature selection boshlanmoqda...
‚úÖ LassoCV fit tugadi
‚úÖ Tanlangan featurelar soni: 34
Tanlangan featurelar ro'yxati:
 - lead_time
 - stays_in_week_nights
 - adults
 - babies
 - is_repeated_guest
 - previous_cancellations
 - previous_bookings_not_canceled
 - booking_changes
 - days_in_waiting_list
 - adr
 - total_of_special_requests
 - arrival_month_num
 - total_stay_nights
 - total_guests
 - adr_per_person
 - special_req_ratio
 - has_children
 - is_long_stay
 - has_parking
 - has_deposit
 - changed_room
 - arrival_date_month_num
 - hotel
 - meal
 - country
 - market_segment
 - distribution_channel
 - reserved_room_type
 - assigned_room_type
 - deposit_type
 - customer_type
 - city
 - agent
 - company
‚úÖ Selected featurelar va ularning ro'yxati CSV formatda saqlandi: C:\Users\Rasulbek907\Desktop\Hotel Booking Cancellation Prediction\Data\Feature_Selection
