## Feature Engineering

Convert datetime into useful features (weekday, weekend, peak hour, season).
1. Create polynomial features for temp, humidity, windspeed.
2. Normalize features using MinMaxScaler.
3. One-hot encode categorical features (month, day, season).

In [1]:
# Include necessary headers

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

In [2]:
# Load dataset

df = pd.read_csv("/content/cleaned_training_set")

In [4]:
# Convert datetime column

df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek  # Monday=0, Sunday=6
df['is_weekend'] = df['day_of_week'].apply(lambda x: 1 if x >= 5 else 0)
df['is_peak_hour'] = df['hour'].apply(lambda x: 1 if 7 <= x <= 9 or 17 <= x <= 19 else 0)

df.head(5)

Unnamed: 0.1,Unnamed: 0,datetime,temp,atemp,humidity,windspeed,Total_Booking,hour,day_of_week,is_weekend,is_peak_hour
0,0,2024-01-01 00:00:00,22.490802,18.702659,45.702341,33.63515,25,0,0,0,0
1,1,2024-01-01 01:00:00,34.014286,25.838019,44.818728,39.83407,68,1,0,0,0
2,2,2024-01-01 02:00:00,29.639879,32.458917,84.375275,12.523395,41,2,0,0,0
3,3,2024-01-01 03:00:00,26.97317,29.644498,44.972772,31.243705,31,3,0,0,0
4,4,2024-01-01 04:00:00,18.120373,31.131223,46.316984,28.587299,59,4,0,0,0


In [5]:
# Creating season column based on month
def get_season(month):
    if month in [12, 1, 2]:
        return "Winter"
    elif month in [3, 4, 5]:
        return "Spring"
    elif month in [6, 7, 8]:
        return "Summer"
    else:
        return "Fall"

df['season'] = df['datetime'].dt.month.apply(get_season)

df.head(5)

Unnamed: 0.1,Unnamed: 0,datetime,temp,atemp,humidity,windspeed,Total_Booking,hour,day_of_week,is_weekend,is_peak_hour,season
0,0,2024-01-01 00:00:00,22.490802,18.702659,45.702341,33.63515,25,0,0,0,0,Winter
1,1,2024-01-01 01:00:00,34.014286,25.838019,44.818728,39.83407,68,1,0,0,0,Winter
2,2,2024-01-01 02:00:00,29.639879,32.458917,84.375275,12.523395,41,2,0,0,0,Winter
3,3,2024-01-01 03:00:00,26.97317,29.644498,44.972772,31.243705,31,3,0,0,0,Winter
4,4,2024-01-01 04:00:00,18.120373,31.131223,46.316984,28.587299,59,4,0,0,0,Winter


In [6]:
# Polynomial Features
df['temp_squared'] = df['temp'] ** 2
df['humidity_squared'] = df['humidity'] ** 2
df['windspeed_squared'] = df['windspeed'] ** 2

df.head(5)

Unnamed: 0.1,Unnamed: 0,datetime,temp,atemp,humidity,windspeed,Total_Booking,hour,day_of_week,is_weekend,is_peak_hour,season,temp_squared,humidity_squared,windspeed_squared
0,0,2024-01-01 00:00:00,22.490802,18.702659,45.702341,33.63515,25,0,0,0,0,Winter,505.836192,2088.703975,1131.323296
1,1,2024-01-01 01:00:00,34.014286,25.838019,44.818728,39.83407,68,1,0,0,0,Winter,1156.971661,2008.718375,1586.753122
2,2,2024-01-01 02:00:00,29.639879,32.458917,84.375275,12.523395,41,2,0,0,0,Winter,878.522417,7119.187003,156.835421
3,3,2024-01-01 03:00:00,26.97317,29.644498,44.972772,31.243705,31,3,0,0,0,Winter,727.551883,2022.550221,976.169101
4,4,2024-01-01 04:00:00,18.120373,31.131223,46.316984,28.587299,59,4,0,0,0,Winter,328.347911,2145.262967,817.233673


In [7]:
# One-hot Encoding

df = pd.get_dummies(df, columns=['season'], drop_first=True)

In [8]:
# Scaling Features

scaler = MinMaxScaler()
scaled_cols = ['temp', 'humidity', 'windspeed', 'temp_squared', 'humidity_squared', 'windspeed_squared']
df[scaled_cols] = scaler.fit_transform(df[scaled_cols])


In [9]:
# Splitting into train and test
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Save new dataset
train_df.to_csv("train_cleaned.csv", index=False)
test_df.to_csv("test_cleaned.csv", index=False)

print("Feature engineering complete! 🚀 Cleaned datasets saved as train_cleaned.csv & test_cleaned.csv")

Feature engineering complete! 🚀 Cleaned datasets saved as train_cleaned.csv & test_cleaned.csv


## Implementing Random Forest Model

Steps to be performed :

* Load the cleaned dataset (train_cleaned.csv, test_cleaned.csv).
* Split into features (X) & target (y) (Total_Booking).
* Train a Random Forest Regressor.
* Evaluate the model using MAE, MSE, RMSE, R².



In [10]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load Cleaned Data
train_df = pd.read_csv("train_cleaned.csv")
test_df = pd.read_csv("test_cleaned.csv")

# Define Features (X) and Target (y)
X_train = train_df.drop(columns=['Total_Booking', 'datetime'])  # Drop target & datetime
y_train = train_df['Total_Booking']

X_test = test_df.drop(columns=['Total_Booking', 'datetime'])
y_test = test_df['Total_Booking']

# Train Random Forest Model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predictions
y_pred = rf_model.predict(X_test)

# Evaluation Metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"📊 Model Performance:")
print(f"✅ Mean Absolute Error (MAE): {mae:.2f}")
print(f"✅ Mean Squared Error (MSE): {mse:.2f}")
print(f"✅ Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"✅ R² Score: {r2:.2f}")

📊 Model Performance:
✅ Mean Absolute Error (MAE): 24.31
✅ Mean Squared Error (MSE): 803.53
✅ Root Mean Squared Error (RMSE): 28.35
✅ R² Score: -0.04


## XGBoost Implementation

Steps in the Code

1. Load the cleaned dataset (train_cleaned.csv, test_cleaned.csv).
2. Train an XGBoost Regressor.
3. Tune hyperparameters for better performance.
4. Evaluate the model using MAE, MSE, RMSE, R².

In [11]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV

# Load Cleaned Data
train_df = pd.read_csv("train_cleaned.csv")
test_df = pd.read_csv("test_cleaned.csv")

# Define Features (X) and Target (y)
X_train = train_df.drop(columns=['Total_Booking', 'datetime'])  # Drop target & datetime
y_train = train_df['Total_Booking']

X_test = test_df.drop(columns=['Total_Booking', 'datetime'])
y_test = test_df['Total_Booking']

# Train XGBoost Model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

# Hyperparameter Tuning using GridSearchCV
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0]
}

grid_search = GridSearchCV(xgb_model, param_grid, scoring='r2', cv=3, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best Model
best_xgb_model = grid_search.best_estimator_

# Predictions
y_pred = best_xgb_model.predict(X_test)

# Evaluation Metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"📊 XGBoost Model Performance:")
print(f"✅ Best Parameters: {grid_search.best_params_}")
print(f"✅ Mean Absolute Error (MAE): {mae:.2f}")
print(f"✅ Mean Squared Error (MSE): {mse:.2f}")
print(f"✅ Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"✅ R² Score: {r2:.2f}")


Fitting 3 folds for each of 36 candidates, totalling 108 fits
📊 XGBoost Model Performance:
✅ Best Parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1.0}
✅ Mean Absolute Error (MAE): 23.95
✅ Mean Squared Error (MSE): 791.80
✅ Root Mean Squared Error (RMSE): 28.14
✅ R² Score: -0.02


# Model Performance Analysis
*  XGBoost is not improving performance over Random Forest
* Both models have poor R² (-0.02), meaning they fail to capture patterns in the data
* Errors remain high (MAE ~24, RMSE ~28), indicating bad predictions
