<p style="background-color:lightgreen;font-family:newtimeroman;font-size:22px;line-height:1.7em;text-align:center;border-radius:5px 5px">Temporal Demand Forecasting for Taxi Services_ A Multi_Model_Comparison</p>

In [3]:
# importing required libraries
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import xgboost as xgb
from statsmodels.tsa.arima.model import ARIMA
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings
from sklearn.svm import SVR
warnings.filterwarnings("ignore")
from sklearn.ensemble import RandomForestRegressor


class TaxiDemandDataProcessor:
    def __init__(self, folder_path):
        self.folder_path = folder_path
        self.merged_df = self.load_data()
        self.preprocess_data()

    def load_data(self):
        dfs = []

        # Load each JSON file into a separate DataFrame
        for json_file in os.listdir(self.folder_path):
            if json_file.endswith('.json'):
                file_path = os.path.join(self.folder_path, json_file)
                df = pd.read_json(file_path)
                dfs.append(df)

        # Merge DataFrames into a single DataFrame
        merged_df = pd.concat(dfs, ignore_index=True)
        return merged_df

    def preprocess_data(self):
        # Convert time-related columns to datetime
        self.merged_df['startTime'] = pd.to_datetime(self.merged_df['startTime'])
        self.merged_df['endTime'] = pd.to_datetime(self.merged_df['endTime'])

        # Round the start time down to the nearest 15-minute interval
        self.merged_df['startInterval'] = self.merged_df['startTime'].dt.floor('15min')

        # Round the end time up to the nearest 15-minute interval
        self.merged_df['endInterval'] = self.merged_df['endTime'].dt.ceil('15min')

        # Extract relevant features
        self.merged_df['hour_of_day'] = self.merged_df['startTime'].dt.hour
        self.merged_df['day_of_week'] = self.merged_df['startTime'].dt.dayofweek

        # Calculate the duration of each trip in minutes
        self.merged_df['tripDurationMinutes'] = (
            (self.merged_df['endInterval'] - self.merged_df['startInterval']).dt.total_seconds() / 60
        ).astype(float)

        # Add a demand column for each 15-minute interval
        demand_per_interval = self.merged_df.groupby('startInterval').size().reset_index(name='demand')

        # Merge the demand back to the original dataframe
        self.merged_df = self.merged_df.merge(demand_per_interval, on='startInterval', how='left')

    def split_data(self):
        # Split data into training and testing sets
        features = ['tripDurationMinutes', 'hour_of_day', 'day_of_week']
        X = self.merged_df[features]
        y = self.merged_df['demand']

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        return X_train, X_test, y_train, y_test

    def train_linear_model(self, X_train, X_test, y_train, y_test):
        # Train a linear regression model
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        reg_model = LinearRegression()
        reg_model.fit(X_train_scaled, y_train)

        # Make predictions
        linear_predictions = reg_model.predict(X_test_scaled)

        # Evaluate the model
        mae = mean_absolute_error(y_test, linear_predictions)
        rmse = np.sqrt(mean_squared_error(y_test, linear_predictions))
        r2 = r2_score(y_test, linear_predictions)
        adj_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

        return {'MAE': mae, 'RMSE': rmse, 'R-squared': r2, 'Adjusted R-squared': adj_r2}

    def train_xgb_model(self, X_train, X_test, y_train, y_test):
        # Train an XGBoost model
        xgb_model = xgb.XGBRegressor()
        xgb_model.fit(X_train, y_train)

        # Make predictions
        xgb_predictions = xgb_model.predict(X_test)

        # Evaluate the model
        mae = mean_absolute_error(y_test, xgb_predictions)
        rmse = np.sqrt(mean_squared_error(y_test, xgb_predictions))
        r2 = r2_score(y_test, xgb_predictions)
        adj_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

        return {'MAE': mae, 'RMSE': rmse, 'R-squared': r2, 'Adjusted R-squared': adj_r2}

    def train_arima_model(self, y_train, y_test):
        # Train an ARIMA model
        arima_model = ARIMA(y_train, order=(5, 1, 0))
        arima_results = arima_model.fit()

        # Make predictions
        arima_predictions = arima_results.predict(start=len(y_train), end=len(y_train) + len(y_test) - 1, typ='levels')

        # Evaluate the model
        mae = mean_absolute_error(y_test, arima_predictions)
        rmse = np.sqrt(mean_squared_error(y_test, arima_predictions))
        r2 = r2_score(y_test, arima_predictions)
        adj_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - 1)

        return {'MAE': mae, 'RMSE': rmse, 'R-squared': r2, 'Adjusted R-squared': adj_r2}

    def train_dnn_model(self, X_train, X_test, y_train, y_test):
        # Train a Deep Neural Network (DNN) model
        model = keras.Sequential([
            layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
            layers.Dense(64, activation='relu'),
            layers.Dense(1)
        ])
        model.compile(optimizer='adam', loss='mean_squared_error')
        model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)

        # Make predictions
        dnn_predictions = model.predict(X_test).flatten()

        # Evaluate the model
        mae = mean_absolute_error(y_test, dnn_predictions)
        rmse = np.sqrt(mean_squared_error(y_test, dnn_predictions))
        r2 = r2_score(y_test, dnn_predictions)
        adj_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

        return {'MAE': mae, 'RMSE': rmse, 'R-squared': r2, 'Adjusted R-squared': adj_r2}

    def train_random_forest_model(self, X_train, X_test, y_train, y_test):
        # Train a Random Forest model
        rf_model = RandomForestRegressor()
        rf_model.fit(X_train, y_train)

        # Make predictions
        rf_predictions = rf_model.predict(X_test)

        # Evaluate the model
        mae = mean_absolute_error(y_test, rf_predictions)
        rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))
        r2 = r2_score(y_test, rf_predictions)
        adj_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

        return {'MAE': mae, 'RMSE': rmse, 'R-squared': r2, 'Adjusted R-squared': adj_r2}

    def train_svr_model(self, X_train, X_test, y_train, y_test):
        # Train a Support Vector Regression (SVR) model
        svr_model = SVR()
        svr_model.fit(X_train, y_train)

        # Make predictions
        svr_predictions = svr_model.predict(X_test)

        # Evaluate the model
        mae = mean_absolute_error(y_test, svr_predictions)
        rmse = np.sqrt(mean_squared_error(y_test, svr_predictions))
        r2 = r2_score(y_test, svr_predictions)
        adj_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

        return {'MAE': mae, 'RMSE': rmse, 'R-squared': r2, 'Adjusted R-squared': adj_r2}

    def reshape_for_lstm(self, X):
        # Reshape the input data for LSTM
        return X.values.reshape((X.shape[0], 1, X.shape[1]))

    def train_lstm_model(self, X_train, X_test, y_train, y_test):
        # Reshape input data for LSTM
        X_train_lstm = self.reshape_for_lstm(X_train)
        X_test_lstm = self.reshape_for_lstm(X_test)

        # Build LSTM model
        lstm_model = Sequential()
        lstm_model.add(LSTM(50, input_shape=(X_train_lstm.shape[1], X_train_lstm.shape[2])))
        lstm_model.add(Dense(1))
        lstm_model.compile(optimizer='adam', loss='mse')

        # Train the model
        lstm_model.fit(X_train_lstm, y_train, epochs=10, batch_size=32, verbose=0)

        # Make predictions
        lstm_predictions = lstm_model.predict(X_test_lstm)

        # Reshape predictions to 1D array
        lstm_predictions = lstm_predictions.reshape(lstm_predictions.shape[0])

        # Evaluate the model
        mae = mean_absolute_error(y_test, lstm_predictions)
        rmse = np.sqrt(mean_squared_error(y_test, lstm_predictions))
        r2 = r2_score(y_test, lstm_predictions)
        adj_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

        return {'MAE': mae, 'RMSE': rmse, 'R-squared': r2, 'Adjusted R-squared': adj_r2}

# Example usage
folder_path = "C:\\Users\\venka\\Downloads\\Meiro_Mobility_Assessment_Jan2022Jul2023"
taxi_demand_processor = TaxiDemandDataProcessor(folder_path)
X_train, X_test, y_train, y_test = taxi_demand_processor.split_data()

# Train and evaluate models
linear_metrics = taxi_demand_processor.train_linear_model(X_train, X_test, y_train, y_test)
xgb_metrics = taxi_demand_processor.train_xgb_model(X_train, X_test, y_train, y_test)
arima_metrics = taxi_demand_processor.train_arima_model(y_train, y_test)
dnn_metrics = taxi_demand_processor.train_dnn_model(X_train, X_test, y_train, y_test)
rf_metrics = taxi_demand_processor.train_random_forest_model(X_train, X_test, y_train, y_test)
svr_metrics = taxi_demand_processor.train_svr_model(X_train, X_test, y_train, y_test)
lstm_metrics = taxi_demand_processor.train_lstm_model(X_train, X_test, y_train, y_test)

# Create metrics DataFrame
metrics_df = pd.DataFrame(
    {
        'Model': ['Linear Regression', 'XGBoost', 'ARIMA', 'DNN', 'Random Forest', 'SVR', 'LSTM'],
        'MAE': [linear_metrics['MAE'], xgb_metrics['MAE'], arima_metrics['MAE'], dnn_metrics['MAE'], rf_metrics['MAE'], svr_metrics['MAE'], lstm_metrics['MAE']],
        'RMSE': [linear_metrics['RMSE'], xgb_metrics['RMSE'], arima_metrics['RMSE'], dnn_metrics['RMSE'], rf_metrics['RMSE'], svr_metrics['RMSE'], lstm_metrics['RMSE']],
        'R-squared': [linear_metrics['R-squared'], xgb_metrics['R-squared'], arima_metrics['R-squared'], dnn_metrics['R-squared'], rf_metrics['R-squared'], svr_metrics['R-squared'], lstm_metrics['R-squared']],
        'Adjusted R-squared': [linear_metrics['Adjusted R-squared'], xgb_metrics['Adjusted R-squared'], arima_metrics['Adjusted R-squared'], dnn_metrics['Adjusted R-squared'], rf_metrics['Adjusted R-squared'], svr_metrics['Adjusted R-squared'], lstm_metrics['Adjusted R-squared']],
    }
)







In [4]:
metrics_df

Unnamed: 0,Model,MAE,RMSE,R-squared,Adjusted R-squared
0,Linear Regression,7.597536,10.994138,0.183756,0.183632
1,XGBoost,5.269429,8.897213,0.465429,0.465348
2,ARIMA,10.216816,12.744965,-0.09692,-0.09692
3,DNN,5.784177,9.331762,0.411935,0.411846
4,Random Forest,5.269207,8.897078,0.465445,0.465364
5,SVR,5.883358,10.454428,0.261929,0.261817
6,LSTM,6.650818,9.844253,0.34557,0.345471


<h1 style="color:blue; font-weight:bold;"><i>Inference :</i></h1>

The choice of the best model depends on the specific requirements of your use case and the importance you place on each metric. Let's analyze the metrics:

#### I) Analyzation of metrics:

1 MAE (Mean Absolute Error):

* Lower values are better.

* Random Forest has the lowest MAE, followed closely by XGBoost.

2 RMSE (Root Mean Squared Error):

* Lower values are better.

* Random Forest and XGBoost have the lowest RMSE.

3 R-squared:

* Higher values are better, indicating better explanatory power.

* XGBoost has the highest R-squared, followed by Random Forest and DNN.

4 Adjusted R-squared:

* Similar to R-squared but adjusted for the number of predictors.

* XGBoost and Random Forest have the highest Adjusted R-squared.

#### II) Recommendations:

1 XGBoost and Random Forest:

* Both XGBoost and Random Forest consistently perform well across all metrics.

* Consider using either XGBoost or Random Forest based on your preferences and interpretability.

2 DNN (Deep Neural Network):

* DNN also performs well, particularly in terms of MAE and RMSE.

* If interpretability is less critical and computational resources are available, DNN could be a good choice.

3 LSTM:

* LSTM has reasonable performance, but it may not outperform XGBoost or Random Forest in this specific use case.

4 Linear Regression and ARIMA:

* Linear Regression and ARIMA appear to have lower performance compared to ensemble methods (XGBoost, Random Forest) and DNN.

In conclusion, based on the provided metrics, XGBoost and Random Forest stand out as strong candidates for the use case.

Consider further evaluation, such as cross-validation, hyperparameter tuning, and possibly an ensemble approach, to make a final decision based on your specific requirements and the characteristics of your dataset.