## Using randomforest regression to create an interactive predict model

Next, we will create another interactive prediction model using random forest regression, similar to the linear regression model. It is designed to handle more complex, non-linear patterns. While it performed moderately well, with an RMSE of 7.71 and an R² score of 0.65, it didn’t outperform Linear Regression.

In [44]:
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Step 1: Load and preprocess data
file_path = 'data/new_fare_prediction_model_data.csv'  
data = pd.read_csv(file_path)

# Preprocessing
data['tpep_pickup_datetime'] = pd.to_datetime(data['tpep_pickup_datetime'])
data['hour_of_day'] = data['tpep_pickup_datetime'].dt.hour  # Extract hour of the day
data['day_of_week'] = data['tpep_pickup_datetime'].dt.dayofweek  # Extract day of the week (0=Monday, 6=Sunday)

features = ['trip_distance', 'passenger_count', 'hour_of_day', 'day_of_week']
target = 'fare_amount'

data = data.dropna(subset=features + [target])

X = data[features]
y = data[target]

# Step 2: Train the RandomForest model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_model = RandomForestRegressor(n_estimators=50, random_state=42)
rf_model.fit(X_train, y_train)

# Evaluate the model
y_pred = rf_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

# Step 3: Create Dash App
app = dash.Dash(__name__, external_stylesheets=['https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css'])

app.layout = html.Div([
    html.H1("Taxi Fare Prediction (Random Forest Model)", style={
        "textAlign": "center", 
        "color": "#4a4a4a", 
        "fontFamily": "'Comic Sans MS', cursive", 
        "fontWeight": "bold",
        "fontSize": "24px",  # Reduced font size
        "marginBottom": "10px"
    }),

    html.Div([
        html.Label("🚗 Trip Distance (miles):", style={
            "marginTop": "10px", "fontFamily": "'Comic Sans MS', cursive", "fontWeight": "bold"
        }),
        dcc.Input(
            id='trip_distance', 
            type='number', 
            value=1.0,  # Default value with a decimal
            min=0.1,  # Minimum value allowed
            step=0.1,  # Allows decimal values
            style={
                "marginBottom": "15px", 
                "width": "90%", 
                "borderRadius": "5px", 
                "padding": "5px"
            }
        ),

        html.Label("👥 Passenger Count:", style={
            "marginTop": "10px", "fontFamily": "'Comic Sans MS', cursive", "fontWeight": "bold"
        }),
        dcc.Input(
            id='passenger_count', 
            type='number', 
            value=1, 
            min=1, 
            max=6, 
            step=1, 
            style={
                "marginBottom": "15px", 
                "width": "90%", 
                "borderRadius": "5px", 
                "padding": "5px"
            }
        ),

        html.Label("⏰ Hour of Day (0-23):", style={
            "marginTop": "10px", "fontFamily": "'Comic Sans MS', cursive", "fontWeight": "bold"
        }),
        dcc.Slider(
            id='hour_of_day', 
            min=0, 
            max=23, 
            step=1, 
            value=12, 
            marks={i: str(i) for i in range(0, 24)}, 
            tooltip={"placement": "bottom"}
        ),

        html.Label("📅 Day of the Week:", style={
            "marginTop": "20px", "fontFamily": "'Comic Sans MS', cursive", "fontWeight": "bold"
        }),
        dcc.Dropdown(
            id='day_of_week',
            options=[
                {"label": "Monday", "value": 0},
                {"label": "Tuesday", "value": 1},
                {"label": "Wednesday", "value": 2},
                {"label": "Thursday", "value": 3},
                {"label": "Friday", "value": 4},
                {"label": "Saturday", "value": 5},
                {"label": "Sunday", "value": 6},
            ],
            value=0,
            style={"marginBottom": "15px", "borderRadius": "5px", "padding": "5px"}
        ),
    ], style={
        'margin': '15px',
        'padding': '15px',
        'backgroundColor': '#fffacd',  # Light yellow background
        'borderRadius': '10px',
        'boxShadow': '0 2px 4px rgba(0, 0, 0, 0.2)'  # Smaller shadow for compact design
    }),

    html.Button("Predict Fare", id='predict-button', n_clicks=0, className="btn btn-primary", style={
        "marginTop": "15px",
        "backgroundColor": "#ffa500",  # Orange button
        "border": "none",
        "fontFamily": "'Comic Sans MS', cursive",
        "padding": "5px 15px"
    }),

    html.Div(id='prediction-output', style={
        'marginTop': '15px',
        'fontSize': '18px',  # Reduced font size
        'fontWeight': 'bold',
        'textAlign': 'center',
        'color': '#4a4a4a',
        "fontFamily": "'Comic Sans MS', cursive"
    }),
], style={"backgroundColor": "#fffacd", "padding": "20px"})  # Smaller padding for compactness

# Step 4: Define Callback
@app.callback(
    Output('prediction-output', 'children'),
    Input('predict-button', 'n_clicks'),
    [Input('trip_distance', 'value'),
     Input('passenger_count', 'value'),
     Input('hour_of_day', 'value'),
     Input('day_of_week', 'value')]
)
def predict_fare(n_clicks, trip_distance, passenger_count, hour_of_day, day_of_week):
    if n_clicks > 0:
        input_data = pd.DataFrame([{
            'trip_distance': trip_distance,
            'passenger_count': passenger_count,
            'hour_of_day': hour_of_day,
            'day_of_week': day_of_week
        }])

        prediction = rf_model.predict(input_data)[0]
        return f"Predicted Fare: ${prediction:.2f}"
    return ""

# Step 5: Run the App
if __name__ == '__main__':
    app.run_server(debug=True)


In [52]:
# Evaluate the model separately
print("Random Forest Model Performance:")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")


Random Forest Model Performance:
RMSE: 7.71
R² Score: 0.65


## Model comparison and Evaluation

### 1. Linear Regression Model
- **RMSE**: 6.98  
- **R² Score**: 0.71  
- **Performance**:  
  - Captures 71% of the variability in fare amounts.  
  - Lower RMSE indicates better predictive accuracy.  
  - A simple and interpretable model that fits the dataset well.  

### 2. Random Forest Model
- **RMSE**: 7.71  
- **R² Score**: 0.65  
- **Performance**:  
  - Explains 65% of the variability in fare amounts.  
  - Higher RMSE suggests more error in predictions compared to Linear Regression.  
  - Handles non-linear relationships better but may need feature tuning or parameter optimization.
    
### 3. Evaluation
- **Linear Regression** performs better in terms of both RMSE and R², suggesting it fits the data more effectively.  
- **Random Forest**, while slightly less accurate, may have potential with hyperparameter tuning and additional features (e.g., location data).

In a word, the **Linear Regression model** is more effective for the current dataset due to its better accuracy and simpler structure. However, the **Random Forest model** could improve with further optimization, making it a robust choice for datasets with more complexity or non-linear patterns.  

## Project Summary

This project focuses on predicting NYC taxi fares to improve decision-making for pricing and fleet management. 

- **Goal**: Develop predictive models for fare estimation and insights.
- **Methods**: Built and compared Linear Regression and Random Forest models, supported by an interactive map.
- **Insights**:
  - Trip distance is the most significant predictor of fare.
  - The map reveals high-demand zones like Midtown Manhattan and the Financial District for better fleet optimization.
