# Business Understanding

##### **Overview**

Yassir is the leading super App in the Maghreb region set to changing the way daily services are provided. It currently operates in 45 cities across Algeria, Morocco and Tunisia with recent expansions into France, Canada and Sub-Saharan Africa. It is backed (~$200M in funding) by VCs from Silicon Valley, Europe and other parts of the world. They offer on-demand services such as ride-hailing and last-mile delivery.

##### **Project Scenario**

Ride-hailing apps like Uber and Yassir depend heavily on real-time data and machine learning algorithms to automate and optimize their services. Accurate prediction of the Estimated Time of Arrival (ETA) is crucial for enhancing the reliability and attractiveness of Yassir's services. This prediction will have significant direct and indirect impacts on both customers and business partners. Improving ETA predictions will not only make Yassir's services more dependable but also allow the company to save money and allocate resources more effectively across other business areas.

##### **Problem Statement**

Yassir aims to optimize its service operations by accurately predicting the ETA for rides. The goal is to ensure that customers receive precise arrival times, improving their overall experience while allowing Yassir to manage resources more effectively and reduce operational costs.

##### **Objective**

The primary objective of this project is to develop machine learning models that accurately predict the ETA for a Yassir journey to enhance service reliability and customer satisfaction. By accurately forecasting the time it will take for a trip to reach its destination, Yassir can improve the customer experience, optimize operational efficiency, and better manage resource allocation. This will contribute to cost savings and more efficient use of resources, benefiting both customers and business partners.

##### **Key Stakeholders**

Stakeholders include Yassir's management team, operations and logistics teams, customer service department, and data science team.

##### **Analytical Goals**

1. **Data Preparation:**
   - Handle missing values in trip and weather datasets using imputation techniques such as mean, median, or mode.
   - Address outliers in trip data that may skew model predictions by applying robust statistical methods.
   - Normalize or scale numerical features (e.g., trip distance) to ensure uniformity and improve model performance.
   - Encode categorical variables (e.g., weather conditions) using one-hot encoding or similar techniques.

2. **Model Development:**
   - Train and evaluate various regression models such as linear regression, decision trees, random forests, and gradient boosting algorithms.
   - Incorporate time series analysis if applicable to capture temporal trends and seasonality.
   - Validate models using cross-validation techniques and assess performance metrics such as RMSE (Root Mean Squared Error).

3. **Feature Engineering:**
   - Extract relevant features from timestamps (e.g., time of day, day of week) and weather conditions to enrich the model.
   - Analyze feature importance to understand key factors affecting ETA predictions.

4. **Visualization and Reporting:**
   - Create visualizations and dashboards to present insights from the model and its predictions.
   - Develop a deployment strategy for integrating the ETA prediction model into Yassir’s operational systems.

##### **Success Criteria**

1. Achieve a significant reduction in ETA prediction errors, with an RMSE below 180 (seconds).
2. Develop a functional data app that embeds the best models and makes accurate prediction of ETA.

##### **Constraints and Assumptions**

- Assumption: Historical trip and weather data are representative of future conditions and trends.
- Constraint: Limited availability of real-time traffic data for model refinement and updates.

##### **Data Requirements**

- Utilize data from trip records and weather datasets for analysis.
- Include features such as trip ID, timestamp, origin and destination coordinates, trip distance, ETA, and weather conditions (temperature, rainfall, wind speed).

##### **Business Impact**

- **Enhanced Customer Experience:** More accurate ETA predictions will improve customer satisfaction and trust in Yassir’s services. When riders can rely on the service to provide timely pickups and drop-offs, they are more likely to use it again in the future and
- **Operational Efficiency:** Better predictions will optimize driver allocation and reduce operational costs. By accurately estimating travel times, the platform can match drivers with riders more efficiently, reducing idle time for drivers and minimizing wait times for riders.
- **Resource Allocation:** Improved resource management through accurate trip scheduling and reduced delays. Yassir can adjust the number of available drivers in different areas based on anticipated demand and traffic conditions, optimizing overall service coverage and availability.
- **Cost Savings:** Financial savings from reduced inefficiencies and optimized resource use.

##### **Analytical Business Questions**

1. What is the impact of trip distance on ETA accuracy?
   Investigating whether longer or shorter trips have more variance in ETA predictions can help refine the model.
2. How do different times of the day affect ETA.
   Analyzing time-based patterns (e.g., rush hours vs. non-rush hours) can help improve the predictive model.
3. How does weather impact the estimated time of arrival (ETA) for Yassir trips?
4. What are the peak hours for long trip durations and how can they be optimized?
5. Is there a significant difference in trip durations between weekdays and weekends?
6. How does the density of trips in a given area affect the performance of the ETA prediction?
7. How does the model's ETA prediction accuracy compare to industry benchmarks or competitors?
8. How does the time of day influence the demand for Yassir rides in different geographical areas?
9. Which origin and destination locations are most common?/Have the most rides
10. What are the longest and shortest trips, and how do their ETAs compare?
11. What is the percentage of trips with ETAs exceeding 30 minutes.
12. What is the average number of trips per hour?
   

#### **Hypothesis**

**Null Hypothesis:** The demand for Yassir rides is significantly higher during peak traffic hours compared to non-peak hours.

**Alternate Hypothesis:** There is no significant difference in ride demand between peak traffic hours and non-peak hours.

#### Recomendations
1. Improve customer satisfaction scores related to ride accuracy.
2. Optimize driver allocation and resource management, leading to cost savings and operational efficiency.

# Data Understanding

`Dataset Overview:`

- ID: Unique identifier for each trip.
        
- Timestamp: Time when the trip started.
    
- Origin_lat, Origin_lon: Latitude and longitude of the trip's start location.

- Destination_lat, Destination_lon: Latitude and longitude of the trip's end location.

- Trip_distance: Distance in meters on the driving route.

- ETA: Estimated trip time in seconds.

In [None]:
# !pip install --quiet plotly_calplot

In [496]:
import requests
import requests_cache
from pathlib import Path

# Data Wrangling
import numpy as np
import pandas as pd


# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly_calplot import calplot
from plotly.subplots import make_subplots

# Geo
from geopy.geocoders import Nominatim

# PCA
from sklearn.decomposition import PCA

# Stats test and Normality
from scipy.stats import ttest_ind, shapiro, mannwhitneyu

# Modelling
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler
from sklearn.metrics import mean_squared_error, root_mean_squared_error

from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor, AdaBoostRegressor


# Save model
import joblib

# Set pandas to display all columns
pd.set_option("display.max_columns", None)

# High precision longitudes and Latitudes
pd.set_option('display.float_format', '{:.16f}'.format)


# Install persistent cache
requests_cache.install_cache('yassir_requests_cache', expire_after=60*60*24*7)  # Cache expires after 1 week

# Disable warnings
import warnings
warnings.filterwarnings('ignore')

print("🛬 Imported all packages.", "Warnings hidden. 👻")



In [458]:
DATA_URL = "https://raw.githubusercontent.com/valiantezabuku/Yassir-ETA-Prediction-Challenge-For-Azubian-Team-Curium/main/Data/"
TEST_FILE = DATA_URL + "Test.csv"
TRAIN_FILE = DATA_URL + "Train.csv"
WEATHER_FILE = DATA_URL + "Weather.csv"

In [459]:
# Date columns to parse
parse_dates = ['Timestamp']

precision_dtypes ={
    'Origin_lat': 'float64',
    'Origin_lon': 'float64',
    'Destination_lat': 'float64',
    'Destination_lon': 'float64'                 
}

# Load CSV files into the Notebook
weather_df = pd.read_csv(WEATHER_FILE)

test_df =pd.read_csv(TEST_FILE, parse_dates=parse_dates, dtype=precision_dtypes)

train_df = pd.read_csv(TRAIN_FILE, parse_dates=parse_dates, dtype=precision_dtypes)

### Exploratory Data Analysis

In [None]:
test_df.head()

In [None]:
train_df.head()

In [None]:
weather_df.head()

In [None]:
train_df.info()

In [None]:
#Checking for missing Values
train_df.isna().sum()

In [469]:
# Checking the descriptive statistics of the dataset
pd.reset_option("display.float_format")

train_df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
ID,83924.0,83924.0,000FLWA8,1.0,,,,,,,
Timestamp,83924.0,,,,2019-12-04 14:22:20.568883712+00:00,2019-11-19 23:00:08+00:00,2019-11-27 01:53:00.500000+00:00,2019-12-04 01:46:50.500000+00:00,2019-12-11 21:36:44+00:00,2019-12-19 23:59:29+00:00,
Origin_lat,83924.0,,,,3.052406,2.807,2.994,3.046,3.095,3.381,0.096388
Origin_lon,83924.0,,,,36.739358,36.589,36.721,36.742,36.76,36.82,0.032074
Destination_lat,83924.0,,,,3.056962,2.807,2.995,3.049,3.109,3.381,0.10071
Destination_lon,83924.0,,,,36.737732,36.596,36.718,36.742,36.76,36.819,0.032781
Trip_distance,83924.0,,,,13527.82141,1.0,6108.0,11731.5,19369.0,62028.0,9296.716006
ETA,83924.0,,,,1111.697762,1.0,701.0,1054.0,1456.0,5238.0,563.565486


In [474]:
# Restore High precision
pd.set_option('display.float_format', '{:.16f}'.format)

### Data Cleaning

- Standardize column names- use snake case

In [478]:
train_df.columns = [col.lower() for col in train_df.columns] # Train

test_df.columns = [col.lower() for col in test_df.columns] # Test

In [479]:
train_df.head()

Unnamed: 0,id,timestamp,origin_lat,origin_lon,destination_lat,destination_lon,trip_distance,eta
0,000FLWA8,2019-12-04 20:01:50+00:00,3.258,36.777,3.003,36.718,39627,2784
1,000RGOAM,2019-12-10 22:37:09+00:00,3.087,36.707,3.0810000000000004,36.727,3918,576
2,001QSGIH,2019-11-23 20:36:10+00:00,3.144,36.739,3.088,36.742,7265,526
3,002ACV6R,2019-12-01 05:43:21+00:00,3.239,36.784,3.054,36.763000000000005,23350,3130
4,0039Y7A8,2019-12-17 20:30:20+00:00,2.912,36.707,3.207,36.698,36613,2138


- According to wikipedia, `Algeria` lies mostly between latitudes 19° and 37°N (a small area is north of 37°N and south of 19°N), and longitudes 9°W and 12°E.

- This implies the latitudes and longitudes would to be swapped to be within `Algeria` boundaries but again wikipedia reveals `Kenya` is the world's 47th-largest country (after Madagascar)and that it lies between latitudes 5°N and 5°S (-4), and longitudes 34° and 42°E. So the dataset is within `Kenya` region.

## 2.0 Visualizations

- The geographical scope of the dataset- `Kenya`

In [337]:
train_df.columns

Index(['id', 'timestamp', 'origin_lat', 'origin_lon', 'destination_lat',
       'destination_lon', 'trip_distance', 'eta'],
      dtype='object')

In [342]:
data = (
    pd.concat(
        [
            train_df[['origin_lat', 'origin_lon', ]].rename(columns={'origin_lat': 'latitude', 'origin_lon': 'longitude'}), 
            train_df[['destination_lat', 'destination_lon']].rename(columns={'destination_lat': 'latitude', 'destination_lon': 'longitude'})
        ],
        ignore_index=True
    )
    .drop_duplicates()
)

data

Unnamed: 0,latitude,longitude
0,3.26,36.78
1,3.09,36.71
2,3.14,36.74
3,3.24,36.78
4,2.91,36.71
...,...,...
167477,3.01,36.66
167539,2.90,36.80
167585,3.06,36.69
167698,3.01,36.66


In [355]:
# Initialize the Nominatim Geocoder
geolocator = Nominatim(user_agent="yassirAPP")

# Function to reverse geocode
def reverse_geocode(lat, lon):
    location = geolocator.reverse((lat, lon), exactly_one=True)
    address = location.raw['address']
    country = address.get('country', '')
    return country

# Apply reverse geocoding to min latitude and longitude pair and also the maximum in the DataFrame
# Find the minimum latitude and longitude
min_lat = data['latitude'].min()
min_lon = data['longitude'].min()
max_lat = data['latitude'].max()
max_lon = data['longitude'].max()
country_min = reverse_geocode(min_lat, min_lon)
country_max = reverse_geocode(max_lat, max_lon)

In [358]:
print(f"It is confirmed as {country_min == country_max}, that the dataset is within the boundaries of the country- {country_min}")

It is confirmed as True, that the dataset is within the boundaries of the country- Kenya


In [542]:
# Get the location for Kenya
location = geolocator.geocode(country_min, exactly_one=True)

# If the location is found
if location:
    # Get the bounding box for Kenya
    bounding_box = location.raw['boundingbox']
    print(f"Bounding Box: {bounding_box}")
    
    # Nominatim API URL with query parameters
    url = "https://nominatim.openstreetmap.org/search"

    # Parameters for the request
    params = {
        'q': country_min, # Kenya
        'format': 'json',
        'polygon_geojson': 1  # Request GeoJSON polygons in the response
    }

   # Headers for the request
    headers = {
        'User-Agent': 'yassirAPP'  
    }

    # Send the request to Nominatim with headers
    response = requests.get(url, params=params, headers=headers)

    # Check if the request was successful
    if response.status_code == 200:
        country_geojson = response.json()
        
        # Print the GeoJSON data
        if country_geojson:
            for feature in country_geojson:
                if 'geojson' in feature:
                    
                    print("GeoJSON data:", feature['geojson'])
                    break
            else:
                print("GeoJSON data is not available in the response.")
        else:
            print(f"No data returned for {country_min}.")
    else:
        print(f"Error: Unable to fetch GeoJSON data. Status code: {response.status_code}")

Bounding Box: ['-4.8995204', '4.6200000', '33.9096888', '41.9067502']
GeoJSON data: {'type': 'Polygon', 'coordinates': [[[33.9096888, 0.0997292], [33.9098379, 0.0989055], [33.9519748, -0.0328341], [33.9842508, -0.1337435], [33.9491227, -0.3397388], [33.9290116, -0.4576707], [33.9253978, -0.5605575], [33.9314175, -0.8074534], [33.9337441, -0.9028746], [33.9333611, -1], [34.0189167, -1], [34.0228934, -1.007032], [34.0257011, -1.0102669], [34.0281425, -1.0141731], [34.0285087, -1.0204597], [34.0283256, -1.0293709], [34.0285087, -1.0324837], [34.0293021, -1.0354744], [34.0313163, -1.0390755], [34.0341239, -1.0426765], [34.0375419, -1.045179], [34.0408988, -1.0461555], [34.0460868, -1.0458504], [34.0803275, -1.0218635], [34.0907035, -1.0273567], [34.0937157, -1.0289179], [34.0960595, -1.0303834], [34.1079382, -1.037082], [34.115911, -1.0415779], [34.1205184, -1.0445775], [34.1779006, -1.0768156], [34.1858002, -1.0812537], [34.2106181, -1.0951966], [34.212863, -1.096459], [34.2595285, -1.122

In [591]:
data['country'] = country_min
fig = px.scatter_geo(
    data, 
    locations='country',
    hover_name='country',
    geojson=country_geojson[0]['geojson'],
    fitbounds='geojson',
)

# Add longitude and latitude points
fig.add_scattergeo(
    lon=data['longitude'],
    lat=data['latitude'],
    mode='markers',
    name='Locations in dataset'
)

# Add annotation to the map
fig.add_annotation(
    text=f"{country_min}",  
    showarrow=False, 
    font=dict(size=16, color="black"),
    align="center"
)

fig.update_layout(
    title=f'Dataset locations in {country_min}', 
    geo_scope='africa'
)

fig.show()

## Key Insight
- All the locations in the dataset are in `Kenya`.


### 2.1 Numericals
#### 2.1.1 Univariate Analysis

In [None]:
# Define the target column
target = 'eta'

In [None]:
numericals = train_df.select_dtypes(include=['number']).columns.tolist()
numericals

In [None]:
# Visualize their distributions
for column in train_df[numericals].columns:
    fig1 = px.violin(train_df, x=column, box=True)

    fig2 = px.histogram(train_df, x=column)

    # Create a subplot layout with 1 row and 2 columns
    fig = make_subplots(rows=1, cols=2, subplot_titles=(f"Violin plot of the {column} column",
                                                    f"Distribution of the {column} column"))

    # Add traces from fig1 to the subplot
    for trace in fig1.data:
        fig.add_trace(trace, row=1, col=1)

    # Add traces from fig2 to the subplot
    for trace in fig2.data:
        fig.add_trace(trace, row=1, col=2)

    # Update layout
    fig.update_layout(title_text=f"Exploring the {column} feature",
                        showlegend=True,
                        legend_title_text=target
    )

    fig.show()

In [None]:
train_df

In [None]:
for column in numericals:
    # Visualizing the distribution of the numericals in the columns by churn
    fig = px.violin(
        train_df,
        x=column,
        box=True,
        title=f"Distribution of the {column} column"
    )

    fig.show()

In [None]:
fig = px.box(
    train_df[['trip_distance', 'eta']],
    orientation='h',
    title='Distribution of Distance Features in the Dataset'
)

fig.show()

In [None]:
fig = px.box(
    train_df[['destination_lon', 'origin_lon']],
    orientation='h',
    title='Distribution of Longitude Features in the Dataset'
)

fig.show()

In [None]:
fig = px.box(
    train_df[['destination_lat', 'origin_lat']],
    orientation='h',
    title='Distribution of Latitude Features in the Dataset'
)

fig.show()

### 2.1.2 Bivariate Analysis

In [None]:
# Relationship between Trip_distance and ETA
fig = px.scatter(
    train_df,
    x='trip_distance',
    y='eta',
    trendline='ols',
    trendline_color_override='red',
    title='Relationship between Trip Distance and ETA',
    labels={'eta': 'Eta (seconds)', 'trip_distance': 'Trip Distance (meters)'},
)


fig.show()

In [None]:
numeric_correlation_matrix = train_df[numericals].corr()

# Create heatmap trace
heatmap_trace = go.Heatmap(
    z=numeric_correlation_matrix.values,
    x=numeric_correlation_matrix.columns,
    y=numeric_correlation_matrix.index,
    colorbar=dict(title='Correlation coefficient'),
    texttemplate='%{z:.3f}',
)

# Create figure
fig = go.Figure(data=[heatmap_trace])

# Update layout
fig.update_layout(
    title='Correlation Matrix Heatmap (Numeric Features)',
)

# Show plot
fig.show()

### Key Insights

The correlation matrix provided reveals key relationships between geographic coordinates (origin and destination), trip distance, and estimated time of arrival (ETA). The analysis aims to uncover the strength and direction of these relationships, which can inform strategies for optimizing route planning and enhancing the accuracy of ETA predictions.

1. **Strong Positive Correlation Between Trip Distance and ETA (0.898):**
   - **Observation:** The most significant finding is the strong positive correlation between `trip_distance` and `eta` (0.898). This suggests that as the trip distance increases, the estimated time of arrival also increases, indicating that distance is a primary determinant of ETA.
   - **Implication:** This insight highlights the importance of accurate distance calculations in predicting ETAs. Any efforts to improve ETA predictions should prioritize refining distance measurements.

2. **Moderate Correlations with Geographic Coordinates:**
   - **Origin Latitude and Destination Latitude:**
     - `origin_lat` and `destination_lat` exhibit a moderate positive correlation (0.313), indicating that trips tend to run more north-south rather than east-west.
     - **Implication:** This could be indicative of the travel patterns within the region, possibly due to geographic or infrastructural factors.
   - **Destination Longitude and Origin Longitude:**
     - `destination_lon` and `origin_lon` show a slight positive correlation (0.172), reflecting that trips generally align along the longitudinal axis.
     - **Implication:** While this correlation is weak, it suggests a slight east-west movement trend, complementing the stronger north-south correlation.

3. **Weak Correlations Across Other Variables:**
   - **Geographic Coordinates vs. Trip Distance and ETA:**
     - The correlations between individual geographic coordinates (latitude and longitude) and `trip_distance` or `eta` are generally weak, with the highest being `destination_lat` and `trip_distance` (0.093).
     - **Implication:** This suggests that the specific starting and ending points of a trip (in terms of latitude and longitude) have minimal direct impact on the distance or time required for the trip, possibly due to variations in route choices, traffic conditions, or other factors.

4. **Negative Correlations Observed:**
   - **Latitude and Longitude Interactions:**
     - There are weak negative correlations between `origin_lat` and `origin_lon` (-0.172), as well as between `destination_lat` and `destination_lon` (-0.214). This suggests some degree of geographic dispersion in the origins and destinations.
     - **Implication:** These negative correlations may indicate that as one coordinate increases, the other tends to decrease, pointing to a potential spread or diversity in trip start and end points across the region.

### Strategic Recommendations

1. **Enhance ETA Prediction Models:**
   - Given the strong correlation between `trip_distance` and `eta`, improving distance measurement accuracy and integrating real-time traffic data could further refine ETA predictions.

2. **Leverage Geographic Insights for Route Optimization:**
   - The moderate correlations between latitudes and longitudes suggest potential patterns in travel direction. Leveraging this understanding could aid in optimizing routes and better managing traffic flow.

3. **Further Exploration of Geographic Variables:**
   - The weak correlations with individual geographic coordinates indicate that additional factors, such as road conditions, traffic signals, or driver behavior, might play a significant role. Investigating these factors could uncover additional opportunities for improving service efficiency.


### 2.1.3 Multivariate


In [None]:
plot_data = train_df[['timestamp', 'eta']].set_index('timestamp')


plot_data = plot_data.resample('D')['eta'].sum().reset_index()

In [None]:
fig = calplot(
    plot_data,
    x='timestamp',
    y='eta',
    years_title=True,
    colorscale='YlGn',
    showscale=True,
    title='Total eta by calendar days, months, and years',
    total_height=400,
    start_month=11,
    end_month=12,
)

fig.show()

### Answering Analytical Business Questions

In [None]:
train_df.head(1)

QN.1 What is the impact of trip distance on ETA accuracy? Investigating whether longer or shorter trips have more variance in ETA predictions can help refine the model.
    


QN. 2 How do different times of the day affect ETA. Analyzing time-based patterns (e.g., rush hours vs. non-rush hours) can help improve the predictive model.
    

QN. 3 How does weather impact the estimated time of arrival (ETA) for Yassir trips?

In [None]:
train_df

In [None]:
daily_eta_df.info()

In [None]:
time_eta_df = train_df[['timestamp', 'eta']]

time_eta_df['date'] = time_eta_df['timestamp'].dt.date

time_eta_df['date'] = pd.to_datetime(time_eta_df['date'])

# time_eta_df['eta_hours'] = time_eta_df['eta'] / 3600

# weather_df['date'] = pd.to_datetime(weather_df['date'])

# Merge trip data with weather data
time_weather_eta_df = pd.merge(time_eta_df, weather_df, left_on='date', right_on='date')

# Drop unnecessary columns
time_weather_eta_df.drop(columns=['timestamp', 'date'], inplace=True)

# Show the merged DataFrame
time_weather_eta_df.head()

Pairplot of ETA vs Temperature features

In [None]:
# Select the columns to use
cols_to_interest = time_weather_eta_df.columns.tolist()

temp_features = [col for col in time_weather_eta_df.columns.to_list() if 'temperature' in col] + ['eta']

# Create the pair plot
fig = px.scatter_matrix(
    time_weather_eta_df,
    dimensions = temp_features,
)

# Update the layout for better visualization
fig.update_layout(
    title=f"Pair Plot of {', '.join(temp_features)} columns",
    width=1440,
    height=1440,
)

fig.show()

In [None]:
daily_eta_df = (
    time_eta_df
    .drop(columns=['timestamp'])
    .set_index('date')
    .resample('D')
    .median()
    .reset_index()    
)    

# Merge trip data with weather data
daily_weather_eta_df = (
    pd.merge(daily_eta_df, weather_df, left_on='date', right_on='date')
    .drop(columns=['date'])
)    

In [113]:
i=6
(i+1)%2 if i<5 else (i+1)%2

1

In [117]:
0,1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9,10
1,2,3,4,5

9

In [163]:
fig_titles

{'xaxis1': 'dewpoint_2m_temperature',
 'xaxis2': 'maximum_2m_air_temperature',
 'xaxis3': 'mean_2m_air_temperature',
 'xaxis4': 'mean_sea_level_pressure',
 'xaxis5': 'minimum_2m_air_temperature',
 'xaxis6': 'surface_pressure',
 'xaxis7': 'total_precipitation',
 'xaxis8': 'u_component_of_wind_10m',
 'xaxis9': 'v_component_of_wind_10m'}

In [169]:
# Create subplots
fig = make_subplots(rows=5, cols=2)

x_cols = [col for col in daily_weather_eta_df.columns if col != target]

x_titles = {}
y_titles = {}

for i, col in enumerate(x_cols):
    r = int(np.ceil((i+1)/2))
    c = 2 if ((i+1)%2==0) else 1
    fig.add_scatter(x=daily_weather_eta_df[col], y=daily_weather_eta_df[target], mode='markers', name=f'{col}', row=r, col=c)
    x_titles[f"xaxis{i+1}_title"]=col
    y_titles[f"yaxis{i+1}_title"]='Eta'

# Update layout
fig.update_layout(
    title_text='Relationship between weather features and Eta in seconds',
    showlegend=True,
    height=1000,
    **x_titles,
    **y_titles
)

# Show the figure
fig.show()

Pairplot of ETA vs other Weather features

In [None]:
other_features = [col for col in time_weather_eta_df.columns.to_list() if 'temperature' not in col]

# Create the pair plot
fig = px.scatter_matrix(
    time_weather_eta_df,
    dimensions = other_features,

)

# Update the layout for better visualization
fig.update_layout(
    title=f"Pair Plot of {', '.join(other_features)} columns",
    width=1440,
    height=1440,
)

fig.show()

QN. 4 What are the peak hours for long trip durations and how can they be optimized?
    

In [None]:
trip_duration = train_df[['timestamp', 'eta']]

trip_duration['eta_hours'] = trip_duration['eta']/3600

trip_duration['hour'] = trip_duration['timestamp'].dt.hour

# Calculate average ETA by hour
average_eta_by_hour = trip_duration.groupby('hour')['eta_hours'].mean().reset_index()

#Calculate the count of ETA by hour
count_eta_by_hour = trip_duration.groupby('hour')['eta_hours'].count().reset_index()

# Plot count ETA by hour
fig = px.line(count_eta_by_hour, x='hour', y='eta_hours', title='Count ETA by Hour of the Day')

fig.show()


# Plot average ETA by hour
fig = px.line(average_eta_by_hour, x='hour', y='eta_hours', title='Average ETA by Hour of the Day')

fig.show()

#### Observations


QN. 5 Is there a significant difference in trip durations between weekdays and weekends?
    

In [None]:
day_eta_df = train_df[['timestamp', 'eta']]

day_eta_df['eta_hours'] = day_eta_df['eta']/3600

day_eta_df['day_of_week'] = day_eta_df['timestamp'].dt.day_name()

category_orders={"day_of_week": ["Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]}

# Boxplot of ETA by Day of the Week
fig = px.box(day_eta_df, x='day_of_week', y='eta_hours',  title = 'ETA by Day of the Week', category_orders = category_orders)
fig.show()

# Calculate average ETA by day of the week
average_eta_by_day = day_eta_df.groupby('day_of_week')['eta_hours'].mean().reset_index()


# Create a Plotly line chart
fig = px.line(average_eta_by_day, x='day_of_week', y='eta_hours', title='Average ETA by Day of the Week')

 # Update x-axis to have correct day names
fig.update_layout( xaxis_title='Day of the Week',  yaxis_title='Average ETA (hours)')
# Show the plot
fig.show()


# Define weekdays (Sunday to Thursday) and weekends (Friday and Saturday)
weekends_list = ['Friday','Saturday']

mask = day_eta_df['day_of_week'].isin(weekends_list)

weekdays = day_eta_df[~mask]['eta_hours']
weekends = day_eta_df[mask]['eta_hours']

# Perform t-test
t_stat, p_value = ttest_ind(weekdays, weekends)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference in trip durations between weekdays and weekends.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in trip durations between weekdays and weekends.")

QN. 6 How does the density of trips in a given area affect the performance of the ETA prediction?
    

In [None]:
# Define a function to categorize coordinates into grids
def categorize_area(lat, lon, grid_size=0.1):
    return (round(lat / grid_size) * grid_size, round(lon / grid_size) * grid_size)

# Create a new dataframe to be used
trip_density_df = train_df[['origin_lat','origin_lon','destination_lat','destination_lon','eta','trip_distance']]

# Apply the function to create a new 'area' column
trip_density_df['origin_area'] = trip_density_df.apply(lambda row: categorize_area(row['origin_lat'], row['origin_lon']), axis=1)
trip_density_df['destination_area'] = trip_density_df.apply(lambda row: categorize_area(row['destination_lat'], row['destination_lon']), axis=1)
trip_density_df.head(1)

QN. 7 How does the model's ETA prediction accuracy compare to industry benchmarks or competitors?

Qn. 8 How does the time of day influence the demand for Yassir rides in different geographical areas?

QN. 9 Which origin and destination locations are most common?/Have the most rides

In [None]:
# Group by origin locations and count occurrences
origin_counts = train_df.groupby(['origin_lat', 'origin_lon'])['origin_lon'].count().reset_index(name='count')

# Sort by count in descending order
most_common_origins = origin_counts.sort_values(by='count', ascending=False).head(10)

# Group by destination locations and count occurrences
destination_counts = train_df.groupby(['destination_lat', 'destination_lon'])['destination_lon'].count().reset_index(name='count')

# Sort by count in descending order
most_common_destinations = destination_counts.sort_values(by='count', ascending=False).head(10)
most_common_origins

In [None]:
# Create subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=('Top 10 Most Common Origin Locations', 'Top 10 Most Common Destination Locations'))

# Prepare data for origin locations
most_common_origins['location'] = most_common_origins.apply(lambda row: f"({row['origin_lat']}, {row['origin_lon']})", axis=1)

# Add trace for origin locations
fig.add_trace(
    go.Bar(x=most_common_origins['location'], y=most_common_origins['count'],
           name='Origin Locations'),
    row=1, col=1
)

# Prepare data for destination locations
most_common_destinations['location'] = most_common_destinations.apply(lambda row: f"({row['destination_lat']}, {row['destination_lon']})", axis=1)

# Add trace for destination locations
fig.add_trace(
    go.Bar(x=most_common_destinations['location'], y=most_common_destinations['count'],
           name='Destination Locations'),
    row=1, col=2
)

# Update layout
fig.update_layout(
    title_text='Top 10 Most Common Origin and Destination Locations',
    xaxis_title='Origin Locations',
    yaxis_title='Number of Rides',
    xaxis2_title='Destination Locations',
    yaxis2_title='Number of Rides',
    xaxis_tickangle=-45,
    xaxis2_tickangle=-45,
    showlegend=False
)

# Show the figure
fig.show()

QN. 10 What are the top 10 longest and top 10 shortest trips, and how do their ETAs compare?

In [None]:
train_df.head(1)

In [None]:
# Get the 10 largest trip distances (shortest)
short_trips = train_df[['trip_distance','eta']].nsmallest(10, 'trip_distance').rename(columns={'trip_distance': 'short_trip_distance'})

# Get the 10 smallest trip distances (longest)
long_trips = train_df[['trip_distance','eta']].nlargest(10, 'trip_distance').rename(columns={'trip_distance': 'long_trip_distance'})

fig = px.line(short_trips, x='short_trip_distance', y='eta')
fig.show()

fig = px.line(long_trips, x='long_trip_distance', y='eta')
fig.show()

QN. 11 What is the percentage of trips with ETAs exceeding 30 minutes.

In [None]:
eta_df = train_df['eta_hours'] = train_df['ETA'] / 3600

# Create a dataframe for ETA values
eta_df = train_df[['eta_hours']]

# Convert ETA to minutes
eta_df['eta_minutes'] = eta_df['eta_hours'] * 60

# Filter trips with ETAs exceeding 30 minutes
long_trips = eta_df[eta_df['eta_minutes'] > 30]

# Count the number of trips with ETAs exceeding 30 minutes
num_long_trips = len(long_trips)

# Calculate the total number of trips
total_trips = len(eta_df)

# Calculate the percentage of trips with ETAs exceeding 30 minutes
percentage_long_trips = (num_long_trips / total_trips) * 100

# Create a summary dataframe
summary_df = pd.DataFrame({
    'Total Trips': [total_trips],
    'Trips > 30 min': [num_long_trips],
    'Percentage > 30 min': [percentage_long_trips]
})
summary_df

In [None]:
# Create labels and values for the pie chart
labels = ['Trips > 30 min', 'Trips <= 30 min']
values = [num_long_trips, total_trips - num_long_trips]

# Create a pie chart using Plotly
fig = go.Figure(data=[go.Pie(labels=labels, values=values,
                             insidetextorientation='radial', marker=dict(colors=['crimson', 'lightblue']))])
# Update layout
fig.update_layout(
    title='Percentage of Trips with ETAs Exceeding 30 Minutes'
)
# Show plot
fig.show()

QN. 12 What is the average number of trips per hour?

In [None]:
# Create a dataframe for Timestamp
time_df = train_df[['Timestamp']]

# Extract the hour from the timestamp
time_df['hour'] = time_df['Timestamp'].dt.hour

# Count the number of trips for each hour
trips_per_hour = time_df['hour'].value_counts().sort_index().reset_index().rename(columns={'hour': 'Hour', 'count': 'Number of Trips'})

ave_trips_per_hour = trips_per_hour.groupby('Hour')['Number of Trips'].mean().reset_index()
ave_trips_per_hour

In [None]:
# Plot count ETA by hour
fig = px.line(ave_trips_per_hour, x='Hour', y='Number of Trips', title='Average number of trips by Hour')

fig.show()

#### Hypothesis

In [None]:
alpha=0.05

### Normality test- Using Shapiro-Wilk test

In [None]:
numericals = train_df.select_dtypes(include='number').columns.to_list()
for c in numericals:
  print([c])
  a, b = shapiro(train_df[[c]])
  print(f'Statistics: {a}, p-value: b')

  if b < alpha:
    print('The distribution is not normal')
  else:
    print('The distribution is normal')

In [None]:
week_eta_df = train_df[['timestamp', 'eta']]

week_eta_df['eta_hours'] = week_eta_df['eta']/3600

week_eta_df['day_of_week'] = week_eta_df['timestamp'].dt.day_name()

# Define weekdays (Sunday to Thursday) and weekends (Friday and Saturday)
weekends_list = ['Friday','Saturday']

mask = week_eta_df['day_of_week'].isin(weekends_list)

weekdays = day_eta_df[~mask]['eta_hours']
weekends = day_eta_df[mask]['eta_hours']


# Perform Mann-Whitney U test
u_statistic, p_value = mannwhitneyu(weekdays, weekends, alternative='two-sided', nan_policy='omit')

# Print the results
print("Mann-Whitney U Test Results:")
print(f"U-statistic: {u_statistic}")
print(f"P-value: {p_value}")

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference in trip durations between weekdays and weekends.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in trip durations between weekdays and weekends.")

This finding suggests that the behavior of trip durations varies notably depending on the day of the week, with implications for operational efficiency and customer experience.

### Key Findings

- **Statistical Insight**: The Mann-Whitney U test yielded a U-statistic of 721,718,033.5 with a p-value of 3.81e-06, well below the conventional significance level of 0.05. This leads to a rejection of the null hypothesis, confirming that trip durations differ significantly between weekdays and weekends.
  
- **Operational Impact**: The significant variation in trip durations highlights potential inefficiencies in resource allocation. Weekends may see either longer or shorter trips due to differences in traffic patterns, rider behavior, or demand fluctuations. If trip durations are longer on weekends, this could lead to increased operational costs and reduced fleet availability. Conversely, shorter durations could indicate underutilization of resources, presenting an opportunity to optimize service levels.

- **Customer Experience**: Discrepancies in trip durations across different days could impact customer satisfaction. Longer trip durations on weekends may lead to delays and dissatisfaction, particularly if expectations are set based on weekday performance. Consistency in ETA is crucial for maintaining customer trust and satisfaction, so addressing these discrepancies could enhance the overall customer experience.

### Business Impact

1. **Resource Optimization**: Understanding the differences in trip durations allows for better resource allocation. For example, if weekend trips are typically longer, the business might need to deploy more vehicles or adjust driver schedules to meet demand and reduce wait times. This optimization can lead to cost savings and improved service reliability.

2. **Demand Forecasting**: The significant difference in trip durations suggests that demand patterns vary between weekdays and weekends. Leveraging this insight can improve demand forecasting models, allowing for more accurate predictions and better planning.

3. **Service Level Adjustments**: To address customer dissatisfaction due to longer trip durations on weekends, the business could explore dynamic pricing, offering incentives for off-peak travel, or adjusting service levels to ensure faster service during peak times.

### Opportunities

- **Targeted Marketing**: The data indicates an opportunity for targeted marketing campaigns aimed at smoothing out demand across the week. For instance, offering promotions for weekend travel during off-peak hours could help balance demand and reduce pressure on resources during peak times.

- **Data-Driven Strategy**: With clear evidence of differing trip durations, the business has an opportunity to adopt a more data-driven approach to operations. This could involve using real-time data to adjust fleet deployment dynamically, ensuring optimal service levels throughout the week.

- **Customer Communication**: Enhancing communication with customers regarding expected trip durations on different days could manage expectations and improve satisfaction. Transparent communication about potential delays on weekends, along with proactive measures to mitigate them, could strengthen customer trust.

### Conclusion

The significant difference in trip durations between weekdays and weekends presents both challenges and opportunities for the business. By optimizing resource allocation, refining demand forecasting, and enhancing customer communication, the company can not only mitigate potential negative impacts but also unlock new avenues for growth and efficiency. Adopting these strategies will position the business to deliver a more consistent and satisfying customer experience while maintaining operational excellence.

In [None]:
# Boxplot of ETA by Day of the Week
fig = make_subplots(rows=1, cols=2, subplot_titles=('ETA weekdays', 'ETA weekends'))
fig.add_trace(
    go.Box(y=weekdays,
           name='Weekdays'),
    row=1, col=1
)
fig.add_trace(
    go.Box(y=weekends,
           name='Weekends'),
    row=1, col=2
)

# Update layout
fig.update_layout(
    title_text='Trip durations between weekdays and weekends',
    yaxis_title='ETA (hours)',
    yaxis2_title='ETA (hours)',
    showlegend=False
)

fig.show()


In [None]:
wk_df = train_df[['timestamp', 'eta']]

wk_df['day_type'] = ['Weekend' if wk_day in [4, 5] else 'Weekday' for wk_day in wk_df['timestamp'].dt.weekday ]

wk_df['eta_hours'] = week_eta_df['eta']/3600

# Add a month column for monthly trend analysis
wk_df['month'] = wk_df['timestamp'].dt.month_name()

# Calculate average trip duration per day type per month
trend_df = wk_df.groupby(['month', 'day_type']).agg({'eta_hours': 'median'}).reset_index()

# Plot the trend using Plotly
fig = px.line(trend_df, x='month', y='eta_hours', color='day_type',
              title='Trend of Median Trip Duration over Time',
              labels={'eta_hours': 'Median Trip Duration (hours)', 'month': 'Month'},
              markers=True)

# Customize the plot
fig.update_layout(xaxis_title='Month',
                  yaxis_title='Median Trip Duration (hours)',
                  legend_title='Day Type',
                  hovermode='x unified')

# Show the plot
fig.show()

The median values indicate that, while statistically significant, the practical difference between weekdays and weekends is subtle. This finding suggests opportunities for fine-tuning operations and enhancing customer satisfaction.

### Key Findings

- **ETA Variation**: The data shows that the median ETA is slightly longer on weekdays compared to weekends for both November and December.
  - **December**: The median ETA on weekdays is 0.295 hours, compared to 0.288 hours on weekends.
  - **November**: The median ETA on weekdays is 0.294 hours, compared to 0.289 hours on weekends.

- **Consistency Across Months**: The pattern of longer weekday ETAs is consistent across both months, indicating a persistent trend rather than random variability.

### Business Impact

- **Operational Efficiency**: The slight increase in median ETA on weekdays suggests potential inefficiencies, such as higher traffic or demand during these periods. Addressing these factors could lead to improvements in operational performance.

- **Customer Experience**: Even small differences in median ETA can influence customer perceptions, especially during high-demand periods. Ensuring more consistent ETAs across the week could enhance overall customer satisfaction.

- **Predictable Patterns**: The consistency of these differences across months suggests that these trends are predictable, allowing the business to anticipate and plan for varying demand and traffic conditions.

### Opportunities

- **Route Optimization**: Given the consistent yet small difference in median ETAs, there's an opportunity to optimize routes specifically for weekdays. Adjustments could be made to minimize delays and improve trip efficiency during these peak times.

- **Enhanced Demand Forecasting**: The median-based analysis supports the development of more refined demand forecasting models that account for day-of-week variations. This can lead to better resource allocation and improved service reliability.

- **Customer Communication Strategies**: Clear communication with customers about potential weekday delays, coupled with targeted incentives for off-peak travel, could help manage expectations and maintain customer loyalty.


The significant difference in median trip durations between weekdays and weekends, while statistically notable, is relatively modest. This suggests room for operational improvements that could further reduce ETAs, particularly during weekdays. By leveraging these insights, the business can optimize its operations and enhance customer satisfaction, ensuring a more consistent and reliable service.

In [None]:
train_df.head(1)

# Data Preparation

Divide the Dataset into X and y variables

In [170]:
X = train_df.drop('eta', axis=1)
y = train_df['eta']

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Preparing Pipelines

In [200]:
numerical_features = X.select_dtypes('number').columns
numerical_features

Index(['origin_lat', 'origin_lon', 'destination_lat', 'destination_lon',
       'trip_distance'],
      dtype='object')

In [201]:
numerical_pipeline = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy = 'median')),
        ('scaler', StandardScaler())
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('numerical_pipeline', numerical_pipeline, numerical_features)
    ],
    remainder='drop'
)


In [202]:
preprocessor.fit_transform(X_train)

array([[ 1.30220102, -0.60339435, -0.46413855,  1.04439557,  0.98884211],
       [-0.52391771,  0.51765573, -0.76183951,  1.3490224 , -0.71610965],
       [-1.14645819,  0.26853349, -0.11682077,  0.73976874,  0.04115715],
       ...,
       [-0.76255823, -1.28848051,  1.4907644 , -1.20984298,  1.6056594 ],
       [ 0.29576058, -0.54111379,  0.35950076, -0.9966042 , -1.08770446],
       [ 0.08824709,  0.23739321,  1.72892517,  1.28809703,  1.19999995]])

# Modeling and Evaluation

In [203]:
random_state = 2024
n_jobs = -1
verbose = 0

models = [
    AdaBoostRegressor(random_state=random_state),
    DecisionTreeRegressor(random_state=random_state),
    GradientBoostingRegressor(random_state=random_state, verbose=verbose),
    HistGradientBoostingRegressor(random_state=random_state, verbose=verbose),
    LinearRegression(n_jobs=n_jobs),
    RandomForestRegressor(random_state=random_state, n_jobs=n_jobs, verbose=verbose),
    XGBRegressor(random_state=random_state, n_jobs=n_jobs, verbose=verbose),  
]

In [194]:
# Dictionary to store predictions
all_pipelines = {}

# Create an empty DataFrame for metrics
metrics_table = pd.DataFrame(columns=['NAME', 'MSE', 'RMSE'])

# Train and predict with each model
for model in models:
  # Model name
  name = model.__class__.__name__
  
  # Model pipeline
  final_pipeline = Pipeline(steps=[
      ('preprocessor', preprocessor),
      ('regressor', model)
  ])
  
  # Fit and predict
  final_pipeline.fit(X_train, y_train)
  y_pred = final_pipeline.predict(X_test)
  
  # Metrics
  mse = mean_squared_error(y_test, y_pred)
  rmse = root_mean_squared_error(y_test, y_pred)

  # Add all pipelines
  all_pipelines[name] = final_pipeline

  # Add metrics to metrics_table
  metrics_table.loc[len(metrics_table)] = [name, mse, rmse]


In [196]:
# Display the metrics table
metrics_table = metrics_table.sort_values(ascending=True, by='RMSE').reset_index().drop(columns='index')
metrics_table

Unnamed: 0,NAME,MSE,RMSE
0,RandomForestRegressor,21823.27,147.73
1,XGBRegressor,24262.68,155.76
2,HistGradientBoostingRegressor,28455.73,168.69
3,GradientBoostingRegressor,38293.49,195.69
4,DecisionTreeRegressor,41821.65,204.5
5,LinearRegression,59812.19,244.57
6,AdaBoostRegressor,159244.49,399.05


In [197]:
best_model_name = metrics_table['NAME'].iloc[0]
best_model_name

'RandomForestRegressor'

In [208]:
best_model_score = metrics_table['RMSE'].iloc[0]
best_model_score

147.72701513362244

In [198]:
best_model = all_pipelines.get(best_model_name)
best_model

In [204]:
# Get the numerical feature names after transformation
numerical_features_transformed = best_model.named_steps['preprocessor'].named_transformers_['numerical_pipeline'].named_steps['scaler'].get_feature_names_out(numerical_features)
numerical_features_transformed

array(['origin_lat', 'origin_lon', 'destination_lat', 'destination_lon',
       'trip_distance'], dtype=object)

In [207]:
# Get the feature names after transformation
# feature_columns = np.concatenate((numerical_features_transformed))
feature_columns = numerical_features_transformed
score = best_model.named_steps['regressor'].feature_importances_

# Display the feature columns
f_importances_df = pd.DataFrame({'feature':feature_columns, 'score': score})
f_importances_df.sort_values(by='score', ascending = False, inplace=True)
f_importances_df

Unnamed: 0,feature,score
4,trip_distance,0.85
1,origin_lon,0.04
3,destination_lon,0.04
2,destination_lat,0.03
0,origin_lat,0.03


In [212]:
# Plot the feature importances
fig = px.bar(
    f_importances_df.sort_values(by='score', ascending = True), 
    x='score', 
    y='feature', 
    orientation='h',  # Set orientation to horizontal
    title=f"Feature Importances- {best_model['regressor'].__class__.__name__} (RMSE: {best_model_score:.2f})",
    labels={'Score': 'Score', 'Feature': 'feature'},
    height=700,
    color='score'      
)

fig.show()

### Use the best two models to predict on unknown dataset (test_df)

- Prepare Test Dataset

In [190]:
test_df.head()

Unnamed: 0,id,timestamp,origin_lat,origin_lon,destination_lat,destination_lon,trip_distance
0,000V4BQX,2019-12-21 05:52:37+00:00,2.98,36.69,2.98,36.75,17549
1,003WBC5J,2019-12-25 21:38:53+00:00,3.03,36.77,3.07,36.75,7532
2,004O4X3A,2019-12-29 21:30:29+00:00,3.04,36.71,3.01,36.76,10194
3,006CEI5B,2019-12-31 22:51:57+00:00,2.9,36.74,3.21,36.7,32768
4,009G0M2T,2019-12-28 21:47:22+00:00,2.86,36.69,2.83,36.7,4513


In [213]:
eta_pred = best_model.predict(test_df)
eta_pred

array([1374.94511905,  731.98      , 1033.17      , ...,  742.41      ,
       1249.55      ,  683.935     ])

In [240]:
submission_df = pd.DataFrame(
    {
        'id': test_df['id'],
        'eta': eta_pred
    }
)
submission_df

Unnamed: 0,id,eta
0,000V4BQX,1374.95
1,003WBC5J,731.98
2,004O4X3A,1033.17
3,006CEI5B,2059.24
4,009G0M2T,758.39
...,...,...
35620,ZZXSJW3Q,432.46
35621,ZZYPNYYY,1064.20
35622,ZZYVPKXY,742.41
35623,ZZZXGRIO,1249.55


In [248]:
filename = 'submission.csv'

path = Path.cwd().parent / "Submission"

filepath = path / filename


submission_df['eta'] = submission_df['eta'].round(0).astype(int)
submission_df.columns = [c.upper() for c in  submission_df.columns]
submission_df.to_csv(filepath, index=False)

### Persist/Saving the models 💾

In [None]:
# for model_name, model in {**all_stat_models, **all_pipelines}.items():
#     joblib.dump(model,f'./Trained models/{model_name}.joblib')

In [224]:
# Models
compdir = Path.cwd().parent / "Models"
for model_name, pipeline in all_pipelines.items():
    filename = f'{model_name}.joblib'
    filepath = compdir / filename
    joblib.dump(pipeline, filepath)


print('💾 All models have been saved.')

💾 All models have been saved.


# Deployment

Made with 💖 [Team Curium](https://github.com/valiantezabuku/Yassir-ETA-Prediction-Challenge-For-Azubian-Team-Curium)