# **Deep ETA Prediction**

## **Introduction**

<img src="../Images/eta.png" alt="eta" width="800">

Ride-hailing apps like Uber and Faras rely on real-time data and machine learning algorithms to automate their services. Accurately predicting the estimated time of arrival (ETA) for Faras trips will make Faras’s services more reliable and attractive; this will have a direct and indirect impact on both customers and business partners. The solution would help the company save money and allocate more resources to other parts of the business.

The objective of this project is to predict the estimated time of arrival at the dropoff point for ajourney.

This project will use the [CRISP-DM](https://www.datascience-pm.com/crisp-dm-2/) framework for data science problems

## **1. Business Understanding**

*Predicting Estimated Time of Arrival (ETA) for Faras*

Ride-hailing platforms such as **Faras** operate in a highly competitive and fast-paced environment where real-time data and advanced machine learning algorithms play a crucial role in providing reliable services. A key component of this is accurately predicting the **Estimated Time of Arrival (ETA)** for Faras journeys. Improving ETA predictions directly enhances the reliability and customer satisfaction, which in turn drives the success of the business.

Accurate ETAs are not only important for providing better customer experience but are also vital for optimizing ride fares, matching riders with drivers, planning deliveries, and overall operational efficiency. An inaccurate ETA can lead to customers canceling their rides due to long waits or, conversely, dissatisfaction if the ETA is too optimistic. Both outcomes could lead to a decrease in customer retention and revenue, making ETA prediction a critical aspect of Faras business model.


Inaccurate ETA predictions have a direct impact on the business. If the ETA is overestimated, customers may cancel their bookings due to longer-than-expected wait times. This results in revenue loss, wasted driver time, and potentially increased customer churn. On the other hand, if the ETA is underestimated, customers may experience delays, leading to frustration, negative reviews, or even app uninstalls, further affecting the company's reputation and revenue.

Improving ETA accuracy offers **several benefits**:
1. **Improved Customer Experience**:
   - **Accurate Trip Planning**: When customers receive reliable ETAs, they can plan their time better, knowing when to expect the vehicle to arrive at the pickup point and how long the trip will take to the destination.
   - **Reduced Cancellations**: If the ETA is realistic, customers are less likely to cancel due to long wait times or false expectations. Accurate ETAs minimize frustrations caused by miscalculations.

2. **Higher Driver Satisfaction**:
   - **Better Earnings and Time Management**: When drivers have accurate ETAs, they can better estimate how many rides they can complete in a given period, optimizing their earnings and time. This leads to higher driver retention and satisfaction.

3. **Revenue Growth**:
   - **Customer Retention**: A reliable ETA prediction system can lead to improved customer loyalty as users appreciate the accuracy of time estimates, leading to increased ride bookings.
## Objectives

The primary objective is to **develop a highly accurate and efficient predictive model** that can estimate the ETA for FarasJourneys trips in real-time. Specifically, the model should:

1. **Minimize Latency**: The model must be able to return an ETA prediction within a few milliseconds. This is critical to ensure that ETAs can be delivered to customers and drivers without delay, enabling the system to make quick decisions regarding route planning and ride matching.
   
2. **Maximize Accuracy**: The model must significantly improve on the existing XGBoost-based solution in terms of accuracy. This means reducing the Mean Absolute Error (MAE) to provide more reliable ETA predictions, thereby minimizing customer dissatisfaction caused by inaccurate estimates.

## Proposed Solution

### Deep Learning Approach:
Given the complexity of the factors involved and the need for both **low latency** and **high accuracy**, a deep learning model is proposed. While traditional models such as XGBoost have been useful, the real-time and multi-dimensional nature of the data makes deep learning more suited for the task.




## **2. Data Understanding**

In [44]:
import pandas as pd
import numpy as np

#### Data Loading

In [7]:
# load training data
df = pd.read_csv('../Data/Train.csv')
df.head()

Unnamed: 0,ID,Timestamp,Origin_lat,Origin_lon,Destination_lat,Destination_lon,Trip_distance,ETA
0,000FLWA8,2019-12-04T20:01:50Z,3.258,36.777,3.003,36.718,39627,2784
1,000RGOAM,2019-12-10T22:37:09Z,3.087,36.707,3.081,36.727,3918,576
2,001QSGIH,2019-11-23T20:36:10Z,3.144,36.739,3.088,36.742,7265,526
3,002ACV6R,2019-12-01T05:43:21Z,3.239,36.784,3.054,36.763,23350,3130
4,0039Y7A8,2019-12-17T20:30:20Z,2.912,36.707,3.207,36.698,36613,2138


In [9]:
# weather data
weather_df = pd.read_csv('../Data/Weather.csv')
weather_df.head()

Unnamed: 0,date,dewpoint_2m_temperature,maximum_2m_air_temperature,mean_2m_air_temperature,mean_sea_level_pressure,minimum_2m_air_temperature,surface_pressure,total_precipitation,u_component_of_wind_10m,v_component_of_wind_10m
0,2019-11-01,290.630524,296.434662,294.125061,101853.617188,292.503998,100806.351562,0.004297,3.561323,0.941695
1,2019-11-02,289.135284,298.432404,295.551666,101225.164062,293.337921,100187.25,0.001767,5.318593,3.258237
2,2019-11-03,287.667694,296.612122,295.182831,100806.617188,293.674316,99771.414062,0.000797,8.447649,3.172982
3,2019-11-04,287.634644,297.173737,294.368134,101240.929688,292.376221,100200.84375,0.000393,5.991428,2.2367
4,2019-11-05,286.413788,294.284851,292.496979,101131.75,289.143066,100088.5,0.004658,6.96273,2.655364


1. **date:** The date when the weather measurements were recorded.

2. **dewpoint_2m_temperature:** The temperature at 2 meters above the ground where the air temperature would be low enough for dew to form. It gives an indication of humidity.

3. **maximum_2m_air_temperature:** The highest air temperature recorded at 2 meters above the ground during the specified date.

4. **mean_2m_air_temperature:** The average air temperature at 2 meters above the ground during the specified date.

5. **mean_sea_level_pressure:** The average atmospheric pressure at sea level during the specified date.

6. **minimum_2m_air_temperature:** The lowest air temperature recorded at 2 meters above the ground during the specified date.

7. **surface_pressure:** The atmospheric pressure at the Earth's surface during the specified date.

8. **total_precipitation:** The total amount of precipitation (rain, snow, etc.) during the specified date.

9. **u_component_of_wind_10m:** The east-west (horizontal) component of the wind speed at 10 meters above the ground.

10. **v_component_of_wind_10m:** The north-south (vertical) component of the wind speed at 10 meters above the ground.

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83924 entries, 0 to 83923
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               83924 non-null  object 
 1   Timestamp        83924 non-null  object 
 2   Origin_lat       83924 non-null  float64
 3   Origin_lon       83924 non-null  float64
 4   Destination_lat  83924 non-null  float64
 5   Destination_lon  83924 non-null  float64
 6   Trip_distance    83924 non-null  int64  
 7   ETA              83924 non-null  int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 5.1+ MB


In [18]:
# to timestamp
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

In [16]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   date                        61 non-null     object 
 1   dewpoint_2m_temperature     61 non-null     float64
 2   maximum_2m_air_temperature  61 non-null     float64
 3   mean_2m_air_temperature     61 non-null     float64
 4   mean_sea_level_pressure     61 non-null     float64
 5   minimum_2m_air_temperature  61 non-null     float64
 6   surface_pressure            61 non-null     float64
 7   total_precipitation         61 non-null     float64
 8   u_component_of_wind_10m     61 non-null     float64
 9   v_component_of_wind_10m     61 non-null     float64
dtypes: float64(9), object(1)
memory usage: 4.9+ KB


In [19]:
# changing date column to datetime
weather_df['date'] = pd.to_datetime(weather_df['date'])

#### *Check for Null Values*

In [31]:
print(f"Null Values: {df.isna().sum()}")
print(f"Null Values Weather: {weather_df.isna().sum()}")

Null Values: ID                 0
Timestamp          0
Origin_lat         0
Origin_lon         0
Destination_lat    0
Destination_lon    0
Trip_distance      0
ETA                0
dtype: int64
Null Values Weather: date                          0
dewpoint_2m_temperature       0
maximum_2m_air_temperature    0
mean_2m_air_temperature       0
mean_sea_level_pressure       0
minimum_2m_air_temperature    0
surface_pressure              0
total_precipitation           0
u_component_of_wind_10m       0
v_component_of_wind_10m       0
dtype: int64


In [22]:
print(f"Duplicated Rows: {df.duplicated().sum()}")
print(f"Duplicated Rows Weather: {weather_df.duplicated().sum()}")

Duplicated Rows: 0
Duplicated Rows Weather: 0


In [29]:
# format numerical columns to 2 decimal places with comma separator
pd.options.display.float_format = '{:,.3f}'.format

# statistics for numerical columns
df.describe()

Unnamed: 0,Origin_lat,Origin_lon,Destination_lat,Destination_lon,Trip_distance,ETA
count,83924.0,83924.0,83924.0,83924.0,83924.0,83924.0
mean,3.052,36.739,3.057,36.738,13527.821,1111.698
std,0.096,0.032,0.101,0.033,9296.716,563.565
min,2.807,36.589,2.807,36.596,1.0,1.0
25%,2.994,36.721,2.995,36.718,6108.0,701.0
50%,3.046,36.742,3.049,36.742,11731.5,1054.0
75%,3.095,36.76,3.109,36.76,19369.0,1456.0
max,3.381,36.82,3.381,36.819,62028.0,5238.0


- **Origin and Destination Coordinates (Latitude and Longitude)**: The average coordinates are approximately 3.052 latitude and 36.739 longitude for both origin and destination. The minimum and maximum values show a limited range.
- **Trip Distance**: The average trip distance is approximately 13,527.82 M, with a minimum of 1.00 M and a maximum of 62,028.00 M. Some anomalies with low distance.
- **ETA (Estimated Time of Arrival)**: The average ETA is approximately 1,111.70s, with a minimum of 1.00s and a maximum of 5,238.00s. There could be some anomalies here with low ETA's, Will keep it for the EDA.