## 1. Business Understanding (Stakeholders input)
- Goal
- Hypothesis
- Analytical Questions
- More Information about the project if applicable
## 2. Data Understanding
- Load Dataset
- Clean dataset
- EDA (info, describe, duplicates, appropriate columns, check for uniques values) - Univariate (Histogram, check for outliers, calculate skweness, density plots, etc) - Bivariate Analysis (Datatypes, correlation heatmap, violin plot, Pair plots, etc) - Multivariate Analysis (PCA) - Further analysis
- Answer Analytical Questions
- Test Hypothesis
## 3. Data Preparation
## 4. Modelling & Evaluation


# Business Understanding

### Objective:
The primary goal of this project is to accurately predict the estimated time of arrival (ETA) for Yassir trips. This will enhance the reliability of Yassir's services, potentially increasing customer satisfaction and retention while optimizing resource allocation and cost management.

### Stakeholders:

- **Customers:** Require reliable and accurate ETAs to plan their journeys better.
- **Drivers:** Benefit from improved route planning and time management.
- **Yassir Management:** Needs accurate ETA predictions to improve service efficiency, resource allocation, and customer satisfaction.

### Success Criteria:

- **Operational Efficiency:** Better resource management and reduced operational costs.
- **Customer Experience:** Enhanced satisfaction due to accurate ETA predictions.
- **Market Competitiveness:** Improved reliability can make Yassir more attractive compared to competitors.

### Business Questions
1. How do weather conditions affect the ETA of Yassir trips?
   Understanding the influence of factors like temperature, rainfall, and wind speed on travel times can help in more accurate ETA predictions.
2. What is the impact of trip distance on ETA accuracy?
   Investigating whether longer or shorter trips have more variance in ETA predictions can help refine the model.
3. How do different times of the day affect ETA predictions?
   Analyzing time-based patterns (e.g., rush hours vs. non-rush hours) can help improve the predictive model.

### Hypothesis

Null Hypothesis: The ETA for Yassir trips is significantly influenced by weather conditions, particularly rainfall and wind speed, trip distance, and the time of day.

# Data Understanding

In [1]:
# Import the Necessary Packages
import pandas as pd
import numpy as np

In [7]:
# Load CSV files into the Notebook
train_df = pd.read_csv("../Data/Train.csv")

weather_df = pd.read_csv("../Data/Weather.csv")

test_df =pd.read_csv("../Data/Test.csv")

### Exploratory Data Analysis

In [8]:
test_df.head()

Unnamed: 0,ID,Timestamp,Origin_lat,Origin_lon,Destination_lat,Destination_lon,Trip_distance
0,000V4BQX,2019-12-21T05:52:37Z,2.981,36.688,2.978,36.754,17549
1,003WBC5J,2019-12-25T21:38:53Z,3.032,36.769,3.074,36.751,7532
2,004O4X3A,2019-12-29T21:30:29Z,3.035,36.711,3.01,36.758,10194
3,006CEI5B,2019-12-31T22:51:57Z,2.902,36.738,3.208,36.698,32768
4,009G0M2T,2019-12-28T21:47:22Z,2.86,36.692,2.828,36.696,4513


In [5]:
train_df.head()

Unnamed: 0,ID,Timestamp,Origin_lat,Origin_lon,Destination_lat,Destination_lon,Trip_distance,ETA
0,000FLWA8,2019-12-04T20:01:50Z,3.258,36.777,3.003,36.718,39627,2784
1,000RGOAM,2019-12-10T22:37:09Z,3.087,36.707,3.081,36.727,3918,576
2,001QSGIH,2019-11-23T20:36:10Z,3.144,36.739,3.088,36.742,7265,526
3,002ACV6R,2019-12-01T05:43:21Z,3.239,36.784,3.054,36.763,23350,3130
4,0039Y7A8,2019-12-17T20:30:20Z,2.912,36.707,3.207,36.698,36613,2138


In [6]:
weather_df.head()

Unnamed: 0,date,dewpoint_2m_temperature,maximum_2m_air_temperature,mean_2m_air_temperature,mean_sea_level_pressure,minimum_2m_air_temperature,surface_pressure,total_precipitation,u_component_of_wind_10m,v_component_of_wind_10m
0,2019-11-01,290.630524,296.434662,294.125061,101853.617188,292.503998,100806.351562,0.004297,3.561323,0.941695
1,2019-11-02,289.135284,298.432404,295.551666,101225.164062,293.337921,100187.25,0.001767,5.318593,3.258237
2,2019-11-03,287.667694,296.612122,295.182831,100806.617188,293.674316,99771.414062,0.000797,8.447649,3.172982
3,2019-11-04,287.634644,297.173737,294.368134,101240.929688,292.376221,100200.84375,0.000393,5.991428,2.2367
4,2019-11-05,286.413788,294.284851,292.496979,101131.75,289.143066,100088.5,0.004658,6.96273,2.655364


In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83924 entries, 0 to 83923
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               83924 non-null  object 
 1   Timestamp        83924 non-null  object 
 2   Origin_lat       83924 non-null  float64
 3   Origin_lon       83924 non-null  float64
 4   Destination_lat  83924 non-null  float64
 5   Destination_lon  83924 non-null  float64
 6   Trip_distance    83924 non-null  int64  
 7   ETA              83924 non-null  int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 5.1+ MB


In [4]:
# Checking for missing Values
train_df.isna().sum()

ID                 0
Timestamp          0
Origin_lat         0
Origin_lon         0
Destination_lat    0
Destination_lon    0
Trip_distance      0
ETA                0
dtype: int64

Cleaning the Timestamp column and changing the data type

In [13]:

# Function to clean the timestamp
def clean_timestamp(Timestamp):
    return Timestamp.replace('T', ' ').replace('Z', '')

# # Function to extract date and time
# def extract_date_time(Timestamp):
#     date_time_str = Timestamp.replace('T', ' ').replace('Z', '')
#     date, time = date_time_str.split(' ')
#     return date, time

# Apply the function to clean the 'timestamp' column
train_df['timestamp'] = train_df['Timestamp'].apply(clean_timestamp)

# # Apply the function and create two new columns
# train_df[['date', 'time']] = train_df['Timestamp'].apply(lambda x: pd.Series(extract_date_time(x)))

print(train_df)


             ID             Timestamp  Origin_lat  Origin_lon  \
0      000FLWA8  2019-12-04T20:01:50Z       3.258      36.777   
1      000RGOAM  2019-12-10T22:37:09Z       3.087      36.707   
2      001QSGIH  2019-11-23T20:36:10Z       3.144      36.739   
3      002ACV6R  2019-12-01T05:43:21Z       3.239      36.784   
4      0039Y7A8  2019-12-17T20:30:20Z       2.912      36.707   
...         ...                   ...         ...         ...   
83919  ZZXN4JH2  2019-11-30T23:21:58Z       3.121      36.743   
83920  ZZXQ5AQJ  2019-11-27T05:59:31Z       3.024      36.749   
83921  ZZXYPKGU  2019-12-06T05:04:06Z       3.189      36.721   
83922  ZZYTQHKT  2019-12-07T05:55:22Z       3.046      36.738   
83923  ZZZY11ZN  2019-12-12T21:22:31Z       2.889      36.762   

       Destination_lat  Destination_lon  Trip_distance   ETA  Distance_KM  \
0                3.003           36.718          39627  2784       39.627   
1                3.081           36.727           3918   576     

In [14]:
train_df.head()

Unnamed: 0,ID,Timestamp,Origin_lat,Origin_lon,Destination_lat,Destination_lon,Trip_distance,ETA,Distance_KM,timestamp
0,000FLWA8,2019-12-04T20:01:50Z,3.258,36.777,3.003,36.718,39627,2784,39.627,2019-12-04 20:01:50
1,000RGOAM,2019-12-10T22:37:09Z,3.087,36.707,3.081,36.727,3918,576,3.918,2019-12-10 22:37:09
2,001QSGIH,2019-11-23T20:36:10Z,3.144,36.739,3.088,36.742,7265,526,7.265,2019-11-23 20:36:10
3,002ACV6R,2019-12-01T05:43:21Z,3.239,36.784,3.054,36.763,23350,3130,23.35,2019-12-01 05:43:21
4,0039Y7A8,2019-12-17T20:30:20Z,2.912,36.707,3.207,36.698,36613,2138,36.613,2019-12-17 20:30:20


### Convert seconds to hours, minutes, and second

In [6]:
# Function to convert seconds to hours, minutes, and seconds
def convert_seconds(ETA):
    hours = ETA // 3600
    minutes = (ETA % 3600) // 60
    seconds = ETA % 60
    return f"{hours}h {minutes}m {seconds}s"

# Apply the function to the 'seconds' column and create a new column 'time_taken'
train_df['time_taken'] = train_df['ETA'].apply(convert_seconds)

print(train_df)


             ID             Timestamp  Origin_lat  Origin_lon  \
0      000FLWA8  2019-12-04T20:01:50Z       3.258      36.777   
1      000RGOAM  2019-12-10T22:37:09Z       3.087      36.707   
2      001QSGIH  2019-11-23T20:36:10Z       3.144      36.739   
3      002ACV6R  2019-12-01T05:43:21Z       3.239      36.784   
4      0039Y7A8  2019-12-17T20:30:20Z       2.912      36.707   
...         ...                   ...         ...         ...   
83919  ZZXN4JH2  2019-11-30T23:21:58Z       3.121      36.743   
83920  ZZXQ5AQJ  2019-11-27T05:59:31Z       3.024      36.749   
83921  ZZXYPKGU  2019-12-06T05:04:06Z       3.189      36.721   
83922  ZZYTQHKT  2019-12-07T05:55:22Z       3.046      36.738   
83923  ZZZY11ZN  2019-12-12T21:22:31Z       2.889      36.762   

       Destination_lat  Destination_lon  Trip_distance   ETA  time_taken  
0                3.003           36.718          39627  2784  0h 46m 24s  
1                3.081           36.727           3918   576   0h 9m 

### Function to convert meters to kilometers

In [10]:
# Function to convert meters to kilometers
def convert_to_km(Trip_distance):
    return Trip_distance / 1000

# Apply the function to the 'distance' column and create a new column 'Distance_KM'
train_df['Distance_KM'] = train_df['Trip_distance'].apply(convert_to_km)

print(train_df)

             ID             Timestamp  Origin_lat  Origin_lon  \
0      000FLWA8  2019-12-04T20:01:50Z       3.258      36.777   
1      000RGOAM  2019-12-10T22:37:09Z       3.087      36.707   
2      001QSGIH  2019-11-23T20:36:10Z       3.144      36.739   
3      002ACV6R  2019-12-01T05:43:21Z       3.239      36.784   
4      0039Y7A8  2019-12-17T20:30:20Z       2.912      36.707   
...         ...                   ...         ...         ...   
83919  ZZXN4JH2  2019-11-30T23:21:58Z       3.121      36.743   
83920  ZZXQ5AQJ  2019-11-27T05:59:31Z       3.024      36.749   
83921  ZZXYPKGU  2019-12-06T05:04:06Z       3.189      36.721   
83922  ZZYTQHKT  2019-12-07T05:55:22Z       3.046      36.738   
83923  ZZZY11ZN  2019-12-12T21:22:31Z       2.889      36.762   

       Destination_lat  Destination_lon  Trip_distance   ETA  Distance_KM  
0                3.003           36.718          39627  2784       39.627  
1                3.081           36.727           3918   576       

In [11]:
# Apply the function to the test data
test_df['Distance_KM'] = test_df['Trip_distance'].apply(convert_to_km)

print(test_df)

             ID             Timestamp  Origin_lat  Origin_lon  \
0      000V4BQX  2019-12-21T05:52:37Z       2.981      36.688   
1      003WBC5J  2019-12-25T21:38:53Z       3.032      36.769   
2      004O4X3A  2019-12-29T21:30:29Z       3.035      36.711   
3      006CEI5B  2019-12-31T22:51:57Z       2.902      36.738   
4      009G0M2T  2019-12-28T21:47:22Z       2.860      36.692   
...         ...                   ...         ...         ...   
35620  ZZXSJW3Q  2019-12-21T04:10:59Z       2.947      36.748   
35621  ZZYPNYYY  2019-12-30T20:31:22Z       3.037      36.742   
35622  ZZYVPKXY  2019-12-27T20:21:38Z       2.993      36.723   
35623  ZZZXGRIO  2019-12-29T22:00:31Z       2.954      36.743   
35624  ZZZYTWJA  2019-12-20T22:44:19Z       2.982      36.760   

       Destination_lat  Destination_lon  Trip_distance  Distance_KM  
0                2.978           36.754          17549       17.549  
1                3.074           36.751           7532        7.532  
2        

# Data Preparation

# Modeling and Evaluation

# Deployment