## Uber Driver Earning Model

* [Introduction](#chapter1)
* [Understanding the Uber earning system](#chapter2)
* [Fare amount vs. Distance travelled](#chapter3)
* [Factors affecting Fare amount](#chapter4)
* [How fare amount and distance travelled vary for each hour of each days in a week](#chapter5)
* [General market behavior of Uber riders](#chapter6)
* [Summary of Uber market behavior](#chapter7)
* [Approach for assessing optimal working period for a Uber Driver](#chapter8)
* [Best working time slot for Uber Driver, Zeeshan](#chapter9)

## Introduction   <a class="anchor"  id="chapter1"></a>
This project is about developing a model for Uber Driver to determine the **optimal working period** which generates the **most earnings**.
As results showed that earnings vary depending on the **time, day and working location**, it is important to understand the Uber market behavior for the area that a Uber Driver usually works in. This model determines the average hourly fare a Uber Driver can make for each hour of the day and each day of the week for a specific Uber Driver based on his/her working pattern and working area. Then, it determines the **best working hour** that a Uber driver can **maximize his/her earnings in 8 consecutive hours for each day of the week**. It can help Uber Drivers to plan their work schedule and optimize their earning potential.

A dataset provided by a Uber Driver [Zeeshan](https://www.kaggle.com/datasets/zusmani/uberdrives) is applied into this model. By the end of this project, it is found that the **earning** of the Uber Driver (Zeeshan) can make is at least **double or even more** the **median salary of an Uber driver** in his working city.



**Dataset:**
A dataset that contains trips information of a Uber Driver is required.
It contains the date and time the trips began and completed, minutes of the trip travelled, and distance of the trip travelled. The accuracy of the model is dependent on the number of trip data used. Uber Drivers should have these information provided by Uber with the earnings they have for each trip. However, in this project, the dataset provided by the Uber Driver doesnot contain the earnings for each trip. To solve this problem, a formula is applied to calculate the earnings. It only affects the estimated earnings, but not the optimal working period.

**Approach:**
1. Determined the fare for each trip by a formula mentioned below
2. Grouped the data and cauclated the average fare by each hour of the day and each day of the week. (Monday 00:00 - 23:00 ... Sunday 00:00 - 23:00)
3. Calculated the total fares for 8 consecutive hours for each hour of the day and each day of the week.
4. Determined the highest total fares for working 8 consecutive hours and the hour to begin for each day of the week.

**Formula:**
As Uber does not provide the formula for calculating the fare of Uber Driver, a formula suggested by [INSHUR (Motor Insurance Company)](https://inshur.com/blog/how-much-does-an-uber-driver-make-in-new-york-city/#:~:text=In%20New%20York%2C%20Uber%20charges,a%20Minimum%20Fare%20of%20%248.) is applied instead.

**Fare = ((base fee + rate per minute \* minute travelled + rate per mile \* miles travelled) \* surge boost multiplier + booking fee) \* (100% - [Service Fee](https://www.uber.com/global/en/price-estimate/#:~:text=The%20base%20rate%20is%20determined,and%20distance%20of%20a%20trip.)%)** (These factors will be explained later.)

(Please noted that the surge boost multiplier varies differently by Uber system and booking fee may not be applicable for every trip. Therefore, in this model, surge boost multiplier is assumed to be 1 and booking fee to be \\$0. As a result, the actual fares a Uber driver can make should be higher than the estimated fare amount)

As the factors in the formula are different for different countries and cities, and also the model is based on the specific Uber Driver (country and city the driver is in, driver's own working pattern), this model is aimed for generating unique working period for specific Uber Driver. Kaggle users are welcomed to use this model to determine the optimal working hour for a Uber Driver by importing a dataset that contains the information about the **trips the Uber Driver travelled (Date and Time when the trip begins and completed, and distance travelled)** and the **factors for the fare formula**. The accuracy of the model is dependent on the number of trip data used.

In the following section, [Zeeshan](https://www.kaggle.com/datasets/zusmani/uberdrives) 's data is analyzed and put into the model to determine the optimal working period. (thank you Zeeshan for sharing his Uber Drives dataset)

### Understading the Uber earning system  <a class="anchor"  id="chapter2"></a>

To begin, it is important to understand the earning systems of Uber.
According to the [Uber website](https://www.uber.com/ca/en/drive/how-much-drivers-make/), the earnings of Uber driver are calculated by: **Standard Fare (base fare + amounts for each minute and mile driver drive)**, Minimum trip earnings, Service fee, Booking fee, Cancellation fee, and most importantly: **[Surge Pricing](https://www.uber.com/ca/en/drive/driver-app/how-surge-works/)**. Surge Pricing is applied when there is a high demand case.
Examples of **high demand case**: Bad weather, rush hour, and special events etc.
In these cases of very high demand, **prices** may **increase** to help ensure that those who need a ride can get one. This is called Surge Pricing. Surge prices are calculated based on a multiplier to standard rates, an additional surge amount, or an upfront fare including the surge amount. This will vary depending on your city. Uber’s service fee percentage does not change during surge pricing.

This is indicating that the earnings of Uber drivers vary differently, depending on the **city** they located, **time and day** they are working, **distance** they have travelled,  **minutes** they have drived, etc.

To have a better understanding about the fare system and the Uber market behavior, a fare dataset is used. (dataset is provided by [
M YASSER H](https://www.kaggle.com/datasets/yasserh/uber-fares-dataset))

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math

In [None]:
fares = pd.read_csv('/content/drive/MyDrive/pallavi /section b/uber /uber.csv')
user = pd.read_csv('/content/drive/MyDrive/pallavi /section b/uber /My Uber Drives - 2016.csv')

In [None]:
##Using haversine formula to calculate the distance from the lat and lon
fares["pickup_datetime"] = pd.to_datetime(fares["pickup_datetime"])
fares["pickup_day"] = fares["pickup_datetime"].dt.day_name()
fares["pickup_hour"] = fares["pickup_datetime"].dt.hour

def haversine_distance(lat1, lon1, lat2, lon2):
    earth_radius = 6371 # km

    dlat = np.radians(lat2-lat1)
    dlon = np.radians(lon2-lon1)
    a = np.sin(dlat/2) * np.sin(dlat/2) + np.cos(np.radians(lat1)) * np.cos(np.radians(lat2)) * np.sin(dlon/2) * np.sin(dlon/2)
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    d = earth_radius * c

    return d

fares["distance"] = haversine_distance(fares['pickup_latitude'], fares['pickup_longitude'], fares['dropoff_latitude'], fares['dropoff_longitude'])

In [None]:
##Data Cleanning
fares["fare_per_km"] = fares.fare_amount / fares.distance
fares = fares.drop(fares[(fares.distance <1) | (fares.distance >= 300)].index)
fares = fares.drop(fares[(fares.pickup_longitude == 0) | (fares.pickup_latitude == 0)].index)
fares = fares.drop(fares[(fares.dropoff_longitude == 0) | (fares.dropoff_latitude == 0)].index)
fares = fares.drop(fares[(fares.pickup_longitude >= 180) | (fares.pickup_longitude <= -180)].index)
fares = fares.drop(fares[(fares.dropoff_longitude >= 180) | (fares.dropoff_longitude <= -180)].index)
fares = fares.drop(fares[(fares.pickup_latitude >= 90) | (fares.dropoff_latitude <= -90)].index)
fares = fares.drop(fares[(fares.dropoff_latitude >= 90) | (fares.dropoff_latitude <= -90)].index)
fares = fares.drop(fares[((fares.fare_amount == max(fares.fare_amount)) | (fares.fare_amount < 2.5))].index)
fares = fares.drop(fares[(fares.passenger_count == 208) | (fares.passenger_count < 1)].index)
fares = fares.drop(fares[(fares.fare_per_km< 0.81)].index )

In [None]:
##Linear Regression of distance vs fare amount
sns.regplot(x = fares.fare_amount, y = fares.distance)
plt.title("Relationship between Fare Amount and Distance travelled")

## Fare amount vs. Distance travelled <a class="anchor"  id="chapter3"></a>
**Haversine formula** to calculate the distance for each trip from the latitude point and longitude point.

**Linear regression** graph showing the relationship between fare amount and distance travelled for a trip.

* clear that there is a **positive relationship** and it matches with the formula suggested
* **distance travelled** is considered as a factor in the fare formula.

In [None]:
fares_passenger = fares.groupby("passenger_count", as_index = False).agg(fare_per_km = ("fare_per_km", "mean"), counted = ("key", "count"), distance = ("distance", "mean"))
fares_day = fares.groupby("pickup_day", as_index = False).agg(fare_per_km = ("fare_per_km", "mean"), counted = ("key", "count"), distance = ("distance", "mean"))
m = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
fares_day = fares_day.set_index('pickup_day').reindex(m).reset_index()
fares_hour = fares.groupby("pickup_hour", as_index = False).agg(fare_per_km = ("fare_per_km", "mean"), counted = ("key", "count"), passenger_count = ("passenger_count", "mean"), distance = ("distance", "mean"))
fig, axes = plt.subplots(1,3 , figsize=(20, 10))
axes[0].set_title("Average fare amount with differet numbers of passenger")
sns.lineplot(x = fares_passenger.passenger_count, y = fares_passenger.fare_per_km,ax=axes[0])
axes[1].set_title("Average fare amount in a week")
sns.barplot(x = fares_day.pickup_day, y = fares_day.fare_per_km,ax=axes[1])
axes[2].set_title("Average fare amount in a day")
sns.lineplot(x = fares_hour.pickup_hour, y = fares_hour.fare_per_km,ax=axes[2])

## Factors affecting Fare amount <a class="anchor"  id="chapter4"></a>

Line graphs & Bar chart showing the relationship between number of passengers, day and time, and the fares.
* **Positive Relationship** between **Number of Passengers** and Fares
* **Day and Time** are the factor that fares vary

According to these graphs, it is clear that number of passengers, day and time are factors that affect fares.

In the following, Day and Time will be analyzed to observe how they affect the fares.
Number of passengers is the controlled variable and it is set to be 1.

In [None]:
##fares vary in 24 hours for each day
fares = fares[fares["passenger_count"] ==1]
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

for day in days:
    mask = fares["pickup_day"] == day
    fares_day = fares[mask].groupby("pickup_hour", as_index = False).agg(fare_per_km = ("fare_per_km", "mean"), counted = ("key", "count"), distance = ("distance", "mean"), fare_amount = ("fare_amount", "mean"))
    exec(f"fares_{day.lower()} = fares_day")


fares_weekday = fares[(fares["pickup_day"] != "Saturday") & (fares["pickup_day"] != "Sunday")].groupby("pickup_hour", as_index = False).agg(fare_per_km = ("fare_per_km", "mean"), counted = ("key", "count"), distance = ("distance", "mean"),fare_amount = ("fare_amount", "mean"))
fares_weekday["counted"] = fares_weekday["counted"] /5

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.plot(fares_monday.pickup_hour, fares_monday.fare_per_km, label='Monday')
ax1.plot(fares_monday.pickup_hour, fares_tuesday.fare_per_km, label='Tuesday')
ax1.plot(fares_monday.pickup_hour, fares_wednesday.fare_per_km, label='Wednesday')
ax1.plot(fares_monday.pickup_hour, fares_thursday.fare_per_km, label='Thursday')
ax1.plot(fares_monday.pickup_hour, fares_friday.fare_per_km, label='Friday')
ax1.set_title('Average Fare Amount for each hour of each days in a week')
ax1.set_xlabel('hours')
ax1.set_ylabel('fare amount per km')
ax1.set_xticks(np.arange(0, 24, 2))
ax1.set_xticklabels(np.arange(0, 24, 2))
ax1.legend()

ax2.plot(fares_monday.pickup_hour, fares_monday.fare_per_km, label='Monday')
ax2.plot(fares_monday.pickup_hour, fares_weekday.fare_per_km, label='weekday')
ax2.plot(fares_monday.pickup_hour, fares_saturday.fare_per_km, label='Saturday')
ax2.plot(fares_monday.pickup_hour, fares_sunday.fare_per_km, label='Sunday')
ax2.set_title('Average Fare Amount for each hour of each days in a week')
ax2.set_xlabel('hours')
ax2.set_ylabel('fare amount per km')
ax2.set_xticks(np.arange(0, 24, 2))
ax2.set_xticklabels(np.arange(0, 24, 2))
ax2.legend()

fig, ax = plt.subplots()
ax.plot(fares_monday.pickup_hour, fares_monday.distance, label='Monday')
ax.plot(fares_monday.pickup_hour, fares_tuesday.distance, label='Tuesday')
ax.plot(fares_monday.pickup_hour, fares_wednesday.distance, label='Wednesday')
ax.plot(fares_monday.pickup_hour, fares_thursday.distance, label='Thursday')
ax.plot(fares_monday.pickup_hour, fares_friday.distance, label='Friday')
ax.plot(fares_monday.pickup_hour, fares_saturday.distance, label='Saturday')
ax.plot(fares_monday.pickup_hour, fares_sunday.distance, label='Sunday')
ax.set_title('Average Distance travelled for each hour of each days in a week')
ax.set_xlabel('hours')
ax.set_ylabel('Distance travelled in km')
ax.set_xticks(np.arange(0, 24, 2))
ax.set_xticklabels(np.arange(0, 24, 2))
ax.legend()

##  How fare amount and distance travelled vary for each hour of each days in a week <a class="anchor"  id="chapter5"></a>

Fare amount per km:
* From 05:00 to 18:00, Weekdays has higher fare amount than Weekends **(because of working hours)**
* From 00:00 to 05:00 and 18:00 to another 00:00, Saturday has higher fare amount than others **(as it is a holiday)**
* Sunday has similar pattern as Saturday, but less in fare amount
* Weekdays: 09:00 to 10:00, 11:00 to 14:00 have higher fare amount than other time period. **(rush hour to school and work, lunch time)**
* Friday has higher fare amount starting from 21:00 to 00:00 of another day. **(weekends begin)**
* Monday has the lowest fare amount over 24 hours.

Distance travelled:
* Days of a week **vary similarly**
* The **distance** travelled the **most** occur **between midnight and dawn**
* Sunday has higher travelled distance while Saturday has the lowest.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))
ax1.plot(fares_monday.pickup_hour, fares_monday.counted, label='Monday')
ax1.plot(fares_monday.pickup_hour, fares_tuesday.counted, label='Tuesday')
ax1.plot(fares_monday.pickup_hour, fares_wednesday.counted, label='Wednesday')
ax1.plot(fares_monday.pickup_hour, fares_thursday.counted, label='Thursday')
ax1.plot(fares_monday.pickup_hour, fares_friday.counted, label='Friday')

ax1.set_title('Average number of trips for each hour of each days in a week')
ax1.set_xlabel('hours')
ax1.set_ylabel('number of trips')
ax1.set_xticks(np.arange(0, 24, 2))
ax1.set_xticklabels(np.arange(0, 24, 2))
ax1.legend()

ax2.plot(fares_monday.pickup_hour, fares_weekday.counted, label='weekday')
ax2.plot(fares_monday.pickup_hour, fares_saturday.counted, label='Saturday')
ax2.plot(fares_monday.pickup_hour, fares_sunday.counted, label='Sunday')
ax2.set_title('Average number of trips for each hour of each days in a week')
ax2.set_xlabel('hours')
ax2.set_ylabel('number of trips')
ax2.set_xticks(np.arange(0, 24, 2))
ax2.set_xticklabels(np.arange(0, 24, 2))
ax2.legend()

##  General market behavior of Uber riders <a class="anchor"  id="chapter6"></a>

Number of trips:
* **Weekdays** have more trips from **06:00 to 11:00** and **18:00 to 22:00**. (**because of office hours**)
* Weekends share similar pattern from 00:00 to 19:00. Very high number of trips starts from 00:00 and drops gradually. **(because people usually hang out late at weekend)** start to rise again from 06:00.
* However, from **19:00 to 00:00** of another day, **Saturday** has more number of trips than Sunday because people hang out at Saturday night while they stay at home at Sunday night.
* Weekdays have similar pattern. But Monday has lowest number of trips for 24 hours. Thursday and Tuesday have more numebr of trips in the morning. **Friday** has more number of trips **starting from evening to dawn**.

## Summary of Uber market behavior <a class="anchor"  id="chapter7"></a>

* **Number of trips and distance travelled** are vary depending on different **time and day** (such as office hours, weekends), which also affect the fares amount.
* There are **relationships** between **days, times and fares amount**.

So, during the process of developing the model of optimal working period for Uber Driver, days and times are key factors that have to considered.

In [None]:
user.columns = user.columns.str.lower()
user.columns = user.columns.str.replace('*','', regex=False)
user = user.drop(index=1155)
user['start_date'] = pd.to_datetime(user['start_date'], errors='coerce')
user['end_date'] = pd.to_datetime(user['end_date'], errors='coerce')
user["day"] = user["start_date"].dt.day_name()
user["pickup_hour"] = user["start_date"].dt.hour
user["date"] = user["start_date"].dt.date
user['duration_minute'] = (user['end_date'] - user['start_date']).dt.total_seconds()/60
user["miles_per_hour"] = (user.miles/(user.duration_minute/60))
##Fares = Base fare + rate per minutes * minutes + rate per miles * miles
##NYC fares = $0 + $0.74 * minute + $1.62 * miles
##NYC Minimum fare $8.00 and Uber charges partners 20% fee on all fares.
rate_per_minute = 0.74
rate_per_mile = 1.62
base_fare = 0
minimum_fare = 8
uber_charges = 0.2
user["fare"] = (1 - uber_charges)*(base_fare + rate_per_minute * user.duration_minute + rate_per_mile * user.miles)
user["fare"] = (1 - uber_charges)*(user["fare"].apply(lambda x: minimum_fare if x <= minimum_fare else x))
user_date = user.groupby(["day", "date","pickup_hour"], as_index = False).agg(miles = ("miles", "mean"), counted = ("day", "count"),  duration_minute = ("duration_minute", "mean"), miles_per_hour = ("miles_per_hour", "mean"), fare_per_hour = ("fare", "sum"), fare_per_trip =("fare", "mean") )
user_date.sort_values(by = "counted", ascending= False)
user_day = user_date.groupby(["day","pickup_hour"], as_index = False).agg(miles = ("miles", "mean"), counted = ("counted", "mean"),  duration_minute = ("duration_minute", "mean"),  miles_per_hour = ("miles_per_hour", "mean"), fare_per_hour = ("fare_per_hour", "mean" ), fare_per_trip =("fare_per_trip", "mean"))


In [None]:
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

for day in days:
    mask = user_day[user_day["day"]== day]
    exec(f"user_{day.lower()} = mask")


##separate user_day into 7 tables
for day in days:
    user_day = eval(f"user_{day.lower()}")
    user_day = user_day.set_index("pickup_hour")
    user_day = user_day.reindex(range(24), fill_value=0)
    user_day = user_day.reset_index()
    user_day["pickup_hour"] = user_day["pickup_hour"].astype(int)
    user_day["day"] = user_day["day"].apply(lambda x: day if x == 0 else x)
    exec(f"user_{day.lower()} = user_day")

##calculate fare amount for next 8 consecutive hours
user_all = pd.concat([user_monday, user_tuesday, user_wednesday, user_thursday, user_friday, user_saturday, user_sunday, user_monday], axis=0)
user_all = user_all.reset_index(drop=True)

n = 0
while n <= 167:
    start_row = n
    end_row = n + 7

    fare_sum = user_all.loc[start_row:end_row, 'fare_per_hour'].sum()
    miles_sum = user_all.loc[start_row:end_row, 'miles'].sum()
    minute_sum = user_all.loc[start_row:end_row, 'duration_minute'].sum()
    minute_mean = user_all.loc[start_row:end_row, 'duration_minute'].mean()
    miles_per_hour_mean = user_all.loc[start_row:end_row, 'miles_per_hour'].mean()
    number_trip = user_all.loc[start_row:end_row, 'counted'].sum()
    numbr_trip_mean = user_all.loc[start_row:end_row, 'counted'].mean()
    user_all.at[start_row, 'fare_for_8_hours'] = fare_sum
    user_all.at[start_row, 'total_miles'] = miles_sum
    user_all.at[start_row, 'total_driving_time'] = minute_sum
    user_all.at[start_row, 'average_driving_time_per_hour'] = minute_mean
    user_all.at[start_row, 'average_speed'] = miles_per_hour_mean
    user_all.at[start_row, 'total_number_of_trip'] = number_trip
    user_all.at[start_row, 'average_number_of_trip_per_hour'] = numbr_trip_mean
    n = n + 1

user_all = user_all.dropna()
user_all = user_all.rename(columns = {"pickup_hour" : "starting_hour"})
user_all["ending_hour"] = user_all["starting_hour"] + 8
user_all["ending_hour"] = user_all["ending_hour"].apply(lambda x: x-24 if x >= 24 else x)

user_max_fare = user_all[user_all["fare_for_8_hours"] == max(user_all.fare_for_8_hours)]

max_fare_day = user_all[user_all["fare_for_8_hours"] == max(user_all.fare_for_8_hours)]["day"]
max_fare_day

for day in days:
    mask = user_all[user_all["day"] == day]
    exec(f"user_{day.lower()} = mask")

max_fare_day = user_monday[user_monday["fare_for_8_hours"] == max(user_monday.fare_for_8_hours)]

days = ["Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
for day in days:
    mask = eval(f"user_{day.lower()}")
    mask = mask[mask["fare_for_8_hours"] == max(mask.fare_for_8_hours)]
    max_fare_day = pd.concat([max_fare_day, mask])
max_fare_day.sort_values(by = "fare_for_8_hours", ascending = False)
max_fare_day = max_fare_day.drop(columns=["miles", "counted", "duration_minute", "miles_per_hour", "fare_per_trip", "fare_per_hour"])


## Approach for assessing optimal working period for a Uber Driver<a class="anchor"  id="chapter8"></a>
**Fare**

General formual for fares:
**Fares = (100% - Service Fee from Uber) * (Base fare + rate per minutes * minutes + rate per miles * miles)**

As mentioned, the base fare, rate, and service fee vary depending on the country and city the Uber Driver is in. The Uber Driver of this dataset is working in US. So, the fare is calculated according to the US rates.

Formual for fares in US:
US fares = \\$0 + \\$0.74 * minute + \\$1.62 * miles

US Minimum fare: \\$8.00 and Uber charges partners 20% fee on all fares.

**Process**
1. Determined the fare for each trip by a formula mentioned
2. Grouped the data and cauclated the average fare by each hour of the day and each day of the week. (Monday 00:00 - 23:00 ... Sunday 00:00 - 23:00)
3. Calculated the total fares for 8 consecutive hours for each hour of the day and each day of the week.
4. Determined the highest total fares for working 8 consecutive hours and the hour to begin for each day of the week.

In [None]:
best_working_time = max_fare_day[["day", "starting_hour","ending_hour","fare_for_8_hours"]]
best_working_time.sort_values(by = "fare_for_8_hours", ascending = False)

## Best working time slot for Uber Driver, Zeeshan  <a class="anchor"  id="chapter9"></a>
This is the **best working timeslot** for Zeeshan.
(It is assumed that Zeeshan is working as a full time Uber driver who works 8 consecutive hours and his working area is in US.) Users can pick the suitable day and start time to work.

The "fare_for_8_hours" is the estimated fare the Uber Driver will get from the work period. According to [Talent.com](https://www.talent.com/salary?job=uber+driver#:~:text=The%20average%20uber%20driver%20salary%20in%20the%20USA%20is%20%2436%2C270,up%20to%20%2447%2C515%20per%20year.), the median of a uber driver salary is \\$17.44 per hour, which is \\$139.52. **By using this model**, the **earning** of Zeeshan can make is **double** the **median salary of an Uber driver**.

Thank you for reading! Please upvote this project and share it if you find it useful or interesting.

