<a href="https://colab.research.google.com/github/saud-py/Uber-Data-analysis-using-Pyspark/blob/main/Uber_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Uber Data Analysis in Pyspark

We use Uber dataset to perform analysis to gain data insights fronm city supply and demand data by cleaning and querying the data to feature engineering

The goal of this project is to gain insights into the demand and supply industry by cleaning, transforming and analyzing the data using PySpark

This means that during the hour beginning at 4 pm (hour 16), on September 10th, 2012, 11 people opened the Uber app (Eyeballs). 2 of them did not see any car (Zeroes) and 4 of them requested a car (Requests). Of the 4 requests, only 3 complete trips actually resulted (Completed Trips). During this time, there were a total of 6 drivers who logged in (Unique Drivers).

Requirements to be satisfied

1) Which date had the most completed trips during the two-week period?

2) What was the highest number of completed trips within a 24-hour period?

3) Which hour of the day had the most requests during the two-week period?

4) What percentages of all zeroes during the two-week period occurred on weekend (Friday at 5 pm to Sunday at 3 am)? Tip: The local time value is the start of the hour (e.g. 15 is the hour from 3:00 pm - 4:00 pm)

5) What is the weighted average ratio of completed trips per driver during the two-week period? Tip: “Weighted average” means your answer should account for the total trip volume in each hour to determine the most accurate number in the whole period.

6) In drafting a driver schedule in terms of 8 hours shifts, when are the busiest 8 consecutive hours over the two-week period in terms of unique requests? A new shift starts every 8 hours. Assume that a driver will work the same shift each day.

7) True or False: Driver supply always increases when demand increases during the two-week period. Tip: Visualize the data to confirm your answer if needed.

8) In which 72-hour period is the ratio of Zeroes to Eyeballs the highest?

9) If you could add 5 drivers to any single hour of every day during the two-week period, which hour should you add them to? Hint: Consider both rider eyeballs and driver supply when choosing

10) Looking at the data from all two weeks, which time might make the most sense to consider a true “end day” instead of midnight? (i.e when are supply and demand at both their natural minimums)

In [None]:
#First to perform the task we need inialize a spark session
!pip install pyspark py4j
!pip install -q findspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=2c71f3e8df358133f992d64d570eb1f9f05ad1b1cc02ec8060b016b6bfa513f4
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [None]:
#Importing the spark
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

#Creating a SparkSession
spark = SparkSession.builder.appName("UberDataAnalysis").getOrCreate()

#loading the dataset into dataframe
df = spark.read.csv("uber.csv", header = True, inferSchema = True)

In [None]:
df.columns

['Date',
 'Time',
 'Eyeballs ',
 'Zeroes ',
 'Completed',
 'Requests ',
 'Unique Drivers']

1) Which date had the most completed trips during the two week period?

To solve we need to the group the data by date and sum up the total trips completed. Then the sort the results in desc order, select the top most rows

In [None]:
from pyspark.sql.functions import max

completed_trips_by_date = df.groupBy("Date").sum("Completed")

date_with_most_completed_trips = completed_trips_by_date \
.orderBy("sum(Completed)", ascending = False) \
.select("Date") \
.first()["Date"]


print(date_with_most_completed_trips)

None


2) Which was the highest number of completed trips within 24hr period?

To find the highest number of completed trips within a 24hr period, we can group the data by date and use a window function to sum the completed trips column over a rolling 24hr period. Then, we can sort the results in descending order and select the top row


In [None]:
from pyspark.sql.functions import sum, window

#This will group the data and sum it together the total trips within 24hr
completed_trips_by_window = df \
.groupBy(window("Time", "24 hours")) \
.agg(sum("Completed").alias("Total Completed Trips")) \
.orderBy("Total Completed Trips", ascending = False)

#To get the highest number trips completed in 24hrs
highest_completed_trips = completed_trips_by_window \
.select("Total Completed Trips") \
.first()["Total Completed Trips"]

print(highest_completed_trips)

SyntaxError: ignored

In [None]:
df.show()