# ProStr 2021 Project 1
The project scenario involves a dataset of taxi rides, collected circa 2013, in the New York city area.

Each taxi ride corresponds to an event in the dataset, comprising of the passenger pick-up and drop-off points, and respective timestamps, as well as information related to the payment, the taxi and its driver.

This project scenario is inspired by the [ACM DEBS 2015 Grand Challenge](http://www.debs2015.org/call-grand-challenge.html).

## Taxi Ride Event

Each taxi ride event comprises a number of attributes, as follows:

| Attribute   | Description |
| :---        |        :--- |
|medallion| an md5sum of the identifier of the taxi - vehicle bound|
|hack_license| an md5sum of the identifier for the taxi license|
|pickup_datetime| time when the passenger(s) were picked up|
|dropoff_datetime| time when the passenger(s) were dropped off|
|trip_time_in_secs| duration of the trip|
|trip_distance| trip distance in miles|
|pickup_longitude| longitude coordinate of the pickup location|
|pickup_latitude| latitude coordinate of the pickup location|
|dropoff_longitude| longitude coordinate of the drop-off location|
|dropoff_latitude| latitude coordinate of the drop-off location|
|payment_type| the payment method - credit card or cash|
|fare_amount| fare amount in dollars|
|surcharge| surcharge in dollars|
|mta_tax| tax in dollars|
|tip_amount| tip in dollars|
|tolls_amount| bridge and tunnel tolls in dollars|
|total_amount| total paid amount in dollars|

Each event is published as a text string, with the attributes separated by commas.


## Dataset

The dataset is available in two forms:

* Sample of 20 days (roughly 2 million events) of data (~ 130 MB) [download](https://drive.google.com/file/d/1jF_YmKFskdNgchtUb0GGaUyfxAjMUHdw/view?usp=sharing)
* The whole year of 2013 (~ 173 million events) (~ 12 GB) (~33 GB expanded) [download](https://drive.google.com/file/d/0B4zFfvIVhcMzcWV5SEQtSUdtMWc/view?usp=sharing)

---

* Events are reported at the end of the trip, i.e., upon arrival in the order of the drop-off timestamps.

* Events with the same *dropoff_datetime* are in random order.

* Quality of the data is **not perfect**.

 + Some events might miss information such as *drop off* and *pickup*;

 + Moreover, some information, such as, e.g., the *fare price*, might have been entered incorrectly by the taxi drivers thus introducing additional skew.

## Requeriments

Out of the following 4 queries, you need to solve **a minimum of 2** (two).

Queries are marked with a number of **€**, as an indication of their expected dificulty and grading points.

---

#### Delivery Format

You can use either Spark Streaming or Spart Structured Streaming.

The solution should be delivered as a jupyter notebook. 

---

#### Bonus

* Solve each query using a different framework (e.g, Q1: Spark Streaming, Q2: Spark SQL)

* Solve a third query.

---

#### Grading

Grading will take into consideration the overall presentation quality of jupyter notebook, the correctness of the solution, the quality of the code, and the summary pdf report, where the results are discussed.

---

#### DEADLINE

9th May 2021, 23h59

## Queries

### Q1: Find the top 10 most frequent routes during the last 30 minutes. (€)

• A route is represented by a starting grid cell and an ending grid cell;

• All routes completed within the last 30 minutes are considered for the query;

• Use a grid of 300x300 cells, corresponding to square of 500x500m. See HelperCode.

• All trips starting or ending outside this area are treated as outliers (not be considered)

---

• Ideally, the output query results should be updated whenever any of the 10 most frequent routes changes;

### Q2: Identify areas that are currently most profitable for taxi drivers. (€€€)

The profitability of an area is determined by dividing the area profit by the number of empty taxis in that area within the last 15 minutes.
    
The profit that originates from an area is computed by calculating the average fare + tip for trips that started in the area and ended within the last 15 minutes.

The number of empty taxis in an area is the sum of taxis that had a drop-off location in that area less than 30 minutes ago and had no following pickup yet.

Note: Unlike in the original DEBS Challenge, use the same 300x300 grid, as in Q1.

### Q3: Detect "Anomalous" Rides (€€)

Provide an answer to the following question: "Are all rides fair?"

Detect rides that cost more or take longer than expected. 

To compute the expected duration or expected cost of ride, use average values computed over the last 1 hour;

### Q4: Detect "Anomalous" Drivers (€€)

Provide an answer to the following question: "Are all drivers equal?"

Detect drivers that seem to deviate from the pack in some way.

The criteria used to diferentiate drivers is up to you. As a suggestion, are there drivers that
are more efficient, i.e., earn more and drive less time or distance?

## Suggestions

* Read all the available information in [Debs Challenge](http://www.debs2015.org/call-grand-challenge.html);

* Get familiar with the sample data;

* Sanitize the data: i.e, exclude incomplete, non used data or out of area rides;

* Compute Streams with converted coordinates to cell grids.

    Simplified flat earth assumption for mapping coordinates to cells in the queries. 
    Moving 500 meters south corresponds to a change of 0.004491556 degrees in latitude. Moving 500 meters east,
    corresponds to 0.005986 degrees in longitude. 

    Use or adapt the supplied code below.
    
* During development define shorter windows to aggregate and preview results faster.

---

## Addendum

### Python code to get some stats from the sample dataset...
Upload `sample.csv.gz` before running...

In [None]:
import gzip
import csv

events = 0
stats = {'#Taxis' : set(), '#Drivers' : set() }
with gzip.open('sample.csv.gz','rt') as f:
    for line in f:
        tokens = line.split(' ')
        medallion = tokens[0]
        driver = tokens[1]
        stats['#Taxis'].add(medallion)
        stats['#Drivers'].add(driver)
        events = events + 1

print('#Events: {}'.format(events))
for k,v in stats.items():
    print(k, ': ', len(v))


---

### Kafka Streams

To fully leverage Spark Streaming, the taxi ride dataset can be accessed
as an [Apache Kafka](https://kafka.apache.org/) stream. 

Apache Kafka is topic-based publish/subscribe broker platform,
offering a reliable and persistent event dissemination service.

Each taxi ride is published as a discrete event, under the `debs`
topic.

*Spark Streaming* and *Spark Structured Streaming* can ingest Kafka event sources, as explained next.

### Setup

1. Start kafka

 * Download and execute [`./start-kakfa.sh`](https://github.com/smduarte/ps2021/blob/master/proj/setup/start-kafka.sh)
    
    
2. Start the Debs Taxi Ride publisher.

 * Download and execute:
    [`./publish-debs.sh`](https://github.com/smduarte/ps2021/blob/master/proj/setup/publish-debs.sh)
 
 
In Linux/macOs, you may need make the scripts executable:
 
 * `chmod a+x <script>`

### Ingesting Events

---

Note: The Kafka source for *Spark Streaming* is only available for **pyspark** up to
version 2.4.x of Spark. 

For newer versions of Spark we are limited to *Spark SQL Structured Streaming*.


We will use Spark version 2.4.5 as it works with both streaming frameworks.

In [None]:
!spark-submit --version

#### SparkStreaming

The example below shows how to create a *DStream* from
a kafka topic. 

Both *Kafka* and *Debs Publisher* need to be running already.

In the example, `kafka:9092` is the name of the machine/container
and port where the Apache Kafka broker is running.

`debs` is the event topic the *publisher* uses.

In [None]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

sc = SparkContext("local[*]", "Kafka Spark Streaming Example")

ssc = StreamingContext(sc, 1)
lines = KafkaUtils.createDirectStream(ssc, ["debs"], \
            {"metadata.broker.list": "kafka:9092"}) \
        .map( lambda e : e[1] ) \
        .filter( lambda line: len(line) > 0)


lines.pprint()
    
ssc.start()
ssc.awaitTermination(20)
ssc.stop()
sc.stop()

#### Spark SQL (Structured Streaming)

The example below shows how to prepare Spark for
ingesting and processing Kafka events using
the structured API. 

Complete by adding the columns you need for the assigment.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

def dumpBatchDF(df, epoch_id):
    df.show(20, False)


spark = SparkSession \
    .builder \
    .appName("Kafka Spark Structured Streaming Example") \
    .getOrCreate()

lines = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "kafka:9092") \
  .option("subscribe", "debs") \
  .load() \
  .selectExpr("CAST(value AS STRING)")

split_lines = split(lines['value'], ',')

rides = lines.withColumn('medallion', split_lines.getItem(0).cast("string")) \
        .withColumn('pickup_datetime', split_lines.getItem(2).cast("timestamp")) \
        .drop('value')

query = rides \
    .writeStream \
    .outputMode("append") \
    .trigger(processingTime='5 seconds') \
    .foreachBatch(dumpBatchDF) \
    .start()

query.awaitTermination( 20)
query.stop()
spark.stop()

### Helper Code

The following helper functions can be used in the assignment, 
as is or changed as needed.


#### Convert GPS coordinates to grid cell coordinates

In [None]:
# Longitude and latitude from the upper left corner of the grid
MIN_LON = -74.916578
MAX_LAT = 41.47718278

# Longitude and latitude that correspond to a shift in 500 meters
LON_DELTA = 0.005986
LAT_DELTA = 0.004491556

def latlon_to_grid(lat, lon):
    return ((int)((MAX_LAT - lat)/LAT_DELTA), (int)((lon - MIN_LON)/LON_DELTA))

#### In Bounds check

You can use cell coordinates to exclude invalid rides 

In [None]:
def inBounds( cell ):
    return cell[0] > 0 and cell[0] < 300 and cell[1] > 0 and cell[1] < 300

#### Parsing timestamps

In [None]:
import datetime

def parseTime( date_time_str ):
    return datetime.datetime.strptime(date_time_str, '%Y-%m-%d %H:%M:%S')