<a href="https://colab.research.google.com/github/lucprosa/dataeng-basic-course/blob/main/spark/misc/read_from_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Read from API
- ...

# Setting up PySpark

In [1]:
%pip install pyspark



In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('Spark Course').config('spark.ui.port', '4050').getOrCreate()
sc = spark.sparkContext

# Get data from API - Vehicles

In [4]:
import requests
from pyspark.sql.types import *

def readFromAPI(url: str, schema: StructType):
  response = requests.get(url)
  rdd = sc.parallelize(response.json())
  df = spark.read.schema(schema).json(rdd)
  return df

In [5]:
vehicle_schema = StructType([StructField('bearing', IntegerType(), True),
                             StructField('block_id', StringType(), True),
                             StructField('current_status', StringType(), True),
                             StructField('id', StringType(), True),
                             StructField('lat', FloatType(), True),
                             StructField('line_id', StringType(), True),
                             StructField('lon', FloatType(), True),
                             StructField('pattern_id', StringType(), True),
                             StructField('route_id', StringType(), True),
                             StructField('schedule_relationship', StringType(), True),
                             StructField('shift_id', StringType(), True),
                             StructField('speed', FloatType(), True),
                             StructField('stop_id', StringType(), True),
                             StructField('timestamp', TimestampType(), True),
                             StructField('trip_id', StringType(), True)])

vehicles = readFromAPI("https://api.carrismetropolitana.pt/vehicles", vehicle_schema)
print(vehicles.count())
vehicles.show()

0
+-------+--------+--------------+---+---+-------+---+----------+--------+---------------------+--------+-----+-------+---------+-------+
|bearing|block_id|current_status| id|lat|line_id|lon|pattern_id|route_id|schedule_relationship|shift_id|speed|stop_id|timestamp|trip_id|
+-------+--------+--------------+---+---+-------+---+----------+--------+---------------------+--------+-----+-------+---------+-------+
+-------+--------+--------------+---+---+-------+---+----------+--------+---------------------+--------+-----+-------+---------+-------+



### API - https://github.com/carrismetropolitana/api

### Exercises

- Create an ETL process to monitor vehicles from Carris Metropolitana
  - Read data from "vehicles" endpoint and writes into "/content/output/vehicles" as parquet
  - Create  timestmap column to datetime (hh24:mi:ss)

- Read data from "stops" endpoint and writes into "/content/output/stops" as parquet
- Convert timestmap column to datetime (hh24:mi:ss)