<a href="https://colab.research.google.com/github/soonieboi/sparkstuff/blob/main/04_REWORK_Ingestion_and_Data_Formats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
PySpark provides interface used to load DataFrame from external storage systems. We will learn how to read different data format files into DataFrame and write DataFrame back to different data format files using PySpark examples. Lastly, we will learn how to transfer data between JVM and Python processes using Apache Arrow efficiently.

In [1]:
import os

# 1. Install OpenJDK 21 (if not already done in a previous cell)
!apt-get update -qq
!apt-get install -qq openjdk-21-jdk-headless

# 2. Verify where it landed (if needed)
!ls /usr/lib/jvm | grep 21

# 3. Point to JDK 21
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-21-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

# 4. Install PySpark via pip (make sure this happens AFTER setting JAVA_HOME)
!pip install pyspark --quiet

# 5. Import and start Spark
from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
      .master("local[*]")
      .appName("Spark on Java21")
      .getOrCreate()
)



W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Selecting previously unselected package openjdk-21-jre-headless:amd64.
(Reading database ... 126371 files and directories currently installed.)
Preparing to unpack .../openjdk-21-jre-headless_21.0.8+9~us1-0ubuntu1~22.04.1_amd64.deb ...
Unpacking openjdk-21-jre-headless:amd64 (21.0.8+9~us1-0ubuntu1~22.04.1) ...
Selecting previously unselected package openjdk-21-jdk-headless:amd64.
Preparing to unpack .../openjdk-21-jdk-headless_21.0.8+9~us1-0ubuntu1~22.04.1_amd64.deb ...
Unpacking openjdk-21-jdk-headless:amd64 (21.0.8+9~us1-0ubuntu1~22.04.1) ...
Setting up openjdk-21-jre-headless:amd64 (21.0.8+9~us1-0ubuntu1~22.04.1) ...
update-alternatives: using /usr/lib/jvm/java-21-openjdk-amd64/bin/java to provide /usr/bin/java (java) in auto mode
update-alternatives: using /usr/lib/jvm/java-21-openjdk-amd64/bin/j

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Read CSV file
PySpark provides DataFrameReader to load a DataFrame from external storage systems (e.g. file systems, key-value stores, etc). Use SparkSession.read to access this. You can use format(source) to specify the input data source format.
Using csv("path") or format("csv").load("path") of DataFrameReader, you can read a CSV file into a PySpark DataFrame, These methods take a file path to read from as an argument. When you use format("csv") method, you can also specify the data sources by their fully qualified name, but for built-in sources, you can simply use their short names (csv,json, parquet, jdbc, text e.t.c). In this example, it shows how to read a single CSV file “people.csv” into DataFrame as well as how to use your own defined schema when read file into DataFrame.

In [3]:
# Read CSV file people.csv
df = spark.read.format('csv') \
                .option("inferSchema","true") \
                .option("header","true") \
                .option("sep",";") \
                .load("/content/drive/MyDrive/data/DataFormat/people.csv")

# Show result
df.show()
# Print schema
df.printSchema()

+-----+---+---------+
| name|age|      job|
+-----+---+---------+
|Jorge| 30|Developer|
|  Bob| 32|Developer|
+-----+---+---------+

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)



In [5]:
# Write csv file
df = spark.read.format('csv') \
                .option("inferSchema","true") \
                .option("header","true") \
                .option("sep",";") \
                .load("/content/drive/MyDrive/data/DataFormat/people.csv")


In [6]:
df = spark.read.format('json') \
                .option("inferSchema","true") \
                .option("header","true") \
                .option("sep",";") \
                .load("/content/drive/MyDrive/data/DataFormat/people.json")
df.show()
df.printSchema

peopleDF = spark.read.json("/content/drive/MyDrive/data/DataFormat/people.json")
peopleDF.show()


+----+---------+-------+
| age|      job|   name|
+----+---------+-------+
|NULL|     NULL|Michael|
|  30|developer|   Andy|
|  19|     NULL| Justin|
+----+---------+-------+

+----+---------+-------+
| age|      job|   name|
+----+---------+-------+
|NULL|     NULL|Michael|
|  30|developer|   Andy|
|  19|     NULL| Justin|
+----+---------+-------+



In [7]:
# DataFrames can be saved as Parquet files, maintaining the schema information.
peopleDF.write.format("parquet").mode("overwrite").save("people.parquet")


In [8]:
# Read in the Parquet file created above.
# Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
parquetFile = spark.read.parquet("people.parquet")

# Parquet files can also be used to create a temporary view and then used in SQL statements.
parquetFile.createOrReplaceTempView("parquetFile")
teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.show()


+------+
|  name|
+------+
|Justin|
+------+



In [9]:
import pyarrow.csv as pv
import pyarrow.parquet as pq
# read hdb resale price
hdb_table = pv.read_csv("/content/drive/MyDrive/data/DataFormat/resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv")
# convert the CSV file to a Parquet file
pq.write_table(hdb_table,'resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.parquet')
hdb_parquet = pq.ParquetFile('resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.parquet')
# inspect the parquet metadata
print(hdb_parquet.metadata)
# inspect the parquet row group metadata
print(hdb_parquet.metadata.row_group(0))
# inspect the column chunk metadata
print(hdb_parquet.metadata.row_group(0).column(9).statistics)



<pyarrow._parquet.FileMetaData object at 0x7ee0b1de11c0>
  created_by: parquet-cpp-arrow version 18.1.0
  num_columns: 10
  num_rows: 52203
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 2061
<pyarrow._parquet.RowGroupMetaData object at 0x7ee0940dd8a0>
  num_columns: 10
  num_rows: 52203
  total_byte_size: 431095
  sorting_columns: ()
<pyarrow._parquet.Statistics object at 0x7ee0940dd8f0>
  has_min_max: True
  min: 195000.0
  max: 1088888.0
  null_count: 0
  distinct_count: None
  num_values: 52203
  physical_type: DOUBLE
  logical_type: None
  converted_type (legacy): NONE
