<a href="https://colab.research.google.com/github/rahulrajpr/prepare-anytime/blob/main/spark/functions/19_spark_sql_dataframe_reader_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Spark DataFrame Reader Methods**
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.html

In [1]:
# Install Java and PySpark

import warnings
warnings.filterwarnings('ignore')

!apt-get update -qq
!apt-get install -y openjdk-11-jdk-headless -qq > /dev/null
!pip install pyspark -q

# Set Java home
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

import pyspark
print(pyspark.__version__)

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
3.5.1


In [2]:
# download the postgre driver

!mkdir -p ~/jars
!wget -P ~/jars https://jdbc.postgresql.org/download/postgresql-42.6.0.jar

--2025-11-03 15:40:53--  https://jdbc.postgresql.org/download/postgresql-42.6.0.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1081604 (1.0M) [application/java-archive]
Saving to: ‘/root/jars/postgresql-42.6.0.jar’


2025-11-03 15:40:53 (12.1 MB/s) - ‘/root/jars/postgresql-42.6.0.jar’ saved [1081604/1081604]



In [3]:
from pyspark.sql import SparkSession

spark = SparkSession\
            .builder\
            .appName('spark-dataframe')\
            .config("spark.jars", "/root/jars/postgresql-42.6.0.jar")\
            .getOrCreate()

In [4]:
csv_file_path = 'https://raw.githubusercontent.com/rahulrajpr/prepare-anytime/refs/heads/main/sample-files/csv/sample.csv'
json_file_path = 'https://raw.githubusercontent.com/rahulrajpr/prepare-anytime/refs/heads/main/sample-files/json/sample.json'
parquet_file_path = 'https://github.com/rahulrajpr/prepare-anytime/raw/refs/heads/main/sample-files/parquet/sample.parquet'

In [5]:
!wget {csv_file_path}
!wget {json_file_path}
!wget {parquet_file_path}

--2025-11-03 15:41:07--  https://raw.githubusercontent.com/rahulrajpr/prepare-anytime/refs/heads/main/sample-files/csv/sample.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60302 (59K) [text/plain]
Saving to: ‘sample.csv’


2025-11-03 15:41:07 (5.08 MB/s) - ‘sample.csv’ saved [60302/60302]

--2025-11-03 15:41:07--  https://raw.githubusercontent.com/rahulrajpr/prepare-anytime/refs/heads/main/sample-files/json/sample.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63790 (62K) [text/plain]
Saving to: ‘sample.json’


2025-11

In [6]:
csv_local_path = '/content/sample.csv'
json_local_path = '/content/sample.json'
parquet_local_path = '/content/sample.parquet'

##### Pandas vs Spark File Reading: Conceptual Comparison

##### Core Difference
- **Pandas**: Single-machine processing
- **Spark**: Distributed cluster computing

##### Quick Comparison
| Aspect | Pandas | Spark |
|--------|--------|-------|
| **Processing Model** | Single-machine | Distributed cluster |
| **HTTP/HTTPS URLs** | Directly supported | Not supported |
| **File Storage** | Any accessible path | Distributed storage only |
| **Worker Access** | Single process | All executors need access |
| **Data Location** | Local or remote URLs | Distributed file systems |

##### Why Spark Can't Read HTTP URLs
**Technical Reason**: Spark executors run on different cluster nodes. HTTP URLs would cause:
- Each executor downloading independently
- No coordinated reading mechanism
- No guarantee of same data across executors
- Violates distributed computing principles

##### Supported Spark File Systems
- Local/NFS (all nodes must have access)
- Hadoop DFS (HDFS)
- AWS S3
- Google Cloud Storage
- Azure Blob Storage

##### HTTP URL Workarounds
1. **Download then Read**: Download to distributed storage first
2. **Pandas Bridge**: Read with Pandas, convert to Spark
3. **Manual Download**: Download locally, then use local path

##### Key Takeaway
Spark requires distributed storage where all executors can access the same data simultaneously. HTTP URLs don't guarantee coordinated access, ensuring data consistency for true parallel processing.

In [7]:
# csv -reader

csv_dataframe = spark.read\
                    .option('header','true')\
                    .option('inferSchema','true')\
                    .csv(csv_local_path)

csv_dataframe.printSchema()
csv_dataframe.show()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|   

In [8]:
# json -reader

# this is a jsonL, so multiLine = True is added

json_dataframe = spark.read\
                     .option('header','true')\
                     .option('inferSchema','true')\
                     .option('multiLine','true')\
                     .json(json_local_path)

json_dataframe.printSchema()
json_dataframe.show(truncate = True)

root
 |-- bio: string (nullable = true)
 |-- id: string (nullable = true)
 |-- language: string (nullable = true)
 |-- name: string (nullable = true)
 |-- version: double (nullable = true)

+--------------------+----------------+----------------+-----------------+-------+
|                 bio|              id|        language|             name|version|
+--------------------+----------------+----------------+-----------------+-------+
|Donec lobortis el...|V59OF92YF627HFY0|          Sindhi|    Adeel Solangi|    6.1|
|Aliquam sollicitu...|ENTOCR13RSCLZ6KU|          Sindhi|    Afzal Ghaffar|   1.88|
|Vestibulum pharet...|IAKPO3R4761JDRVG|          Sindhi|    Aamir Solangi|   7.27|
|Donec lobortis el...|5ZVOEPMJUI4MB4EN|          Uyghur|    Abla Dilmurat|   2.53|
|Vivamus id faucib...|6VTI8X6LL0MMPJCC|          Uyghur|         Adil Eli|   6.49|
|Duis commodo orci...|F2KEU5L7EHYSYFTT|          Uyghur|      Adile Qadir|    1.9|
|Vivamus id faucib...|LO6DVTZLRK68528I|          Uyghur|Abduker

In [9]:
# parquet - reader

# i dont want to specify the schema and header options because, parquet files comes with the schema well defined

parquet_dataframe = spark.read\
                         .parquet(parquet_local_path)

parquet_dataframe.show()

+-------------------+---+----------+---------+--------------------+------+---------------+-------------------+--------------------+----------+---------+--------------------+--------------------+
|  registration_dttm| id|first_name|last_name|               email|gender|     ip_address|                 cc|             country| birthdate|   salary|               title|            comments|
+-------------------+---+----------+---------+--------------------+------+---------------+-------------------+--------------------+----------+---------+--------------------+--------------------+
|2016-02-03 07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|    1.197.201.2|   6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|               1E+02|
|2016-02-03 17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male| 218.111.175.34|                   |              Canada| 1/16/1968|150280.17|       Accountant IV|                    |
|2016-02-03 01:09:31|  3|

In [10]:
# reader - format --> csv

format_dataframe1 = spark.read.format('csv')\
                         .option('inferSchema','true')\
                         .option('header','true')\
                         .load(csv_local_path)

format_dataframe1.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|
|          6|       0|     3|    Moran, Mr. James|  male|NULL|    0|    0|      

In [11]:
# reader - format -- > json

format_dataframe2 = spark.read.format('json')\
                         .option('header','true')\
                         .option('inferSchema','true')\
                         .option('multiLine','true')\
                         .load(json_local_path)
format_dataframe2.show()

+--------------------+----------------+----------------+-----------------+-------+
|                 bio|              id|        language|             name|version|
+--------------------+----------------+----------------+-----------------+-------+
|Donec lobortis el...|V59OF92YF627HFY0|          Sindhi|    Adeel Solangi|    6.1|
|Aliquam sollicitu...|ENTOCR13RSCLZ6KU|          Sindhi|    Afzal Ghaffar|   1.88|
|Vestibulum pharet...|IAKPO3R4761JDRVG|          Sindhi|    Aamir Solangi|   7.27|
|Donec lobortis el...|5ZVOEPMJUI4MB4EN|          Uyghur|    Abla Dilmurat|   2.53|
|Vivamus id faucib...|6VTI8X6LL0MMPJCC|          Uyghur|         Adil Eli|   6.49|
|Duis commodo orci...|F2KEU5L7EHYSYFTT|          Uyghur|      Adile Qadir|    1.9|
|Vivamus id faucib...|LO6DVTZLRK68528I|          Uyghur|Abdukerim Ibrahim|    5.9|
|Etiam malesuada b...|LJRIULRNJFCNZJAJ|          Sindhi|        Adil Abro|   9.32|
|Fusce eu ultrices...|JMCL0CXNXHPL1GBC|        Galician| Afonso Vilarchán|   5.21|
|Nam

In [12]:
# reader - format --> parquet

format_dataframe3 = spark.read.format('parquet')\
                         .load(parquet_local_path)
format_dataframe3.show()

+-------------------+---+----------+---------+--------------------+------+---------------+-------------------+--------------------+----------+---------+--------------------+--------------------+
|  registration_dttm| id|first_name|last_name|               email|gender|     ip_address|                 cc|             country| birthdate|   salary|               title|            comments|
+-------------------+---+----------+---------+--------------------+------+---------------+-------------------+--------------------+----------+---------+--------------------+--------------------+
|2016-02-03 07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|    1.197.201.2|   6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|               1E+02|
|2016-02-03 17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male| 218.111.175.34|                   |              Canada| 1/16/1968|150280.17|       Accountant IV|                    |
|2016-02-03 01:09:31|  3|

In [13]:
# SET THE POSTGRE RDBMS (LOCALLY) TO TEST THE JDBC

!apt-get update
!apt-get install -y postgresql postgresql-contrib
!service postgresql start
!clear

0% [Working]            Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 https://cli.github.com/packages stable InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entr

In [14]:

## CREATE DATABASE, SCHEMA, TABLE AND INSERT DATA

database = 'magic_database'
schema = 'magic_schema'
table = 'magic_table'
user = 'rahul'
password = 'rahul_password'

!sudo -u postgres psql -c "CREATE USER {user} WITH PASSWORD '{password}';"

!sudo -u postgres psql -c "CREATE DATABASE {database} OWNER {user};"
!sudo -u postgres psql -d {database} -c "CREATE SCHEMA {schema} AUTHORIZATION {user};"
!sudo -u postgres psql -d {database} -c "CREATE TABLE {schema}.{table} (id SERIAL PRIMARY KEY, name VARCHAR(50), age INT, department VARCHAR(50));"

!sudo -u postgres psql -d {database} -c "GRANT ALL PRIVILEGES ON SCHEMA {schema} TO {user};"
!sudo -u postgres psql -d {database} -c "GRANT ALL PRIVILEGES ON TABLE {schema}.{table} TO {user};"

!sudo -u postgres psql -d {database} -c "INSERT INTO {schema}.{table} (name, age, department) VALUES ('Alice', 30, 'HR'), ('Bob', 28, 'IT'), ('Charlie', 35, 'Finance');"


CREATE ROLE
CREATE DATABASE
CREATE SCHEMA
CREATE TABLE
GRANT
GRANT
INSERT 0 3


In [15]:
# jdbc

url = f"jdbc:postgresql://localhost:5432/{database}"
table = f"magic_schema.magic_table"

properties = {"user": "rahul",
              "password": "rahul_password",
              "driver": "org.postgresql.Driver"}

jdbc_dataframe = spark.read\
                      .jdbc(url = url,
                            table = table,
                            properties = properties)

jdbc_dataframe.printSchema()
jdbc_dataframe.show()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- department: string (nullable = true)

+---+-------+---+----------+
| id|   name|age|department|
+---+-------+---+----------+
|  1|  Alice| 30|        HR|
|  2|    Bob| 28|        IT|
|  3|Charlie| 35|   Finance|
+---+-------+---+----------+



In [16]:
# read - table

# it has to be spark or view as an input

jdbc_dataframe.createOrReplaceTempView('tempView_Demo')

In [17]:
# read - table

table_dataframe = spark.read\
                       .table('tempView_Demo')

table_dataframe.printSchema()
table_dataframe.show(truncate = False)

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- department: string (nullable = true)

+---+-------+---+----------+
|id |name   |age|department|
+---+-------+---+----------+
|1  |Alice  |30 |HR        |
|2  |Bob    |28 |IT        |
|3  |Charlie|35 |Finance   |
+---+-------+---+----------+



In [18]:
# text

# create sample text file

sample_text = """id,name,age,city
1,Rahul,29,Bangalore
2,Aditi,25,Mumbai
3,John,32,Delhi
4,Sara,28,Chennai
5,Ankit,30,Kolkata
"""

with open("sample.txt", "w") as f:
    f.write(sample_text)

text_local_path = '/content/sample.txt'

In [19]:
# text

# text method always reads the data as single line now.

text_dataframe = spark.read\
                      .text(text_local_path)

text_dataframe.show(truncate = False)

+--------------------+
|value               |
+--------------------+
|id,name,age,city    |
|1,Rahul,29,Bangalore|
|2,Aditi,25,Mumbai   |
|3,John,32,Delhi     |
|4,Sara,28,Chennai   |
|5,Ankit,30,Kolkata  |
+--------------------+



In [43]:
from pyspark.sql.functions import split, col, expr

text_dataframe_trans = text_dataframe.withColumn('splitCol',split(col('value'),','))\
                                     .withColumn('id',col('splitCol')[0])\
                                     .withColumn('name',col('splitCol')[1])\
                                     .withColumn('age',col('splitCol')[2])\
                                     .withColumn('city',col('splitCol')[3])\
                                     .drop('value','splitCol')\
                                     .filter(~expr("id = 'id'"))\
                                     .selectExpr('try_cast(id as int) as id',
                                                 'try_cast(name as string) as name',
                                                 'try_cast(age as int) as age',
                                                 'try_cast(city as string) as city')

text_dataframe_trans.printSchema()
text_dataframe_trans.show(truncate = False)

##--

text_dataframe_trans = text_dataframe.withColumn('splitCol',split(col('value'),','))\
                                     .selectExpr('try_cast(splitCol[0] as int) as id'
                                                ,'try_cast(splitCol[1] as string) as name'
                                                ,'try_cast(splitCol[2] as int) as age'
                                                ,'try_cast(splitCol[3] as string) as city')\
                                     .filter('id is not null')

text_dataframe_trans.printSchema()
text_dataframe_trans.show(truncate = False)

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)

+---+-----+---+---------+
|id |name |age|city     |
+---+-----+---+---------+
|1  |Rahul|29 |Bangalore|
|2  |Aditi|25 |Mumbai   |
|3  |John |32 |Delhi    |
|4  |Sara |28 |Chennai  |
|5  |Ankit|30 |Kolkata  |
+---+-----+---+---------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)

+---+-----+---+---------+
|id |name |age|city     |
+---+-----+---+---------+
|1  |Rahul|29 |Bangalore|
|2  |Aditi|25 |Mumbai   |
|3  |John |32 |Delhi    |
|4  |Sara |28 |Chennai  |
|5  |Ankit|30 |Kolkata  |
+---+-----+---+---------+



Note : Try is not avaialble as function or a method till spark 4.0.1,
> we can use expr or SelectExpr to use the try_cast there

In [45]:
# reading the text file as csv reader
# will be using .schema and .delimeter

## This is the easiest way to read a txt file with know delimeter and schem

text_dataframe2 = spark.read\
                       .option('header',True)\
                       .option('delimeter',',')\
                       .schema('id int, name string, age int , city string')\
                       .csv(text_local_path)

text_dataframe2.printSchema()
text_dataframe2.show(truncate = False)

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)

+---+-----+---+---------+
|id |name |age|city     |
+---+-----+---+---------+
|1  |Rahul|29 |Bangalore|
|2  |Aditi|25 |Mumbai   |
|3  |John |32 |Delhi    |
|4  |Sara |28 |Chennai  |
|5  |Ankit|30 |Kolkata  |
+---+-----+---+---------+



In [46]:
# do the same activity with infering the schema

text_dataframe2 = spark.read\
                       .option('header',True)\
                       .option('delimeter',',')\
                       .option('inferSchema',True)\
                       .csv(text_local_path)

text_dataframe2.printSchema()
text_dataframe2.show(truncate = False)

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)

+---+-----+---+---------+
|id |name |age|city     |
+---+-----+---+---------+
|1  |Rahul|29 |Bangalore|
|2  |Aditi|25 |Mumbai   |
|3  |John |32 |Delhi    |
|4  |Sara |28 |Chennai  |
|5  |Ankit|30 |Kolkata  |
+---+-----+---+---------+

