# Spark with Postgres in Jupyter Notebooks

There are several things that need special attention.

- Configure `"spark.jars.packages", "org.postgresql:postgresql:42.7.4"` in order to download the Postgres JDBC driver artifact.
- Load the `sql` extension installed using `%load_ext sql` in order to be able to run SQL queries in a cell magic using `%%sql`.

In [1]:
%load_ext sql

In [2]:
%%bash
docker run --name jupyter_postgres -p 5432:5432 -e POSTGRES_PASSWORD=secret -d postgres

deed89aff18ce3e18e6245ac50ad5c8851a0c8d2e5dd954b70b44856748842e2


In [4]:
%sql postgresql://postgres:secret@localhost:5432/postgres

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Row
from pyspark.sql.window import Window

spark = (
    SparkSession.builder 
    .appName("PostgresExample")
    .master("local[*]")
    .config("spark.ui.enabled", "true")   
    .config("spark.jars.packages", "org.postgresql:postgresql:42.7.4")
    .getOrCreate()
)

# Show the SparkUI url (useful for monitoring and debuging)
spark.sparkContext.uiWebUrl

:: loading settings :: url = jar:file:/home/yannis/Development/tmp/pyspark-delta/.venv/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/yannis/.ivy2.5.2/cache
The jars for the packages stored in: /home/yannis/.ivy2.5.2/jars
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-053b01e9-3996-4971-862a-232d22b5ecc3;1.0
	confs: [default]
	found org.postgresql#postgresql;42.7.4 in central
	found org.checkerframework#checker-qual;3.42.0 in central
:: resolution report :: resolve 162ms :: artifacts dl 11ms
	:: modules in use:
	org.checkerframework#checker-qual;3.42.0 from central in [default]
	org.postgresql#postgresql;42.7.4 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| numb

'http://ouranos:4041'

In [6]:
schema = T.StructType([
    T.StructField("id", T.IntegerType(), False),
    T.StructField("hero_name", T.StringType(), False),
    T.StructField("secret_identity", T.StringType(), False),
    T.StructField("power_level", T.IntegerType(), False)
])

In [7]:
raw_df = (
    spark.read
        .schema(schema)
        .option("header", "true")
        .csv("data/marvel.csv")
)

raw_df.createOrReplaceTempView("superheroes_raw")

In [8]:
(
    raw_df.write \
        .format("jdbc") \
        .option("url", "jdbc:postgresql://localhost:5432/postgres") \
        .option("dbtable", "public.superheroes") \
        .option("user", "postgres") \
        .option("password", "secret") \
        .option("driver", "org.postgresql.Driver") \
        .mode("overwrite")
        .save()
)

                                                                                

In [9]:
%%sql
select * from superheroes;

 * postgresql://postgres:***@localhost:5432/postgres
10 rows affected.


id,hero_name,secret_identity,power_level
1,Iron Man,Tony Stark,95
2,Captain America,Steve Rogers,88
3,Thor,Thor Odinson,98
4,Hulk,Bruce Banner,97
5,Black Widow,Natasha Romanoff,75
6,Spider-Man,Peter Parker,92
7,Black Panther,T'Challa,89
8,Doctor Strange,Stephen Strange,93
9,Scarlet Witch,Wanda Maximoff,94
10,Hawkeye,Clint Barton,70


In [10]:
result = %sql select * from superheroes;

 * postgresql://postgres:***@localhost:5432/postgres
10 rows affected.


In [11]:
result[0].hero_name

'Iron Man'