# Basic Read/Write

This example demonstrates how to perform a basic read and write using the Spark Connector.

## Spark Setup

First we start with the basics of setting up Spark to work with Vertica. To do this we need to create a Spark Context that has the Spark Connector passed through it as a configuration option.

In [None]:
# Get Connector JAR name
import glob
import os

files = glob.glob("/spark-connector/connector/target/scala-2.12/spark-vertica-connector-assembly-*")
os.environ["CONNECTOR_JAR"] = files[0]
print(os.environ["CONNECTOR_JAR"])

In [None]:
# Create the Spark session and context
from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .config("spark.master", "spark://spark:7077")
    .config("spark.driver.memory", "2G")
    .config("spark.executor.memory", "1G")
    .config("spark.jars", os.environ["CONNECTOR_JAR"])
    .getOrCreate())
sc = spark.sparkContext

In [None]:
# Display the context information
print(sc.version)
print(sc.master)
display(sc.getConf().getAll())

## Read/Write

We can now build the schema we want. Since this is a basic example we can just use Python's native arrays and populate them with regards to column names as well as nested arrays for the data. 

We will now create our Spark DataFrame. However as we do that, we will also use the parallelize method to create an [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html). This is a fundamental data structure that belongs to Spark and is used to parallelize data transfer.

In [None]:
# Perform a simple write then read using the Spark Connector
columns = ["language", "rating"]
data = [("Scala", 71), ("Java", 89), ("C++", 67), ("Python", 94)]
rdd = sc.parallelize(data)
df = rdd.toDF(columns)

Finally we can write our dataframe to the Vertica database "docker" to a table named "jupytertest." We then read the table and once again store it into a Spark DataFrame for any processing we want to do with Spark.

In [None]:
df.write.mode("overwrite").save(format="com.vertica.spark.datasource.VerticaSource",
    host="vertica",
    user="dbadmin",
    password="",
    db="docker",
    table="jupytertest",
    staging_fs_url="webhdfs://hdfs:50070/jupytertest")

df = spark.read.load(format="com.vertica.spark.datasource.VerticaSource",
    host="vertica",
    user="dbadmin",
    password="",
    db="docker",
    table="jupytertest",
    staging_fs_url="webhdfs://hdfs:50070/jupytertest")
df.rdd.collect()
df.show()