# Chapter 9

## Reading and Writing data in Spark

### Reading

```python
spark.read.format("csv")
  .option("mode", "FAILFAST")
  .option("inferSchema", "true")
  .option("path", "path/to/file(s)")
  .schema(someSchema)
  .load()
```

### Writing

```python
dataframe.write.format("csv")
  .option("mode", "OVERWRITE")
  .option("dateFormat", "yyyy-MM-dd")
  .option("path", "path/to/file(s)")
  .save()
```

In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'

### Example for reading and writing `CSVs`

In [2]:
# Reading data from Google Cloud Storage
csvFile = spark.read.format("csv")\
  .option("header", "true")\
  .option("mode", "FAILFAST")\
  .option("inferSchema", "true")\
  .load("gs://reddys-data-for-experimenting/flight-data/csv/2010-summary.csv")

In [3]:
csvFile.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



In [4]:
# Writing data back to Google Cloud Storage
csvFile.write.format("csv") \
    .mode("overwrite") \
    .option("sep", "\t") \
    .save("gs://reddys-data-for-experimenting/output/chapter9/tsv")

### Example for reading and writing `JSON`

In [5]:
# Reading data from Google Cloud Storage
jsonData = spark.read.format("json").option("mode", "FAILFAST")\
  .option("inferSchema", "true")\
  .load("gs://reddys-data-for-experimenting/flight-data/json/2010-summary.json")

In [6]:
jsonData.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



In [7]:
jsonData.write.format("json").mode("overwrite").save("gs://reddys-data-for-experimenting/output/chapter9/json")

### Example for reading and writing `Parquet`

In [8]:
parquetData = spark.read.format("parquet")\
  .load("gs://reddys-data-for-experimenting/flight-data/parquet/2010-summary.parquet")

In [9]:
parquetData.write.format("json").mode("overwrite").save("gs://reddys-data-for-experimenting/output/chapter9/parquet")

### Example for reading and writing `ORC`

In [10]:
orcData = spark.read.format("orc").load("gs://reddys-data-for-experimenting/flight-data/orc/2010-summary.orc")

In [11]:
orcData.write.format("json").mode("overwrite").save("gs://reddys-data-for-experimenting/output/chapter9/orc")

### Example for Reading and writing with `JDBC`

Code snippet

``` python
driver = "org.sqlite.JDBC"
path = "gs://reddys-data-for-experimenting//flight-data/jdbc/my-sqlite.db"
url = "jdbc:sqlite:" + path
tablename = "flight_info"

dbDataFrame = spark.read.format("jdbc").option("url", url)\
  .option("dbtable", tablename).option("driver",  driver).load()
    
pgDF = spark.read.format("jdbc")\
  .option("driver", "org.postgresql.Driver")\
  .option("url", "jdbc:postgresql://database_server")\
  .option("dbtable", "schema.tablename")\
  .option("user", "username").option("password", "my-secret-password").load()
```

Also Spark does query push down, so that it fetches as little data as possible from the underlying datasource.

You can also write to SQL Databases usinf spark

```python
csvFile.write.jdbc(newPath, tablename, mode="overwrite", properties=props)
```

### Writing data into partitions

In [17]:
csvFile.repartition(5).write.format("csv") \
    .save("gs://reddys-data-for-experimenting/output/chapter9/partitioned-csv")

In [18]:
csvFile.write.mode("overwrite").partitionBy("DEST_COUNTRY_NAME")\
  .save("gs://reddys-data-for-experimenting/output/chapter9/partitioned-by-key-parquet")

### Writing data into buckets

Bukceting is only supported in `Scala` and not in `Python` at the moment

```scala
val numberBuckets = 10
val columnToBucketBy = "count"

csvFile.write.format("parquet")
  .mode("overwrite")
  .bucketBy(numberBuckets, columnToBucketBy)
  .save("gs://reddys-data-for-experimenting/output/chapter9/partitioned-csv")
```