# Chapter 5: Loading and Saving your Data (Scala)

In this Notebook, we will explore how to load and save data in three different formats:

    * Parquet
    * CSV
    * Json

The problems included in this notebook are solved using different high-level data sources included in Spark SQL.

In [5]:
import sys.process._

## Parquet Format

Loading data

In [1]:
val parquetData = spark.read.parquet("../data/person.parquet")

parquetData = [Name: string, Age: int]


[Name: string, Age: int]

In [2]:
parquetData.show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



Saving data

In [3]:
parquetData.write.mode("overwrite").parquet("../data/person_write.parquet")

In [4]:
spark.read.parquet("../data/person_write.parquet").show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



## CSV Format

### Original Approach

Loading data

In [61]:
val sep = ","
val csvDataOriginal = sc.textFile("../data/person.csv").map(_.split(sep))

sep = ,
csvDataOriginal = MapPartitionsRDD[100] at map at <console>:33


MapPartitionsRDD[100] at map at <console>:33

In [62]:
csvDataOriginal.take(2)

[[Name, Age], [Raul, 29]]

Write data

In [79]:
"rm -rf ../data/person_write_orginal.csv".!
val sep = ","
csvDataOriginal.map(_.mkString(sep)).coalesce(1).saveAsTextFile("../data/person_write_orginal.csv")

sep = ,


,

In [80]:
csvDataOriginal.take(1)

[[Name, Age]]

In [81]:
val sep = ","
val csvDataOriginalLoaded = sc.textFile("../data/person_write_orginal.csv").map(_.split(sep))

sep = ,
csvDataOriginalLoaded = MapPartitionsRDD[129] at map at <console>:33


MapPartitionsRDD[129] at map at <console>:33

In [82]:
csvDataOriginalLoaded.take(2)

[[Name, Age], [Raul, 29]]

### Using SQL API

Loading data

In [5]:
val csvData = spark.read.option("header", "true").option("inferschema", "true").csv("../data/person.csv")

csvData = [Name: string, Age: int]


[Name: string, Age: int]

In [6]:
csvData.show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



In [7]:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val schema = new StructType(Array(StructField("Name", StringType, true), 
                                  StructField("Age", IntegerType, true)))

schema = StructType(StructField(Name,StringType,true), StructField(Age,IntegerType,true))


StructType(StructField(Name,StringType,true), StructField(Age,IntegerType,true))

In [8]:
val csvDataSchema = spark.read.schema(schema).csv("../data/person.csv")

csvDataSchema = [Name: string, Age: int]


[Name: string, Age: int]

In [9]:
csvDataSchema.show()

+----+----+
|Name| Age|
+----+----+
|null|null|
|Raul|  29|
|Javi|  34|
+----+----+



Saving data:

In [10]:
csvData.write.mode("overwrite").option("header", "true").csv("../data/person_write.csv")

In [11]:
spark.read.option("inferSchema", "true").option("header", "true").csv("../data/person_write.csv").show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



## JSON Format

### Using SQL API

Loading data

In [12]:
val jsonData = spark.read.json("../data/person.json")

jsonData = [age: bigint, name: string]


[age: bigint, name: string]

In [13]:
jsonData.show()

+---+----+
|age|name|
+---+----+
| 29|Raul|
| 33|Javi|
+---+----+



Saving data

In [14]:
jsonData.write.mode("overwrite").json("../data/person_write.json")

In [15]:
spark.read.json("../data/person_write.json").show()

+---+----+
|age|name|
+---+----+
| 29|Raul|
| 33|Javi|
+---+----+

