## Table Of Contents
* [Cache and Persist](#Cache-and-Persist)
* [Block Eviction](#Block-Eviction)
* [Difference between a DAG and a lineage](#Difference-between-a-DAG-and-a-lineage)
* [Executing the jar in spark](#Executing-the-jar-in-spark)
* [Practical - movie ratings](#Practical---movie-ratings)
* [Structured APIs](#Structured-APIs)
* [Practical - Reading CSV](#Practical---Reading-CSV)
* [RDD vs DataFrame vs dataset](#RDD-vs-DataFrame-vs-dataset)
* [Reading JSON files](#Reading-JSON-files)
* [Reading a parquet file](#Reading-a-parquet-file)
* [Ways to define the schema](#Ways-to-define-the-schema)

### Cache and Persist

Consider if you've a RDD which you've generated by using a bunch of transformations.

what if we'll want to use the rdd4 again and don't want to do the other transformations again. Using the following cache function you can use the cached rdd4 again.

```
rdd1
rdd2
rdd3
rdd4.cache
rdd5
rdd5.collect
```



In [2]:
val rawCustomersInfo = sc.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
// split on "," and take only 1st and third element of an array, and convert third element to float
// Put these in a touple
val splitCust = rawCustomersInfo.map(x => (x.split(",")(0), x.split(",")(2).toFloat ) )

// Calculate the sum of amount and sort by amount in descending order
val totalPurchase = splitCust.reduceByKey((x, y) => x+y)
val finalInfo = totalPurchase.filter(x => x._2>5000) //customers spending more than 5K

// For the first time it will run it, but for the second time it'll not execute it.
val doubledAmount = finalInfo.map(x => (x._1 , x._2*2)).cache()
val final_ = doubledAmount.collect()
for(info <- final_){
    println(info)
}

Waiting for a Spark session to start...

(19,10118.861)
(42,11393.681)
(62,10506.643)
(6,10795.759)
(46,11926.222)
(2,11989.182)
(93,10531.5)
(28,10001.421)
(59,11285.781)
(24,10519.84)
(39,12386.221)
(11,10304.58)
(64,10577.38)
(8,11034.48)
(60,10081.419)
(15,10827.0205)
(35,10310.84)
(97,11954.379)
(0,11049.899)
(55,10596.18)
(40,10372.859)
(71,11991.32)
(22,10038.898)
(26,10500.801)
(68,12750.9)
(33,10509.318)
(17,10065.359)
(73,12412.398)
(69,10246.02)
(41,11275.238)
(92,10758.562)
(9,10645.299)
(34,10661.599)
(61,10994.96)
(81,10225.42)
(25,10115.221)
(63,10830.3)
(65,10280.699)
(29,10065.061)
(90,10580.82)
(32,10992.101)
(85,11006.861)
(54,12130.78)
(72,10674.879)
(52,10490.121)
(58,10875.461)
(87,10412.799)
(70,10736.501)
(43,10737.66)


rawCustomersInfo = /user/itv002768/customerorders_practical/customerorders-201008-180523.csv MapPartitionsRDD[7] at textFile at <console>:34
splitCust = MapPartitionsRDD[8] at map at <console>:37
totalPurchase = ShuffledRDD[9] at reduceByKey at <console>:40
finalInfo = MapPartitionsRDD[10] at filter at <console>:41
doubledAmount = MapPartitionsRDD[11] at map at <console>:44
final_ = Array((19,10118.861), (42,11393.681), (62,10506.643), (6,10795.759), (46,11926.222), (2,11989.182), (93,10531.5), (28,10001.421), (59,11285.781), (24,10519.84), (39,12386.221), (11,...


Array((19,10118.861), (42,11393.681), (62,10506.643), (6,10795.759), (46,11926.222), (2,11989.182), (93,10531.5), (28,10001.421), (59,11285.781), (24,10519.84), (39,12386.221), (11,...

Cache and Persist, both are used for the same purpose.</BR>
Rather than recalculating we can reuse. A rdd that is not cached is reevaluated each time when you call an action.

The difference is that cache will cache the rdd in-memory however, persist comes with various storage levels. If we'll use persist without passing any arguments then it'll act same as cache.

persist(storageLevel.LEVEL)</BR>
* MEMORY_ONLY - Data is cached in memory in non serialized form. If not enough memory is available it won't give any error rather it'll skip the caching step.
* DISK_ONLY - Data is cached on disk in serialized form and takes less storage.
* MEMORY_AND_DISK - This is very widely used, Data is cached in memory and if enough memory is not available evicted blocks from memory will be placed on disk in serialized form. This mode is recommended when re-evaluation is expensive and memory resources are scarce.
* OFF_HEAP - Blocks are cached off-heap. Off heap means outside the JVM. Problem with storing the objects in JVM is that it uses garbage collection for freeing up and it is a time taking process. It is basically grabbing a piece of memory outside JVM to make it quick. These operations are performant but not safe.
* MEMORY_ONLY_SER - Serialized form.
* MEMORY_AND_DISK_SER - Serialized form.
* MEMORY_ONLY_2 - This represents the two replicas on two different worker nodes. Just for high availability of cached rdd.

### Block Eviction</BR>
Let's say there are partitions blocks which are so large then they will quickly fill up the memory used for caching. When the storage memory becomes full an eviction policy will be used to make up the space for the new blocks based on LRU algorithm.

**Serialization** increases the processing cost but reduces mamory foot prints.</BR>
In case of **non-serialization**, memory foot prints are large but low processing is required.

Do not cache/persist your base rdds.

In [3]:
// Read the cached rdd from bottom to top
doubledAmount.toDebugString

(2) MapPartitionsRDD[11] at map at <console>:44 [Memory Deserialized 1x Replicated]
 |       CachedPartitions: 2; MemorySize: 4.4 KB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 |  MapPartitionsRDD[10] at filter at <console>:41 [Memory Deserialized 1x Replicated]
 |  ShuffledRDD[9] at reduceByKey at <console>:40 [Memory Deserialized 1x Replicated]
 +-(2) MapPartitionsRDD[8] at map at <console>:37 [Memory Deserialized 1x Replicated]
    |  /user/itv002768/customerorders_practical/customerorders-201008-180523.csv MapPartitionsRDD[7] at textFile at <console>:34 [Memory Deserialized 1x Replicated]
    |  /user/itv002768/customerorders_practical/customerorders-201008-180523.csv HadoopRDD[6] at textFile at <console>:34 [Memory Deserialized 1x Replicated]


In [5]:
val rawCustomersInfo = sc.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
// split on "," and take only 1st and third element of an array, and convert third element to float
// Put these in a touple
val splitCust = rawCustomersInfo.map(x => (x.split(",")(0), x.split(",")(2).toFloat ) )

// Calculate the sum of amount and sort by amount in descending order
val totalPurchase = splitCust.reduceByKey((x, y) => x+y)
val finalInfo = totalPurchase.filter(x => x._2>5000) //customers spending more than 5K

// For the first time it will run it, but for the second time it'll not execute it.
val doubledAmountPersist = finalInfo.map(x => (x._1 , x._2*2)).persist(storageLevel.MEMORY_AND_DISK)
val final_ = doubledAmountPersist.collect()
for(info <- final_){
    println(info)
}

Compile Error: <console>:43: error: not found: value storageLevel
       val doubledAmountPersist = finalInfo.map(x => (x._1 , x._2*2)).persist(storageLevel.MEMORY_AND_DISK)
                                                                              ^


### Difference between a DAG and a lineage

Lineage is a dependency graph where we've to read it from bottom to top and it shows dependencies of various rdds. You can say it's a logical plan.

DAG is an acylic graph and it talks about jobs, stages and tasks.

### Executing the jar in spark
Inside bin there will be spark-submit

```
./spark-submit --class WordCount /path/to/jar
```

### Practical - movie ratings
below is the dataset containing user_id::movie_id::rating::watch_time

```
[itv002768@g02 ~]$ hadoop fs -head /user/itv002768/week11_practicals/ratings-201019-002101.dat
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
1::595::5::978824268
1::938::4::978301752

[itv002768@g02 ~]$ hadoop fs -head /user/itv002768/week11_practicals/movies-201019-002101.dat
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
11::American President, The (1995)::Comedy|Drama|Romance
```

Problem statement: Find all the movie names having average ratings > 4.5, but atleast 1k people should have rated that movie.



In [27]:
val rawData = sc.textFile("/user/itv002768/week11_practicals/ratings-201019-002101.dat")
val mappedRatingsRdd = rawData.map(x => {
    val fields = x.split("::")
    (fields(1), fields(2))
})
val valMap = mappedRatingsRdd.mapValues(x => (x.toFloat, 1.0))
val reducedRdd = valMap.reduceByKey((x,y) => (x._1 + y._1, x._2 + y._2))
val filteredRdd = reducedRdd.filter(x => x._2._2 > 100)
val filterRatings = filteredRdd.mapValues(x => x._1/x._2).filter(x=> x._2 > 4.5)
//filterRatings.collect().foreach(println)

val movieInfo = sc.textFile("/user/itv002768/week11_practicals/movies-201019-002101.dat")
val requiredMovieInfo = movieInfo.map(x => {
    val fields = x.split("::")
    (fields(0), fields(1))
})

val joinedInfo = requiredMovieInfo.join(filterRatings)
val final_ = joinedInfo.map(x => x._2._1)
final_.collect().foreach(println)

Schindler's List (1993)
Shawshank Redemption, The (1994)
Close Shave, A (1995)
Wrong Trousers, The (1993)
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)
Godfather, The (1972)
Usual Suspects, The (1995)


rawData = /user/itv002768/week11_practicals/ratings-201019-002101.dat MapPartitionsRDD[168] at textFile at <console>:38
mappedRatingsRdd = MapPartitionsRDD[169] at map at <console>:39
valMap = MapPartitionsRDD[170] at mapValues at <console>:43
reducedRdd = ShuffledRDD[171] at reduceByKey at <console>:44
filteredRdd = MapPartitionsRDD[172] at filter at <console>:45
filterRatings = MapPartitionsRDD[174] at filter at <console>:46
movieInfo = /user/itv002768/week11_practicals/movie...


/user/itv002768/week11_practicals/movie...

## Structured APIs

There are Structured APIs in the form of DataFrames or Datasets. Internally everything is happening in the form of RDDs.

A DataFrame is a distributed collection of data organized in named columns. It is conceptually equivlent to a table in RDBMS. But in dataframe data will be divided in many partitions.

dataframes and datasets were also available in spark1 as well. From spark2 we got better support for these two APIs and both of these are merged into a single API known as "dataset API".

Moving forward we'll start using sparkSession instead of sparkContext. For everythin we used to create separate context like spark context, hive context, sql context etc. But Spark Session is an unified entry point of spark application. It provides a way to interact with various spark functionalities with lesser number of constructs. All are encapsulated in sparkSession.

SparkSession is a singleton object.

In [34]:
// SparkSession example
val spark = SparkSession.builder().
appName("My Application-1").
master("local[2]").
getOrCreate()

//closing the connection
spark.stop()

spark = org.apache.spark.sql.SparkSession@234e2e4f


org.apache.spark.sql.SparkSession@234e2e4f

In [35]:
// using spark conf
import org.apache.spark.SparkConf

val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application-1")
sparkConfig.set("spark.master", "local[2]")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

spark.stop()

sparkConfig = org.apache.spark.SparkConf@e4645e
spark = org.apache.spark.sql.SparkSession@5bbe8b97


org.apache.spark.sql.SparkSession@5bbe8b97

### Practical - Reading CSV
```
[itv002768@g02 ~]$ hadoop fs -put orders-201019-002101.csv /user/itv002768/week11_practicals/
[itv002768@g02 ~]$ hadoop fs -head /user/itv002768/week11_practicals/orders-201019-002101.csv
order_id,order_date,order_customer_id,order_status
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,PROCESSING
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
```

In [42]:
import org.apache.spark.SparkConf

val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application-1")
sparkConfig.set("spark.master", "local[2]")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

// Never use the inferSchema on prod as it will only read sampleset and infer the schema
val ordersDf = spark.read.option("header", true)
               .option("inferSchema", true)
               .csv("/user/itv002768/week11_practicals/orders-201019-002101.csv")
ordersDf.show()
ordersDf.printSchema()

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
|       6|2013-07-25 00:00:00|             7130|       COMPLETE|
|       7|2013-07-25 00:00:00|             4530|       COMPLETE|
|       8|2013-07-25 00:00:00|             2911|     PROCESSING|
|       9|2013-07-25 00:00:00|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:00|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:00|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:00|             1837|         CLOSED|
|      13|2013-07-25 00:0

sparkConfig = org.apache.spark.SparkConf@37d6826e
spark = org.apache.spark.sql.SparkSession@1f90077d
ordersDf = [order_id: int, order_date: timestamp ... 2 more fields]


[order_id: int, order_date: timestamp ... 2 more fields]

In [44]:
import org.apache.spark.SparkConf

val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application-1")
sparkConfig.set("spark.master", "local[2]")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

// Never use the inferSchema on prod as it will only read sampleset and infer the schema
val ordersDf = spark.read.option("header", true)
               .option("inferSchema", true)
               .csv("/user/itv002768/week11_practicals/orders-201019-002101.csv")
ordersDf.show()
ordersDf.printSchema()

// Give me the number of orders made by customers whose customer_id > 10000
val groupedOrdersDf = ordersDf
                    .repartition(4)
                    .where("order_customer_id > 10000") // filtering
                    .select("order_id", "order_customer_id") // selecting columns
                    .groupBy("order_customer_id")
                    .count()

groupedOrdersDf.show()


+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
|       6|2013-07-25 00:00:00|             7130|       COMPLETE|
|       7|2013-07-25 00:00:00|             4530|       COMPLETE|
|       8|2013-07-25 00:00:00|             2911|     PROCESSING|
|       9|2013-07-25 00:00:00|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:00|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:00|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:00|             1837|         CLOSED|
|      13|2013-07-25 00:0

sparkConfig = org.apache.spark.SparkConf@63e24e85
spark = org.apache.spark.sql.SparkSession@1f90077d
ordersDf = [order_id: int, order_date: timestamp ... 2 more fields]
groupedOrdersDf = [order_customer_id: int, count: bigint]


[order_customer_id: int, count: bigint]

### RDD vs DataFrame vs dataset

* When we use RDD, we are dealing with low level code and we have to tell the system how to work. This is not developer friendly.
* Lower level code lags some of the basic optimizations.

To make life of developers easier they develop **DataFrame** in spark 1.3
* Higher level constructs. We've to tell the system what we want system will take care of it.

Challanges with DataFrame:
* DataFrames do not offer strongly typed code. It means type errors are not caught at compile time instead of run time they are caught.
* Developers felt that their flexibility has become limited.

There was a way where dataframes can be converted to RDD

df.rdd (whenever, we want more flexibility and type safety we can conver it to rdd).
* This conversion from df to rdd is not seamless.
* If we work with raw rdd by converting df to rdd, we'll miss out some major optimizations. Catalyst optimizer and tungsten engine is only possible in case of data frames.


Just to address the above challanges DataSets came into picture in spark 1.6. It provides:
* compile time safety.
* We get more flexibility in terms of using lower level code.
* Conversion from dataframe to dataset is seamless.
* We won't lose on any of the optimizations.

Before spark2.0 both dataframes and datasets were different things. In Spark2.0 dataframe and dataset were merged into a unified spark dataset API or you can say Structured API.

DataFrame is nothing but a DataSet[row], row is nothing but a generic type which is found at runtime

In Case of dataframe the datatypes are bound at runtime. However DataSet[Employee] is bound at compile time.

DataSet[row] -> DataFrame (Type errors are caught at runtime)

DataSet[Employee] -> DataSet (compile time type safety)

How to convert DataFrame to a DataSet? If we replace generic row with a specific object then it becomes a dataset.
 

In [46]:
import org.apache.spark.SparkConf

val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application-1")
sparkConfig.set("spark.master", "local[2]")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

// Never use the inferSchema on prod as it will only read sampleset and infer the schema
val ordersDf = spark.read.option("header", true)
               .option("inferSchema", true)
               .csv("/user/itv002768/week11_practicals/orders-201019-002101.csv")
ordersDf.show()
ordersDf.printSchema()

// This will work fine
//ordersDf.filter("order_id > 10").show()

/* this will give error at runtime instead of compile time
annot resolve '`order_ids`' given input columns: [order_id, order_date, order_customer_id, order_status]; line 1 pos 0;
*/
ordersDf.filter("order_ids > 10").show()

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
|       6|2013-07-25 00:00:00|             7130|       COMPLETE|
|       7|2013-07-25 00:00:00|             4530|       COMPLETE|
|       8|2013-07-25 00:00:00|             2911|     PROCESSING|
|       9|2013-07-25 00:00:00|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:00|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:00|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:00|             1837|         CLOSED|
|      13|2013-07-25 00:0

org.apache.spark.sql.AnalysisException: cannot resolve '`order_ids`' given input columns: [order_id, order_date, order_customer_id, order_status]; line 1 pos 0;
'Filter ('order_ids > 10)
+- Relation[order_id#238,order_date#239,order_customer_id#240,order_status#241] csv


How to convert a dataframe to a dataset?

Create a Case Class and create dataset[OrdersData]

In [53]:
import java.sql.Timestamp
import org.apache.spark.SparkConf
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row

case class OrdersData(order_id: Int, order_date: Timestamp, order_customer_id: Int, order_status: String)

val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application-1")
sparkConfig.set("spark.master", "local[2]")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

// Never use the inferSchema on prod as it will only read sampleset and infer the schema
val ordersDf: Dataset[Row] = spark.read.option("header", true)
               .option("inferSchema", true)
               .csv("/user/itv002768/week11_practicals/orders-201019-002101.csv")

/* 
This import is required if you want to convert dataframe to dataset or vice versa
and you cannot put it in the starting because it requires spark session
*/
import spark.implicits._
val ordersDs = ordersDf.as[OrdersData]

ordersDs.filter(x => x.order_id <10 )


Unknown Error: <console>:30: error: not found: value SparkSession
       val spark = SparkSession.builder().config(sparkConfig).getOrCreate()
                   ^
<console>:42: error: Unable to find encoder for type OrdersData. An implicit Encoder[OrdersData] is needed to store OrdersData instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
       val ordersDs = ordersDf.as[OrdersData]
                                 ^


Converting a dataframe to datasets there is an overhead involved and this is for casting it to a particular type that's why dataframe is preferred.

When we're dealing with dataframes then the searialization is managed by tungston binary format which is very fast. But when we're dealing with datasets then the serialization is managed by Java which is quiet slow and impacts the performance. Definately, datasets will help us in minimizing the developer mistakes at compile time but it comes with a cost of serialization.

**Operations we've done till now**

1. Read the data

There can be two kinds of data sources:
- External data source - e.g dbms, redshift, mongo, external API.
- Internal data source - s3, hdfs, azure, gcs

We've the flexibility in spark to create a dataframe from an external data source.</BR>
Spark is good at processing but not that efficient for data ingestion.

for ex - spark provides a jdbc mysql connector to fetch the data but it is not a good practice. Instead we'll use sqoop to get the data in hdfs and use that as a internal data source.

2. Perform the transformations

3. Writing the data to target(sink)

Again, it can be internal or external. But not recommended to write on external data sources.

**Below is the standard way of reading the files instead of using csv method**

In [3]:
import org.apache.spark.SparkConf

val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application-1")
sparkConfig.set("spark.master", "local[2]")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

// Never use the inferSchema on prod as it will only read sampleset and infer the schema
val ordersDf = spark.read.format("csv")
               .option("header", true)
               .option("inferSchema", true)
               .option("path", "/user/itv002768/week11_practicals/orders-201019-002101.csv").load()
ordersDf.show()

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
|       6|2013-07-25 00:00:00|             7130|       COMPLETE|
|       7|2013-07-25 00:00:00|             4530|       COMPLETE|
|       8|2013-07-25 00:00:00|             2911|     PROCESSING|
|       9|2013-07-25 00:00:00|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:00|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:00|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:00|             1837|         CLOSED|
|      13|2013-07-25 00:0

sparkConfig = org.apache.spark.SparkConf@549f4ded
spark = org.apache.spark.sql.SparkSession@2b38cf
ordersDf = [order_id: int, order_date: timestamp ... 2 more fields]


[order_id: int, order_date: timestamp ... 2 more fields]

### Reading JSON files

```
[itv002768@g02 ~]$ hadoop fs -head /user/itv002768/week11_practicals/players-201019-002101.json
{"player_id":101, "player_name":"R Sharma", "age":33, "role":"Batsman", "team_id":11, "country":"IND"}
{"player_id":102, "player_name":"S Iyer", "age":25, "role":"Batsman", "team_id":15, "country":"IND"}
{"player_id":103, "player_name":"T Boult", "age":30, "role":"Bowler", "team_id":13, "country":"NZ"}
{"player_id":104, "player_name":"MS Dhoni", "age":38, "role":"WKeeper", "team_id":14, "country":"IND"}
{"player_id":105, "player_name":"S Watson", "age":39, "role":"Allrounder", "team_id":12, "country":"AUS"}
{"player_id":106, "player_name":"S Hetmyer", "age":23, "role":"Batsman", "team_id":16, "country":"WI"}
```

To deal with malformed json line we use the following read modes:
* PERMISSIVE - This is default mode, it sets all the fields to null if it encounters a corrupt row. Corrupt record will be shown under a new column _corrupt_record
* DROPMALFORMED - It'll not consider the malformed rows.
* FAILFAST - If a malformed record is found an exception is raised.

In [6]:
import org.apache.spark.SparkConf

val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application-1")
sparkConfig.set("spark.master", "local[2]")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

// Never use the inferSchema on prod as it will only read sampleset and infer the schema
val ordersDf = spark.read.format("json")
               .option("path", "/user/itv002768/week11_practicals/players-201019-002101.json")
               .option("mode", "DROPMALFORMED")
               .load()
ordersDf.printSchema()
ordersDf.show()

root
 |-- age: long (nullable = true)
 |-- country: string (nullable = true)
 |-- player_id: long (nullable = true)
 |-- player_name: string (nullable = true)
 |-- role: string (nullable = true)
 |-- team_id: long (nullable = true)

+---+-------+---------+-----------+----------+-------+
|age|country|player_id|player_name|      role|team_id|
+---+-------+---------+-----------+----------+-------+
| 33|    IND|      101|   R Sharma|   Batsman|     11|
| 25|    IND|      102|     S Iyer|   Batsman|     15|
| 30|     NZ|      103|    T Boult|    Bowler|     13|
| 38|    IND|      104|   MS Dhoni|   WKeeper|     14|
| 39|    AUS|      105|   S Watson|Allrounder|     12|
| 23|     WI|      106|  S Hetmyer|   Batsman|     16|
+---+-------+---------+-----------+----------+-------+



sparkConfig = org.apache.spark.SparkConf@3041bf43
spark = org.apache.spark.sql.SparkSession@2b38cf
ordersDf = [age: bigint, country: string ... 4 more fields]


[age: bigint, country: string ... 4 more fields]

### Reading a parquet file

In [8]:
import org.apache.spark.SparkConf

val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application-1")
sparkConfig.set("spark.master", "local[2]")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

val ordersDf = spark.read.format("parquet") // If you'll not mention then also it will take as parquet by default
               .option("path", "/user/itv002768/week11_practicals/users-201019-002101.parquet")
               .load()
ordersDf.printSchema()
ordersDf.show()

root
 |-- registration_dttm: timestamp (nullable = true)
 |-- id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- cc: string (nullable = true)
 |-- country: string (nullable = true)
 |-- birthdate: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- title: string (nullable = true)
 |-- comments: string (nullable = true)

+-------------------+---+----------+---------+--------------------+------+---------------+-------------------+--------------------+----------+---------+--------------------+--------------------+
|  registration_dttm| id|first_name|last_name|               email|gender|     ip_address|                 cc|             country| birthdate|   salary|               title|            comments|
+-------------------+---+----------+---------+--------------------+------+--------------

sparkConfig = org.apache.spark.SparkConf@f6d9291
spark = org.apache.spark.sql.SparkSession@2b38cf
ordersDf = [registration_dttm: timestamp, id: int ... 11 more fields]


[registration_dttm: timestamp, id: int ... 11 more fields]

### Ways to define the schema

There are three options:

- Infer Schema - Not preferred for production.

- Implicit Schema - Try to read a file where schema is associated with it e.g parquet, avro, json etc.

- Explicit schema - Manually defining the schema.

### Explicit Schema
* Do it programatically.
Here, we'll create a struct type and it is applicable to one row.

```scala
val ordersSchema = StructType(List(
StructField("orderid", <datatype>),
StructField("orderdata", <datatype>),
.
.
))
```

StructFields will be equal to the number of columns

```
Scala     Spark
---------------
Int       IntegetType
Long      LongType
Float     FloatType
Double    DoubleType
String    StringType
Timestamp TimestampType
```
In StructField("order_id", IntegerType), First param is column name, second param is data type and the third param is of bool type that tells us if the field is nullable or not. By default it is true and if we'll pass false then it means that column should only contain the non-null values.

Below is the code for the same.

In [14]:
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.TimestampType

val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application-1")
sparkConfig.set("spark.master", "local[2]")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

val ordersSchema = StructType(List(
StructField("order_id", IntegerType),
StructField("order_date", TimestampType),
StructField("order_customer_id", IntegerType),
StructField("order_statue", StringType)
))

val ordersDf = spark.read.option("header", true)
               .schema(ordersSchema)
               .csv("/user/itv002768/week11_practicals/orders-201019-002101.csv")
ordersDf.printSchema()
ordersDf.show()


root
 |-- order_id: integer (nullable = true)
 |-- order_date: timestamp (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_statue: string (nullable = true)

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_statue|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
|       6|2013-07-25 00:00:00|             7130|       COMPLETE|
|       7|2013-07-25 00:00:00|             4530|       COMPLETE|
|       8|2013-07-25 00:00:00|             2911|     PROCESSING|
|       9|2013-07-25 00:00:00|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:00|    

sparkConfig = org.apache.spark.SparkConf@33ee22bb
spark = org.apache.spark.sql.SparkSession@2b38cf
ordersSchema = StructType(StructField(order_id,IntegerType,true), StructField(order_date,TimestampType,true), StructField(order_customer_id,IntegerType,true), StructField(order_statue,StringType,true))
ordersDf = [order_id: int, order_date: timestamp ... 2 more fields]


[order_id: int, order_date: timestamp ... 2 more fields]

* DDL String

val ordersSchemaDDL = "order_id Int, order_date String, cust_id Int, order_status int"

In [15]:
import org.apache.spark.SparkConf


val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application-1")
sparkConfig.set("spark.master", "local[2]")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

val ordersSchemaDDL = "order_id Int, order_date String, cust_id Int, order_status Int"

val ordersDf = spark.read.option("header", true)
               .schema(ordersSchemaDDL)
               .csv("/user/itv002768/week11_practicals/orders-201019-002101.csv")
ordersDf.printSchema()
ordersDf.show()


root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- cust_id: integer (nullable = true)
 |-- order_status: integer (nullable = true)

+--------+----------+-------+------------+
|order_id|order_date|cust_id|order_status|
+--------+----------+-------+------------+
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null|      null|   null|        null|
|    null

sparkConfig = org.apache.spark.SparkConf@771dcad8
spark = org.apache.spark.sql.SparkSession@2b38cf
ordersSchemaDDL = order_id Int, order_date String, cust_id Int, order_status Int
ordersDf = [order_id: int, order_date: string ... 2 more fields]


[order_id: int, order_date: string ... 2 more fields]