* Overview of Spark read APIs
* Let us get the overview of Spark read APIs to read files of different formats.

* spark has a bunch of APIs to read data from files of different formats.
* All APIs are exposed under spark.read
* `text` - to read single column data from text files as well as reading each of the whole text file as one record.
* `csv` - to read text files with delimiters. Default is a comma, but we can use other delimiters as well.
* `json` - to read data from JSON files
* `orc` - to read data from ORC files
* `parquet` - to read data from Parquet files.
* We can also read data from other file formats by plugging in and by using spark.read.format
* We can also pass options based on the file formats.
* `inferSchema` - to infer the data types of the columns based on the data.
* `header` - to use header to get the column names in case of text files.
* `schema` - to explicitly specify the schema.
* We can get the help on APIs like `spark.read.csv` using `help(spark.read.csv)`.
* Reading delimited data from text files.

#### Read and Write CSV Files

#### Read files Options
* `path`: location of files. Accepts standard Hadoop globbing expressions. To read a directory of CSV files, specify a directory.
* header: when set to true, the first line of files name columns and are not included in data. All types are assumed to be string. Default value is false.
* `sep`: the column delimiter. By default ,, but can be set to any character.
* `quote`: the quote character. By default ", but can be set to any character. Delimiters inside quotes are ignored.
* `escape`: the escape character. By default \, but can be set to any character. Escaped quote characters are ignored.
* `parserLib`: by default is commons. Can be set to univocity to use that library for CSV parsing.
* `mode` : the parsing mode. By default it is PERMISSIVE. Possible values are:
* `PERMISSIVE`: try to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored.
* `DROPMALFORMED`: drop lines that have fewer or more tokens than expected or tokens which do not match the schema.
* `FAILFAST`: abort with a RuntimeException if any malformed line is encountered.
* `charset`: the character set. By default UTF-8, but can be set to other valid charset names.
* `inferSchema`: automatically infer column types. It requires one extra pass over the data and is false by default.
* `comment`: skip lines beginning with this character. Default is #. Disable comments by setting this to null.
* `nullValue`: string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame.
* `dateFormat`: string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at *java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default it is null, which means try to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().

#### Writing file options.
* `dateFormat`: string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default it is null, which means try to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().
* Write files
The package also supports saving simple (non-nested) DataFrame. When writing files the API accepts the following options:

* `path`: location of files.
* `header`: when set to true, the header (from the schema in the DataFrame) is written at the first line.
* `sep`: the column delimiter. By default ,, but can be set to any character.
* `quote`: the quote character. By default ", but can be set to any character. This is written according to quoteMode.
* `escape`: the escape character. By default \, but can be set to any character. Escaped quote characters are written.
* `nullValue`: string that indicates a null value, nulls in the DataFrame will be written as this string.
* `dateFormat`: string that indicates the date format to use writing dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. If no dateFormat is specified, then yyyy-MM-dd HH:mm:ss.S.
* `codec`: compression codec to use when saving. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive short names (bzip2, gzip, lz4, and snappy). Defaults to no compression.
* `quoteMode`: when to quote fields (ALL, MINIMAL (default), NON_NUMERIC, NONE),

In [5]:
file_location = '/FileStore/tables/emp.csv' 
emp_df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").option("sep", ",").load(file_location)

In [6]:
emp_df.printSchema()

In [7]:
diamonds = spark.read.format('csv').options(header='true', inferSchema='true').load('/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv')

In [8]:
display(diamonds)

_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
6,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
7,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
8,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
9,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
10,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


*  using `quote` and `escape` options while reading a file...

In [10]:
dept_loc = '/FileStore/tables/dept.csv' 
dept_df = spark.read.format("csv").option("inferSchema", "true")\
.option("header", "true").option("sep", ",")\
.option("quote", "\"").option("escape", "\"")\
.load(dept_loc)


In [11]:
display(dept_df)

Deptno,Dname,Loc
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON


* using `encoding` option for encoding. values like `cp1252` , `UTF-8` , `UTF-16`
* Using `delimiter`  for data

In [13]:
dept_loc = '/FileStore/tables/dept.csv' 
dept_df = spark.read.format("csv").option("inferSchema", "true")\
.option("header", "true").option("delimiter", ",")\
.option("encoding","cp1252")\
.load(dept_loc)

In [14]:
display(dept_df)

Deptno,Dname,Loc
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON


#### Creating Temporary View in SQL

* Creating `TEMPORARY VIEW` in SQL on direct accessing the data from  file path
* Using file reading Mode `FAILFAST`

In [16]:
%sql
-- mode "FAILFAST" will abort file parsing with a RuntimeException if any malformed lines are encountered
CREATE OR REPLACE TEMPORARY VIEW diamonds
USING CSV
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true", mode "FAILFAST")

In [17]:
%sql
select * from diamonds

_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
6,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
7,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
8,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
9,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
10,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


In [18]:
emp_df.coalesce(2).write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
 .save('/tmp/file.csv',  header=True)

In [19]:
diamonds.coalesce(2).write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
 .save('/tmp/diamons.csv',  header=True)

*  Using `overwrite`  mode AND `seperator` while creating a file

In [21]:
csv_write='/tmp/diamons.csv'
emp_df.write.format("csv") \
  .mode("overwrite") \
  .option("dateFormat", "yyyy-MM-dd")\
  .option("path", csv_write)\
  .option("sep", ",") \
  .save()


* Using `compression` while creating file with `zip`

In [23]:
csv_write='/tmp/diamons.csv'
emp_df.write.format("csv") \
  .mode("overwrite") \
  .option("dateFormat", "yyyy-MM-dd")\
  .option("path", csv_write)\
  .option("sep", ",") \
  .save(compression='gzip')


#### Creating zip file from DATAFRAME

* Saving data into `compression` as `gzip` file

In [25]:

csv_write='/tmp/diamons.csv'
emp_df.write.format("csv") \
  .mode("overwrite") \
  .option("dateFormat", "yyyy-MM-dd")\
  .option("path", csv_write)\
  .option("sep", ",") \
  .save(cmode='append', sep=';', compression='gzip', encoding='cp1252')

In [26]:
%fs
ls /tmp/diamons.csv