#Reading and Writing DataFrames#

###DataFrame Creation### 
 **** Using Python List of tuples ****

In [0]:
data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)

In [0]:
df.printSchema()

In [0]:
df.show()

**createDataFrame() from SparkSession using data list**

In [0]:
dfFromData2 = spark.createDataFrame(data).toDF(*columns)

**Create DataFrame with schema**

In [0]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)

**Creating DataFrame from RDD**

**Method 1 : Using toDF() function**

In [0]:
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
rdd = spark.sparkContext.parallelize(data)

dfFromRDD1 = rdd.toDF()

dfFromRDD1.show()

In [0]:
columns = ["language","users_count"]
dfFromRDD1 = rdd.toDF(columns)
dfFromRDD1.printSchema()

**Method 2 : Using createDataFrame() from SparkSession**
0. Using createDataFrame() from SparkSession is another way to create manually and it takes rdd object as an argument
0. chain with toDF() to specify name to the columns.

In [0]:
dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)

**Create DataFrame from Data sources**
* csv(path)
* json(path)
* orc(path)
* parquet(path)
* table(tableName)
* text(path)
* textFile(path)
* jdbc(url, table, ..., connection properties)

In [0]:
#You can mount a Blob storage container or a folder inside a container to Databricks File System (DBFS). The mount is a pointer to a Blob storage container, so the data is never synced locally#
#the below one did not work while trying to upload csv file"
#go to data tab and upload any files)
%fs mount("E:/SampleFiles","/mnt/SampleFiles/csv")


In [0]:
%fs ls

In [0]:
csvDF = (spark.read
         .option("delimeter",",")
         .option("InferSchema","True")
         .option("header","True")
         .csv("/FileStore/tables/country_lookup.csv")
         )


In [0]:
csvDF.printSchema()

In [0]:
display(csvDF) #displays all the record in tabular format. Best way to view the output

In [0]:
csvDF.show() #shows top 20 records. Not so great for visulaization when the table contains so many columns

***spark.read.csv("path") or spark.read.format("csv").load("path")***
* including options
  * by chaining option(key,value)
  * options(Map)

* List of options available are
  * delimeter
  * inferSchema
  * header 
  * quotes
  * nullValues - sets particular string value to null
  * dateFormat
  * schema - takes schema object as an argument

In [0]:
csvDF1 = (spark.read.options(delimeter = ",", inferSchema = "True", header = "True")
         .csv("/FileStore/tables/country_lookup.csv"))

**Reading the data with User defined schema**  
Declare the Schema - list of fields and datatypes

In [0]:
# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

csvSchema = StructType([
  StructField("Country", StringType(), False),
  StructField("country_code_2_digit", StringType(), False),
  StructField("country_code_3_digit", StringType(), False),
  StructField("continent", StringType(), False),
  StructField("population", IntegerType(), False)
])

In [0]:
(spark.read                   # The DataFrameReader
  .option('header', 'true')   # Ignore line #1 - it's a header
  .option('sep', ",")        # Use tab delimiter (default is comma-separator)
  .schema(csvSchema)          # Use the specified schema
  .csv("/FileStore/tables/country_lookup.csv") # Creates a DataFrame from CSV after reading in the file
  .printSchema()
)

##Reading JSON files## 
###options###
1. read.option("multiline","true")
2. inferSchema
3. schema
4. nullValues
5. dateFormat

In [0]:
%fs head /FileStore/tables/people.json

In [0]:
jsonDF = (spark.read                  
  .json("/FileStore/tables/people.json") # Creates a DataFrame from text after reading in the file
)

In [0]:
jsonDF.show()

##Reading Parquet files## 
<strong style="font-size:larger">"</strong>Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.<strong style="font-size:larger">"</strong><br>

-sandbox
### About Parquet Files
* Free & Open Source.
* Increased query performance over row-based data stores.
* Provides efficient data compression.
* Designed for performance on large data sets.
* Supports limited schema evolution.
* Is a splittable "file format".
* A <a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS" target="_blank">Column-Oriented</a> data store

&nbsp;&nbsp;&nbsp;&nbsp;** Row Format ** &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **Column Format**

<table style="border:0">

  <tr>
    <th>ID</th><th>Name</th><th>Score</th>
    <th style="border-top:0;border-bottom:0">&nbsp;</th>
    <th>ID:</th><td>1</td><td>2</td>
    <td style="border-right: 1px solid #DDDDDD">3</td>
  </tr>

  <tr>
    <td>1</td><td>john</td><td>4.1</td>
    <td style="border-top:0;border-bottom:0">&nbsp;</td>
    <th>Name:</th><td>john</td><td>mike</td>
    <td style="border-right: 1px solid #DDDDDD">sally</td>
  </tr>

  <tr>
    <td>2</td><td>mike</td><td>3.5</td>
    <td style="border-top:0;border-bottom:0">&nbsp;</td>
    <th style="border-bottom: 1px solid #DDDDDD">Score:</th>
    <td style="border-bottom: 1px solid #DDDDDD">4.1</td>
    <td style="border-bottom: 1px solid #DDDDDD">3.5</td>
    <td style="border-bottom: 1px solid #DDDDDD; border-right: 1px solid #DDDDDD">6.4</td>
  </tr>

  <tr>
    <td style="border-bottom: 1px solid #DDDDDD">3</td>
    <td style="border-bottom: 1px solid #DDDDDD">sally</td>
    <td style="border-bottom: 1px solid #DDDDDD; border-right: 1px solid #DDDDDD">6.4</td>
  </tr>

</table>

See also
* <a href="https://parquet.apache.org/" target="_blank">https&#58;//parquet.apache.org</a>
* <a href="https://en.wikipedia.org/wiki/Apache_Parquet" target="_blank">https&#58;//en.wikipedia.org/wiki/Apache_Parquet</a>

In [0]:
parquetDF = (spark.read                  
  .parquet("/FileStore/tables/userdata1.parquet") # Creates a DataFrame from text after reading in the file
)

In [0]:
parquetDF.show()

In [0]:
%fs head /FileStore/tables/userdata1.parquet

##Writing Dataframes to different formats## 
**df.write** 
csv

  0. Options: 
    * header
    * delimeter 
    * quote
    * escape
    * nullValues
    * dateFormat
    * quoteMode

**Saving modes**
-PySpark DataFrameWriter also has a method mode() to specify saving mode. 

* **.mode **

  overwrite – mode is used to overwrite the existing file.

  append – To add the data to the existing file.

  ignore – Ignores write operation when the file already exists.

  error – This is a default option when the file already exists, it returns an error.

In [0]:
parquetDF.write.options(header='True', delimiter=',').csv("/FileStore/tables/writeDF")

In [0]:
%fs ls "/FileStore/tables/writeDF"

path,name,size
dbfs:/FileStore/tables/writeDF/_SUCCESS,_SUCCESS,0
dbfs:/FileStore/tables/writeDF/_committed_4317844966561036838,_committed_4317844966561036838,112
dbfs:/FileStore/tables/writeDF/_started_4317844966561036838,_started_4317844966561036838,0
dbfs:/FileStore/tables/writeDF/part-00000-tid-4317844966561036838-2610edd0-1a3d-49c5-ba86-b126ebaa364e-34-1-c000.csv,part-00000-tid-4317844966561036838-2610edd0-1a3d-49c5-ba86-b126ebaa364e-34-1-c000.csv,149366


In [0]:
%fs head /FileStore/tables/writeDF/part-00000-tid-4317844966561036838-2610edd0-1a3d-49c5-ba86-b126ebaa364e-34-1-c000.csv


**Writing in parquet file format**
* .option - compression (none, snappy, gzip, lzo)  
* .mode - saving mode (overwrite, append, ignore, error)
* .partitionBy - while saving the file we can partition the data. Retreiving the particular partition becomes easier and improves performance.

In [0]:
jsonDF.write.mode('overwrite').option("compression","snappy").partitionBy('city').parquet("/FileStore/tables/writeDF/people.parquet")