#### Reading & Writing Json Data and creating RDD and DataFrame from Json Data
* Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SparkSession.read.json on a JSON file.

* Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

* For a regular multi-line JSON file, set the multiLine parameter to True.

In [2]:
%fs ls /FileStore/tables/

path,name,size
dbfs:/FileStore/tables/1/,1/,0
dbfs:/FileStore/tables/1.xml/,1.xml/,0
dbfs:/FileStore/tables/2.xml/,2.xml/,0
dbfs:/FileStore/tables/3.xml/,3.xml/,0
dbfs:/FileStore/tables/4.xml/,4.xml/,0
dbfs:/FileStore/tables/5xml/,5xml/,0
dbfs:/FileStore/tables/cache.png,cache.png,135788
dbfs:/FileStore/tables/custom.log,custom.log,6268
dbfs:/FileStore/tables/custom2020-08-08-16-34-39.log,custom2020-08-08-16-34-39.log,0
dbfs:/FileStore/tables/custom2020-08-08-16-35-05.log,custom2020-08-08-16-35-05.log,0


#### Reading Json File Data...

In [4]:
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files
# Option Charset also you can use based on your data -- option("charset", "UTF-8")
emp_json=spark.read.json('/FileStore/tables/emp_json.json')

In [5]:
emp_json.printSchema()

In [6]:
emp_text=spark.read.text('/FileStore/tables/emp_json.json')

In [7]:
display(emp_text)

value
"{""EMPNO"":""7369"",""ENAME"":""SMITH"",""JOB"":""CLERK"",""MGR"":""7902"",""HIREDATE"":""17-12-80"",""SAL"":""800"",""DEPTNO"":""20""}"
"{""EMPNO"":""7499"",""ENAME"":""ALLEN"",""JOB"":""SALESMAN"",""MGR"":""7698"",""HIREDATE"":""20-02-81"",""SAL"":""1600"",""COMM"":""300"",""DEPTNO"":""30""}"
"{""EMPNO"":""7521"",""ENAME"":""WARD"",""JOB"":""SALESMAN"",""MGR"":""7698"",""HIREDATE"":""22-02-81"",""SAL"":""1250"",""COMM"":""500"",""DEPTNO"":""30""}"
"{""EMPNO"":""7566"",""ENAME"":""JONES"",""JOB"":""MANAGER"",""MGR"":""7839"",""HIREDATE"":""02-04-81"",""SAL"":""2975"",""DEPTNO"":""20""}"
"{""EMPNO"":""7654"",""ENAME"":""MARTIN"",""JOB"":""SALESMAN"",""MGR"":""7698"",""HIREDATE"":""28-09-81"",""SAL"":""1250"",""COMM"":""1400"",""DEPTNO"":""30""}"
"{""EMPNO"":""7698"",""ENAME"":""SGR"",""JOB"":""MANAGER"",""MGR"":""7839"",""HIREDATE"":""01-05-81"",""SAL"":""2850"",""DEPTNO"":""30""}"
"{""EMPNO"":""7782"",""ENAME"":""RAVI"",""JOB"":""MANAGER"",""MGR"":""7839"",""HIREDATE"":""09-06-81"",""SAL"":""2450"",""DEPTNO"":""10""}"
"{""EMPNO"":""7788"",""ENAME"":""SCOTT"",""JOB"":""ANALYST"",""MGR"":""7566"",""HIREDATE"":""19-04-87"",""SAL"":""3000"",""DEPTNO"":""20""}"
"{""EMPNO"":""7839"",""ENAME"":""KING"",""JOB"":""PRESIDENT"",""HIREDATE"":""17-11-81"",""SAL"":""5000"",""DEPTNO"":""10""}"
"{""EMPNO"":""7844"",""ENAME"":""TURNER"",""JOB"":""SALESMAN"",""MGR"":""7698"",""HIREDATE"":""08-09-81"",""SAL"":""1500"",""COMM"":""0"",""DEPTNO"":""30""}"


In [8]:
display(emp_json)

COMM,DEPTNO,EMPNO,ENAME,HIREDATE,JOB,MGR,SAL
,20,7369,SMITH,17-12-80,CLERK,7902.0,800
300.0,30,7499,ALLEN,20-02-81,SALESMAN,7698.0,1600
500.0,30,7521,WARD,22-02-81,SALESMAN,7698.0,1250
,20,7566,JONES,02-04-81,MANAGER,7839.0,2975
1400.0,30,7654,MARTIN,28-09-81,SALESMAN,7698.0,1250
,30,7698,SGR,01-05-81,MANAGER,7839.0,2850
,10,7782,RAVI,09-06-81,MANAGER,7839.0,2450
,20,7788,SCOTT,19-04-87,ANALYST,7566.0,3000
,10,7839,KING,17-11-81,PRESIDENT,,5000
0.0,30,7844,TURNER,08-09-81,SALESMAN,7698.0,1500


In [9]:
emp_json=spark.read.json('/FileStore/tables/emp_json.json/part-00000-tid-1549053766405469969-89a72372-3b50-41ad-9efa-00667fb0e6a4-112-1-c000.json')

In [10]:
display(emp_json)

COMM,DEPTNO,EMPNO,ENAME,HIREDATE,JOB,MGR,SAL
,20,7369,SMITH,17-12-80,CLERK,7902.0,800
300.0,30,7499,ALLEN,20-02-81,SALESMAN,7698.0,1600
500.0,30,7521,WARD,22-02-81,SALESMAN,7698.0,1250
,20,7566,JONES,02-04-81,MANAGER,7839.0,2975
1400.0,30,7654,MARTIN,28-09-81,SALESMAN,7698.0,1250
,30,7698,SGR,01-05-81,MANAGER,7839.0,2850
,10,7782,RAVI,09-06-81,MANAGER,7839.0,2450
,20,7788,SCOTT,19-04-87,ANALYST,7566.0,3000
,10,7839,KING,17-11-81,PRESIDENT,,5000
0.0,30,7844,TURNER,08-09-81,SALESMAN,7698.0,1500


In [11]:
%fs
ls /FileStore/tables/emp_json.json/

path,name,size
dbfs:/FileStore/tables/emp_json.json/_SUCCESS,_SUCCESS,0
dbfs:/FileStore/tables/emp_json.json/_committed_1549053766405469969,_committed_1549053766405469969,114
dbfs:/FileStore/tables/emp_json.json/_started_1549053766405469969,_started_1549053766405469969,0
dbfs:/FileStore/tables/emp_json.json/part-00000-tid-1549053766405469969-89a72372-3b50-41ad-9efa-00667fb0e6a4-112-1-c000.json,part-00000-tid-1549053766405469969-89a72372-3b50-41ad-9efa-00667fb0e6a4-112-1-c000.json,1685


* creating sample json file using `dbutils.fs.put`

In [13]:
dbutils.fs.put("/tmp/test.json", """
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}
""",True)

In [14]:
%fs
ls /tmp/

path,name,size
dbfs:/tmp/LoanStats3a.csv,LoanStats3a.csv,42408085
dbfs:/tmp/PySpark_Data/,PySpark_Data/,0
dbfs:/tmp/book.xml,book.xml,5542
dbfs:/tmp/books.xml,books.xml,79
dbfs:/tmp/custom_log/,custom_log/,0
dbfs:/tmp/delta-table/,delta-table/,0
dbfs:/tmp/diamons.csv/,diamons.csv/,0
dbfs:/tmp/excel.xlsx,excel.xlsx,5269
dbfs:/tmp/file.csv/,file.csv/,0
dbfs:/tmp/files.csv/,files.csv/,0


In [15]:
test_df=spark.read.json('/tmp/test.json')

In [16]:
test_df.printSchema()

In [17]:
display(test_df)

array,dict,int,string
"List(1, 2, 3)","List(null, value1)",1,string1
"List(2, 4, 6)","List(null, value2)",2,string2
"List(3, 6, 9)","List(extra_value3, value3)",3,string3


#### Creating SQL Temporary view on Json files for reading json data.

In [19]:
%sql 
CREATE TEMPORARY VIEW emp_json_table
USING json
OPTIONS (path "/FileStore/tables/emp_json.json")

In [20]:
%sql
select * from emp_json_table where sal>1250.0

COMM,DEPTNO,EMPNO,ENAME,HIREDATE,JOB,MGR,SAL
300.0,30,7499,ALLEN,20-02-81,SALESMAN,7698.0,1600
,20,7566,JONES,02-04-81,MANAGER,7839.0,2975
,30,7698,SGR,01-05-81,MANAGER,7839.0,2850
,10,7782,RAVI,09-06-81,MANAGER,7839.0,2450
,20,7788,SCOTT,19-04-87,ANALYST,7566.0,3000
,10,7839,KING,17-11-81,PRESIDENT,,5000
0.0,30,7844,TURNER,08-09-81,SALESMAN,7698.0,1500
,20,7902,FORD,03-12-81,ANALYST,7566.0,3000
,10,7934,MILLER,23-01-82,CLERK,7782.0,1300


## Multi Line Json File Data...

In [22]:
dbutils.fs.put("/tmp/multiline.json", """
[
    {"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
    {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
    {
        "string": "string3",
        "int": 3,
        "array": [
            3,
            6,
            9
        ],
        "dict": {
            "key": "value3",
            "extra_key": "extra_value3"
        }
    }
]""",True)

In [23]:
json_data=spark.read.option('multiline',"true").json("/tmp/multiline.json")

In [24]:
display(json_data)

array,dict,int,string
"List(1, 2, 3)","List(null, value1)",1,string1
"List(2, 4, 6)","List(null, value2)",2,string2
"List(3, 6, 9)","List(extra_value3, value3)",3,string3


In [25]:
multi_line_df=spark.read.option("multiline", "true").json("/tmp/multiline.json")

In [26]:
display(multi_line_df)

array,dict,int,string
"List(1, 2, 3)","List(null, value1)",1,string1
"List(2, 4, 6)","List(null, value2)",2,string2
"List(3, 6, 9)","List(extra_value3, value3)",3,string3


*  Creating `RDD` for customised Json Data
*  Creating `DataFrame` for Customised Json Data
*  Reading `Nested` `Struct` Fields json data

In [28]:
# an RDD[String] storing one JSON object per string
my_data = ['{"name":"Ravi","address":{"city":"Bangalore","state":"Karnataka"}}']
my_rdd = sc.parallelize(my_data)
my_df = spark.read.json(my_rdd)
my_df.printSchema()

In [29]:
display(my_df)

address,name
"List(Bangalore, Karnataka)",Ravi


* Get individual item and values using select query..

In [31]:
display(my_df.select("address.city","address.state","name"))
#my_df.select("address.city","address.state","name").show()

city,state,name
Bangalore,Karnataka,Ravi


In [32]:
stringJSONRDD = sc.parallelize((""" 
  { "id": "123",
    "name": "Raj",
    "age": 39,
    "eyeColor": "brown"
  }""",
   """{
    "id": "234",
    "name": "Srinu",
    "age": 37,
    "eyeColor": "Black"
  }""", 
  """{
    "id": "345",
    "name": "Ravi",
    "age": 35,
    "eyeColor": "red"
  }""")
)

In [33]:
test_json_df=spark.read.json(stringJSONRDD)

In [34]:
test_json_df.printSchema()

In [35]:
display(test_json_df)

age,eyeColor,id,name
39,brown,123,Raj
37,Black,234,Srinu
35,red,345,Ravi


####  Writing Json Files data from DataFrame
* using `mode('append')` and `mode('overwrite')`

In [37]:
emp_json.write.mode('append').json('/FileStore/Tables/jsonfiles')
#emp_json.write.mode('overwrite').json('/FileStore/Tables/jsonfiles')

In [38]:
%fs
ls /FileStore/Tables/jsonfiles/

path,name,size
dbfs:/FileStore/Tables/jsonfiles/_SUCCESS,_SUCCESS,0
dbfs:/FileStore/Tables/jsonfiles/_committed_2464412959603568780,_committed_2464412959603568780,472
dbfs:/FileStore/Tables/jsonfiles/_committed_5807186400108590717,_committed_5807186400108590717,114
dbfs:/FileStore/Tables/jsonfiles/_committed_6188326100353873665,_committed_6188326100353873665,114
dbfs:/FileStore/Tables/jsonfiles/_committed_6540946452806229999,_committed_6540946452806229999,115
dbfs:/FileStore/Tables/jsonfiles/_committed_vacuum4435709910176646463,_committed_vacuum4435709910176646463,162
dbfs:/FileStore/Tables/jsonfiles/_started_6540946452806229999,_started_6540946452806229999,0
dbfs:/FileStore/Tables/jsonfiles/part-00000-tid-2464412959603568780-ff1ab8fa-1c1b-4eba-bbc5-8fa214c929b5-110-1-c000.json,part-00000-tid-2464412959603568780-ff1ab8fa-1c1b-4eba-bbc5-8fa214c929b5-110-1-c000.json,1685
dbfs:/FileStore/Tables/jsonfiles/part-00000-tid-5807186400108590717-72a56d7e-ec63-422f-8988-89aa286bb907-111-1-c000.json,part-00000-tid-5807186400108590717-72a56d7e-ec63-422f-8988-89aa286bb907-111-1-c000.json,1685
dbfs:/FileStore/Tables/jsonfiles/part-00000-tid-6540946452806229999-9882638c-66fc-48fb-ae14-15cdda6bce0c-5590-1-c000.json,part-00000-tid-6540946452806229999-9882638c-66fc-48fb-ae14-15cdda6bce0c-5590-1-c000.json,1685
