## Module 3

This file is running on Databricks cluster: **DBR 9.1 LTS | Spark 3.1.2 | Scala 2.12**

Notebook has default language: **Python**

### Schema inference - semi-structured files

Make sure that the files for module3 are loaded and use Spark API file connection

In [0]:
import pyspark

In [0]:
spark.read.json("dbfs:/FileStore/module3/json1.json")

In [0]:
spark.read.json("dbfs:/FileStore/module3/json1.json").printSchema()

Actual file looks like and all types are long and nullable (!)
```
{"a":1, "b":2, "c":3}
{"e":2, "c":3, "b":5}
{"a":5, "d":7}
```

In [0]:
spark.read.json("dbfs:/FileStore/module3/json2.json")

In [0]:
spark.read.json("dbfs:/FileStore/module3/json2.json").printSchema()

Actual file looks like and all types are long and nullable (!)
```
{"a":1, "b":2, "c":3.1}
{"e":2, "c":3, "b":5}
{"a":"5", "d":7}
```

We can store results in dataFrame

In [0]:
df = spark.read.json("dbfs:/FileStore/module3/json2.json")
df.printSchema()
df.show()

In [0]:
# Read JSON file into dataframe
df = spark.read.format('org.apache.spark.sql.json') \
        .load("dbfs:/FileStore/module3/json2.json")

But we want to enforce schema to get correct import values; let's repeat for `json1.json` file

In [0]:
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,BooleanType,DoubleType

In [0]:
mojaShemica = StructType([
  StructField("a", IntegerType(), True),
  StructField("b", IntegerType(), True)])

In [0]:
spark.read.schema(mojaShemica).json("dbfs:/FileStore/module3/json1.json").show()

In [0]:
spark.read.json("dbfs:/FileStore/module3/json1.json").printSchema()

## Playing with data
Now let's create a more "interesting" JSON file

In [0]:
[{
  "RecordNumber": 2,
  "Zipcode": 1000,
  "ZipCodeType": "STANDARD",
  "City": "Ljubljana",
   "State":"SI"
},
{
  "RecordNumber": 10,
  "Zipcode": 3000,
  "ZipCodeType": "STANDARD",
  "City": "Celje",
   "State":"SI"
 },
{
  "RecordNumber": 32,
  "Zipcode": 100,
  "ZipCodeType": "STANDARD",
  "City": "Ljubljana",
   "State":"SI", 
   "Country":"Slovenia",
   "Lat":"46.0569",
   "Long":"14.5058"
 }]

In [0]:
# Read multiline json file
multiline_df = spark.read.option("multiline","true") \
      .json("dbfs:/FileStore/module3/json3.json")
multiline_df.show() 

Now let's read multiple files in this folder

In [0]:
# Read multiple files as listed names
df = spark.read.option("multiline","true").json(
    ['dbfs:/FileStore/module3/json3.json','dbfs:/FileStore/module3/json4_corrupt.json'])


This is why the files are corrupted. Missing comma and last comma must be removed.

```
[{
  "RecordNumber": 2,
  "Zipcode": 1000,
  "ZipCodeType": "STANDARD",
  "City": "Ljubljana",
   "State":"SI"
},
{
  "RecordNumber": 10,
  "Zipcode": 3000,
  "ZipCodeType": "STANDARD",
  "City": "Celje",
   "State":"SI"
 },
{
  "RecordNumber": 32,
  "Zipcode": 100,
  "ZipCodeType": "STANDARD",
  "City": "Ljubljana",
   "State":"SI", 
   "Country":"Slovenia",
   "Lat":"46.0569",
   "Long":"14.5058"
 }
{
  "RecordNumber": 104,
  "Zipcode": 89260,
  "ZipCodeType": "STANDARD",
  "City": "Seattle",
   "State":"WA",
   "Country":"USA"
 },
]
```

In [0]:
# Read multiple files as listed names and made json5 as corrected copy of json4
df = spark.read.option("multiline","true").json(['dbfs:/FileStore/module3/json3.json','dbfs:/FileStore/module3/json5.json'])


In [0]:
df.show()

Now let's try to read all files from a dedicated folder `dbfs:/FileStore/module3/input_files/`

In [0]:
# Read all JSON files from a folder
df = spark.read.option("multiline","true").json("dbfs:/FileStore/module3/input_files/*.json")
df.show()

Now let's infer schema. This schema will be user-specified and custom schema.

In [0]:
# Define custom schema
schema = StructType([
      StructField("RecordNumber",IntegerType(),True),
      StructField("Zipcode",IntegerType(),True),
      StructField("ZipCodeType",StringType(),True),
      StructField("City",StringType(),True),
      StructField("State",StringType(),True),
      StructField("LocationType",StringType(),True),
      StructField("Lat",DoubleType(),True),
      StructField("Long",DoubleType(),True),
      StructField("WorldRegion",StringType(),True),
      StructField("Country",StringType(),True),
      StructField("LocationText",StringType(),True)
  ])

In [0]:
df_with_schema = spark.read.schema(schema) \
        .json("dbfs:/FileStore/module3/json3.json")
df_with_schema.printSchema()
df_with_schema.show()

Reading JSON files using Spark SQL

In [0]:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,BooleanType,DoubleType
spark = SparkSession.builder \
    .master("local[1]") \
    .appName("module3") \
    .getOrCreate()

In [0]:
# Create a table from Parquet File
spark.sql("CREATE OR REPLACE TEMPORARY VIEW json3 USING json OPTIONS (path 'dbfs:/FileStore/module3/json3.json')")

In [0]:
df2 = spark.sql("select * from json3")

In [0]:
# PySpark write Parquet File
# referring to dataframe called df!
df.write.mode('Overwrite').json("dbfs:/FileStore/module3/output/res.json")

## Getting data from Source