## Reference Documentation

- [https://spark.apache.org/docs/4.0.0/api/python/user_guide/dataframes.html?highlight=inferschema](Apache Spark User Guide)

## Load Sample Data

- Use the option in Catalog to Load Sample Data file from local/S3/any Cloud Storage 

## Verify the contents of the data load location

  `dbutils.fs.ls("dbfs:/Volumes/workspace/default/tutorial")`

## What is DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of a DataFrame like a spreadsheet, a SQL table, or a dictionary of series objects. Apache Spark DataFrames support a rich set of APIs (select columns, filter, join, aggregate, etc.) that allow you to solve common data analysis problems efficiently.

Compared to traditional relational databases, Spark DataFrames offer several key advantages for big data processing and analytics:

- **Distributed computing**: Spark distributes data across multiple nodes in a cluster, allowing for parallel processing of big data
- **In-memory processing**: Spark performs computations in memory, which can be significantly faster than disk-based processing
- **Schema flexibility**: Unlike traditional databases, PySpark DataFrames support schema evolution and dynamic typing
- **Fault tolerance**: PySpark DataFrames are built on top of Resilient Distributed Dataset (RDDs), which are inherently fault-tolerant. Spark automatically handles node failures and data replication, ensuring data reliability and integrity.


## Create Dataframe

- **From a list of dictionaries**:

  employees = [{"name": "John D.", "age": 30},
    {"name": "Alice G.", "age": 25},
    {"name": "Bob T.", "age": 35},
    {"name": "Eve A.", "age": 28}]

  **`Create a DataFrame containing the employees data: `**
  
    `df = spark.createDataFrame(employees)`
    
    `df.show()`

- **From a local file**:

  `df = spark.read.csv("../data/employees.csv", header=True, inferSchema=True)`
  
  `df.show()`

- **From a local json**:

  `df = spark.read.option("multiline","true").json("../data/employees.json")`
  
  `df.show()`

- **From an existing DataFrame**:

  employees = [
    {"name": "John D.", "age": 30, "department": "HR"},
    {"name": "Alice G.", "age": 25, "department": "Finance"},
    {"name": "Bob T.", "age": 35, "department": "IT"},
    {"name": "Eve A.", "age": 28, "department": "Marketing"}
  ]
  
  `df = spark.createDataFrame(employees)`

  **`Select only the name and age columns: `**
  `new_df = df.select("name", "age")`

- **From a table in Spark environment**:

  `df = spark.read.table("table_name")  `

- **From a table in an external Database, by connecting using JDBC to read the table into DataFrame**:

  url = "jdbc:mysql://localhost:3306/mydatabase"
  table = "employees"
  properties = {
    "user": "username",
    "password": "password"
  }

  **`Read table into DataFrame: `**
  `df = spark.read.jdbc(url=url, table=table, properties=properties)`

## Display DataFrame

- **df.show()** - Displays the basic visualization of the DataFrame's contents. By default it displays first 20 rows.

- **df.show(n=2)** - Displays only 2 rows
    +---+--------+
    |age|    name|
    +---+--------+
    | 30| John D.|
    | 25|Alice G.|
    +---+--------+
    only showing top 2 rows

- **df.show(truncate=3)** - Truncate attribute controls the length of the displayed column values, by default it's 20
    +---+----+
    |age|name|
    +---+----+
    | 30| Joh|
    | 25| Ali|
    | 35| Bob|
    | 28| Eve|
    +---+----+

- **df.show(vertical=True)** - DataFrame will be displayed vertically with one line per value
    -RECORD 0--------
    age  | 30
    name | John D.
    -RECORD 1--------
    age  | 25
    name | Alice G.
    -RECORD 2--------
    age  | 35
    name | Bob T.
    -RECORD 3--------
    age  | 28
    name | Eve A.

- **df.printSchema()** - To view the schema of the DataFrame

- **display(df)** (Only Available in Databricks): Displays the data in tabular format.

- **display(df, streamName)** (Only Available in Databricks): Render streaming data in real-time.
  
    display(df, streamName="myStream")

  - Shows live updates of a structured streaming query.
  - Automatically refreshes until you stop the cell.
  Example:

  `streamingDF = spark.readStream.format("csv").option("header", "true").load("/path")`

  `display(streamingDF, streamName="LiveCSVStream")`

- **display(df.select("col1", "col2"))** (Only Available in Databricks): Display only selected columns

- **display with Visualization options** (Only Available in Databricks): After calling display(df) in a Databricks notebook, we can switch between Table, Bar chart, Line chart, Scatter plot, Map, etc.

- **display with Temporary SQL Views** (Only Available in Databricks): You can combine with createOrReplaceTempView() to display SQL query results:

  `df.createOrReplaceTempView("my_table")`
  
  `display(spark.sql("SELECT col1, count(*) FROM my_table GROUP BY col1"))`


## DataFrame Manipulation

### Reference: 
- [https://spark.apache.org/docs/4.0.0/api/python/user_guide/dataprep.html](Apache Spark User Guide - Data Frame Manipulation)
- PySpark-Tutorial-2-Manipulate DataFrame Notebook

## DataFrame v/s Tables

- **DataFrame**: A DataFrame is an immutable distributed collection of data, only available in the **current Spark** session.
- **Tables**: A table is a persistent data structure that can be accessed across **multiple Spark** sessions.
- **Convert DataFrame to Table**: df.createOrReplaceTempView("employees")

  (Note: The lifetime of this **temporary table** is tied to the **Spark session** that was used to **create this DataFrame**. To persist the table **beyond this Spark session**, you will need to save it to **persistent storage**.)

## Save DataFrame to Persisted Storage

- **Save to file based Data Store**: df.write.option("path", "../dataout").saveAsTable("dataframes_savetable_example")

  For file-based data source (text, parquet, json, etc.), you can specify a custom table path. Even if the table is dropped, the custom table path and table data will still be there.

  If no custom table path is specified, Spark will write data to a default table path under the warehouse directory. When the table is dropped, the default table path will be removed too.

- **Save to Hive metastore**: df.write().mode("overwrite").saveAsTable("schemaName.tableName")  







In [0]:
dbutils.fs.ls("dbfs:/Volumes/workspace/default/tutorial")

In [0]:
df = spark.read.csv("/Volumes/workspace/default/tutorial/BigMart Sales.csv", header=True, inferSchema=True)
df.show(n=3, vertical=True)
df.printSchema()
df.count()

In [0]:
df = spark.read.csv("/Volumes/workspace/default/tutorial/BigMart Sales.csv", header=True, inferSchema=True)
display(df)

In [0]:
dfj = spark.read.option("multiline", False).json("/Volumes/workspace/default/tutorial/drivers.json")
display(dfj)


## Manipulating Schema Using DDL

In [0]:
df = spark.read.csv("/Volumes/workspace/default/tutorial/BigMart Sales.csv", header=True, inferSchema=True)
df.printSchema()


In [0]:
new_modified_ddl_schema = """
                      Item_Identifier     STRING,
                      Item_Weight     STRING,
                      Item_Fat_Content  STRING,
                      Item_Visibility   DOUBLE,
                      Item_Type         STRING,
                      Item_MRP          DOUBLE,
                      Outlet_Identifier STRING,
                      Outlet_Establishment_Year INTEGER,
                      Outlet_Size       STRING,
                      Outlet_Location_Type STRING,
                      Outlet_Type       STRING,
                      Item_Outlet_Sales DOUBLE
                      """
df = spark.read.csv("/Volumes/workspace/default/tutorial/BigMart Sales.csv", header=True, schema=new_modified_ddl_schema)
df.printSchema()    

## Manipulating Schema Using Struct Type

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

new_modified_struct_schema = StructType([
    StructField("Item_Identifier", StringType(), True),
    StructField("Item_Weight", StringType(), True),
    StructField("Item_Fat_Content", StringType(), True),
    StructField("Item_Visibility", DoubleType(), True),
    StructField("Item_Type", StringType(), True),
    StructField("Item_MRP", DoubleType(), True),
    StructField("Outlet_Identifier", StringType(), True),
    StructField("Outlet_Establishment_Year", IntegerType(), True),
    StructField("Outlet_Size", StringType(), True),
    StructField("Outlet_Location_Type", StringType(), True),
    StructField("Outlet_Type", StringType(), True),
    StructField("Item_Outlet_Sales", DoubleType(), True)
])

df = spark.read.csv("/Volumes/workspace/default/tutorial/BigMart Sales.csv", header=True, schema=new_modified_struct_schema)
display(df)

## Create DataFrame with schema containing different data types

### Reference
- https://spark.apache.org/docs/4.0.0/api/python/user_guide/touroftypes.html

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, LongType, DoubleType, FloatType
from pyspark.sql.types import DecimalType, StringType, BinaryType, BooleanType, DateType, TimestampType
from decimal import Decimal
from datetime import date, datetime

# Define the schema of the DataFrame
schema = StructType([
    StructField("integer_field", IntegerType(), nullable=False),
    StructField("long_field", LongType(), nullable=False),
    StructField("double_field", DoubleType(), nullable=False),
    StructField("float_field", FloatType(), nullable=False),
    StructField("decimal_field", DecimalType(10, 2), nullable=False),
    StructField("string_field", StringType(), nullable=False),
    StructField("binary_field", BinaryType(), nullable=False),
    StructField("boolean_field", BooleanType(), nullable=False),
    StructField("date_field", DateType(), nullable=False),
    StructField("timestamp_field", TimestampType(), nullable=False)
])

# Sample data using the Python objects corresponding to each PySpark type
data = [
    (123, 1234567890123456789, 12345.6789, 123.456, Decimal('12345.67'), "Hello, World!",
     b'Hello, binary world!', True, date(2020, 1, 1), datetime(2020, 1, 1, 12, 0)),
    (456, 9223372036854775807, 98765.4321, 987.654, Decimal('98765.43'), "Goodbye, World!",
     b'Goodbye, binary world!', False, date(2025, 12, 31), datetime(2025, 12, 31, 23, 59)),
    (-1, -1234567890123456789, -12345.6789, -123.456, Decimal('-12345.67'), "Negative Values",
     b'Negative binary!', False, date(1990, 1, 1), datetime(1990, 1, 1, 0, 0)),
    (0, 0, 0.0, 0.0, Decimal('0.00'), "", b'', True, date(2000, 1, 1), datetime(2000, 1, 1, 0, 0))
]

# Create DataFrame
df_multi_data_type = spark.createDataFrame(data, schema=schema)

# Show schema
df_multi_data_type.printSchema()

# Show the DataFrame
df_multi_data_type.show()