# Pyspark: SQL and DataFrames

###### Spark SQL: A Spark module for working with structured data using SQL.                       
###### DataFrame: A distributed collection of data oraganized into name column, same as table in RDBMS

1. Creating DataFrame Manually

In [14]:
# Sample Data
data = [(1,"Alice"),(2,"Bob")
        ,(3,"Charlie"),(4,"Dave"),
        (5,"Eve"),(6,"Frank"),
        (7,"Grace"),(8,"Harry"),
        (9,"Ivan"),(10,"Jose")]

columns = ["id","name"]

#Create DatFrame
df = spark.createDataFrame(data,columns)

# Display DataFrame in table structure
display(df)

StatementMeta(, 5bb567b5-49b1-4ff4-9c89-a22e73f6ed04, 16, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 08c1e0ac-d7e4-4605-b55e-f48a0bc8f5f7)

show() Function in Pyspark DataFrame

The show() function in Pyspark displays the content of a DataFrame in a tabular Format. It has several useful parameters for customization

  1. n:number of rows to display (default is 20)
  2. truncate: If set to True, it truncates column values longer than 20 characters(bydefault is True)
  3. vertical: Prints rows in vertically is set to True

In [15]:
#Show the first 3 rows, truncate columns to 25 charactersand displays vertically
df.show(3,truncate=3,vertical=True)

# Show entire DataFrame
df.show()

# Show first 5 rows
df.show(5)

# Show DataFrame without truncatingany columns
df.show(truncate=False)

StatementMeta(, 5bb567b5-49b1-4ff4-9c89-a22e73f6ed04, 17, Finished, Available, Finished)

-RECORD 0---
 id   | 1   
 name | Ali 
-RECORD 1---
 id   | 2   
 name | Bob 
-RECORD 2---
 id   | 3   
 name | Cha 
only showing top 3 rows

+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
|  4|   Dave|
|  5|    Eve|
|  6|  Frank|
|  7|  Grace|
|  8|  Harry|
|  9|   Ivan|
| 10|   Jose|
+---+-------+

+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
|  4|   Dave|
|  5|    Eve|
+---+-------+
only showing top 5 rows

+---+-------+
|id |name   |
+---+-------+
|1  |Alice  |
|2  |Bob    |
|3  |Charlie|
|4  |Dave   |
|5  |Eve    |
|6  |Frank  |
|7  |Grace  |
|8  |Harry  |
|9  |Ivan   |
|10 |Jose   |
+---+-------+



2. Create DataFrame from Pandas

In [16]:
import pandas as pd

#Sample DataFrame
pandas_df = pd.DataFrame(data=data,columns=columns)

# Convert to Pyspark DataFrame
df_to_pyspark = spark.createDataFrame(pandas_df)
df_to_pyspark.show(5)

StatementMeta(, 5bb567b5-49b1-4ff4-9c89-a22e73f6ed04, 18, Finished, Available, Finished)

+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
|  4|   Dave|
|  5|    Eve|
+---+-------+
only showing top 5 rows



3. Creating DataFrame from Dictionary

In [17]:
dict_data = [{'id':1, 'Name':'Alice'},
             {'id':2, 'Name':'Bob'},
             {'id':3, 'Name':'Charlie'}]
df = spark.createDataFrame(dict_data)
df.show()

StatementMeta(, 5bb567b5-49b1-4ff4-9c89-a22e73f6ed04, 19, Finished, Available, Finished)

+-------+---+
|   Name| id|
+-------+---+
|  Alice|  1|
|    Bob|  2|
|Charlie|  3|
+-------+---+



4. Creating Empty DataFrame

In [18]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([StructField("id",IntegerType(),True),
                     StructField("Name",StringType(),True)])
df = spark.createDataFrame([], schema)
df.show()

StatementMeta(, 5bb567b5-49b1-4ff4-9c89-a22e73f6ed04, 20, Finished, Available, Finished)

+---+----+
| id|Name|
+---+----+
+---+----+



5. Creating DataFrame from Structured Data (CSV, JSON, Parquet)

In [19]:
# reading csv file into DataFrame
path = "Files/Pyspark_files/people.csv"
df_csv = spark.read.csv(path,header=True,sep=";")
display(df_csv)

# Reading JSON file
path = "Files/Pyspark_files/people.json"
df_json = spark.read.json(path)
display(df_json)

# Reading Multiline JSON file
path = "Files/Pyspark_files/multiline-zipcode.json"
df_multi_json = spark.read.json(path,multiLine=True)
display(df_multi_json)

# Reading Parquet Files
path = "Files/Pyspark_files/users.parquet"
df_parquet = spark.read.parquet(path)
display(df_parquet)

StatementMeta(, 5bb567b5-49b1-4ff4-9c89-a22e73f6ed04, 21, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 3ffe2190-0b58-457a-ba09-e397ed5eea36)

SynapseWidget(Synapse.DataFrame, ec1366e5-3658-4b4f-a6a1-f9f08c1cace6)

SynapseWidget(Synapse.DataFrame, a612b4bd-eace-4121-a978-6e895674befc)

SynapseWidget(Synapse.DataFrame, ee18115c-2734-4046-8436-507e147a3afc)