In [22]:
!ls /data


employee.csv  employee.json  employees.parquet


What is a DataFrame in PySpark?

A DataFrame in PySpark is:

A distributed collection of data

Organized into rows and named columns

Similar to:

A table in a SQL database

A Pandas DataFrame (but distributed across a cluster)

Key characteristics

Immutable: You don‚Äôt change data in place; every transformation creates a new DataFrame

Lazy evaluation: Operations are not executed immediately‚ÄîSpark builds a plan and runs it only when an action is called

Optimized: Spark uses the Catalyst optimizer and Tungsten engine for efficient execution

Schema-based: Each column has a defined data type

| id | name  | age |
|----|-------|-----|
| 1  | Alice | 24  |
| 2  | Bob   | 30  |


In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark DataFrame Basics") \
    .getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/15 13:51:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [23]:
data = [
    (1, "Alice", 24),
    (2, "Bob", 30),
    (3, "Charlie", 28)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)
df.show()


+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 24|
|  2|    Bob| 30|
|  3|Charlie| 28|
+---+-------+---+



In [24]:
columns = ["id", "name", "age"]
df = spark.createDataFrame(data)
df.columns
df.show()
df.printSchema()

+---+-------+---+
| _1|     _2| _3|
+---+-------+---+
|  1|  Alice| 24|
|  2|    Bob| 30|
|  3|Charlie| 28|
+---+-------+---+

root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = true)



In [10]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

df = spark.createDataFrame(data, schema=schema)
df.printSchema()


root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)



1Ô∏è‚É£ What is StructType?

Think of StructType as a table blueprint ‚Äî it defines the schema of your DataFrame.

It tells Spark:

‚ÄúThis table has these columns‚Äù

‚ÄúEach column has this type‚Äù

üí° Analogy:
StructType = the form template or Excel sheet header

It doesn‚Äôt contain data, only structure.

What is StructField?

StructField = one column definition inside the StructType.

It defines:

Column name

Data type (StringType, IntegerType, etc.)

Nullable or not (True/False)

üí° Analogy:
If StructType = Excel sheet,
then each StructField = one column header with type info.

| Concept     | Analogy                                                    |
| ----------- | ---------------------------------------------------------- |
| StructType  | The **table blueprint** (schema of entire DataFrame)       |
| StructField | **One column** in the blueprint, with name, type, nullable |


very common point of confusion üëç
Short answer first, then details:

‚ùó inferSchema, header, delimiter, etc. do NOT apply to this method
(spark.createDataFrame(data, columns))

They are only for file-based reads (CSV/JSON/etc.)

What Spark Does Instead (Schema Inference Here)

Spark infers schema automatically from Python types

In [7]:
df_csv = spark.read.csv(
    "/data/employee.csv",
    header=True,
    inferSchema=True
)

df_csv.show()
df_csv.printSchema()


                                                                                

+---+-------+---+-----------+------+
| id|   name|age| department|salary|
+---+-------+---+-----------+------+
|  1|  Alice| 24|Engineering| 70000|
|  2|    Bob| 30|  Marketing| 60000|
|  3|Charlie| 28|      Sales| 55000|
|  4|  David| 35|Engineering| 90000|
|  5|    Eva| 26|         HR| 50000|
|  6|  Frank| 40|    Finance| 95000|
|  7|  Grace| 29|  Marketing| 62000|
|  8|  Helen| 32|         HR| 58000|
|  9|    Ian| 27|      Sales| 54000|
| 10|   Jack| 45| Management|120000|
+---+-------+---+-----------+------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: integer (nullable = true)



In [15]:
df = spark.read.json("/data/employee.json")

df.show()
df.printSchema()


+---+-----------+---+-------+------+
|age| department| id|   name|salary|
+---+-----------+---+-------+------+
| 24|Engineering|  1|  Alice| 70000|
| 30|  Marketing|  2|    Bob| 60000|
| 28|      Sales|  3|Charlie| 55000|
| 35|Engineering|  4|  David| 90000|
| 26|         HR|  5|    Eva| 50000|
| 40|    Finance|  6|  Frank| 95000|
| 29|  Marketing|  7|  Grace| 62000|
| 32|         HR|  8|  Helen| 58000|
| 27|      Sales|  9|    Ian| 54000|
| 45| Management| 10|   Jack|120000|
+---+-----------+---+-------+------+

root
 |-- age: long (nullable = true)
 |-- department: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: long (nullable = true)



In [16]:
data_parque = [
    (1, "Alice", 24, "Engineering", 70000),
    (2, "Bob", 30, "Marketing", 60000),
    (3, "Charlie", 28, "Sales", 55000),
    (4, "David", 35, "Engineering", 90000),
    (5, "Eva", 26, "HR", 50000),
    (6, "Frank", 40, "Finance", 95000),
    (7, "Grace", 29, "Marketing", 62000),
    (8, "Helen", 32, "HR", 58000),
    (9, "Ian", 27, "Sales", 54000),
    (10, "Jack", 45, "Management", 120000)
]

columns = ["id", "name", "age", "department", "salary"]


In [18]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetExample").getOrCreate()

df = spark.createDataFrame(data_parque, columns)
df.show()


+---+-------+---+-----------+------+
| id|   name|age| department|salary|
+---+-------+---+-----------+------+
|  1|  Alice| 24|Engineering| 70000|
|  2|    Bob| 30|  Marketing| 60000|
|  3|Charlie| 28|      Sales| 55000|
|  4|  David| 35|Engineering| 90000|
|  5|    Eva| 26|         HR| 50000|
|  6|  Frank| 40|    Finance| 95000|
|  7|  Grace| 29|  Marketing| 62000|
|  8|  Helen| 32|         HR| 58000|
|  9|    Ian| 27|      Sales| 54000|
| 10|   Jack| 45| Management|120000|
+---+-------+---+-----------+------+



In [19]:
df.write.mode("overwrite").parquet("/data/employees.parquet")


26/01/15 15:13:05 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
26/01/15 15:13:05 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 84.44% for 9 writers
26/01/15 15:13:05 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 76.00% for 10 writers
26/01/15 15:13:05 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 69.09% for 11 writers
26/01/15 15:13:05 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 76.00% for 10 writers
26/01/15 15:13:05 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 84.44% for 9 writers
26/01/15 15:13:05 WARN MemoryManager: Total allocation exceeds 95.0

In [20]:
df_parquet = spark.read.parquet("/data/employees.parquet")
df_parquet.show()
df_parquet.printSchema()


+---+-------+---+-----------+------+
| id|   name|age| department|salary|
+---+-------+---+-----------+------+
|  1|  Alice| 24|Engineering| 70000|
|  4|  David| 35|Engineering| 90000|
|  7|  Grace| 29|  Marketing| 62000|
| 10|   Jack| 45| Management|120000|
|  3|Charlie| 28|      Sales| 55000|
|  6|  Frank| 40|    Finance| 95000|
|  2|    Bob| 30|  Marketing| 60000|
|  9|    Ian| 27|      Sales| 54000|
|  8|  Helen| 32|         HR| 58000|
|  5|    Eva| 26|         HR| 50000|
+---+-------+---+-----------+------+

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)



In [21]:
!ls /data

employee.csv  employee.json  employees.parquet
