In PySpark, the toJSON() function is used to convert each row of a DataFrame into a JSON string.
This is helpful when you want to export, debug, or transform row-level data into JSON format.

✅ Example: Using toJSON() in PySpark

In [0]:
# Sample DataFrame
data = [
    (1, "Alice", 29),
    (2, "Bob", 35),
    (3, "Cathy", 23)
]
columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.display()

Original DataFrame:


id,name,age
1,Alice,29
2,Bob,35
3,Cathy,23


In [0]:
# Convert each row to JSON string
json_rdd = df.toJSON()

print("DataFrame rows converted to JSON strings:")
for row in json_rdd.collect():
    print(row)

DataFrame rows converted to JSON strings:
{"id":1,"name":"Alice","age":29}
{"id":2,"name":"Bob","age":35}
{"id":3,"name":"Cathy","age":23}


In [0]:
# Save JSON strings as text file (each row = one JSON line)
json_rdd.saveAsTextFile("output/json_data")

📂 Output in directory output/json_data/

Inside the folder, Spark will create part files (parallelized output):

output/json_data/part-00000

output/json_data/part-00001
...

In [0]:

# Writes DataFrame directly in JSON format
df.write.json("output/json_df")

⚡ Alternative: Use DataFrame’s Built-in JSON Writer

If you want structured JSON output directly from a DataFrame:

This will also create part-* files, but unlike toJSON(), it doesn’t convert rows to strings first—it writes in JSON format natively.

⚡ Notes:

toJSON() returns an RDD of strings, not a DataFrame.

Each element in the RDD is a JSON string representation of a row.

Useful for:

Exporting JSON lines to files (.saveAsTextFile()).

Sending row-level data to APIs.

Debugging transformations in JSON format.