### In PySpark, you can add columns and check the schema of a DataFrame easily using built-in methods. Here's a complete breakdown:

#Add Column to DataFrame

You use the .withColumn() method:

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when

spark = SparkSession.builder.appName("AddColumnsExample").getOrCreate()

data = [(1, "Alice", 60), (2, "Bob", 45)]
df = spark.createDataFrame(data, ["id", "name", "score"])


In [0]:
# Add constant column
df = df.withColumn("country", lit("India"))

# Add calculated column
df = df.withColumn("passed", when(col("score") > 50, "Yes").otherwise("No"))

df.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- score: long (nullable = true)
 |-- country: string (nullable = false)
 |-- passed: string (nullable = false)



In [0]:
# Show display
df.display()

id,name,score,country,passed
1,Alice,60,India,Yes
2,Bob,45,India,No


In PySpark, toDF() is a method used to convert an RDD to a DataFrame, or to rename the columns of an existing DataFrame.

✅ 1. toDF() on an RDD → Create a DataFrame


In [0]:
rdd = spark.sparkContext.parallelize([(1, "Alice"), (2, "Bob")])
df = rdd.toDF(["id", "name"])
df.show()

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
+---+-----+



In [0]:

🔹 Without toDF(), an RDD has no column names or schema.

🔹 toDF() gives it structure by converting it into a DataFrame.

In [0]:

#✅ 2. toDF() on a DataFrame → Rename Columns

df = df.toDF("user_id", "user_name")
df.printSchema()

🔹 Use this if you want to quickly rename all columns.

⚠️ Notes
toDF() without arguments will auto-generate column names like _1, _2, etc.

You must import implicits to use toDF() in Scala, but in PySpark it's available directly.

In [0]:
#💡 Example: RDD → DataFrame with Schema
rdd = spark.sparkContext.parallelize([(1, "apple"), (2, "banana")])
df = rdd.toDF(["id", "fruit"])
df.show()