### Creating a new column in Dataframe

How you can create a new column derived from existing columns in a PySpark DataFrame in different ways.

In [0]:
# sample data
data = [
    ("Rohish", 34, 5000),
    ("Smit", 45, 4000),
    ("Pushak", 23, 3000),
    ("Faisal", 37, 7000)
]

columns = ["Name", "Age", "Salary"]

df = spark.createDataFrame(data, columns)
df.show()

+------+---+------+
|  Name|Age|Salary|
+------+---+------+
|Rohish| 34|  5000|
|  Smit| 45|  4000|
|Pushak| 23|  3000|
|Faisal| 37|  7000|
+------+---+------+



**Coding Question 1:** Creating a New Column Using Arithmetic Operations. 

Create a new column Bonus which is 10% of the Salary

In [0]:
from pyspark.sql.functions import col

df1 = df.withColumn("Bonus", (col("salary")* 0.10).cast("long"))
df1.show()

+------+---+------+-----+
|  Name|Age|Salary|Bonus|
+------+---+------+-----+
|Rohish| 34|  5000|  500|
|  Smit| 45|  4000|  400|
|Pushak| 23|  3000|  300|
|Faisal| 37|  7000|  700|
+------+---+------+-----+



**Coding Question 2:** Creating a New Column Using Conditional Statements

Create a new column Category that categorizes people based on their age:
- If age is less than 30, the category is Young.
- If age is between 30 and 40, the category is Mid-age.
- If age is greater than 40, the category is Senior

In [0]:
from pyspark.sql.functions import when

df3 = df.withColumn("Category", when(col("Age") < 30, "Young") \
                                .when((col("Age") >= 30) & (col("Age") <= 40), "Mid-Age") \
                                .otherwise("Senior")
            )
df3.show()

+------+---+------+--------+
|  Name|Age|Salary|Category|
+------+---+------+--------+
|Rohish| 34|  5000| Mid-Age|
|  Smit| 45|  4000|  Senior|
|Pushak| 23|  3000|   Young|
|Faisal| 37|  7000| Mid-Age|
+------+---+------+--------+



**Coding Question 3:** Creating a New Column by Combining Two Columns

Create a new column `Full Info` that combines Name and Age into a single string

In [0]:
from pyspark.sql.functions import concat, lit

df4 = df.withColumn("Full Info", concat(col("Name"), lit(" is "), col("Age"), lit(" years old")))
df4.show(truncate=False)

+------+---+------+----------------------+
|Name  |Age|Salary|Full Info             |
+------+---+------+----------------------+
|Rohish|34 |5000  |Rohish is 34 years old|
|Smit  |45 |4000  |Smit is 45 years old  |
|Pushak|23 |3000  |Pushak is 23 years old|
|Faisal|37 |7000  |Faisal is 37 years old|
+------+---+------+----------------------+



**Coding Question 4:** Creating a New Column Using a UDF (User-Defined Function)

Define a UDF to classify the salary range:
- Low if salary is less than 4000.
- Medium if salary is between 4000 and 6000.
- High if salary is greater than 6000.

In [0]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def classify_salary(salary):
    if salary < 4000:
        return "Low"
    elif salary >= 4000 and salary <= 6000:
        return "Medium"
    else:
        return "High"

# Register UDF to use in DataFrame API
salary_udf = udf(classify_salary, StringType())

df5 = df.withColumn("salary category", salary_udf(col("Salary")))
df5.show()

+------+---+------+---------------+
|  Name|Age|Salary|salary category|
+------+---+------+---------------+
|Rohish| 34|  5000|         Medium|
|  Smit| 45|  4000|         Medium|
|Pushak| 23|  3000|            Low|
|Faisal| 37|  7000|           High|
+------+---+------+---------------+



**Coding Question 5:** Creating a New Column Using SQL Expressions

Use Spark SQL to create a new column Net Salary where Net Salary = Salary + Bonus.


In [0]:
# create a view first
df.createOrReplaceTempView("emp_tbl")

df6 = spark.sql("""select *, cast((Salary * 0.10) as long) as Bonus, 
                cast((Salary + (Salary * 0.10)) as long) as `Net Salary` 
                from emp_tbl;
        """)
df6.show()

+------+---+------+-----+----------+
|  Name|Age|Salary|Bonus|Net Salary|
+------+---+------+-----+----------+
|Rohish| 34|  5000|  500|      5500|
|  Smit| 45|  4000|  400|      4400|
|Pushak| 23|  3000|  300|      3300|
|Faisal| 37|  7000|  700|      7700|
+------+---+------+-----+----------+



In [0]:
%sql
select *, (Salary + (Salary * 0.10)) as `Net Salary` from emp_tbl;

Name,Age,Salary,Net Salary
Rohish,34,5000,5500.0
Smit,45,4000,4400.0
Pushak,23,3000,3300.0
Faisal,37,7000,7700.0
