## Dataframe Transformations in PySpark

### 1. Selecting Columns
The most basic transformation is selecting specific columns from a DataFrame.

You can select data from a DataFrame using various methods and techniques, such as selecting specific columns, applying conditions, or using SQL queries.

In [0]:
# creating a datafrae
df = spark.read.format("csv") \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/employee_1.csv")

# output of df
df.show(5)

+---+--------+---+------+------------+--------+
| id|    name|age|salary|     address| nominee|
+---+--------+---+------+------------+--------+
|  1|  Rohish| 26| 75000|        Pune|nominee1|
|  2|  Nikita| 23|100000|uttarpradesh|nominee2|
|  3|  Pritam| 22|150000|   Bangalore|nominee3|
|  4|Prantosh| 17|200000|     Kolkata|nominee4|
|  5|  Vikash| 31|300000|        null|nominee5|
+---+--------+---+------+------------+--------+



**To select specific columns from a DataFrame, use the .select() method.**

In [0]:
df.select("name").show()

+--------+
|    name|
+--------+
|  Rohish|
|  Nikita|
|  Pritam|
|Prantosh|
|  Vikash|
+--------+



In [0]:
df.select("name", "salary", "*").show()

+--------+------+---+--------+---+------+------------+--------+
|    name|salary| id|    name|age|salary|     address| nominee|
+--------+------+---+--------+---+------+------------+--------+
|  Rohish| 75000|  1|  Rohish| 26| 75000|        Pune|nominee1|
|  Nikita|100000|  2|  Nikita| 23|100000|uttarpradesh|nominee2|
|  Pritam|150000|  3|  Pritam| 22|150000|   Bangalore|nominee3|
|Prantosh|200000|  4|Prantosh| 17|200000|     Kolkata|nominee4|
|  Vikash|300000|  5|  Vikash| 31|300000|        null|nominee5|
+--------+------+---+--------+---+------+------------+--------+



In [0]:
df.select(df.name, df.age).show()

+--------+---+
|    name|age|
+--------+---+
|  Rohish| 26|
|  Nikita| 23|
|  Pritam| 22|
|Prantosh| 17|
|  Vikash| 31|
+--------+---+



In [0]:
# Using Col method
from pyspark.sql.functions import *

df.select(col("name"), col("age")).show()


+--------+---+
|    name|age|
+--------+---+
|  Rohish| 26|
|  Nikita| 23|
|  Pritam| 22|
|Prantosh| 17|
|  Vikash| 31|
+--------+---+



**Why we need `col()` method for selecting when we can directly select with column names in select()**

Lets try to do some expression. lets add 5 in id column

In [0]:
# Lets try to do some expression. lets add 5 in id column
df.select("id + 5")

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-3222529370696004>:2[0m
[1;32m      1[0m [38;5;66;03m# Lets try to do some expression. lets add 5 in id column[39;00m
[0;32m----> 2[0m [43mdf[49m[38;5;241;43m.[39;49m[43mselect[49m[43m([49m[38;5;124;43m"[39;49m[38;5;124;43mid + 5[39;49m[38;5;124;43m"[39;49m[43m)[49m

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_counter()
[1;32m     47[0m [38;5;28;01mtry[39;00m:
[0;32m---> 48[0m     res [38;5;241m=[39m [43mfunc[49m[43m([49m[38;5;241;43m*[39;49m[43margs[49m[43m,[49m[43m [49m[38;5;241;43m*[39;49m[38;5;241;43m*[39;49m[43mkwargs[49m[43m)[49m
[1;32m     49[0m     logger[38;5;241m

You can see the error, spark is taking `id + 5` as a column and looking into the DF but `id + 5` doesnt exist.

When you pass "id + 5" as a string inside .select(), PySpark interprets it as a literal string rather than an actual operation on a column. PySpark does not understand that "id + 5" means "take the id column and add 5 to its values." Instead, it treats "id + 5" as just a plain string.

**Correct Way to Perform the Transformation:**
To correctly add 5 to the id column in PySpark, you need to explicitly reference the column using `col()` and then perform the addition operation.

In [0]:
# Lets try to do some expression. lets add 5 in id column
df.select("id", col("id") + 5).show()

+---+--------+
| id|(id + 5)|
+---+--------+
|  1|       6|
|  2|       7|
|  3|       8|
|  4|       9|
|  5|      10|
+---+--------+



**selectExpr() for SQL-like Syntax:**

If you want to use a more SQL-like syntax, you can use selectExpr(), which allows you to pass expressions in string format, but with the understanding that you are writing SQL-like expressions.

In [0]:
df.selectExpr("name", "age", "salary", "id + 5").show()

+--------+---+------+--------+
|    name|age|salary|(id + 5)|
+--------+---+------+--------+
|  Rohish| 26| 75000|       6|
|  Nikita| 23|100000|       7|
|  Pritam| 22|150000|       8|
|Prantosh| 17|200000|       9|
|  Vikash| 31|300000|      10|
+--------+---+------+--------+



In [0]:

# we can write sql like queries
df.select(expr("id as emp_id"), expr("id + 5 as new_emp_id")).show()

+------+----------+
|emp_id|new_emp_id|
+------+----------+
|     1|         6|
|     2|         7|
|     3|         8|
|     4|         9|
|     5|        10|
+------+----------+



**Pandas way to select columns in PySpark**

- when you have multiple DataFrames with duplicate column names, it can be difficult to distinguish between those columns when performing operations like joining 
- Selecting in pandas way is helpful when we have same columns names in two or more dataframes to distinct the columns

In [0]:
df.select(df["name"], df["age"]).show()

+--------+---+
|    name|age|
+--------+---+
|  Rohish| 26|
|  Nikita| 23|
|  Pritam| 22|
|Prantosh| 17|
|  Vikash| 31|
+--------+---+



**Spark SQL**

To use the spark.sql() method in PySpark, you must create a `temporary view` or `global temporary view` from a DataFrame.

you can create a temporary view using the `.createOrReplaceTempView()` method, which makes the DataFrame available to SQL queries using spark.sql().

In [0]:
df.createOrReplaceTempView("employee_vw")

In [0]:
spark.sql("""
        select * from employee_vw          
""").show()

+---+--------+---+------+------------+--------+
| id|    name|age|salary|     address| nominee|
+---+--------+---+------+------------+--------+
|  1|  Rohish| 26| 75000|        Pune|nominee1|
|  2|  Nikita| 23|100000|uttarpradesh|nominee2|
|  3|  Pritam| 22|150000|   Bangalore|nominee3|
|  4|Prantosh| 17|200000|     Kolkata|nominee4|
|  5|  Vikash| 31|300000|        null|nominee5|
+---+--------+---+------+------------+--------+



### Types of Views

**Temporary View (createOrReplaceTempView):**
- A temporary view is session-scoped and will be dropped when the Spark session ends.
- It can be created using `.createOrReplaceTempView().`
- Example: `df.createOrReplaceTempView("temp_view")`

**Global Temporary View (createGlobalTempView):**
- A global temporary view is shared across all Spark sessions and will exist until the Spark application is terminated.
- It can be created using `.createGlobalTempView().`
- Example: `df.createGlobalTempView("global_temp_view")`
- To query a global temporary view, you must use the special namespace `global_temp:`
- `result = spark.sql("SELECT * FROM global_temp.global_temp_view")`


