# PySpark SQL select() Function: How to Select Columns

## What is the `select()` Function?

The `select()` function in PySpark is used to select one or more columns from a DataFrame. 

It allows you to:
- Pick specific columns
- Rename columns
- Apply transformations or expressions

It’s similar to the `SELECT` statement in SQL, where you specify which columns you want to retrieve from a table.

In PySpark, `select()` is one of the most commonly used functions when working with DataFrames, especially when you want to filter out unnecessary columns or transform data in specific columns.


## Syntax:

```
DataFrame.select(*cols)
```

cols: The columns you want to select from the DataFrame. These can be specified as strings (column names) or as expressions.

## Key Features of `select()` Function

- **Selecting specific columns**:  
  You can pass column names directly to the `select()` function to retrieve the columns you need.

- **Column renaming**:  
  You can rename columns using the `alias()` method.

- **Expressions**:  
  You can apply expressions to columns, such as adding, multiplying, or creating new calculated columns.

- **Selecting nested columns**:  
  If your DataFrame contains nested data, `select()` allows you to extract specific fields from those nested structures.


## Practical Examples

## Example 1: Basic Column Selection

You have a DataFrame containing employee data, and you only want to view specific columns like `EMPLOYEE_ID` and `FIRST_NAME`.


In [0]:
# Sample DataFrame
df = spark.createDataFrame([
    (1, "John", "Smith", 3000),
    (2, "Jane", "Doe", 4000),
    (3, "Tom", "Hardy", 3500)
], ["EMPLOYEE_ID", "FIRST_NAME", "LAST_NAME", "SALARY"])

# Select specific columns
df.select("EMPLOYEE_ID", "FIRST_NAME").show()


+-----------+----------+
|EMPLOYEE_ID|FIRST_NAME|
+-----------+----------+
|          1|      John|
|          2|      Jane|
|          3|       Tom|
+-----------+----------+



## Example 2: Selecting Columns with Aliases

You want to display `FIRST_NAME` and `SALARY`, but rename `SALARY` to `EMPLOYEE_SALARY`.


In [0]:
from pyspark.sql.functions import col

# Select columns and rename using alias
df.select("FIRST_NAME", col("SALARY").alias("EMPLOYEE_SALARY")).show()


+----------+---------------+
|FIRST_NAME|EMPLOYEE_SALARY|
+----------+---------------+
|      John|           3000|
|      Jane|           4000|
|       Tom|           3500|
+----------+---------------+



## Example 3: Applying Expressions in `select()`

You want to show the `FIRST_NAME` and `LAST_NAME` along with a new column that calculates a 10% increase in the `SALARY`.


In [0]:
from pyspark.sql.functions import expr

# Select columns and apply expression
df.select("FIRST_NAME", "LAST_NAME", expr("SALARY * 1.1").alias("NEW_SALARY")).show()


+----------+---------+----------+
|FIRST_NAME|LAST_NAME|NEW_SALARY|
+----------+---------+----------+
|      John|    Smith|    3300.0|
|      Jane|      Doe|    4400.0|
|       Tom|    Hardy|    3850.0|
+----------+---------+----------+



## Example 4: Selecting Nested Columns

You have a DataFrame with nested columns and want to extract specific fields from the nested structure.


In [0]:
# Sample nested DataFrame
data = [
    (1, "John", {"street": "1234 Elm St", "city": "Denver"}),
    (2, "Jane", {"street": "5678 Maple St", "city": "Seattle"})
]
df_nested = spark.createDataFrame(data, ["EMPLOYEE_ID", "NAME", "ADDRESS"])

# Select fields from nested columns
df_nested.select("NAME", "ADDRESS.city").show()


+----+-------+
|NAME|   city|
+----+-------+
|John| Denver|
|Jane|Seattle|
+----+-------+



## Example 5: Using `selectExpr()` for SQL-like Expressions

You want to use SQL-like expressions within `select()`. This can be achieved using `selectExpr()`.


In [0]:
# Using selectExpr for SQL-like expressions
df.selectExpr("FIRST_NAME", "SALARY * 2 as DOUBLE_SALARY").show()


+----------+-------------+
|FIRST_NAME|DOUBLE_SALARY|
+----------+-------------+
|      John|         6000|
|      Jane|         8000|
|       Tom|         7000|
+----------+-------------+

