# PySpark SQL drop() Function: How to Drop Columns

## Introduction to the `drop()` Function

The `drop()` function in PySpark is used to remove one or more columns from a DataFrame. It is especially useful when you want to clean up your data by eliminating unnecessary columns that are not needed for further processing or analysis.


## Basic Syntax:

```
DataFrame.drop(*cols)
```

### Parameters

- **`cols`**: The column(s) to be dropped. You can specify one or more column names as strings.


## How It Works:

- When you call `drop()`, it returns a new DataFrame with the specified columns removed. The original DataFrame remains unchanged unless reassigned.
- You can drop multiple columns at once by passing multiple column names to the `drop()` function.


## Practical Examples

### 1. Dropping a Single Column

**Scenario**: You have a DataFrame with employee data, and you want to drop the `EMAIL` column since it’s not needed for your analysis.

**Code Example**:

In [0]:
df = spark.createDataFrame([
    (1, "John", "john@example.com", 3000),
    (2, "Jane", "jane@example.com", 4000),
    (3, "Tom", "tom@example.com", 3500)
], ["EMPLOYEE_ID", "NAME", "EMAIL", "SALARY"])

# Drop the "EMAIL" column
df.drop("EMAIL").show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          1|John|  3000|
|          2|Jane|  4000|
|          3| Tom|  3500|
+-----------+----+------+



### 2. Dropping Multiple Columns

**Scenario**: You want to drop both the `EMAIL` and `SALARY` columns from the employee DataFrame.

**Code Example**:

In [0]:
# Drop multiple columns "EMAIL" and "SALARY"
df.drop("EMAIL", "SALARY").show()


+-----------+----+
|EMPLOYEE_ID|NAME|
+-----------+----+
|          1|John|
|          2|Jane|
|          3| Tom|
+-----------+----+



### 3. Handling Non-Existent Columns

**Scenario**: You attempt to drop a column that doesn’t exist, such as `ADDRESS`. PySpark will simply ignore the non-existent column and drop only the ones that are valid.

**Code Example**:


In [0]:
# Attempt to drop non-existent column "ADDRESS"
df.drop("ADDRESS", "EMAIL").show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          1|John|  3000|
|          2|Jane|  4000|
|          3| Tom|  3500|
+-----------+----+------+



### 4. Dropping Columns in a Loop

**Scenario**: You have a list of columns to drop, and you want to remove them dynamically in a loop.

**Code Example**:

In [0]:
# List of columns to drop
cols_to_drop = ["EMAIL", "SALARY"]

# Drop columns dynamically using a loop
for col in cols_to_drop:
    df = df.drop(col)

df.show()


+-----------+----+
|EMPLOYEE_ID|NAME|
+-----------+----+
|          1|John|
|          2|Jane|
|          3| Tom|
+-----------+----+



### 5. Dropping Columns Using a Condition (Advanced)

**Scenario**: You want to drop all columns where the data type is `string`. This can be useful when you’re only interested in numerical columns for a certain analysis.

**Code Example**:

In [0]:
# Drop columns where data type is string
string_cols = [col for col, dtype in df.dtypes if dtype == 'string']

df.drop(*string_cols).show()


+-----------+------+
|EMPLOYEE_ID|SALARY|
+-----------+------+
|          1|  3000|
|          2|  4000|
|          3|  3500|
+-----------+------+

