# PySpark SQL sum() Function: Summing Values Made Simple

## Introduction to the `sum()` Function

The `sum()` function in PySpark is used to calculate the sum of numeric values across one or more columns. It works similarly to the SQL `SUM()` function and is a common aggregation operation when working with large datasets, especially in financial or transactional data.


## Basic Syntax:

```
DataFrame.groupBy(*cols).agg(sum("columnName"))
```

### Parameters:

- **`cols`**: The column(s) to group by (if needed).
- **`sum("columnName")`**: Applies the summing operation to the specified column.


## Why Use `sum()`?

- It is essential for computing total values, such as sales, profits, expenses, or any numeric metric that needs to be aggregated.
- It can be used in combination with `groupBy()` for grouped sums, or it can calculate global sums across the entire DataFrame.


## Practical Examples

### 1. Summing a Single Column

**Scenario**: You have a DataFrame with sales data, and you want to calculate the total sales.

**Code Example**:

In [0]:
from pyspark.sql.functions import sum

df = spark.createDataFrame([
    ("ItemA", 100),
    ("ItemB", 200),
    ("ItemA", 300),
    ("ItemC", 400),
    ("ItemB", 500)
], ["ITEM", "SALES"])

# Sum the SALES column
df.agg(sum("SALES").alias("TOTAL_SALES")).show()


+-----------+
|TOTAL_SALES|
+-----------+
|       1500|
+-----------+



### 2. Grouping and Summing Values

**Scenario**: You want to group by `ITEM` and calculate the total sales for each item.

**Code Example**:

In [0]:
# Group by ITEM and sum SALES for each group
df.groupBy("ITEM").agg(sum("SALES").alias("TOTAL_SALES")).show()


+-----+-----------+
| ITEM|TOTAL_SALES|
+-----+-----------+
|ItemA|        400|
|ItemB|        700|
|ItemC|        400|
+-----+-----------+



### 3. Summing Multiple Columns

**Scenario**: You have sales and profit data, and you want to sum both columns.

**Code Example**:

In [0]:
df_multi = spark.createDataFrame([
    ("ItemA", 100, 10),
    ("ItemB", 200, 20),
    ("ItemA", 300, 30),
    ("ItemC", 400, 40),
    ("ItemB", 500, 50)
], ["ITEM", "SALES", "PROFIT"])

# Sum SALES and PROFIT for the entire DataFrame
df_multi.agg(
    sum("SALES").alias("TOTAL_SALES"),
    sum("PROFIT").alias("TOTAL_PROFIT")
).show()


+-----------+------------+
|TOTAL_SALES|TOTAL_PROFIT|
+-----------+------------+
|       1500|         150|
+-----------+------------+



### 4. Summing with a Conditional Expression

**Scenario**: You want to sum the sales where the sales value is greater than $200.

**Code Example**:

In [0]:
from pyspark.sql.functions import when

# Sum SALES where SALES > 200
df.agg(sum(when(df.SALES > 200, df.SALES)).alias("TOTAL_SALES_OVER_200")).show()


+--------------------+
|TOTAL_SALES_OVER_200|
+--------------------+
|                1200|
+--------------------+



### 5. Summing Without Grouping (Global Aggregation)

**Scenario**: You want to calculate the total sales across the entire dataset, without any grouping.

**Code Example**:

In [0]:
# Global sum of SALES without grouping
df.agg(sum("SALES").alias("TOTAL_GLOBAL_SALES")).show()


+------------------+
|TOTAL_GLOBAL_SALES|
+------------------+
|              1500|
+------------------+



### 6. Summing Multiple Aggregations in `agg()`

**Scenario**: You want to calculate the total sales and the total number of transactions (i.e., count of rows) for each item.

**Code Example**:

In [0]:
from pyspark.sql.functions import count

# Group by ITEM, sum SALES, and count transactions
df.groupBy("ITEM").agg(
    sum("SALES").alias("TOTAL_SALES"),
    count("*").alias("TOTAL_TRANSACTIONS")
).show()


+-----+-----------+------------------+
| ITEM|TOTAL_SALES|TOTAL_TRANSACTIONS|
+-----+-----------+------------------+
|ItemA|        400|                 2|
|ItemB|        700|                 2|
|ItemC|        400|                 1|
+-----+-----------+------------------+

