# PySpark SQL count() Function: How to Count Rows and Column Values

## Introduction to the `count()` Function

The `count()` function in PySpark is used to count the number of rows in a DataFrame or count the occurrences of specific column values. It is similar to the `COUNT` function in SQL, where you can count all records or count specific columns that meet certain criteria.


## Basic Syntax:

```
DataFrame.count()
DataFrame.groupBy(*cols).count()
```

### Parameters:

- **Direct Application**: When `count()` is applied directly on a DataFrame, it returns the total number of rows in the DataFrame.
- **With `groupBy()`**: When `count()` is used after `groupBy()`, it counts the occurrences of each group defined by the `groupBy()` columns.


## Why Use `count()`?

- It is a fundamental function for understanding the size of your data.
- It’s often used to calculate frequencies or counts of records and specific values in data analysis, data profiling, or reporting.


## Practical Examples

### 1. Counting Total Rows in a DataFrame

**Scenario**: You have a DataFrame with customer data, and you want to count the total number of rows in the DataFrame.

**Code Example**:

In [0]:
df = spark.createDataFrame([
    (1, "John"),
    (2, "Jane"),
    (3, "Tom"),
    (4, "Jerry")
], ["ID", "NAME"])

# Count the total number of rows
row_count = df.count()
print(f"Total Rows: {row_count}")


Total Rows: 4


### 2. Counting Non-Null Values in a Column

**Scenario**: You want to count the number of non-null values in the `NAME` column.

**Code Example**:

In [0]:
df_with_null = spark.createDataFrame([
    (1, "John"),
    (2, "Jane"),
    (3, None),
    (4, "Jerry")
], ["ID", "NAME"])

# Count non-null values in the "NAME" column
non_null_count = df_with_null.filter(df_with_null.NAME.isNotNull()).count()
print(f"Non-null Names: {non_null_count}")


Non-null Names: 3


### 3. Counting Unique Values in a Column

**Scenario**: You want to count the distinct values in the `NAME` column.

**Code Example**:

In [0]:
# Count distinct values in the "NAME" column
distinct_count = df_with_null.select("NAME").distinct().count()
print(f"Distinct Names: {distinct_count}")


Distinct Names: 4


### 4. Counting Rows After Grouping

**Scenario**: You have a DataFrame with sales data, and you want to group by `ITEM` and count how many times each item appears in the dataset.

**Code Example**:

In [0]:
df_sales = spark.createDataFrame([
    ("ItemA", 100),
    ("ItemB", 200),
    ("ItemA", 300),
    ("ItemC", 400),
    ("ItemB", 500)
], ["ITEM", "SALES"])

# Group by ITEM and count the occurrences
df_sales.groupBy("ITEM").count().show()


+-----+-----+
| ITEM|count|
+-----+-----+
|ItemA|    2|
|ItemB|    2|
|ItemC|    1|
+-----+-----+



### 5. Counting Multiple Columns

**Scenario**: You want to count occurrences of each combination of `ITEM` and `SALES`.

**Code Example**:

In [0]:
# Group by both ITEM and SALES and count occurrences
df_sales.groupBy("ITEM", "SALES").count().show()


+-----+-----+-----+
| ITEM|SALES|count|
+-----+-----+-----+
|ItemA|  100|    1|
|ItemB|  200|    1|
|ItemA|  300|    1|
|ItemC|  400|    1|
|ItemB|  500|    1|
+-----+-----+-----+



### 6. Counting All Columns in the DataFrame

**Scenario**: You want to count how many rows each column contains in the entire DataFrame, useful for detecting incomplete or null data.

**Code Example**:

In [0]:
from pyspark.sql.functions import col, count

# Count non-null values for each column
df_sales.agg(*[count(col(c)).alias(c) for c in df_sales.columns]).show()


+----+-----+
|ITEM|SALES|
+----+-----+
|   5|    5|
+----+-----+

