<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/6_Data_Cleaning/1_Conditional_Handle_Nulls.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Conditional Expressions for Nulls

## Overview

### 🥅 Analysis Goals

- **Handle Missing Values**: Replace NULL values with appropriate defaults using COALESCE to ensure data consistency and accurate analysis
- **Cohort Spend**: Find the average total net revenue for each customer.

### 📘 Concepts Covered

- `COALESCE`
- `NULLIF`

[Source Documentation for Conditional Expressions](https://www.postgresql.org/docs/17/functions-conditional.html)

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## COALESCE vs NULLIF


### Key Differences

- COALESCE accepts multiple arguments and works through them until it finds something that isn't NULL
- NULLIF only takes exactly two arguments and is specifically for converting matching values to NULL

### Simple Way to Remember

- COALESCE turns NULLs into values (NULL → value)
- NULLIF turns matching values into NULL (value → NULL)

### Example

#### Create a table of "real" data jobs

In [66]:
%%sql

-- Create a table of "real" data jobs
CREATE TABLE data_jobs (
    id INT,
    job_title VARCHAR(30),
    is_real_job VARCHAR(20),
    salary INT
);

-- Insert our "professional" opinions
INSERT INTO data_jobs VALUES
(1, 'Data Analyst', 'yes', NULL),
(2, 'Data Scientist', NULL, 140000),
(3, 'Data Engineer', 'kinda', 120000);

SELECT * FROM data_jobs;

Unnamed: 0,id,job_title,is_real_job,salary
0,1,Data Analyst,yes,
1,2,Data Scientist,,140000.0
2,3,Data Engineer,kinda,120000.0


#### Use COALESCE to turn NULLs into values

In [67]:
%%sql

SELECT 
    job_title,
    COALESCE(is_real_job, 'questionable') AS is_real_job,
    COALESCE(salary, 0) AS salary
FROM data_jobs;


Unnamed: 0,job_title,is_real_job,salary
0,Data Analyst,yes,0
1,Data Scientist,questionable,140000
2,Data Engineer,kinda,120000


In [68]:
%%sql

SELECT 
    *,
    COALESCE(is_real_job, salary::text, '2nd Backup') AS salary_as_backup
FROM data_jobs;


Unnamed: 0,id,job_title,is_real_job,salary,salary_as_backup
0,1,Data Analyst,yes,,yes
1,2,Data Scientist,,140000.0,140000
2,3,Data Engineer,kinda,120000.0,kinda


#### Use NULLIF to turn matching values into NULL

In [69]:
%%sql

SELECT 
    job_title,
    NULLIF(is_real_job, 'kinda') AS is_real_job,
    NULLIF(salary, 100000) AS salary
FROM data_jobs;

Unnamed: 0,job_title,is_real_job,salary
0,Data Analyst,yes,
1,Data Scientist,,140000.0
2,Data Engineer,,120000.0


Drop the table that was just created.

In [70]:
%%sql

-- Clean up our controversial table
DROP TABLE data_jobs;

---
## COALESCE

### 📝 Notes

**`COALESCE()`**

- **COALESCE**: Returns the first non-null value from a list of expressions.

- Syntax:

  ```sql
  SELECT COALESCE(expression1, expression2, ..., default_value);
  ```

- Used to replace `NULL` values with a default. Common in reporting and data cleaning, such as filling missing values with a placeholder.

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Customer Lifetime Value (LTV): Total revenue generated by a customer over time
  - Net Revenue: Total revenue after accounting for all adjustments
- **💡 Why It Matters**: Ensures consistent customer data analysis
    - Enables accurate customer revenue tracking
    - Prevents missing data from skewing analysis results
    - Maintains data integrity in customer-level calculations
- **🎯 Common Use Cases**: 
  - Customer name standardization
  - Revenue calculations
  - Customer data cleaning
- **📈 Related KPIs**: 
  - Customer net revenue
  - Customer count
  - Data completeness metrics

### 📈 Analysis

- Calculates each customer's average net revenue.

#### Cleaned Customer's Avg Net Revenue

**`COALESCE`**

1. Write in a query that gets the LTV for each customer. 
   - Selects `customerkey` to group revenue calculations by customer.  
   - Calculates `net_revenue` using `SUM(quantity * netprice * exchangerate)`.  
   - Uses `GROUP BY customerkey` to aggregate revenue per customer.  

In [71]:
%%sql
    
SELECT
    customerkey,
    SUM(quantity * netprice * exchangerate) AS net_revenue
FROM sales
GROUP BY
    customerkey
-- HAVING SUM(quantity * netprice * exchangerate) IS NULL

Unnamed: 0,customerkey,net_revenue
0,2044589,2470.73
1,1603477,136.62
2,876049,2601.13
3,1469222,5278.54
4,2089398,98.39
...,...,...
49482,853617,903.31
49483,1573639,6973.42
49484,1355936,149.99
49485,967453,5.40


2. Put the query into a CTE (`sales_data`), then `LEFT JOIN` this CTE onto the customer table to return every customer's cleaned name and their LTV. 
   - Defines `sales_data` as a CTE that calculates `net_revenue` per customer.  
   - In the main query:
        - 🔔 Performs a `LEFT JOIN` on `customer` to retain all customers, even those without sales.  
        - 🔔 Uses `COALESCE(s.net_revenue, 0)` to ensure customers without sales show `0` LTV instead of `NULL`.

In [72]:
%%sql

-- Put query into a CTE
WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS net_revenue
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.net_revenue,
    COALESCE(s.net_revenue, 0) AS cleaned_net_revenue
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey
LIMIT 10

Unnamed: 0,customerkey,net_revenue,cleaned_net_revenue
0,15,2217.41,2217.41
1,23,,0.0
2,36,,0.0
3,120,,0.0
4,180,2510.22,2510.22
5,185,1395.52,1395.52
6,189,,0.0
7,210,,0.0
8,225,,0.0
9,243,287.67,287.67


3. Calculate the average net revenue for customers that have sales and the average net revenue for all customers.
   - Use `AVG` to calculate the average net revenue for customers that have sales.
   - Use `AVG(COALESCE(s.net_revenue, 0))` to calculate the average net revenue for all customers.

In [73]:
%%sql

-- Put query into a CTE
WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS net_revenue
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    AVG(s.net_revenue) AS spending_customers_avg_net_revenue,  -- average net revenue for customers that have sales
    AVG(COALESCE(s.net_revenue, 0)) AS all_customers_avg_net_revenue -- average net revenue for all customers
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey

Unnamed: 0,spending_customers_avg_net_revenue,all_customers_avg_net_revenue
0,4170.94,1965.97


<img src="../Resources/images/6.1_customer_avg_revenue.png" alt="Customer Average Revenue" width="50%">


---
## NULLIF

### 📝 Notes

**`NULLIF`**

- **NULLIF**: Returns `NULL` if two expressions are equal; otherwise, returns the first expression.

- Syntax:

  ```sql
  SELECT NULLIF(expression1, expression2);
  ```

- Helps prevent division by zero by returning `NULL` instead of causing an error.

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Cohort Analysis: Grouping customers by acquisition year
  - Average Order Value: Revenue per order for each customer
  - Customer Orders: Number of transactions per customer
- **💡 Why It Matters**: Enables accurate customer behavior analysis
    - Prevents division by zero errors in calculations
    - Allows proper calculation of average order values
    - Helps identify customer purchasing patterns
    - Provides clear view of customer order frequency
- **🎯 Common Use Cases**: 
  - Average order calculations
  - Customer cohort analysis
  - Order pattern analysis
- **📈 Related KPIs**: 
  - Average order value
  - Order frequency
  - Cohort metrics

### 📈 Analysis

- Calculates each customer's average net revenue.

#### Cleaned Customer's Avg Net Revenue

**`NULLIF`**

3. Calculate the average net revenue for customers that have sales and the average net revenue for all customers.
   - Use `AVG` to calculate the average net revenue for customers that have sales.
   - Use `AVG(NULLIF(s.net_revenue, 0))` to calculate the average net revenue for all customers.

> **NOTE:** Why is `all_customers_avg_net_revenue` different from the `COALESCE`?
> - `AVG(COALESCE(s.net_revenue, 0))`: Includes **all customers**, replacing `NULL` (no sales) with `0`, lowering the average.  
> - `AVG(NULLIF(s.net_revenue, 0))`: Excludes both `NULL` (no sales) and `0` net revenue, leading to a higher average if some customers have `0`.

In [4]:
%%sql

-- Put query into a CTE
WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS net_revenue
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    AVG(s.net_revenue) AS spending_customers_avg_net_revenue,  -- average net revenue for customers that have sales
    AVG(NULLIF(s.net_revenue, 0)) AS all_customers_avg_net_revenue -- average net revenue for all customers
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey

Unnamed: 0,spending_customers_avg_net_revenue,all_customers_avg_net_revenue
0,4170.94,4170.94


### 💡 What's the difference between `COALESCE` and `NULLIF`

- `NULLIF(expr1, expr2)` Returns NULL if `expr1 = expr2`, otherwise returns `expr1` (used to nullify specific values).  
- `COALESCE(expr1, expr2, ...)`: Returns the first non-NULL value from a list (used to replace NULLs with defaults).  
- **Difference:** `NULLIF` creates NULLs, while `COALESCE` replaces NULLs.