<table>
  <tr>
    <td style="text-align: left;">
      <h1>Lighthouse Labs</h1>
      <h2>W1D4 - Combining Data</h2>
      <strong>Instructor:</strong> Socorro E. Dominguez-Vidana
    </td>
    <td style="text-align: right;">
      <img src="img/lhl.jpeg" alt="LHL" width="200">
    </td>
  </tr>
</table>


<table>
<tr>
<td  style="text-align: left;"><img src="img/hi.png" alt="Hi" width="200">
<td> 
    <b>Name:</b> Socorro Dominguez-Vidana <br>
    <b>Work:</b> University of Wisconsin-Madison <br>
    Data Scientist <br>
    <b>Hobbies:</b> Kung Fu, traveling, learning languages <br>
</td>
</tr>
</table>

#### Overview

- [] Multiple joins

- []  Demonstrate multiple `join`s using a pre-prepared PG database

- [] What is a subquery?

- [] `SELECT` subquery

- [] `FROM` subquery

- [] `WHERE` subquery

- [] Demostrate subqueries

Follow with [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/sedv8808/LHL_Lectures/main?labpath=W2D4%2FW2D4_DataQualityAssuranceProcess.ipynb)

In [1]:
%load_ext sql

In [4]:
%sql postgresql://sedv8808:postgres@localhost/insurance

### Introduction to SQL `JOIN`s

- A `JOIN` clause is used to combine rows from two or more tables based on a related column.

- Types of `JOIN`s:
    - `INNER JOIN`: Returns records that have matching values in both tables.
    - `LEFT JOIN`: Returns all records from the left table and the matched records from the right table.
    - `RIGHT JOIN`: Returns all records from the right table and the matched records from the left table.
    - `FULL OUTER JOIN`: Returns all records when there is a match in either left or right table.

## Illustration for joins

The insurance database simulates a fictional insurance company, **HT Insurance**, and contains essential information about *clients*, *policies*, *claims*, *agents*, and *payments*.

We are going to follow Emma, an actuarial analyst in HT-Insurance and do some insurance-related activities, such as evaluating claims, tracking policy information, and analyzing payments. 

Database Tables and Relationships:

- *Clients*: This table stores basic information about each client, such as their name, address, and contact details.
- *Agents*: This table contains details about the insurance agents who manage policies for clients.
- *Policies*: This table tracks the insurance policies taken out by clients, including details about policy type, premium, and the agent managing the policy.
- *Claims*: This table stores data about claims filed by clients on their policies, including the amount of the claim and its current status.
- *Payments*: This table records payments made by clients toward their policies, including the amount and the type of payment.

![ERD Diagram](img/ERD.png)

Emma starts her day by analyzing client claims and their associated agents.

She needs to find out which clients have unresolved `claims` and who their assigned `agents` are.

Let's use multiple `JOIN`s to combine the relevant tables and retrieve the information:

In [9]:
%%sql
-- SQL Query: Finding unresolved claims with client and agent information

SELECT 
    c.first_name AS client_first_name, 
    c.last_name AS client_last_name, 
    p.policy_type,
    cl.claim_id, 
    cl.claim_date, 
    cl.amount, 
    cl.status, 
    a.first_name AS agent_first_name, 
    a.last_name AS agent_last_name
FROM 
    clients c
INNER JOIN 
    policies p ON c.client_id = p.client_id
INNER JOIN 
    claims cl ON p.policy_id = cl.policy_id
INNER JOIN 
    agents a ON p.agent_id = a.agent_id
WHERE 
    cl.status IN ('Pending', 'Denied');

 * postgresql://sedv8808:***@localhost/insurance
   postgresql://sedv8808:***@localhost/sakila
10 rows affected.


client_first_name,client_last_name,policy_type,claim_id,claim_date,amount,status,agent_first_name,agent_last_name
Jane,Smith,Home,10,2023-07-22,5000.0,Pending,Bob,Lee
Jane,Smith,Home,2,2023-07-22,5000.0,Pending,Bob,Lee
Robert,Brown,Health,11,2023-09-10,1500.0,Denied,Carol,White
Robert,Brown,Health,3,2023-09-10,1500.0,Denied,Carol,White
Michael,Davis,Life,13,2023-02-11,4500.0,Pending,Eva,Parker
Michael,Davis,Life,5,2023-02-11,4500.0,Pending,Eva,Parker
Sarah,Wilson,Home,14,2023-04-05,1800.0,Denied,Alice,Johnson
Sarah,Wilson,Home,6,2023-04-05,1800.0,Denied,Alice,Johnson
Patricia,Taylor,Auto,16,2023-08-20,2200.0,Pending,Carol,White
Patricia,Taylor,Auto,8,2023-08-20,2200.0,Pending,Carol,White


This query uses `INNER JOIN`s to combine data from the *clients*, *policies*, *claims*, and *agents* tables. We filter the results to focus on *claims* that are either **"Pending"** or **"Denied"** which need attention.

Next, Emma wants to identify clients who have made multiple claims in 2023. 

Instead of using multiple joins, she simplifies her analysis by using a subquery to calculate the total number of claims each client made that year.

In [10]:
%%sql

-- 2.1 SELECT Subquery: Finding clients with multiple claims in 2023

SELECT 
    c.first_name, 
    c.last_name, 
    (SELECT COUNT(cl.claim_id) 
     FROM claims cl 
     INNER JOIN policies p ON cl.policy_id = p.policy_id 
     WHERE p.client_id = c.client_id AND YEAR(cl.claim_date) = 2023) AS total_claims
FROM 
    clients c
HAVING 
    total_claims > 1;

 * postgresql://sedv8808:***@localhost/insurance
   postgresql://sedv8808:***@localhost/sakila
(psycopg2.errors.UndefinedFunction) function year(date) does not exist
LINE 9:      WHERE p.client_id = c.client_id AND YEAR(cl.claim_date)...
                                                 ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

[SQL: -- 2.1 SELECT Subquery: Finding clients with multiple claims in 2023

SELECT 
    c.first_name, 
    c.last_name, 
    (SELECT COUNT(cl.claim_id) 
     FROM claims cl 
     INNER JOIN policies p ON cl.policy_id = p.policy_id 
     WHERE p.client_id = c.client_id AND YEAR(cl.claim_date) = 2023) AS total_claims
FROM 
    clients c
HAVING 
    total_claims > 1;]
(Background on this error at: https://sqlalche.me/e/20/f405)


The subquery calculates the total number of claims each client made in 2023.
We then filter to show only clients who have made more than one claim.
This approach simplifies the main query by delegating the calculation to the subquery.


Emma also needs to analyze the average premium per policy type to help the underwriting team adjust premium rates. She uses a FROM subquery to simplify the aggregation.

In [11]:
%%sql

-- FROM Subquery: Calculating average premium by policy type

SELECT 
    policy_summary.policy_type, 
    AVG(policy_summary.total_premium) AS avg_premium
FROM 
    (SELECT p.policy_type, SUM(p.premium) AS total_premium 
     FROM policies p 
     GROUP BY p.client_id, p.policy_type) AS policy_summary
GROUP BY 
    policy_summary.policy_type;

 * postgresql://sedv8808:***@localhost/insurance
   postgresql://sedv8808:***@localhost/sakila
4 rows affected.


policy_type,avg_premium
Life,1550.0
Health,700.0
Auto,1050.0
Home,2466.6666666666665


This subquery calculates the total premium for each policy type, grouping by client and policy. The outer query then calculates the average premium per policy type.

Finally, Emma wants to identify high-value claims that are still unresolved. 
She uses a WHERE subquery to filter for clients with pending claims greater than $3,000.



In [12]:
%%sql
-- 2.3 WHERE Subquery: Finding high-value unresolved claims

SELECT 
    c.first_name, 
    c.last_name, 
    cl.amount, 
    cl.claim_date, 
    a.first_name AS agent_first_name, 
    a.last_name AS agent_last_name
FROM 
    clients c
INNER JOIN 
    policies p ON c.client_id = p.client_id
INNER JOIN 
    claims cl ON p.policy_id = cl.policy_id
INNER JOIN 
    agents a ON p.agent_id = a.agent_id
WHERE 
    cl.amount > 3000 
    AND cl.status = 'Pending';

 * postgresql://sedv8808:***@localhost/insurance
   postgresql://sedv8808:***@localhost/sakila
4 rows affected.


first_name,last_name,amount,claim_date,agent_first_name,agent_last_name
Jane,Smith,5000.0,2023-07-22,Bob,Lee
Michael,Davis,4500.0,2023-02-11,Eva,Parker
Jane,Smith,5000.0,2023-07-22,Bob,Lee
Michael,Davis,4500.0,2023-02-11,Eva,Parker


The subquery filters for claims that exceed $3,000 and are still "Pending."

This helps the claims department focus on high-value cases that need priority.

### Section 3: JOIN vs. Subquery Discussion

Let's discuss when a subquery might be preferred over a join, and vice versa. 

For example, in the SELECT subquery above, we could have rewritten the query 
using a JOIN, but the subquery is preferred in certain cases.

Here’s how the same query could be written using a JOIN:

In [13]:
%%sql

SELECT 
    c.first_name, 
    c.last_name, 
    COUNT(cl.claim_id) AS total_claims
FROM 
    clients c
INNER JOIN 
    policies p ON c.client_id = p.client_id
INNER JOIN 
    claims cl ON p.policy_id = cl.policy_id
WHERE 
    YEAR(cl.claim_date) = 2023
GROUP BY 
    c.client_id
HAVING 
    COUNT(cl.claim_id) > 1;


 * postgresql://sedv8808:***@localhost/insurance
   postgresql://sedv8808:***@localhost/sakila
(psycopg2.errors.UndefinedFunction) function year(date) does not exist
LINE 12:     YEAR(cl.claim_date) = 2023
             ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

[SQL: SELECT 
    c.first_name, 
    c.last_name, 
    COUNT(cl.claim_id) AS total_claims
FROM 
    clients c
INNER JOIN 
    policies p ON c.client_id = p.client_id
INNER JOIN 
    claims cl ON p.policy_id = cl.policy_id
WHERE 
    YEAR(cl.claim_date) = 2023
GROUP BY 
    c.client_id
HAVING 
    COUNT(cl.claim_id) > 1;]
(Background on this error at: https://sqlalche.me/e/20/f405)


While this JOIN version achieves the same result, it requires more  operations and is less modular. Subqueries can be more efficient and easier to read when calculating a specific derived value like total claims.

### Subqueries are often preferred for:
1. **Simplicity**: Breaking complex logic into manageable parts.
2. **Modularity**: Subqueries can be used in different parts of the main query.
3. **Performance**: In some cases, subqueries can be faster due to optimized execution plans.

JOINs, on the other hand, are preferred when:
1. You need to retrieve large datasets with complex relationships.
2. You need to return multiple rows rather than a single derived value.
"""


Conclusion

In this lecture, we followed Emma’s journey as she analyzed client claims, 
policies, and premiums using multiple JOINs and subqueries.

Key takeaways:
- **Multiple JOINs** allow you to combine data from several tables to perform complex analysis.
- **Subqueries** help you simplify logic, calculate derived values, and break down complex queries.

Both JOINs and subqueries are powerful tools in SQL, and choosing the right one depends on the specific use case and data you're working with.
