<a href="https://colab.research.google.com/github/sethkipsangmutuba/SQL/blob/main/2d_Filtering_and_analysing_a_summary_statistic_report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

##  Learning Objectives

By the end of this notebook, you’ll know how to:

- Filter rows using the `WHERE` clause (before aggregation)  
- Filter groups using the `HAVING` clause (after aggregation)  
- Use `GROUP BY`, `MIN`, `MAX`, `AVG`, `SUM`, and `COUNT` for grouped summaries  
- Order results for clearer analysis  

---

##  Setup (Titanic Dataset + SQLite)


In [43]:
import pandas as pd
import sqlite3
import seaborn as sns

# Load Titanic dataset
df = sns.load_dataset("titanic")

# Drop rows with nulls in key columns
df = df.dropna(subset=["fare", "age", "class", "sex"])

# Add a fake 'year' column to simulate filtering
df["year"] = 2020

# Create SQLite in-memory database
conn = sqlite3.connect(":memory:")
df.to_sql("titanic", conn, index=False, if_exists="replace")


714

---

## 1: Summary Statistics Report

Generate a summary report grouped by **passenger class (`pclass`)** and **sex**, showing:

- Minimum fare (`MIN`)
- Maximum fare (`MAX`)
- Average fare (`AVG`)
- Total fare (`SUM`)
- Number of passengers (`COUNT`)


In [44]:
summary_query = """
SELECT
    class,
    sex,
    MIN(fare) AS min_fare,
    MAX(fare) AS max_fare,
    AVG(fare) AS avg_fare,
    COUNT(*) AS passenger_count,
    SUM(fare) AS total_fare
FROM titanic
GROUP BY class, sex
ORDER BY total_fare ASC;
"""

pd.read_sql_query(summary_query, conn)


Unnamed: 0,class,sex,min_fare,max_fare,avg_fare,passenger_count,total_fare
0,Third,female,6.75,46.9,15.875369,102,1619.2876
1,Second,female,10.5,65.0,21.95107,74,1624.3792
2,Second,male,10.5,73.5,21.113131,99,2090.2
3,Third,male,0.0,56.4958,12.162695,253,3077.1619
4,First,male,0.0,512.3292,71.142781,101,7185.4209
5,First,female,25.9292,512.3292,107.946275,85,9175.4334


---

##2: Filter for Year = 2020

Assuming your dataset contains a `year` column, filter the data **before aggregation** using a `WHERE` clause:

- Only include rows where `year = 2020`
- Then apply `GROUP BY`, `MIN`, `MAX`, `AVG`, `SUM`, and `COUNT` as before


In [45]:
year_filter_query = """
SELECT
    class,
    sex,
    MIN(fare) AS min_fare,
    MAX(fare) AS max_fare,
    AVG(fare) AS avg_fare,
    COUNT(*) AS passenger_count,
    SUM(fare) AS total_fare
FROM titanic
WHERE year = 2020
GROUP BY class, sex
ORDER BY total_fare ASC;
"""

pd.read_sql_query(year_filter_query, conn)


Unnamed: 0,class,sex,min_fare,max_fare,avg_fare,passenger_count,total_fare
0,Third,female,6.75,46.9,15.875369,102,1619.2876
1,Second,female,10.5,65.0,21.95107,74,1624.3792
2,Second,male,10.5,73.5,21.113131,99,2090.2
3,Third,male,0.0,56.4958,12.162695,253,3077.1619
4,First,male,0.0,512.3292,71.142781,101,7185.4209
5,First,female,25.9292,512.3292,107.946275,85,9175.4334


---

## 3: Filter for Rows Where Fare < 60

Use a `WHERE` clause to include only rows where the `fare` is less than 60 **before aggregation**.

- This helps focus your analysis on lower-fare passengers
- Then apply `GROUP BY`, `MIN`, `MAX`, `AVG`, `SUM`, and `COUNT` as in previous steps


In [46]:
fare_filter_query = """
SELECT
    class,
    sex,
    MIN(fare) AS min_fare,
    MAX(fare) AS max_fare,
    AVG(fare) AS avg_fare,
    COUNT(*) AS passenger_count,
    SUM(fare) AS total_fare
FROM titanic
WHERE year = 2020
  AND fare < 60
GROUP BY class, sex
ORDER BY total_fare ASC;
"""

pd.read_sql_query(fare_filter_query, conn)


Unnamed: 0,class,sex,min_fare,max_fare,avg_fare,passenger_count,total_fare
0,First,female,25.9292,59.4,44.272925,24,1062.5502
1,Second,female,10.5,41.5792,20.755267,72,1494.3792
2,Third,female,6.75,46.9,15.875369,102,1619.2876
3,Second,male,10.5,41.5792,18.326596,94,1722.7
4,First,male,0.0,57.0,34.093011,62,2113.7667
5,Third,male,0.0,56.4958,12.162695,253,3077.1619


---

##  4: Filter Groups with Fewer Than 4 Passengers (Using `HAVING`)

After performing your aggregation with `GROUP BY`, use the `HAVING` clause to filter out groups that have fewer than 4 passengers.

- Use `HAVING COUNT(*) >= 4`
- This filters **after aggregation**, unlike `WHERE` which filters before


In [47]:
having_filter_query = """
SELECT
    class,
    sex,
    MIN(fare) AS min_fare,
    MAX(fare) AS max_fare,
    AVG(fare) AS avg_fare,
    COUNT(*) AS passenger_count,
    SUM(fare) AS total_fare
FROM titanic
WHERE year = 2020
  AND fare < 60
GROUP BY class, sex
HAVING passenger_count < 4
ORDER BY total_fare ASC;
"""

pd.read_sql_query(having_filter_query, conn)


Unnamed: 0,class,sex,min_fare,max_fare,avg_fare,passenger_count,total_fare
