<a href="https://colab.research.google.com/github/sethkipsangmutuba/SQL/blob/main/2c_Summary_Statidstics_in_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup in Google Colab

Before running SQL queries, make sure to:

1. Import necessary libraries.
2. Load the Titanic dataset into a pandas DataFrame.
3. Register the DataFrame as a temporary SQL table using `ipython-sql`.


In [28]:
import pandas as pd
import sqlite3
import seaborn as sns

# Load Titanic dataset
df = sns.load_dataset('titanic')

# Drop rows with nulls in group-by or aggregation columns
df = df.dropna(subset=['fare', 'age', 'class', 'sex'])

# Create SQLite in-memory DB
conn = sqlite3.connect(":memory:")
df.to_sql("titanic", conn, index=False, if_exists="replace")


714

---

## Learning Objectives

By the end of this notebook, you will be able to:

- Use `GROUP BY` to segment Titanic data  
- Apply `MIN`, `MAX`, `AVG`, `SUM`, and `COUNT` in SQL  
- Create a clean summary statistics report using SQL in Python  

---

## Titanic Dataset Summary Report in SQL

We’ll segment our report by **class** (`pclass`: 1st, 2nd, 3rd) and **sex** (male/female).

**1. What is the `MIN`, `MAX`, and `AVG` fare per class and sex?**


In [29]:
query = """
SELECT
    class,
    sex,
    MIN(fare) AS min_fare,
    MAX(fare) AS max_fare,
    AVG(fare) AS avg_fare
FROM titanic
GROUP BY class, sex
ORDER BY class, sex;
"""
pd.read_sql_query(query, conn)


Unnamed: 0,class,sex,min_fare,max_fare,avg_fare
0,First,female,25.9292,512.3292,107.946275
1,First,male,0.0,512.3292,71.142781
2,Second,female,10.5,65.0,21.95107
3,Second,male,10.5,73.5,21.113131
4,Third,female,6.75,46.9,15.875369
5,Third,male,0.0,56.4958,12.162695


**2. What is the number of passengers per class and sex?**


In [30]:
query = """
SELECT
    class,
    sex,
    COUNT(*) AS passenger_count
FROM titanic
GROUP BY class, sex
ORDER BY class, sex;
"""
pd.read_sql_query(query, conn)


Unnamed: 0,class,sex,passenger_count
0,First,female,85
1,First,male,101
2,Second,female,74
3,Second,male,99
4,Third,female,102
5,Third,male,253


**3. What is the total fare collected per class and sex?**


In [31]:
query = """
SELECT
    class,
    sex,
    SUM(fare) AS total_fare
FROM titanic
GROUP BY class, sex
ORDER BY class, sex;
"""
pd.read_sql_query(query, conn)


Unnamed: 0,class,sex,total_fare
0,First,female,9175.4334
1,First,male,7185.4209
2,Second,female,1624.3792
3,Second,male,2090.2
4,Third,female,1619.2876
5,Third,male,3077.1619


---

##  Final Summary Report: All in One

Let’s combine all those queries into one unified report, just like in MySQL:

- Group by `pclass` and `sex`
- Show:
  - `MIN(fare)`
  - `MAX(fare)`
  - `AVG(fare)`
  - `COUNT(*)` (number of passengers)
  - `SUM(fare)` (total fare collected)


In [32]:
query = """
SELECT
    class,
    sex,
    MIN(fare) AS min_fare,
    MAX(fare) AS max_fare,
    AVG(fare) AS avg_fare,
    COUNT(*) AS passenger_count,
    SUM(fare) AS total_fare
FROM titanic
GROUP BY class, sex
ORDER BY class, sex;
"""
pd.read_sql_query(query, conn)


Unnamed: 0,class,sex,min_fare,max_fare,avg_fare,passenger_count,total_fare
0,First,female,25.9292,512.3292,107.946275,85,9175.4334
1,First,male,0.0,512.3292,71.142781,101,7185.4209
2,Second,female,10.5,65.0,21.95107,74,1624.3792
3,Second,male,10.5,73.5,21.113131,99,2090.2
4,Third,female,6.75,46.9,15.875369,102,1619.2876
5,Third,male,0.0,56.4958,12.162695,253,3077.1619
