<a href="https://colab.research.google.com/github/sethkipsangmutuba/SQL/blob/main/3b_Top_N_Analysis_Using_Ranking_Window_Functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Top-N Analysis Using Ranking Window Functions

In this notebook, we’ll use **SQL ranking window functions** like `RANK()`, `DENSE_RANK()`, and `ROW_NUMBER()` to identify top-N records across groups in the Titanic dataset (or any dataset of interest).

Ranking functions help us:

- Rank rows within partitions (e.g., top earners per class or region)
- Break ties (or not) depending on the function used
- Perform advanced filtering like top 3 per group

---

##  Setup: Titanic Dataset + SQLite


In [58]:
import pandas as pd
import numpy as np
import sqlite3
import seaborn as sns

# Load sample dataset
df = sns.load_dataset("titanic").dropna(subset=["age", "fare", "embarked", "class"])

# Simulate necessary columns for the exercise
np.random.seed(42)
df["time_period"] = np.random.choice([2015, 2016, 2017], size=len(df))
df["country_name"] = df["embarked"] + "_" + df.index.astype(str)
df["pct_managed_drinking_water_services"] = np.random.choice([80, 85, 90, 95, 100], size=len(df), p=[0.2, 0.3, 0.2, 0.2, 0.1])

# Create SQLite in-memory database
conn = sqlite3.connect(":memory:")
df.to_sql("access_to_basic_services", conn, index=False, if_exists="replace")


712

---

## Base Query: Select Fields

Start by selecting the key fields you’ll use for ranking:

In [59]:
query_base = """
SELECT
    country_name,
    time_period,
    pct_managed_drinking_water_services
FROM access_to_basic_services;
"""

pd.read_sql_query(query_base, conn)


Unnamed: 0,country_name,time_period,pct_managed_drinking_water_services
0,S_0,2017,85
1,C_1,2015,90
2,S_2,2017,85
3,S_3,2017,80
4,S_4,2015,85
...,...,...,...
707,Q_885,2017,100
708,S_886,2015,95
709,S_887,2017,80
710,C_889,2016,95


---

## Task 1: Use `ROW_NUMBER()` to Rank Countries by Water Access Level (Per Year)

Use the `ROW_NUMBER()` window function to assign a unique rank to each country **within each year**, based on water access levels.

- Rank countries by highest to lowest `water_access`
- Use `PARTITION BY year` to segment the data
- Use `ORDER BY water_access DESC` for ranking
- This function gives **strict rankings** with no ties


In [60]:
query_row_number = """
SELECT
    country_name,
    time_period,
    pct_managed_drinking_water_services,
    ROW_NUMBER() OVER (
        PARTITION BY time_period
        ORDER BY pct_managed_drinking_water_services ASC
    ) AS rank_of_water_services
FROM access_to_basic_services;
"""

pd.read_sql_query(query_row_number, conn)


Unnamed: 0,country_name,time_period,pct_managed_drinking_water_services,rank_of_water_services
0,Q_16,2015,80,1
1,C_30,2015,80,2
2,S_33,2015,80,3
3,C_52,2015,80,4
4,S_123,2015,80,5
...,...,...,...,...
707,S_771,2017,100,227
708,C_866,2017,100,228
709,S_870,2017,100,229
710,S_884,2017,100,230


---

## Task 2: Show Ranks for Only Those with 100% Access

Filter the ranked results to display only countries that have **100% water access**.

- Use a condition to include only rows where `water_access = 100`
- The ranking from `ROW_NUMBER()` (or other ranking function) should still be visible
- This helps identify top-ranked countries that have achieved full access


In [61]:
query_row_number_100 = """
SELECT
    country_name,
    time_period,
    pct_managed_drinking_water_services,
    ROW_NUMBER() OVER (
        PARTITION BY time_period
        ORDER BY pct_managed_drinking_water_services ASC
    ) AS rank_of_water_services
FROM access_to_basic_services
WHERE pct_managed_drinking_water_services = 100;
"""

pd.read_sql_query(query_row_number_100, conn)


Unnamed: 0,country_name,time_period,pct_managed_drinking_water_services,rank_of_water_services
0,S_113,2015,100,1
1,S_117,2015,100,2
2,Q_156,2015,100,3
3,S_231,2015,100,4
4,S_383,2015,100,5
...,...,...,...,...
67,S_771,2017,100,21
68,C_866,2017,100,22
69,S_870,2017,100,23
70,S_884,2017,100,24


Countries with the same access value (100%) have **different ranks** —  
this is expected behavior from `ROW_NUMBER()`, which always assigns **unique** ranks, even to tied values.


---

## Task 3: Use `RANK()` Instead to Assign Same Ranks to Tied Values

Use the `RANK()` window function to allow **ties in ranking** when countries have the same `water_access` value.

- Partition the data by `year`
- Order by `water_access` in descending order
- Countries with equal `water_access` will receive the **same rank**
- Gaps will appear in rank numbers after ties (e.g., 1, 2, 2, 4)


In [62]:
query_rank = """
SELECT
    country_name,
    time_period,
    pct_managed_drinking_water_services,
    RANK() OVER (
        PARTITION BY time_period
        ORDER BY pct_managed_drinking_water_services ASC
    ) AS rank_of_water_services
FROM access_to_basic_services;
"""

pd.read_sql_query(query_rank, conn)


Unnamed: 0,country_name,time_period,pct_managed_drinking_water_services,rank_of_water_services
0,Q_16,2015,80,1
1,C_30,2015,80,1
2,S_33,2015,80,1
3,C_52,2015,80,1
4,S_123,2015,80,1
...,...,...,...,...
707,S_771,2017,100,207
708,C_866,2017,100,207
709,S_870,2017,100,207
710,S_884,2017,100,207


Countries with the same access percentage now share the same rank —  
`RANK()` handles ties correctly by assigning the same rank to tied values  
and **skipping subsequent positions** in the ranking sequence.
