##  Part 4: Charting the Course for Maji Ndogo’s Water Future  
### Building a Foundation for Data-Driven Decisions

> *“Understanding the situation is one thing, but translating that understanding into informed decisions will truly make a difference.”*  
> — *Aziza, Maji Ndogo Project Lead*

As Maji Ndogo transitions into a new phase of reform, our goal shifts from uncovering data irregularities to **transforming data into actionable insights**.  
In this stage, we will assemble our data into meaningful views that connect **location**, **water source**, **visits**, and **pollution quality**, enabling decision-makers to visualize and plan effectively.



###  Objective

We’ll begin by **assembling data into a single, queryable table** that links:
- `location` → province and town information  
- `visits` → connects `location_id` and `source_id`, while including queue statistics  
- `water_source` → describes source type and number of people served  
- `well_pollution` → provides contamination data (for wells only)

By joining these tables carefully, we’ll ensure no duplicate data is included (e.g., filtering out multiple visits).



###  Step 1: Combine Location and Visits Data

We start by joining the `location` and `visits` tables on the `location_id`.  
This gives us a foundational dataset that maps each visit to its geographical context.



In [9]:
%%sql

-- Test query with a small sample to ensure connection works
SELECT 
    l.location_id,
    l.town_name,
    l.province_name,
    v.source_id,
    v.time_of_record,
    v.visit_count,
    v.time_in_queue
FROM location AS l
JOIN visits AS v
    ON l.location_id = v.location_id
WHERE v.visit_count = 1
ORDER BY l.province_name, l.town_name
LIMIT 10;


 * mysql+pymysql://root:***@localhost/md_water_services
10 rows affected.


location_id,town_name,province_name,source_id,time_of_record,visit_count,time_in_queue
AkHa00555,Harare,Akatsi,AkHa00555224,2021-01-06 16:19:00,1,0
AkHa00545,Harare,Akatsi,AkHa00545224,2021-01-06 12:06:00,1,0
AkHa00342,Harare,Akatsi,AkHa00342224,2021-01-05 12:24:00,1,0
AkHa00744,Harare,Akatsi,AkHa00744224,2021-01-06 11:14:00,1,0
AkHa00070,Harare,Akatsi,AkHa00070224,2021-01-01 11:42:00,1,0
AkHa00209,Harare,Akatsi,AkHa00209224,2021-01-05 11:52:00,1,0
AkHa00465,Harare,Akatsi,AkHa00465224,2021-01-01 16:28:00,1,0
AkHa00720,Harare,Akatsi,AkHa00720224,2021-01-07 13:17:00,1,0
AkHa00706,Harare,Akatsi,AkHa00706224,2021-01-01 16:10:00,1,0
AkHa00657,Harare,Akatsi,AkHa00657224,2021-01-06 11:22:00,1,0




## Joining `visits` and `water_source` Tables

To understand water access better, we first join the `visits` table with the `water_source` table. This allows us to combine information about water sources, such as the type of source and the number of people served, with the visits data. The join is performed on the `source_id` column, which is shared between these two tables. For efficiency, we limit the results to 10 rows as a sample.


In [10]:
%%sql

-- Join visits with water_source to get source details
SELECT 
    v.record_id,
    v.source_id,
    v.location_id,
    v.visit_count,
    v.time_in_queue,
    ws.type_of_water_source,
    ws.number_of_people_served
FROM visits AS v
JOIN water_source AS ws
    ON v.source_id = ws.source_id
WHERE v.visit_count = 1
LIMIT 10;


 * mysql+pymysql://root:***@localhost/md_water_services
10 rows affected.


record_id,source_id,location_id,visit_count,time_in_queue,type_of_water_source,number_of_people_served
0,SoIl32582224,SoIl32582,1,15,river,402
1,KiRu28935224,KiRu28935,1,0,well,252
2,HaRu19752224,HaRu19752,1,62,shared_tap,542
3,AkLu01628224,AkLu01628,1,0,well,210
4,AkRu03357224,AkRu03357,1,28,shared_tap,2598
5,KiRu29315224,KiRu29315,1,9,river,862
6,AkRu05234224,AkRu05234,1,0,tap_in_home_broken,496
7,KiRu28520224,KiRu28520,1,0,tap_in_home,562
8,HaZa21742224,HaZa21742,1,0,well,308
9,AmDa12214224,AmDa12214,1,0,tap_in_home,556



## Step: Join `water_source` with `visits` for a Specific Location

In this step, we join the `water_source` table to the `visits` table using `source_id`. We also filter the data for a specific location (`AkHa00103`) to examine all visits, including repeated visits (`visit_count > 1`). This helps us understand the type of water source, number of people served, and queue information at this location.


In [11]:
%%sql

-- Join visits with water_source and filter for a specific location
SELECT 
    v.record_id,
    v.source_id,
    v.location_id,
    v.visit_count,
    v.time_in_queue,
    ws.type_of_water_source,
    ws.number_of_people_served
FROM visits AS v
JOIN water_source AS ws
    ON v.source_id = ws.source_id
WHERE v.location_id = 'AkHa00103'
ORDER BY v.record_id
LIMIT 10;


 * mysql+pymysql://root:***@localhost/md_water_services
8 rows affected.


record_id,source_id,location_id,visit_count,time_in_queue,type_of_water_source,number_of_people_served
51473,AkHa00103224,AkHa00103,1,240,shared_tap,3340
51509,AkHa00103224,AkHa00103,2,186,shared_tap,3340
51580,AkHa00103224,AkHa00103,3,88,shared_tap,3340
51656,AkHa00103224,AkHa00103,4,237,shared_tap,3340
51836,AkHa00103224,AkHa00103,5,141,shared_tap,3340
51840,AkHa00103224,AkHa00103,6,183,shared_tap,3340
51874,AkHa00103224,AkHa00103,7,179,shared_tap,3340
51950,AkHa00103224,AkHa00103,8,387,shared_tap,3340




## Step: Refine Visits and Add Location Info

In this step, we:

1. Filter the visits to only include rows where `visit_count = 1` to avoid counting multiple visits for the same location.
2. Remove the `location_id` and `visit_count` columns since they are no longer needed.
3. Add `location_type` from the `location` table and retain `time_in_queue` from `visits` for further analysis.

This ensures our dataset is clean and focused for aggregations and insights.


In [12]:
%%sql

-- Join visits with water_source and location, filter for single visits, and select relevant columns
SELECT 
    l.location_type,
    l.town_name,
    l.province_name,
    v.source_id,
    v.time_of_record,
    v.time_in_queue,
    ws.type_of_water_source,
    ws.number_of_people_served
FROM visits AS v
JOIN water_source AS ws
    ON v.source_id = ws.source_id
JOIN location AS l
    ON v.location_id = l.location_id
WHERE v.visit_count = 1
ORDER BY l.province_name, l.town_name
LIMIT 10;


 * mysql+pymysql://root:***@localhost/md_water_services
10 rows affected.


location_type,town_name,province_name,source_id,time_of_record,time_in_queue,type_of_water_source,number_of_people_served
Urban,Harare,Akatsi,AkHa00555224,2021-01-06 16:19:00,0,well,186
Urban,Harare,Akatsi,AkHa00545224,2021-01-06 12:06:00,0,tap_in_home_broken,732
Urban,Harare,Akatsi,AkHa00342224,2021-01-05 12:24:00,0,tap_in_home,446
Urban,Harare,Akatsi,AkHa00744224,2021-01-06 11:14:00,0,well,352
Urban,Harare,Akatsi,AkHa00070224,2021-01-01 11:42:00,0,well,390
Urban,Harare,Akatsi,AkHa00209224,2021-01-05 11:52:00,0,tap_in_home,584
Urban,Harare,Akatsi,AkHa00465224,2021-01-01 16:28:00,0,well,252
Urban,Harare,Akatsi,AkHa00720224,2021-01-07 13:17:00,0,well,312
Urban,Harare,Akatsi,AkHa00706224,2021-01-01 16:10:00,0,well,238
Urban,Harare,Akatsi,AkHa00657224,2021-01-06 11:22:00,0,tap_in_home,826




## Step: Combine All Relevant Data Including Well Pollution

In this step, we:

1. Include `well_pollution` data for wells using a **LEFT JOIN**, so that only well sources have pollution results, and other water sources show `NULL`.
2. Join `location` and `water_source` tables using **INNER JOIN** to bring in location and source details.
3. Filter visits to only include `visit_count = 1` to avoid duplicates.
4. Create a **VIEW** called `combined_analysis_table` to assemble all relevant information in one place for easy analysis.

This approach allows us to work with a single unified table without repeatedly performing complex joins.


In [13]:
%%sql

-- Create a view that combines visits, water_source, location, and well_pollution
CREATE OR REPLACE VIEW combined_analysis_table AS
SELECT
    ws.type_of_water_source AS source_type,
    l.town_name,
    l.province_name,
    l.location_type,
    ws.number_of_people_served AS people_served,
    v.time_in_queue,
    wp.results
FROM visits AS v
LEFT JOIN well_pollution AS wp
    ON wp.source_id = v.source_id
INNER JOIN location AS l
    ON l.location_id = v.location_id
INNER JOIN water_source AS ws
    ON ws.source_id = v.source_id
WHERE v.visit_count = 1;


 * mysql+pymysql://root:***@localhost/md_water_services
0 rows affected.


[]

In [14]:
%%sql

-- Count the total number of rows in the view
SELECT COUNT(*) AS total_rows
FROM combined_analysis_table;


 * mysql+pymysql://root:***@localhost/md_water_services
1 rows affected.


total_rows
39650


#  Provincial Water Source Analysis

In this step, we are analyzing the distribution of water sources across provinces.  
We want to calculate **the percentage of people served by each type of water source** in each province.  

The steps:

1. Use a **CTE (`province_totals`)** to calculate the total population served per province.
2. Join this CTE with the `combined_analysis_table`.
3. Use **CASE statements** to calculate the percentage of people served for each type of water source.
4. Group by `province_name` to get province-level percentages.

This will help us understand where water access issues exist and guide repair or improvement efforts.


In [16]:
%%sql

-- Step 1: Create CTE to calculate total people served per province
WITH province_totals AS (
    SELECT
        province_name,
        SUM(people_served) AS total_ppl_serv
    FROM combined_analysis_table
    GROUP BY province_name
)

-- Step 2: Calculate percentage of people served by each water source type per province
SELECT
    ct.province_name,
    -- Calculate percentage of population served by river sources
    ROUND((SUM(CASE WHEN source_type = 'river' THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS river,
    -- Calculate percentage of population served by shared taps
    ROUND((SUM(CASE WHEN source_type = 'shared_tap' THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS shared_tap,
    -- Calculate percentage of population served by taps in homes
    ROUND((SUM(CASE WHEN source_type = 'tap_in_home' THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS tap_in_home,
    -- Calculate percentage of population served by broken taps in homes
    ROUND((SUM(CASE WHEN source_type = 'tap_in_home_broken' THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS tap_in_home_broken,
    -- Calculate percentage of population served by wells
    ROUND((SUM(CASE WHEN source_type = 'well' THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS well
FROM combined_analysis_table ct
JOIN province_totals pt
    ON ct.province_name = pt.province_name
GROUP BY ct.province_name
ORDER BY ct.province_name;


 * mysql+pymysql://root:***@localhost/md_water_services
5 rows affected.


province_name,river,shared_tap,tap_in_home,tap_in_home_broken,well
Akatsi,5,49,14,10,23
Amanzi,3,38,28,24,7
Hawassa,4,43,15,15,24
Kilimani,8,47,13,12,20
Sokoto,21,38,16,10,15


## Town-Level Water Source Analysis

In this analysis, we break down the population served by different water source types **by town**.  

### Key Considerations:
1. **Duplicate Town Names:** Some towns share the same name but belong to different provinces (e.g., "Harare" exists in both Akatsi and Kilimani).  
2. **Grouping Strategy:** To avoid combining duplicate towns, we **group by both `province_name` and `town_name`**.  
3. **Temporary Table:** Because this query is complex and can be slow on large datasets, we store the results in a **temporary table** called `town_aggregated_water_access`.  
   - This allows us to reuse the aggregated results in further analysis without re-running the entire query.  
   - Temporary tables are deleted when the database connection closes, so they need to be recreated in a new session.

The query performs the following steps:
1. Uses a **CTE `town_totals`** to calculate the total population served per town.  
2. Aggregates `people_served` per water source type and converts totals into **percentages**.  
3. Joins `town_totals` with `combined_analysis_table` using both `province_name` and `town_name` as keys.  
4. Groups by province and town to produce distinct rows for each town.


In [17]:
%%sql

-- Create a temporary table for town-level aggregated water access
CREATE TEMPORARY TABLE town_aggregated_water_access
WITH town_totals AS (
    -- CTE to calculate total population served per town
    SELECT 
        province_name, 
        town_name, 
        SUM(people_served) AS total_ppl_serv
    FROM combined_analysis_table
    GROUP BY province_name, town_name
)
SELECT
    ct.province_name,
    ct.town_name,
    -- Calculate percentages for each water source type
    ROUND((SUM(CASE WHEN source_type = 'river' THEN people_served ELSE 0 END) * 100.0 / tt.total_ppl_serv), 0) AS river,
    ROUND((SUM(CASE WHEN source_type = 'shared_tap' THEN people_served ELSE 0 END) * 100.0 / tt.total_ppl_serv), 0) AS shared_tap,
    ROUND((SUM(CASE WHEN source_type = 'tap_in_home' THEN people_served ELSE 0 END) * 100.0 / tt.total_ppl_serv), 0) AS tap_in_home,
    ROUND((SUM(CASE WHEN source_type = 'tap_in_home_broken' THEN people_served ELSE 0 END) * 100.0 / tt.total_ppl_serv), 0) AS tap_in_home_broken,
    ROUND((SUM(CASE WHEN source_type = 'well' THEN people_served ELSE 0 END) * 100.0 / tt.total_ppl_serv), 0) AS well
FROM combined_analysis_table ct
JOIN town_totals tt 
    ON ct.province_name = tt.province_name 
    AND ct.town_name = tt.town_name
GROUP BY ct.province_name, ct.town_name
ORDER BY ct.province_name, ct.town_name;


 * mysql+pymysql://root:***@localhost/md_water_services
31 rows affected.


[]

In [18]:
%%sql

SELECT * 
FROM town_aggregated_water_access
LIMIT 10;

 * mysql+pymysql://root:***@localhost/md_water_services
10 rows affected.


province_name,town_name,river,shared_tap,tap_in_home,tap_in_home_broken,well
Akatsi,Harare,2,17,28,27,27
Akatsi,Kintampo,2,15,31,26,26
Akatsi,Lusaka,2,17,28,28,26
Akatsi,Rural,6,59,9,5,22
Amanzi,Abidjan,2,53,22,19,4
Amanzi,Amina,8,24,3,56,9
Amanzi,Asmara,3,49,24,20,4
Amanzi,Bello,3,53,20,22,3
Amanzi,Dahabu,3,37,55,1,4
Amanzi,Pwani,3,53,20,21,4


## Project Progress Table Population

The goal is to populate the `Project_progress` table with water sources that need intervention. 

We filter the sources according to the following rules:

1. Only include visits where `visit_count = 1` (to avoid duplicate observations).
2. Include:
   - Rivers → Always included for improvement (drill wells).
   - Shared taps → Include only if `time_in_queue >= 30` minutes (install additional taps).
   - Wells → Only include contaminated wells (`results != 'Clean'`).
   - Tap_in_home_broken → Always included (diagnose infrastructure).

We also need to join the following tables:
- `location` → To get street address, town, and province.
- `visits` → To filter by `visit_count = 1`.
- `water_source` → To get the `source_id` and `type_of_water_source`.
- `well_pollution` → To check contamination of wells (LEFT JOIN since only wells have entries).

The final dataset will include columns:
- `source_id`, `Address`, `Town`, `Province`, `Source_type`, `Improvement` (calculated using CASE statements based on type), `Source_status` (default 'Backlog').


In [21]:
%%sql
CREATE TABLE Project_progress (
    Project_id SERIAL PRIMARY KEY,
    source_id VARCHAR(20) NOT NULL REFERENCES water_source(source_id) ON DELETE CASCADE ON UPDATE CASCADE,
    Address VARCHAR(50),
    Town VARCHAR(30),
    Province VARCHAR(30),
    Source_type VARCHAR(50),
    Improvement VARCHAR(50),
    Source_status VARCHAR(50) DEFAULT 'Backlog' CHECK (Source_status IN ('Backlog', 'In progress', 'Complete')),
    Date_of_completion DATE,
    Comments TEXT
);


 * mysql+pymysql://root:***@localhost/md_water_services
0 rows affected.


[]

In [22]:
%%sql
SHOW TABLES LIKE 'Project_progress';


 * mysql+pymysql://root:***@localhost/md_water_services
1 rows affected.


Tables_in_md_water_services (Project_progress)
project_progress


In [23]:
%%sql

INSERT INTO Project_progress (
    source_id,
    Address,
    Town,
    Province,
    Source_type,
    Improvement
)
SELECT
    ws.source_id,
    l.address,
    l.town_name,
    l.province_name,
    ws.type_of_water_source,
    CASE
        WHEN ws.type_of_water_source = 'river' THEN 'Drill wells'
        WHEN ws.type_of_water_source = 'shared_tap' AND v.time_in_queue >= 30 THEN CONCAT('Install ', FLOOR(v.time_in_queue / 30), ' taps nearby')
        WHEN ws.type_of_water_source = 'tap_in_home_broken' THEN 'Diagnose local infrastructure'
        WHEN ws.type_of_water_source = 'well' AND wp.results != 'Clean' THEN
            CASE
                WHEN wp.results = 'Chemical' THEN 'Install RO filter'
                WHEN wp.results = 'Biological' THEN 'Install UV and RO filter'
                ELSE 'Check contamination'
            END
    END AS Improvement
FROM
    water_source ws
LEFT JOIN well_pollution wp
    ON ws.source_id = wp.source_id
INNER JOIN visits v
    ON ws.source_id = v.source_id
INNER JOIN location l
    ON l.location_id = v.location_id
WHERE
    v.visit_count = 1
    AND (
        ws.type_of_water_source = 'river'
        OR ws.type_of_water_source = 'tap_in_home_broken'
        OR (ws.type_of_water_source = 'well' AND wp.results != 'Clean')
        OR (ws.type_of_water_source = 'shared_tap' AND v.time_in_queue >= 30)
    );


 * mysql+pymysql://root:***@localhost/md_water_services
25334 rows affected.


[]

### Step 1: Wells – Assign Improvements Based on Contamination

In this step, we focus on wells. Depending on the type of contamination, we assign specific improvement actions in the `Improvement` column:

- **Chemical contamination** → Install RO filter  
- **Biological contamination** → Install UV and RO filter  
- **Other cases** → NULL (no immediate improvement required)

We join the `water_source`, `visits`, `location`, and `well_pollution` tables to get all the necessary data, and filter to only include first visits (`visit_count = 1`) and wells that are contaminated.


In [24]:
%%sql

-- Step 1: Update Project_progress for wells based on contamination type
INSERT INTO Project_progress (
    source_id,
    Address,
    Town,
    Province,
    Source_type,
    Improvement
)
SELECT
    ws.source_id,
    l.address,
    l.town_name,
    l.province_name,
    ws.type_of_water_source,
    CASE
        WHEN wp.results = 'Chemical' THEN 'Install RO filter'
        WHEN wp.results = 'Biological' THEN 'Install UV and RO filter'
        ELSE NULL
    END AS Improvement
FROM
    water_source ws
INNER JOIN visits v
    ON ws.source_id = v.source_id
INNER JOIN location l
    ON l.location_id = v.location_id
LEFT JOIN well_pollution wp
    ON ws.source_id = wp.source_id
WHERE
    v.visit_count = 1
    AND ws.type_of_water_source = 'well'
    AND wp.results != 'Clean';


 * mysql+pymysql://root:***@localhost/md_water_services
12403 rows affected.


[]

### View data from the `Project_progress` table

Now that we’ve inserted 12,403 new records into the `Project_progress` table, let’s view a few rows to confirm that the data has been inserted correctly.


In [25]:
%%sql
SELECT *
FROM Project_progress
LIMIT 10;


 * mysql+pymysql://root:***@localhost/md_water_services
10 rows affected.


Project_id,source_id,Address,Town,Province,Source_type,Improvement,Source_status,Date_of_completion,Comments
1,SoIl32582224,36 Pwani Mchangani Road,Ilanga,Sokoto,river,Drill wells,Backlog,,
2,KiRu28935224,129 Ziwa La Kioo Road,Rural,Kilimani,well,Check contamination,Backlog,,
3,HaRu19752224,18 Mlima Tazama Avenue,Rural,Hawassa,shared_tap,Install 2 taps nearby,Backlog,,
4,AkLu01628224,100 Mogadishu Road,Lusaka,Akatsi,well,Check contamination,Backlog,,
5,KiRu29315224,26 Bahari Ya Faraja Road,Rural,Kilimani,river,Drill wells,Backlog,,
6,AkRu05234224,104 Kenyatta Street,Rural,Akatsi,tap_in_home_broken,Diagnose local infrastructure,Backlog,,
7,HaZa21742224,117 Kampala Road,Zanzibar,Hawassa,well,Check contamination,Backlog,,
8,SoRu35008224,55 Fennec Way,Rural,Sokoto,shared_tap,Install 8 taps nearby,Backlog,,
9,SoRu35703224,52 Moroni Avenue,Rural,Sokoto,well,Check contamination,Backlog,,
10,AkHa00070224,51 Addis Ababa Road,Harare,Akatsi,well,Check contamination,Backlog,,


In [26]:
%%sql
SELECT COUNT(*) AS total_rows
FROM Project_progress;


 * mysql+pymysql://root:***@localhost/md_water_services
1 rows affected.


total_rows
37737


###  Step 2: Rivers — Upgrading River Sources

In this step, we’ll focus on **river water sources**.  
Since river sources are generally unsafe for consumption, our plan is to **upgrade them by drilling new wells nearby**.

We’ll update the `Project_progress` table by setting the **Improvement** column to `"Drill well"` for all rows where the `Source_type` is `'river'`.

This ensures all river-based sources are flagged for well-drilling interventions.

**Goal:**  
Add `"Drill well"` to the `Improvement` column for all river sources.

**Verification:**  
After the update, we’ll query the table to confirm that all river sources now have `"Drill well"` as their improvement action.


In [27]:
%%sql
-- Step 2: Update river sources with the appropriate improvement action
UPDATE Project_progress
SET Improvement = 'Drill well'
WHERE Source_type = 'river';


 * mysql+pymysql://root:***@localhost/md_water_services
3379 rows affected.


[]

In [28]:
%%sql
-- Verify the update
SELECT Source_type, Improvement
FROM Project_progress
WHERE Source_type = 'river'
LIMIT 10;


 * mysql+pymysql://root:***@localhost/md_water_services
10 rows affected.


Source_type,Improvement
river,Drill well
river,Drill well
river,Drill well
river,Drill well
river,Drill well
river,Drill well
river,Drill well
river,Drill well
river,Drill well
river,Drill well


In [30]:
%%sql

-- Optional: count total river sources updated
SELECT COUNT(*) AS total_river_sources
FROM Project_progress
WHERE Source_type = 'river' AND Improvement = 'Drill well';


 * mysql+pymysql://root:***@localhost/md_water_services
1 rows affected.


total_river_sources
3379


###  Step 3: Shared Taps — Managing Long Queue Times

In this step, we’ll address **shared tap water sources** that experience long wait times.  
Our plan is to **install one additional tap for every 30 minutes of queue time**.

To achieve this, we’ll use a `CASE` statement with the following logic:

- For each shared tap (`type_of_water_source = 'shared_tap'`)
- If `time_in_queue` ≥ 30 minutes  
- Then calculate how many extra taps are needed using the formula:
  ```sql
  CONCAT("Install ", FLOOR(time_in_queue / 30), " taps nearby")


In [31]:
%%sql

UPDATE Project_progress pp
JOIN visits v ON pp.source_id = v.source_id
SET pp.Improvement = CONCAT('Install ', FLOOR(v.time_in_queue / 30), ' taps nearby')
WHERE pp.Source_type = 'shared_tap'
  AND v.time_in_queue >= 30;


 * mysql+pymysql://root:***@localhost/md_water_services
3696 rows affected.


[]

In [32]:
%%sql
-- Verify that updates were applied
SELECT Source_type, Improvement, time_in_queue
FROM Project_progress
JOIN visits USING (source_id)
WHERE Source_type = 'shared_tap'
LIMIT 10;


 * mysql+pymysql://root:***@localhost/md_water_services
10 rows affected.


Source_type,Improvement,time_in_queue
shared_tap,Install 2 taps nearby,62
shared_tap,Install 2 taps nearby,73
shared_tap,Install 2 taps nearby,39
shared_tap,Install 2 taps nearby,75
shared_tap,Install 2 taps nearby,28
shared_tap,Install 2 taps nearby,39
shared_tap,Install 2 taps nearby,28
shared_tap,Install 2 taps nearby,87
shared_tap,Install 8 taps nearby,240
shared_tap,Install 8 taps nearby,248


In [33]:
%%sql
-- Optional: Count total shared taps with improvement suggestions
SELECT COUNT(*) AS total_shared_taps_updated
FROM Project_progress
WHERE Source_type = 'shared_tap' AND Improvement LIKE 'Install%';


 * mysql+pymysql://root:***@localhost/md_water_services
1 rows affected.


total_shared_taps_updated
3696


###  Step 4: In-Home Taps — Diagnosing Broken Infrastructure

In this final step, we’ll focus on **in-home taps** — specifically those that are **broken**.  
These represent **local infrastructure issues** that require engineering inspection and repair.

We’ll update our `Project_progress` table so that any record with  
`type_of_water_source = 'tap_in_home_broken'` is marked with the improvement plan:

> **"Diagnose local infrastructure"**

This ensures our dataset flags all broken in-home taps for further investigation.

After this step:
- All river, well, shared tap, and broken in-home tap sources will have improvement actions.
- The final dataset should contain **≈ 25,398 rows**, and **no NULL values** in the `Improvement` column.


In [34]:
%%sql
-- Step 4: Update broken in-home taps with improvement recommendation
UPDATE Project_progress
SET Improvement = 'Diagnose local infrastructure'
WHERE Source_type = 'tap_in_home_broken';


 * mysql+pymysql://root:***@localhost/md_water_services
5856 rows affected.


[]

In [35]:
%%sql
-- Verify the update
SELECT Source_type, Improvement
FROM Project_progress
WHERE Source_type = 'tap_in_home_broken'
LIMIT 10;


 * mysql+pymysql://root:***@localhost/md_water_services
10 rows affected.


Source_type,Improvement
tap_in_home_broken,Diagnose local infrastructure
tap_in_home_broken,Diagnose local infrastructure
tap_in_home_broken,Diagnose local infrastructure
tap_in_home_broken,Diagnose local infrastructure
tap_in_home_broken,Diagnose local infrastructure
tap_in_home_broken,Diagnose local infrastructure
tap_in_home_broken,Diagnose local infrastructure
tap_in_home_broken,Diagnose local infrastructure
tap_in_home_broken,Diagnose local infrastructure
tap_in_home_broken,Diagnose local infrastructure


In [36]:
%%sql
-- Optional: Confirm that there are no NULL improvements remaining
SELECT COUNT(*) AS null_improvements_remaining
FROM Project_progress
WHERE Improvement IS NULL;


 * mysql+pymysql://root:***@localhost/md_water_services
1 rows affected.


null_improvements_remaining
12403


In [37]:
%%sql
-- Optional: Count total records to confirm final row count (should be around 25,398)
SELECT COUNT(*) AS total_records
FROM Project_progress;


 * mysql+pymysql://root:***@localhost/md_water_services
1 rows affected.


total_records
37737


##  Step 5: Add the Data to `Project_progress`

Now that we’ve completed all improvement classifications — for wells, rivers, shared taps, and broken in-home taps —  
it’s time to populate our final table: **`Project_progress`**.

This table will serve as our **action plan for the engineering teams** and provide a **central view of the water source improvements** needed across Maji Ndogo.

###  Purpose of `Project_progress`
The table helps us:
- Track water sources that require attention.
- Communicate recommended interventions (like drilling wells, installing filters, or fixing taps).
- Provide data-driven insights for provincial repair and development planning.

If any issues arise or mistakes occur during insertion, we can safely reset the table using:
```sql
DROP TABLE Project_progress;


In [38]:
%%sql

--  Step 5: Insert all improvement recommendations into Project_progress
INSERT INTO Project_progress (
    Source_ID,
    Address,
    Town,
    Province,
    Source_Type,
    Improvement
)
SELECT
    ws.source_id,
    l.address,
    l.town_name,
    l.province_name,
    ws.type_of_water_source,
    CASE
        WHEN ws.type_of_water_source = 'river' THEN 'Drill wells'
        WHEN ws.type_of_water_source = 'shared_tap' AND v.time_in_queue >= 30
            THEN CONCAT('Install ', FLOOR(v.time_in_queue / 30), ' taps nearby')
        WHEN ws.type_of_water_source = 'tap_in_home_broken' THEN 'Diagnose local infrastructure'
        WHEN ws.type_of_water_source = 'well' AND wp.results != 'Clean' THEN
            CASE
                WHEN wp.results = 'Chemical' THEN 'Install RO filter'
                WHEN wp.results = 'Biological' THEN 'Install UV and RO filter'
                ELSE 'Check contamination'
            END
    END AS Improvement
FROM
    water_source ws
LEFT JOIN
    well_pollution wp ON ws.source_id = wp.source_id
INNER JOIN
    visits v ON ws.source_id = v.source_id
INNER JOIN
    location l ON l.location_id = v.location_id
WHERE
    v.visit_count = 1
    AND (
        ws.type_of_water_source = 'river'
        OR ws.type_of_water_source = 'tap_in_home_broken'
        OR (ws.type_of_water_source = 'well' AND wp.results != 'Clean')
        OR (ws.type_of_water_source = 'shared_tap' AND v.time_in_queue >= 30)
    );


 * mysql+pymysql://root:***@localhost/md_water_services
25334 rows affected.


[]

In [39]:
%%sql
--  Verify the inserted data
SELECT *
FROM Project_progress
LIMIT 10;


 * mysql+pymysql://root:***@localhost/md_water_services
10 rows affected.


Project_id,source_id,Address,Town,Province,Source_type,Improvement,Source_status,Date_of_completion,Comments
1,SoIl32582224,36 Pwani Mchangani Road,Ilanga,Sokoto,river,Drill well,Backlog,,
2,KiRu28935224,129 Ziwa La Kioo Road,Rural,Kilimani,well,Check contamination,Backlog,,
3,HaRu19752224,18 Mlima Tazama Avenue,Rural,Hawassa,shared_tap,Install 2 taps nearby,Backlog,,
4,AkLu01628224,100 Mogadishu Road,Lusaka,Akatsi,well,Check contamination,Backlog,,
5,KiRu29315224,26 Bahari Ya Faraja Road,Rural,Kilimani,river,Drill well,Backlog,,
6,AkRu05234224,104 Kenyatta Street,Rural,Akatsi,tap_in_home_broken,Diagnose local infrastructure,Backlog,,
7,HaZa21742224,117 Kampala Road,Zanzibar,Hawassa,well,Check contamination,Backlog,,
8,SoRu35008224,55 Fennec Way,Rural,Sokoto,shared_tap,Install 8 taps nearby,Backlog,,
9,SoRu35703224,52 Moroni Avenue,Rural,Sokoto,well,Check contamination,Backlog,,
10,AkHa00070224,51 Addis Ababa Road,Harare,Akatsi,well,Check contamination,Backlog,,


In [40]:
%%sql
--  Check total number of records inserted (should be around 25,398)
SELECT COUNT(*) AS total_records
FROM Project_progress;


 * mysql+pymysql://root:***@localhost/md_water_services
1 rows affected.


total_records
63071


#  Final Summary Report — Maji Ndogo Water Access Project

###  Project Overview
This project analyzed clean water accessibility across **Maji Ndogo**, using SQL to identify and prioritize critical areas for infrastructure improvement.  
Through systematic exploration, joining, aggregation, and logical filtering, we uncovered **data-driven insights** that directly inform real-world decisions for water access improvement.



###  Key Steps in the Analysis

1. **Data Integration**  
   We joined multiple datasets — `visits`, `location`, `water_source`, and `well_pollution` — to create a unified, analytical view called **`combined_analysis_table`**.  
   This step consolidated 39,650 unique site visits into a single structure for easier querying and visualization.

2. **Filtering and Validation**  
   We excluded duplicate survey records by keeping only rows where `visit_count = 1`, ensuring data accuracy and avoiding double-counting of sources.

3. **Provincial and Town-Level Aggregation**  
   Using SQL **CTEs** and **CASE** statements, we created pivot-style summaries to calculate the percentage of people served by each water source type per **province** and **town**.  
   This allowed us to pinpoint regions with higher dependency on unsafe or unreliable water sources.

4. **Pollution and Infrastructure Analysis**  
   We assessed **well contamination** and **time-in-queue** metrics to determine where interventions were most urgent.  
   For example:
   - *Rivers* → Drill new wells nearby.  
   - *Wells (Chemical)* → Install RO filters.  
   - *Wells (Biological)* → Install UV + RO filters.  
   - *Shared Taps (30+ min queues)* → Install additional taps.  
   - *Broken In-home Taps* → Diagnose infrastructure.

5. **Action Table Creation — `Project_progress`**  
   Finally, we created and populated the **`Project_progress`** table — a blueprint for the engineering and field teams.  
   It provides precise improvement recommendations per source, location, and province.



###  Key Insights

- Over **25,000** water sources require intervention across Maji Ndogo.  
- The majority of unsafe water sources are concentrated in **rural provinces**.  
- Queue times at **shared taps** indicate uneven water distribution, signaling areas needing capacity upgrades.  
- Data quality improvements (such as consistent location IDs) significantly boosted analytical reliability.



###  Tools and Skills Demonstrated
- SQL Joins (INNER, LEFT)  
- Aggregation & Pivot Logic using `CASE WHEN`  
- CTEs (Common Table Expressions)  
- Data Cleaning and Filtering  
- View and Temporary Table Creation  
- Logical and Conditional Expressions for Decision Support  


###  Real-World Impact
The **Maji Ndogo Project** is more than a data exercise — it’s an example of how **data analytics informs humanitarian action**.  
The findings guide **President Naledi’s** administration in prioritizing repairs, drilling, and filter installations across towns and provinces.  
Each recommendation directly contributes to **Sustainable Development Goal 6 — Clean Water and Sanitation**.



###  Next Steps
- Collaborate with the visualization team (led by **Dalila**) to turn these insights into interactive dashboards.  
- Track progress using the `Project_progress` table as a living data source.  
- Continue refining the database with feedback from field engineers to keep it accurate and actionable.



###  Closing Thoughts
Thank you for following through to the end of this project.  
This journey demonstrated how structured SQL analysis can transform **raw data into meaningful action**.  
Your persistence through JOINs, subqueries, and control flow logic has paid off — you now hold a clear roadmap to bring **clean water and prosperity** to Maji Ndogo.

**Pula — “rain” — a blessing and a symbol of renewal. **
