# Clustering Data to Unveil Maji Ndogo's Water Crisis

### Introduction

In the second part of the integrated project, we gear up for a deep analytical dive into Maji Ndogo's water scenerio. We will leverage a wide range of SQL functions, including advanced window functions, to extract valuable insights from the data tables.

# Connecting to our MySQL database

In [2]:
# Load the SQL Extension

%load_ext sql

In [3]:
# Establish a connection to the local database 

%sql mysql+pymysql://root:1234567@localhost:3306/md_water_services

'Connected: root@md_water_services'

# Cleaning Data

Let’s take a look at the `employee` table, which contains details about all the staff at Maji Ndogo. However, it’s important to note that email addresses are missing. Since we need to send them reports and data, we will need to update this information. Fortunately, the email format for the department, as outlined in the project description, is straightforward: `first_name.last_name@ndogowater.gov`.

We can determine the email address for each employee by:
- selecting the `employee_name` column
- replacing the space with a full stop
- make it lowercase
- and stitch it all together

We have to update the database again with these email addresses, so before we do, let's use a `SELECT` query to get the format right, then use
`UPDATE` and `SET` to make the changes.

In [4]:
%%sql
# Constructing Email Addresses for Maji Ndogo Staff
SELECT
    CONCAT(
    LOWER(REPLACE(employee_name, ' ', '.')), '@ndogowater.gov') AS new_email
FROM 
    employee
LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


new_email
amara.jengo@ndogowater.gov
bello.azibo@ndogowater.gov
bakari.iniko@ndogowater.gov
malachi.mavuso@ndogowater.gov
cheche.buhle@ndogowater.gov


In [4]:
'''%%sql
# Update the Email Column
UPDATE employee
SET email = CONCAT(LOWER(REPLACE(employee_name, ' ', '.')), '@ndogowater.gov');'''

 * mysql+pymysql://root:***@localhost:3306/md_water_services
56 rows affected.


[]

> **NOTE**: The query above ↑ was executed once and commented out because it is a **DML (Data Manipulation Language)** query. It should not be run again in case the notebook is restarted and all cells are executed.

Now that we’ve addressed that, let’s confirm that the previous query executed successfully.

In [5]:
%sql SELECT * FROM employee LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


assigned_employee_id,employee_name,phone_number,email,address,province_name,town_name,position
0,Amara Jengo,99637993287,amara.jengo@ndogowater.gov,36 Pwani Mchangani Road,Sokoto,Ilanga,Field Surveyor
1,Bello Azibo,99643864786,bello.azibo@ndogowater.gov,129 Ziwa La Kioo Road,Kilimani,Rural,Field Surveyor
2,Bakari Iniko,99222599041,bakari.iniko@ndogowater.gov,18 Mlima Tazama Avenue,Hawassa,Rural,Field Surveyor
3,Malachi Mavuso,99945849900,malachi.mavuso@ndogowater.gov,100 Mogadishu Road,Akatsi,Lusaka,Field Surveyor
4,Cheche Buhle,99381679640,cheche.buhle@ndogowater.gov,1 Savanna Street,Akatsi,Rural,Field Surveyor


Awesome, now we have emails for all the employees updated in the database. Let's check the `phone_number` table. The phone numbers should be 12 characters long, consisting of the plus sign, area code (99), and the phone number digits. However, when we use the **LENGTH(column)** function as shown below, it returns 13 characters, indicating there's an extra character.

In [6]:
%%sql
SELECT
    LENGTH(phone_number)
FROM
    employee
LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


LENGTH(phone_number)
13
13
13
13
13


That's because there is a space at the end of the number! If you try to send an automated SMS to that number it will fail. This is a common problem, and the solution is a function called `TRIM(column)`, which removes any leading or trailing spaces from a string. 

In [7]:
%%sql
# Trim the Leading and Trailing Whitespaces
SELECT LENGTH(TRIM(phone_number)) FROM employee LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


LENGTH(TRIM(phone_number))
12
12
12
12
12


In [8]:
'''%%sql
# Update the Records 
UPDATE employee
SET phone_number = TRIM(phone_number);'''

 * mysql+pymysql://root:***@localhost:3306/md_water_services
56 rows affected.


[]

> **NOTE**: The query above was executed once and then commented out, as it is a DML query and should not be re-run if the notebook is restarted and all cells are executed.

Now, let's verify if the query above was successful ↓

In [9]:
%%sql
SELECT LENGTH(TRIM(phone_number)) FROM employee LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


LENGTH(TRIM(phone_number))
12
12
12
12
12


# Honouring Employees

Before starting our analysis to identify employees deserving of recognition, let's first find how many employees reside in each town.

In [10]:
%%sql
# Number of Emploees per Province
SELECT town_name, COUNT(employee_name) AS no_of_employees
FROM employee
GROUP BY town_name

 * mysql+pymysql://root:***@localhost:3306/md_water_services
9 rows affected.


town_name,no_of_employees
Ilanga,3
Rural,29
Lusaka,4
Zanzibar,4
Dahabu,6
Kintampo,1
Harare,5
Yaounde,1
Serowe,3


If the organization's leadership wanted to congratulate the top three field surveyors, we could use the database to identify them. First, we would query the `visits` table to determine the number of visits each employee has made. Then, using their `employee_id`, we could retrieve the `name`, `email`, and `phone number` of the **three field surveyors with the most location visits**.

In [11]:
%%sql
# Retrieve Emploees with Most Location Visits
SELECT assigned_employee_id, COUNT(assigned_employee_id) AS no_of_visits
FROM visits
GROUP BY assigned_employee_id
ORDER BY no_of_visits DESC
LIMIT 3;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
3 rows affected.


assigned_employee_id,no_of_visits
1,3708
30,3676
34,3539


The final step is to create a query that retrieves the details of the employees based on the employee IDs obtained from the previous query. 

In [12]:
%%sql
# Retrieve Employee Details
SELECT assigned_employee_id, employee_name, phone_number, email
FROM employee
WHERE assigned_employee_id IN (1, 30, 34);

 * mysql+pymysql://root:***@localhost:3306/md_water_services
3 rows affected.


assigned_employee_id,employee_name,phone_number,email
1,Bello Azibo,99643864786,bello.azibo@ndogowater.gov
30,Pili Zola,99822478933,pili.zola@ndogowater.gov
34,Rudo Imani,99046972648,rudo.imani@ndogowater.gov


Awesome, now we have `employee_name`, `phone_number` and `email` columns of the top performers. 

# Analysing Locations

Looking at the location table, let’s focus on the `province_name`, `town_name` and `location_type` to gain insights into the distribution of water sources in Maji Ndogo. Let's count the records per `town_name` and then count by `province_name`.

In [5]:
%%sql
# Number of Records per Town
SELECT 
    town_name,
    COUNT(location_id) AS records_per_town
FROM location
GROUP BY town_name
ORDER BY records_per_town DESC
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


town_name,records_per_town
Rural,23740
Harare,1650
Amina,1090
Lusaka,1070
Mrembo,990
Asmara,930
Dahabu,930
Kintampo,780
Ilanga,780
Isiqalo,770


In [14]:
%%sql
# Records per Province
SELECT
    province_name,
    COUNT(location_id) AS records_per_province
FROM location
GROUP BY province_name
ORDER BY records_per_province DESC;    

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


province_name,records_per_province
Kilimani,9510
Akatsi,8940
Sokoto,8220
Amanzi,6950
Hawassa,6030


From this table, it's pretty clear that most of the water sources in the survey are situated in small rural communities, scattered across Maji Ndogo. If we count the records for each province, most of them have a similar number of sources, so every province is well-represented in the survey. Let's create a table that shows the records of each town ensuring our data is grouped by both `province_name` and `town_name`.

In [6]:
%%sql
# Records per Province and Town
SELECT
    province_name,
    town_name,
    COUNT(location_id) AS records_per_town
FROM location
GROUP BY province_name, town_name
ORDER BY province_name, records_per_town DESC
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


province_name,town_name,records_per_town
Akatsi,Rural,6290
Akatsi,Lusaka,1070
Akatsi,Harare,800
Akatsi,Kintampo,780
Amanzi,Rural,3100
Amanzi,Asmara,930
Amanzi,Dahabu,930
Amanzi,Amina,670
Amanzi,Pwani,520
Amanzi,Abidjan,400


These results show us that Maji Ndogo's field surveyors did an excellent job of documenting the status of our country's water crisis. Every province and town has many documented sources. This gives us confidence that the data we have is reliable enough to base our decisions on. This is an insight we can use to communicate data integrity, so let's make a note of that. Finally, let's look at the number of records for each location type.

In [16]:
%%sql
# Records of Each Location Type
SELECT
    COUNT(location_type) AS records_per_type,
    location_type
FROM location
GROUP BY location_type;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
2 rows affected.


records_per_type,location_type
15910,Urban
23740,Rural


In [17]:
%%sql
SELECT 
    ROUND((23740 / (15910 + 23740) * 100)) AS rural_percentage, 
    ROUND((15910 / (15910 + 23740) * 100)) AS urban_percentage;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
1 rows affected.


rural_percentage,urban_percentage
60,40


So again, what are some of the insights we gained from the location table?

1. Our entire country was properly canvassed, and our dataset represents the situation on the ground.
2. 60% of our water sources are in rural communities across Maji Ndogo. We need to keep this in mind when we make decisions.

## Diving Into Sources

Ok, `water_source` is a big table, with lots of stories to tell! We have access to different water source types and the number of people using each source. These are the questions that we are curious about.

1. How many people did we survey in total?
2. How many wells, taps and rivers are there?
3. How many people share particular types of water sources on average?
4. How many people are getting water from each type of source?

Let's begin with the first question. How many people did we survey in total?

In [31]:
%%sql
# Total Number of People Served
SELECT 
    SUM(number_of_people_served) AS total_number_of_people_served
FROM water_source;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
1 rows affected.


total_number_of_people_served
27628140


Now, let's address the second question: How many wells, taps and rivers are there?

In [20]:
%%sql
# Total Number of Types of Water Sources
SELECT
    type_of_water_source,
    COUNT(type_of_water_source) AS number_of_sources
FROM water_source
GROUP BY type_of_water_source
ORDER BY number_of_sources DESC;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


type_of_water_source,number_of_sources
well,17383
tap_in_home,7265
tap_in_home_broken,5856
shared_tap,5767
river,3379


We can see that well is the most popular type of water source serving the population in Maji Ndogo. Next, let's explore another question in our dataset: What is the average number of people that are served by each water source?

In [21]:
%%sql
# Number of People That Share Water Sources on Average
SELECT
    type_of_water_source,
    ROUND(AVG(number_of_people_served)) AS avg_people_per_source
FROM water_source
GROUP BY type_of_water_source;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


type_of_water_source,avg_people_per_source
tap_in_home,644
tap_in_home_broken,649
well,279
shared_tap,2071
river,699


These results are telling us that 644 people share a tap_in_home on average. Does that make sense? No it doesn’t, right? Remember the few important things that apply to `tap_in_home` and `broken_tap_in_home`? The surveyors combined the data of many households together and added this as a single tap record, but each household actually has its own tap. In addition to this, there is an average of 6 people living in a home. So 6 people actually share 1 tap (not 644). We can see that `shared_tap` has the highest number of total number of people served in our database. It is always important to think about data. We tend to just analyse, and calculate at the start of our careers, but the value we bring as data
practitioners is in understanding the meaning of results or numbers, and interpreting their meaning. Imagine we were presenting this to the President and all of the Ministers, and one of them asks us: "Why does it say that 644 share a home tap?" and we had no answer.

This means that 1 `tap_in_home` actually represents 644 ÷ 6 &#8776; 107 taps. Calculating the average number of people served by a single instance of each water source type helps us understand the typical capacity or load
on a single water source. This can help us decide which sources should be repaired or upgraded, based on the average impact of each upgrade. For example, wells don't seem to be a problem, as fewer people are sharing them. On the other hand, 2000 share a single public tap on average! We saw some of the queue times last time, and now we can see why. So looking at
these results, we probably should focus on improving shared taps first.

Now let’s calculate the total number of people served by each type of water source in total, to make it easier to interpret, order them so the most
people served by a source is at the top.

In [22]:
%%sql
# Total People Served by Each Water Source
SELECT
    type_of_water_source,
    SUM(number_of_people_served) AS population_served
FROM water_source
GROUP BY type_of_water_source
ORDER BY population_served DESC;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


type_of_water_source,population_served
shared_tap,11945272
well,4841724
tap_in_home,4678880
tap_in_home_broken,3799720
river,2362544


It's a little hard to comprehend these numbers, but you can see that one of these is dominating. To make it a bit simpler to interpret, let's use
percentages.

In [23]:
%%sql
# Total Percentage of People Served by Each Water Source
SELECT
    type_of_water_source,
    ROUND(SUM(number_of_people_served) / (SELECT SUM(number_of_people_served) FROM water_source) * 100) AS percentage_people_per_source         
FROM water_source
GROUP BY type_of_water_source
ORDER BY percentage_people_per_source DESC;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


type_of_water_source,percentage_people_per_source
shared_tap,43
well,18
tap_in_home,17
tap_in_home_broken,14
river,9


18% of people are using wells. But only 4916 out of 17383 are clean = 28%. 43% of our people are using shared taps in their communities, and on average, we saw earlier, that 2000 people share one shared_tap. By adding tap_in_home and tap_in_home_broken together, we see that 31% of people have water infrastructure installed in their homes, but 45% (14/31) of these taps are not working! This isn't the tap itself that is broken, but rather the infrastructure like treatment plants, reservoirs, pipes, and pumps that serve these homes that are broken.

# Start of a Solution

At some point, we will have to fix or improve all of the infrastructure, so we should start thinking about how we can make a data-driven decision
how to do it. A simple approach is to fix the things that affect most people first. So let's write a query that ranks each type of source based
on how many people in total use it. `RANK()` should tell you we are going to need a window function to do this, so let's think through the problem.

In [24]:
%%sql
# Rank Type of Sources Based on Total Usage
SELECT
    type_of_water_source,
    SUM(number_of_people_served) AS people_served,
    RANK() OVER(ORDER BY SUM(number_of_people_served) DESC) AS rank_by_population
FROM water_source
GROUP BY type_of_water_source;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


type_of_water_source,people_served,rank_by_population
shared_tap,11945272,1
well,4841724,2
tap_in_home,4678880,3
tap_in_home_broken,3799720,4
river,2362544,5


Ok, so we should fix shared taps first, then wells, and so on. But the next question is, which shared taps or wells should be fixed first? We can use
the same logic; the most used sources should really be fixed first.

In [7]:
%%sql
# Rank Water Sources Based on Population Served Per Source Type
SELECT 
    source_id, 
    type_of_water_source, 
    number_of_people_served, 
    RANK() OVER(PARTITION BY type_of_water_source ORDER BY number_of_people_served DESC) AS priority_rank
FROM water_source
WHERE type_of_water_source != 'tap_in_home'
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


source_id,type_of_water_source,number_of_people_served,priority_rank
SoRu34798224,river,998,1
SoRu35837224,river,998,1
SoRu36238224,river,998,1
SoRu36791224,river,998,1
SoRu36880224,river,998,1
SoRu38142224,river,998,1
SoRu37756224,river,998,1
SoMa33775224,river,998,1
KiRu30353224,river,998,1
SoIl32972224,river,998,1


By using `RANK()` teams doing the repairs can use the value of rank to measure how many they have fixed, but what would be the benefits of using
`DENSE_RANK()`? Maybe it is easier to explain to the engineers this way, or the priority feels a bit more natural?

What about `ROW_NUMBER()`? Since each source now has a unique rank, teams don't have to think whether they should repair AkRu05603224, or
AkRu04862224 first (both serve 3998 people), because `ROW_NUMBER()` doesn't consider records that are equal.

Imagine yourself in an engineer's boots, and try to interpret the priority list. Thinking about the userc of a table helps us to design the table better. In that line of thought, would it make sense to give them a list of `source_ids`? How would they know where to go?

# Analysing Queues

Recall the `visits` table documented all of the visits the field surveyors made to each location. For most sources, one visit was enough, but if there were queues, they visited the location a couple of times to get a good idea of the time it took for people to queue for water. So we have the time that
they collected the data, how many times the site was visited, and how long people had to queue for water.

Remember we can use some `DateTime` functions here to get
some deeper insight into the water queueing situation in Maji Ndogo, like which day of the week it was, and what time.

Here are some key areas I think are worth exploring:

1. How long did the survey take?
2. What is the average total queue time for water?
3. What is the average queue time on different days?
4. How can we communicate this information efficiently

To answer the first question, we can run either of the functions `DATEDIFF` or `TIMESTAMPDIFF` (both will return the same output)

In [26]:
%%sql
# Survey Duration
SELECT
    DATEDIFF(MAX(time_of_record), MIN(time_of_record)) AS "survey_duration (in days)"
    # TIMESTAMPDIFF(day, MIN(time_of_record), MAX(time_of_record)) AS "survey_duration (in days)"
FROM visits;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
1 rows affected.


survey_duration (in days)
924


**924** days which is equivalent to around two and a half years, that's how long the survey took. Let’s shift our focus to the next question. Let's see how long people have to queue on average in Maji Ndogo. Keep in mind that many sources like `taps_in_home` have no queues. These are just recorded as 0 in the `time_in_queue` column, so when we calculate averages, we need to exclude those rows.

In [27]:
%%sql
# Average Queue Time
SELECT
    ROUND(AVG(NULLIF(time_in_queue, 0))) AS avg_queue_time
FROM visits;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
1 rows affected.


avg_queue_time
123


The average queue time is around 123 minutes, meaning that people spend about two hours fetching water if they don’t have taps in their homes. While that may seem manageable, it’s worth noting that demand could vary throughout the week, with more people likely to collect water on certain days due to personal schedules or resource availability. Let’s dive deeper and explore the third question by examining how queue times vary across different days of the week.

`DAY()` gives you the day of the month. If we want to aggregate data for each day of the week, we need to use another DateTime function, `DAYNAME(column)`. As the name suggests, it returns the day of the week as a string — Monday, Tuesday, and so on—based on the timestamp in the `time_of_record` column. By calculating the average queue time and grouping it by day of the week, we can better understand patterns in queue times across the week.

In [28]:
%%sql
# Average Queue Times per Day of the Week
SELECT  
    DAYNAME(time_of_record) AS day_of_week,
    ROUND(AVG(NULLIF(time_in_queue, 0))) AS avg_queue_time
FROM visits
GROUP BY DAYNAME(time_of_record);

 * mysql+pymysql://root:***@localhost:3306/md_water_services
7 rows affected.


day_of_week,avg_queue_time
Friday,120
Saturday,246
Sunday,82
Monday,137
Tuesday,108
Wednesday,97
Thursday,105


It turns out that Saturdays have significantly longer queue times compared to other days—fascinating! Now let's go deeper and try to answer the fourth question to investigate what time during the day people collect water.

In [29]:
%%sql
# Aggregate the Average Queue Time per Time of Day
SELECT
    TIME_FORMAT(TIME(time_of_record), "%H:00") AS hour_of_day,
    ROUND(AVG(NULLIF(time_in_queue, 0))) AS avg_queue_time
FROM visits
GROUP BY hour_of_day
ORDER BY hour_of_day ASC;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
14 rows affected.


hour_of_day,avg_queue_time
06:00,149
07:00,149
08:00,149
09:00,118
10:00,114
11:00,111
12:00,112
13:00,115
14:00,114
15:00,114


We can deduce that mornings and evenings are the busiest from query result above. What could this mean? Are people collecting water before and after work? Wouldn't it be nice to break down the queue times for each hour of each day? In a spreadsheet, we could easily accomplish this with a pivot table.

While pivot tables are incredibly useful for interpreting results, they're not commonly used in SQL, and there aren’t built-in functions to create them directly. However, in cases where the dataset is so massive that is the only option.

For rows, we will use the hour of the day in that nice format, and then make each column a different day! To filter a row we use WHERE, but using `CASE()` in SELECT can filter columns. We can use a `CASE()` function for each day to separate the queue time column into a column for each day. Let's begin by only focusing on Sunday. So, when a row's `DAYNAME(time_of_record)` is Sunday, we make that value equal to `time_in_queue`, and NULL for any other days.

In [30]:
%%sql
# Create a Pivot Table of Avg Queue Times for Each Hour in Each Day!
SELECT
    TIME_FORMAT(TIME(time_of_record), "%H:00") AS hour_of_day,
    # Sunday
    ROUND(AVG(
        CASE 
            WHEN DAYNAME(time_of_record) = "Sunday"
            THEN time_in_queue ELSE null
        END), 0) AS Sunday,
    
    # Monday
    ROUND(AVG(
        CASE 
            WHEN DAYNAME(time_of_record) = "Monday" 
            THEN time_in_queue ELSE null
        END), 0) AS Monday,

    # Tuesday
    ROUND(AVG(
        CASE 
            WHEN DAYNAME(time_of_record) = "Tuesday" 
            THEN time_in_queue ELSE null
        END), 0) AS Tuesday,

    # Wednesday
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = "Wednesday" 
            THEN time_in_queue ELSE null
        END), 0) AS Wednesday,

    # Thursday
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = "Thursday" 
            THEN time_in_queue ELSE null
        END), 0) AS Thursday,
    
    # Friday
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = "Friday" 
            THEN time_in_queue ELSE null
        END), 0) AS Friday,

    # Saturday
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = "Saturday" 
            THEN time_in_queue ELSE null
        END), 0) AS Saturday
FROM md_water_services.visits
WHERE time_in_queue != 0
GROUP BY hour_of_day
ORDER BY hour_of_day;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
14 rows affected.


hour_of_day,Sunday,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday
06:00,79,190,134,112,134,153,247
07:00,82,186,128,111,139,156,247
08:00,86,183,130,119,129,153,247
09:00,84,127,105,94,99,107,252
10:00,83,119,99,89,95,112,259
11:00,78,115,102,86,99,104,236
12:00,78,115,97,88,96,109,239
13:00,81,122,97,98,101,115,242
14:00,83,127,104,92,96,110,244
15:00,83,126,104,88,92,110,248


Perfect, now we can compare the queue times for each day, hour by hour! By closely examining the custom-built SQL pivot table, several patterns begin to emerge:

1. Queues are very long on a Monday morning and Monday evening as people rush to get water.
2. Wednesday has the lowest queue times (on weekdays), but long queues on Wednesday evening.
3. People have to queue pretty much twice as long on Saturdays compared to the weekdays. It looks like people spend their Saturdays queueing for water, perhaps for the week's supply?
4. The shortest queues are on Sundays, and this is probably a cultural thing. The people of Maji Ndogo may prioritise family and religion, so Sundays are spent with family and friends.

SQL is a set of tools we can apply. By understanding `CASE()` function, we could build a complex query that aggregates our data in a format that is very easy to understand. We can even visualize the pivot table like so ↓.

![Queue Times](assets/queue_time.png)

# Water Accessibility and Infrastructure Summary Report

This survey aimed to identify the water sources people use and determine both the total and average number of users for each source. Additionally, it examined the duration citizens typically spend in queues to access water. So let's create a short summary report we can send off to relevant stakeholders:

### Insights

1. Most water sources are rural.
2. **43%** of our people are using shared taps. **2000** people often share one tap.
3. **31%** of our population has water infrastructure in their homes, but within that group, **45%** face non-functional systems due to issues with pipes, pumps, and reservoirs.
4. **18%** of our people are using wells of which, but within that, only **28%** are clean..
5. Our citizens often face long wait times for water, averaging more than **120 minutes**.
6. In terms of queues:

- Queues are very long on Saturdays.
- Queues are longer in the mornings and evenings.
- Wednesdays and Sundays have the shortest queues.

# Strategy for Solving the Water Crisis in Maji Ndogo

If we consider a strategy to begin addressing Maji Ndogo's water crisis, it might look something like this:

1. We want to focus our efforts on improving the water sources that affect the most people.
- Most people will benefit if we improve the shared taps first.
- Wells are a good source of water, but many are contaminated. Fixing this will benefit a lot of people.
- Fixing existing infrastructure will help many people. If they have running water again, they won't have to queue, thereby shorting queue times for others. So we can solve two problems at once.
- Installing taps in homes will stretch our resources too thin, so for now, if the queue times are low, we won't improve that source.
2. Most water sources are in rural areas. We need to ensure our teams know this as this means they will have to make these repairs/upgrades in rural areas where road conditions, supplies, and labour are harder challenges to overcome.

# Recommended Pratical Solutions

1. If communities are using **rivers**, we can dispatch trucks to those regions to provide water temporarily in the short term, while we send out crews to drill for wells, providing a more permanent solution.

2. If communities are using **wells**, we can install filters to purify the water. For wells with **biological** contamination, we can **install UV filters** that kill microorganisms, and for *polluted wells*, we can **install reverse osmosis filters**. In the long term, we need to figure out why these sources are polluted.

3. For **shared taps**, in the short term, we can send additional water tankers to the busiest taps, on the busiest days. We can use the queue time pivot table we made to send tankers at the busiest times. Meanwhile, we can start the work on **installing extra taps** where they are needed. According to UN standards, the maximum acceptable wait time for water is 30 minutes. With this in mind, our aim is to **install taps** to get **queue times below 30 min**.

4. **Shared taps** with **short queue times** (< 30 min) represent a logistical challenge to further reduce waiting times. The most effective solution, installing taps in homes, is resource-intensive and better suited as a long-term goal.

5. **Addressing broken infrastructure** offers a significant impact even with just a single intervention. It is expensive to fix, but so **many people** can **benefit** from repairing one facility. For example, fixing a reservoir or pipe that multiple taps are connected to. We will have to find the commonly affected areas though to see where the problem actually is.