## Cleaning our data: Using employee table


In [2]:
%load_ext sql

In [4]:
pip install cryptography

Collecting cryptography
  Downloading cryptography-46.0.3-cp311-abi3-win_amd64.whl.metadata (5.7 kB)
Collecting cffi>=2.0.0 (from cryptography)
  Downloading cffi-2.0.0-cp313-cp313-win_amd64.whl.metadata (2.6 kB)
Downloading cryptography-46.0.3-cp311-abi3-win_amd64.whl (3.5 MB)
   ---------------------------------------- 0.0/3.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.5 MB ? eta -:--:--
   -- ------------------------------------- 0.3/3.5 MB ? eta -:--:--
   -------- ------------------------------- 0.8/3.5 MB 2.2 MB/s eta 0:00:02
   ----------------- ---------------------- 1.6/3.5 MB 3.0 MB/s eta 0:00:01
   -------------------------- ------------- 2.4/3.5 MB 3.3 MB/s eta 0:00:01
   ----------------------------------- ---- 3.1/3.5 MB 3.6 MB/s eta 0:00:01
   ---------------------------------------- 3.5/3.5 MB 3.5 MB/s  0:00:01
Downloading cffi-2.0.0-cp313-cp313-win_amd64.whl (183 kB)
Installing collected packages: cffi, cryptography

  Attempting uninstall: cffi

In [3]:
%sql mysql+pymysql://root:02510251@localhost:3306/md_water_services


In [3]:
%%sql
SELECT * 
FROM employee
LIMIT 10;

UsageError: Cell magic `%%sql` not found.


Ok, the employee table, it has info on all of our workers, but note that the email addresses have not been added. We will have to send them reports and figures, so let's update it. Luckily the emails for our department are easy: first_name.last_name@ndogowater.gov.

In [26]:
%%sql
SELECT employee_name,
       CONCAT(
             LOWER(REPLACE(employee_name, ' ', '.')),
              '@ndogowater.gov'
              ) AS generated_email
FROM employee;

employee_name,generated_email
Amara Jengo,amara.jengo@ndogowater.gov
Bello Azibo,bello.azibo@ndogowater.gov
Bakari Iniko,bakari.iniko@ndogowater.gov
Malachi Mavuso,malachi.mavuso@ndogowater.gov
Cheche Buhle,cheche.buhle@ndogowater.gov
Zuriel Matembo,zuriel.matembo@ndogowater.gov
Deka Osumare,deka.osumare@ndogowater.gov
Lalitha Kaburi,lalitha.kaburi@ndogowater.gov
Enitan Zuri,enitan.zuri@ndogowater.gov
Farai Nia,farai.nia@ndogowater.gov


In [24]:
%%sql
UPDATE employee
SET email = CONCAT(
    LOWER(REPLACE(employee_name, ' ', '.')),
    '@ndogowater.gov'
);

    

In [25]:
%%sql
SELECT employee_name, email
FROM employee
LIMIT 10;

employee_name,email
Amara Jengo,amara.jengo@ndogowater.gov
Bello Azibo,bello.azibo@ndogowater.gov
Bakari Iniko,bakari.iniko@ndogowater.gov
Malachi Mavuso,malachi.mavuso@ndogowater.gov
Cheche Buhle,cheche.buhle@ndogowater.gov
Zuriel Matembo,zuriel.matembo@ndogowater.gov
Deka Osumare,deka.osumare@ndogowater.gov
Lalitha Kaburi,lalitha.kaburi@ndogowater.gov
Enitan Zuri,enitan.zuri@ndogowater.gov
Farai Nia,farai.nia@ndogowater.gov


I picked up another bit we have to clean up. Often when databases are created and updated, or information is collected from different sources,
errors creep in. For example, if you look at the phone numbers in the phone_number column, the values are stored as strings.

In [29]:
%%sql
SELECT
LENGTH(phone_number)
FROM
employee;

LENGTH(phone_number)
13
13
13
13
13
13
13
13
13
13


The phone numbers should be 12 characters long, consisting of the plus sign, area code (99), and the phone number digits. However, when we use
the LENGTH(column) function, it returns 13 characters, indicating there's an extra character. That's because there is a space at the end of the number! If you try to send an automated SMS to that number it will fail. This happens so often
that they create a function, especially for trimming off the space, called TRIM(column).
It removes any leading or trailing spaces from a string.

In [30]:
%%sql
SELECT phone_number,
       LENGTH(phone_number) AS original_length,
       TRIM(phone_number) AS trimmed_number,
       LENGTH(TRIM(phone_number)) AS trimmed_length
FROM employee
LIMIT 10;

phone_number,original_length,trimmed_number,trimmed_length
99637993287,13,99637993287,12
99643864786,13,99643864786,12
99222599041,13,99222599041,12
99945849900,13,99945849900,12
99381679640,13,99381679640,12
99034075111,13,99034075111,12
99379364631,13,99379364631,12
99681623240,13,99681623240,12
99248509202,13,99248509202,12
99570082739,13,99570082739,12


In [31]:
%%sql
UPDATE employee
SET phone_number = TRIM(phone_number);

In [32]:
%%sql
SELECT phone_number
FROM employee
WHERE LENGTH(phone_number) != 12;


phone_number


Let's have a look at where our employees live.
We grouped employees by their town of residence to understand regional staffing. This helps us identify where our workforce is concentrated and may guide future resource allocation or outreach efforts.


In [36]:
%%sql
SELECT town_name,
       COUNT(*) AS employee_count
FROM employee
GROUP BY town_name
ORDER BY employee_count DESC;

town_name,employee_count
Rural,29
Dahabu,6
Harare,5
Lusaka,4
Zanzibar,4
Ilanga,3
Serowe,3
Kintampo,1
Yaounde,1


Pres. Naledi congratulated the team for completing the survey, but we would not have this data were it not for our field workers. So let's gather
some data on their performance in this process, so we can thank those who really put all their effort in.

Pres. Naledi has asked we send out an email or message congratulating the top 3 field surveyors.

In [38]:
%%sql
SELECT assigned_employee_id,
       COUNT(*) AS visit_count
FROM visits
GROUP BY assigned_employee_id
ORDER BY visit_count DESC
LIMIT 3;
       

assigned_employee_id,visit_count
1,3708
30,3676
34,3539


In [40]:
%%sql
SELECT e.employee_name,
       e.email,
       e.phone_number,
       v.visit_count
FROM (
    SELECT assigned_employee_id,
           COUNT(*) AS visit_count
    FROM visits
    GROUP BY assigned_employee_id
    ORDER BY visit_count DESC
    LIMIT 3
) v
JOIN employee e
  ON e.assigned_employee_id = v.assigned_employee_id;

employee_name,email,phone_number,visit_count
Bello Azibo,bello.azibo@ndogowater.gov,99643864786,3708
Pili Zola,pili.zola@ndogowater.gov,99822478933,3676
Rudo Imani,rudo.imani@ndogowater.gov,99046972648,3539


## Top 3 Field Surveyors – Honoring Excellence

We identified the top 3 field surveyors based on the number of location visits. Their names, emails, and phone numbers are listed below for recognition and outreach.

**Next Step**: Send a congratulatory message from Pres. Naledi to thank them for their outstanding contribution.


### Analysing locations


Count Records Per Town

In [43]:
%%sql
SELECT town_name,
       COUNT(*) AS location_count
FROM location
GROUP BY town_name
ORDER BY location_count DESC;

town_name,location_count
Rural,23740
Harare,1650
Amina,1090
Lusaka,1070
Mrembo,990
Asmara,930
Dahabu,930
Kintampo,780
Ilanga,780
Isiqalo,770


In [44]:
%%sql
SELECT location_type,
       COUNT(*) AS location_count
FROM location
GROUP BY location_type
ORDER BY location_count DESC;

location_type,location_count
Rural,23740
Urban,15910


### The insights we gained from the location table?
1. Maji-ndogo was properly canvassed, and our dataset represents the situation on the ground.
2. 60% of our water sources are in rural communities across Maji Ndogo. We need to keep this in mind when we make decisions.

## Diving into the sources

In [45]:
%%sql
SELECT SUM(Number_of_people_served) AS total_people_surveyed
FROM water_source;

total_people_surveyed
27628140


In [46]:
%%sql
SELECT type_of_water_source,
       COUNT(*) AS number_of_sources
FROM water_source
GROUP BY type_of_water_source
ORDER BY number_of_sources DESC;

type_of_water_source,number_of_sources
well,17383
tap_in_home,7265
tap_in_home_broken,5856
shared_tap,5767
river,3379



We analyzed the `water_source` table to count how many wells, taps, rivers, and other sources exist across Maji Ndogo. This breakdown is essential for estimating repair costs and planning infrastructure upgrades.

**Insight**: Despite the drought, water infrastructure is widespread. However, the high number of broken taps (e.g., 5856) signals urgent repair needs.

Average People Served per Source Type

In [48]:
%%sql
SELECT type_of_water_source,
       ROUND(AVG(Number_of_people_served), 0) AS ave_people_per_source
FROM water_source
GROUP BY type_of_water_source
ORDER BY ave_people_per_source DESC;

type_of_water_source,ave_people_per_source
shared_tap,2071
river,699
tap_in_home_broken,649
tap_in_home,644
well,279


#### Average Population Served per Water Source

We calculated the average number of people served by each water source type. This helps us assess which sources are under the most pressure and prioritize repairs accordingly.

**Key Insight**: Shared taps serve over 2000 people on average, indicating severe strain and long queue times. Home tap records are aggregated, so actual tap counts are higher than reported. Adjusting for household size (~6 people), each tap_in_home record represents ~100 taps.



In [51]:
%%sql
SELECT type_of_water_source,
       ROUND(SUM(Number_of_people_served) / 27000000 * 100, 0) AS percentage_people_per_source
FROM water_source
GROUP BY type_of_water_source
ORDER BY percentage_people_per_source DESC;


type_of_water_source,percentage_people_per_source
shared_tap,44
well,18
tap_in_home,17
tap_in_home_broken,14
river,9


#### Water Access by Source Type – Population Impact

We calculated the total number of people served by each water source type and converted those figures into percentages for clearer interpretation.

**Key Insights**:
- **44%** of citizens rely on shared taps, which serve ~2000 people each, indicating severe strain.
- **31%** of citizens have home taps, but **45%** of those are broken, pointing to infrastructure failures.
- **18%** use wells, but only **28%** of those wells are clean (from previous audit).



## Start of a solution

We are designing a repair strategy that’s intuitive for engineers and impactful for citizens. Let’s walk through the two ranking layers step by step, then explore how different window functions shape the priority list.

##### Step 1: 
Rank Source Types by Total Population Served (Excluding tap_in_home)

In [7]:
%%sql
SELECT type_of_water_source,
       SUM(Number_of_people_served) AS people_served,
       RANK() OVER (ORDER BY SUM(Number_of_people_served) DESC) AS rank_by_population
FROM water_source
WHERE type_of_water_source != 'tap_in_home'
GROUP BY type_of_water_source;


type_of_water_source,people_served,rank_by_population
shared_tap,11945272,1
well,4841724,2
tap_in_home_broken,3799720,3
river,2362544,4


##### Step 2: 
Rank Individual Sources Within Each Type (Improvable Only)

In [8]:
%%sql
SELECT source_id,
       type_of_water_source,
       Number_of_people_served,
       RANK() OVER (
           PARTITION BY type_of_water_source
           ORDER BY Number_of_people_served DESC
       ) AS priority_rank
FROM water_source
WHERE type_of_water_source IN ('shared_tap', 'well', 'river')
ORDER BY priority_rank ASC;


source_id,type_of_water_source,Number_of_people_served,priority_rank
KiRu26679224,well,398,1
KiRu25775224,well,398,1
KiRu27141224,well,398,1
KiRu25975224,well,398,1
KiRu26829224,well,398,1
KiRu25413224,well,398,1
KiRu25386224,well,398,1
KiRu27569224,well,398,1
KiHa23407224,well,398,1
HaRu18921224,well,398,1


## Analysing queues

A recap from last time:
The visits table documented all of the visits our field surveyors made to each location. For most sources, one visit was enough, but if there were
queues, they visited the location a couple of times to get a good idea of the time it took for people to queue for water. So we have the time that
they collected the data, how many times the site was visited, and how long people had to queue for water.
Ok, these are some of the things I think are worth looking at:
#### Question 1. How long did the survey take?

In [4]:
%%sql
SELECT DATEDIFF(
           MAX(time_of_record),
           MIN(time_of_record)
       ) AS survey_duration_days
FROM visits;


survey_duration_days
924


#### Question 2: What Is the Average Total Queue Time?

In [5]:
%%sql
SELECT ROUND(AVG(time_in_queue), 1) AS average_queue_time_minutes
FROM visits;

average_queue_time_minutes
60.7


#### Question 3: Average Queue Time by Day of Week

In [8]:
%%sql
SELECT DAYNAME(time_of_record) AS day_of_week,
       ROUND(AVG(time_in_queue), 1) AS avg_queue_time
FROM visits
GROUP BY day_of_week
ORDER BY avg_queue_time DESC;


day_of_week,avg_queue_time
Saturday,246.3
Sunday,81.5
Monday,59.7
Friday,52.7
Tuesday,47.1
Thursday,46.0
Wednesday,42.5


#### Question 4: Average Queue Time by Hour

In [10]:
%%sql
SELECT TIME_FORMAT(TIME(time_of_record), '%H:00') AS hour_of_day,
       ROUND(AVG(time_in_queue), 0) AS avg_queue_time
FROM visits
GROUP BY hour_of_day
ORDER BY hour_of_day;


hour_of_day,avg_queue_time
06:00,149
07:00,149
08:00,149
09:00,49
10:00,48
11:00,46
12:00,47
13:00,47
14:00,47
15:00,48


#### Communicating the Insights

##### Queue Time Analysis – Maji Ndogo

We analyzed the `visits` table to understand water access delays across Maji Ndogo.

- **Survey Duration**: The survey spanned 924 days.
- **Average Queue Time**: Citizens waited an average of 60.7 minutes for water.
- **Worst Days**: Queue times peaked on [e.g., Saturday and Sunday], suggesting systemic delays.
- **Queue Time by Hour of Day**: We analyzed when citizens collect water and how long they wait. Queue times peak between 06:00 and 08:00, suggesting early morning congestion. This insight can guide scheduling of repairs or water delivery to reduce delays.


#### Question 5: Hourly Queue Time Breakdown by Day

In [13]:
%%sql
SELECT
  TIME_FORMAT(TIME(time_of_record), '%H:00') AS hour_of_day,
  ROUND(AVG(CASE WHEN DAYNAME(time_of_record) = 'Sunday' THEN time_in_queue ELSE NULL END), 0) AS Sunday,
  ROUND(AVG(CASE WHEN DAYNAME(time_of_record) = 'Monday' THEN time_in_queue ELSE NULL END), 0) AS Monday,
  ROUND(AVG(CASE WHEN DAYNAME(time_of_record) = 'Tuesday' THEN time_in_queue ELSE NULL END), 0) AS Tuesday,
  ROUND(AVG(CASE WHEN DAYNAME(time_of_record) = 'Wednesday' THEN time_in_queue ELSE NULL END), 0) AS Wednesday,
  ROUND(AVG(CASE WHEN DAYNAME(time_of_record) = 'Thursday' THEN time_in_queue ELSE NULL END), 0) AS Thursday,
  ROUND(AVG(CASE WHEN DAYNAME(time_of_record) = 'Friday' THEN time_in_queue ELSE NULL END), 0) AS Friday,
  ROUND(AVG(CASE WHEN DAYNAME(time_of_record) = 'Saturday' THEN time_in_queue ELSE NULL END), 0) AS Saturday
FROM visits
WHERE time_in_queue != 0
GROUP BY hour_of_day
ORDER BY hour_of_day;


hour_of_day,Sunday,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday
06:00,79,190,134,112,134,153,247
07:00,82,186,128,111,139,156,247
08:00,86,183,130,119,129,153,247
09:00,84,127,105,94,99,107,252
10:00,83,119,99,89,95,112,259
11:00,78,115,102,86,99,104,236
12:00,78,115,97,88,96,109,239
13:00,81,122,97,98,101,115,242
14:00,83,127,104,92,96,110,244
15:00,83,126,104,88,92,110,248


We created a pivot-style table showing average queue times for each hour of the day across all seven days. This reveals clear patterns:

- **Peak Hours**: 06:00–08:00 and 17:00–19:00
- **Peak Days**: Saturday and Monday show consistently high queue times
- **Interpretation**: Citizens collect water before and after work, with weekend spikes likely due to household chores and limited weekday access.