## 1. Get to know our data: 
let's take a good look at our data. I'll load up the database and pull up the firstfew records from each table. 

In [2]:
%load_ext sql

In [3]:
%sql mysql+pymysql://root:02510251@localhost:3306/md_water_services


In [4]:
%%sql
SHOW TABLES;

Tables_in_md_water_services
data_dictionary
employee
global_water_access
location
visits
water_quality
water_source
well_pollution


In [6]:
%%sql
SELECT * FROM location
LIMIT 5;

location_id,address,province_name,town_name,location_type
AkHa00000,2 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00001,10 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00002,9 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00003,139 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00004,17 Addis Ababa Road,Akatsi,Harare,Urban


In [10]:
%%sql
SELECT * FROM visits
LIMIT 5;

record_id,location_id,source_id,time_of_record,visit_count,time_in_queue,assigned_employee_id
0,SoIl32582,SoIl32582224,2021-01-01 09:10:00,1,15,12
1,KiRu28935,KiRu28935224,2021-01-01 09:17:00,1,0,46
2,HaRu19752,HaRu19752224,2021-01-01 09:36:00,1,62,40
3,AkLu01628,AkLu01628224,2021-01-01 09:53:00,1,0,1
4,AkRu03357,AkRu03357224,2021-01-01 10:11:00,1,28,14


In [11]:
%%sql
SELECT * FROM location
LIMIT 5;

location_id,address,province_name,town_name,location_type
AkHa00000,2 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00001,10 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00002,9 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00003,139 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00004,17 Addis Ababa Road,Akatsi,Harare,Urban


##  2. Dive into the water sources:

In [14]:
%%sql
SELECT DISTINCT type_of_water_source
FROM water_source

type_of_water_source
tap_in_home
tap_in_home_broken
well
shared_tap
river


1. River - People collect drinking water along a river. This is an open water source that millions of people use in Maji Ndogo. Water from
a river has a high risk of being contaminated with biological and other pollutants, so it is the worst source of water possible.

2. Well - These sources draw water from underground sources, and are commonly shared by communities. Since these are closed water sources, contamination is much less likely compared to a river. Unfortunately, due to the aging infrastructure and the corruption of officials in the past, many of our wells are not clean.

3. Shared tap - This is a tap in a public area shared by communities.

4. Tap in home - These are taps that are inside the homes of our citizens. On average about 6 people live together in Maji Ndogo, so
each of these taps serves about 6 people.

5. Broken tap in home - These are taps that have been installed in a citizen’s home, but the infrastructure connected to that tap is not
functional. This can be due to burst pipes, broken pumps or water treatment plants that are not working.

An important note on the home taps: About 6-10 million people have running water installed in their homes in Maji Ndogo, including
broken taps. If we were to document this, we would have a row of data for each home, so that one record is one tap. That means our
database would contain about 1 million rows of data, which may slow our systems down. For now, the surveyors combined the data of
many households together into a single record.

For example, the first record, AkHa00000224 is for a tap_in_home that serves 956 people. What this means is that the records of about
160 homes nearby were combined into one record, with an average of 6 people living in each house 160 x 6 ≈ 956. So 1 tap_in_home or tap_in_home_broken record actually refers to multiple households, with the sum of the people living in these homes equal to num-
ber_of_people_served.

## 3. Unpack the visits to water sources:

In [20]:
%%sql
SELECT * 
FROM visits
WHERE time_in_queue > 500;

record_id,location_id,source_id,time_of_record,visit_count,time_in_queue,assigned_employee_id
899,SoRu35083,SoRu35083224,2021-01-16 10:14:00,6,515,28
2304,SoKo33124,SoKo33124224,2021-02-06 07:53:00,5,512,16
2315,KiRu26095,KiRu26095224,2021-02-06 14:32:00,3,529,8
3206,SoRu38776,SoRu38776224,2021-02-20 15:03:00,5,509,46
3701,HaRu19601,HaRu19601224,2021-02-27 12:53:00,3,504,0
4154,SoRu38869,SoRu38869224,2021-03-06 10:44:00,2,533,24
5483,AmRu14089,AmRu14089224,2021-03-27 18:15:00,4,509,12
9177,SoRu37635,SoRu37635224,2021-05-22 18:48:00,2,515,1
9648,SoRu36096,SoRu36096224,2021-05-29 11:24:00,2,533,3
11631,AkKi00881,AkKi00881224,2021-06-26 06:15:00,6,502,32


I am wondering what type of water sources take this long to queue for. We will have to find that information in another table that lists
the types of water sources. The table has type_of_water_source, and a source_id column. So let's write
down a couple of these source_id values from our results, and search for them in the other table. 
AkKi00881224
SoRu37635224
SoRu36096224

In [24]:
%%sql
SELECT source_id,  type_of_water_source
FROM water_source
WHERE source_id IN ('AkKi00881224', 'SoRu37635224', 'SoRu36096224');


source_id,type_of_water_source
AkKi00881224,shared_tap
SoRu36096224,shared_tap
SoRu37635224,shared_tap


## 4. Assess the quality of water sources: Flag Home Taps with Multiple Visits

In [4]:
%%sql
SELECT *
FROM water_quality
WHERE subjective_quality_score = 10
 AND visit_count > 1;

record_id,subjective_quality_score,visit_count
59,10,2
67,10,3
85,10,4
128,10,5
137,10,2
232,10,3
263,10,6
269,10,2
271,10,3
317,10,4


We identified 218 records where water sources scored a perfect 10 (indicating clean home taps) but were visited more than once. This contradicts field survey protocol, which states that home taps should not be revisited. The anomaly suggests potential misclassification or procedural errors

In [26]:
%%sql
SELECT *
FROM well_pollution
LIMIT 5;

source_id,date,description,pollutant_ppm,biological,results
KiRu28935224,2021-01-04 09:17:00,Bacteria: Giardia Lamblia,0.0,495.898,Contaminated: Biological
AkLu01628224,2021-01-04 09:53:00,Bacteria: E. coli,0.0,6.09608,Contaminated: Biological
HaZa21742224,2021-01-04 10:37:00,"Inorganic contaminants: Zinc, Zinc, Lead, Cadmium",2.715,0.0,Contaminated: Chemical
HaRu19725224,2021-01-04 11:04:00,Clean,0.0288593,9.56996e-05,Clean
SoRu35703224,2021-01-04 11:29:00,Bacteria: E. coli,0.0,22.5009,Contaminated: Biological


In the well pollution table, the descriptions are notes taken by our scientists as text, so it will be challenging to process it. The
biological column is in units of CFU/mL, so it measures how much contamination is in the water. 0 is clean, and anything more than
0.01 is contaminated.
Let's check the integrity of the data. The worst case is if we have contamination, but we think we don't. People can get sick, so we
need to make sure there are no errors here.

In [27]:
%%sql
SELECT * FROM well_pollution
WHERE results = 'Clean'
    AND biological > 0.01;

source_id,date,description,pollutant_ppm,biological,results
AkRu08936224,2021-01-08 09:22:00,Bacteria: E. coli,0.0406458,35.0068,Clean
AkRu06489224,2021-01-10 09:44:00,Clean Bacteria: Giardia Lamblia,0.0897904,38.467,Clean
SoRu38011224,2021-01-14 15:35:00,Bacteria: E. coli,0.0425095,19.2897,Clean
AkKi00955224,2021-01-22 12:47:00,Bacteria: E. coli,0.0812092,40.2273,Clean
KiHa22929224,2021-02-06 13:54:00,Bacteria: E. coli,0.0722537,18.4482,Clean
KiRu25473224,2021-02-07 15:51:00,Clean Bacteria: Giardia Lamblia,0.0630094,24.4536,Clean
HaRu17401224,2021-03-01 13:44:00,Clean Bacteria: Giardia Lamblia,0.0649209,25.8129,Clean
AkRu07137224,2021-03-04 13:41:00,Clean Bacteria: Giardia Lamblia,0.0656843,18.2978,Clean
KiRu27205224,2021-03-13 14:17:00,Clean Bacteria: Giardia Lamblia,0.0418018,49.4281,Clean
AkLu02307224,2021-03-13 15:41:00,Bacteria: E. coli,0.0709682,35.203,Clean


Some wells were marked as “Clean” in the results column even though they had biological contamination > 0.01 CFU/mL.
This error happened because the description field started with “Clean”, and some data entry personnel used that instead of checking the actual contamination values.

So I want to:

1. Create a safe copy of the original table.

2. Fix the incorrect descriptions.

3. Correct the misclassified results.

4. Verify that the errors are gone.

In [6]:
%%sql
CREATE TABLE md_water_services.well_pollution_copy AS
SELECT *
FROM md_water_services.well_pollution;

In [18]:
%%sql
UPDATE md_water_services.well_pollution_copy
SET description = 'Bacteria: E. coli'
WHERE description LIKE 'Clean%Bacteria: E. coli';

UPDATE md_water_services.well_pollution_copy
SET description = 'Bacteria: Giardia Lamblia'
WHERE description LIKE 'Clean%Bacteria: Giardia Lamblia';


In [19]:
%%sql
UPDATE md_water_services.well_pollution_copy
SET results = 'Contaminated: Biological'
WHERE biological > 0.01 AND results LIKE 'Clean%';



In [21]:
%%sql
SELECT *
FROM md_water_services.well_pollution_copy
WHERE results = 'Clean' AND biological > 0.01;


source_id,date,description,pollutant_ppm,biological,results


### Data Integrity Fix: Well Pollution Records

We successfully corrected 4916 records in the `well_pollution_copy` table. Descriptions that falsely began with “Clean” were updated, and results were corrected based on biological contamination levels. A final integrity check confirmed that no contaminated wells remain mislabeled as “Clean”.

**Impact**: This fix improves the reliability of our water safety data and supports accurate reporting to stakeholders. The workflow is reproducible and audit-ready.
