# Beginning Our Data-driven Journey in Maji Ndogo

## Introduction

In this first part of the integrated project, we dive into Maji Ndogo's expansive database containing 60,000 records spread across various tables. As we navigate this trove of data, we'll use basic queries to familiarise ourselves with the content of each table. Along the way, we'll also refine some data using **Data Manipulation Language (DML).**

## Connecting to our MySQL database

We'll start by connecting to our database by connecting to the MySQL server

In [1]:
# Load and activate the SQL extension 

%load_ext sql

In [2]:
# Establish a connection to the local database 

%sql mysql+pymysql://root:1234567@localhost:3306/md_water_services

'Connected: root@md_water_services'

## Get to Know the Data

We'll then familiarse ourselves by reviewing the first few records of each table to get a level overview of what our data looks like. Firstly, let's see all the tables that are in Maji Ndogo's database.

In [3]:
%sql SHOW TABLES

 * mysql+pymysql://root:***@localhost:3306/md_water_services
8 rows affected.


Tables_in_md_water_services
data_dictionary
employee
global_water_access
location
visits
water_quality
water_source
well_pollution


We can see a total of 8 tables all labeled very well because we can kind of figure out what each table is about without really having to think too hard.

Let us see what each of these tables contain starting with the `data_dictionary` table.

In [12]:
%sql SELECT * FROM data_dictionary LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


table_name,column_name,description,datatype,related_to
employee,assigned_employee_id,Unique ID assigned to each employee,INT,visits
employee,employee_name,Name of the employee,VARCHAR(255),
employee,phone_number,Contact number of the employee,VARCHAR(15),
employee,email,Email address of the employee,VARCHAR(255),
employee,address,Residential address of the employee,VARCHAR(255),
employee,town_name,Name of the town where the employee resides,VARCHAR(255),
employee,province_name,Name of the province where the employee resides,VARCHAR(255),
employee,position,Position or job title of the employee,VARCHAR(255),
visits,record_id,Unique ID assigned to each visit,int,"water_quality, water_source"
visits,location_id,ID of the location visited,varchar(255),location


We notice that the data dictionary has descriptions of each column per table in the database. The information above also tells us that the `employee` table has 8 columns which seems to have a primary key related to another table i.e `assigned_employee_id` is used to reference some information in the `visits` table. We can retrieve table names that are related to each other by running the below query.

In [5]:
%%sql
SELECT DISTINCT table_name, related_to
FROM data_dictionary
WHERE TRIM(related_to) != '';

 * mysql+pymysql://root:***@localhost:3306/md_water_services
8 rows affected.


table_name,related_to
employee,visits
visits,"water_quality, water_source"
visits,location
visits,well_pollution
visits,employee
water_source,visits
well_pollution,visits
location,visits


From the `data_dictionary` table, we observe that only 6 tables are interrelated. With the table serving as our guide and the `md_water_services` database as our environment, we now know how to effectively navigate our data landscape. To view the first few rows for each table, just like we did for the `data_dictionary` table, you can execute the same query multiple times while altering the table name after the `FROM` clause. This will display the first 10 records along with the attributes for each table/entity

# Diving Into Water Sources

Now that we are familiar with the structure of the tables we can explore further. A great place to begin is understanding the types of water sources we are dealing with and examining the sources that were recorded in the database. To get that information, we can take a closer look at the `water_source` table.

In [6]:
%%sql 
SELECT DISTINCT type_of_water_source
FROM water_source;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


type_of_water_source
tap_in_home
tap_in_home_broken
well
shared_tap
river


We can see that we have 5 unique types of water sources recorded in our database. Understanding the significance of each type is essential for creating effective, data-driven decision-making reports.

# Visits to Water Sources

We have a table in our database that logs the visits made to different water sources. Let's retrieve all records from this table where `time_in_queue` is more than some crazy time, say 500 min.

In [13]:
%%sql
SELECT *
FROM visits
WHERE time_in_queue > 500
ORDER BY record_id ASC
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


record_id,location_id,source_id,time_of_record,visit_count,time_in_queue,assigned_employee_id
899,SoRu35083,SoRu35083224,2021-01-16 10:14:00,6,515,28
2304,SoKo33124,SoKo33124224,2021-02-06 07:53:00,5,512,16
2315,KiRu26095,KiRu26095224,2021-02-06 14:32:00,3,529,8
3206,SoRu38776,SoRu38776224,2021-02-20 15:03:00,5,509,46
3701,HaRu19601,HaRu19601224,2021-02-27 12:53:00,3,504,0
4154,SoRu38869,SoRu38869224,2021-03-06 10:44:00,2,533,24
5483,AmRu14089,AmRu14089224,2021-03-27 18:15:00,4,509,12
9177,SoRu37635,SoRu37635224,2021-05-22 18:48:00,2,515,1
9648,SoRu36096,SoRu36096224,2021-05-29 11:24:00,2,533,3
11631,AkKi00881,AkKi00881224,2021-06-26 06:15:00,6,502,32


In [8]:
%%sql
SELECT 
    source_id,
    type_of_water_source,
    number_of_people_served
FROM water_source
WHERE source_id IN ("AkKi00881224", "HaRu19601224", "SoRu36096224", "SoRu37635224", "SoRu38776224");

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


source_id,type_of_water_source,number_of_people_served
AkKi00881224,shared_tap,3398
HaRu19601224,shared_tap,3322
SoRu36096224,shared_tap,3786
SoRu37635224,shared_tap,3920
SoRu38776224,shared_tap,3180


These `shared_tap` water sources are serving over 3,000 people. It's important to note that the field surveyors measured sources that had queues a few times to see if the queue time changed. 

# Assessing Water Sources Quality

The primary focus of this survey is the quality of our water. We have a `water_quality` table that records quality scores assigned to each visit made to a water source by a field surveyor. The scoring system ranges from 1, indicating poor quality, to 10, signifying a good, clean water source for homes. Shared taps typically receive lower scores, and the rating also considers factors such as queue times. We will now construct a query to identify records where the `subjective_quality_score` is 10 -- only looking for home taps -- and where the source was visited a second time.

In [14]:
%%sql
# Retrieve All Records with Good Water Quality and Visit is Iwice
SELECT *
FROM water_quality
WHERE subjective_quality_score = 10
AND visit_count = 2
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


record_id,subjective_quality_score,visit_count
59,10,2
137,10,2
269,10,2
363,10,2
378,10,2
618,10,2
752,10,2
801,10,2
819,10,2
850,10,2


Since the surveyor only revisited shared taps and not other types of water sources, there should not be any records with a visit count of 2 or more for water sources rated as good, indicated by a `subjective_quality_score` of 10. So, why does this discrepancy exist?

# Investigating Pollution Issues

There's a table that contains data on pollution and contamination levels in the wells of Maji Ndogo. Let's have a quick look at that table.

In [15]:
%%sql
SELECT * FROM well_pollution
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


source_id,date,description,pollutant_ppm,biological,results
KiRu28935224,2021-01-04 09:17:00,Bacteria: Giardia Lamblia,0.0,495.898,Contaminated: Biological
AkLu01628224,2021-01-04 09:53:00,Bacteria: E. coli,0.0,6.09608,Contaminated: Biological
HaZa21742224,2021-01-04 10:37:00,"Inorganic contaminants: Zinc, Zinc, Lead, Cadmium",2.715,0.0,Contaminated: Chemical
HaRu19725224,2021-01-04 11:04:00,Clean,0.0288593,9.56996e-05,Clean
SoRu35703224,2021-01-04 11:29:00,Bacteria: E. coli,0.0,22.5009,Contaminated: Biological
AkHa00070224,2021-01-04 11:42:00,Inorganic contaminants: Cadmium,5.46739,0.0,Contaminated: Chemical
HaSe21346224,2021-01-04 11:52:00,Clean,0.0140376,8.98989e-05,Clean
HaYa21468224,2021-01-04 12:03:00,"Inorganic contaminants: Chromium, Barium, Chromium, Lead",6.05137,0.0,Contaminated: Chemical
SoRu36278224,2021-01-04 12:24:00,Parasite: Cryptosporidium,0.0,485.162,Contaminated: Biological
AkLu02155224,2021-01-04 12:29:00,"Inorganic contaminants: Selenium, Arsenic",7.64106,0.0,Contaminated: Chemical


It looks like our scientists diligently recorded the water quality of all the wells. Some are contaminated with biological contaminants,
while others are polluted with an excess of heavy metals and other pollutants. Based on the results, each well was classified as `Clean`,
`Contaminated: Biological` or `Contaminated: Chemical`. This classification is crucial because wells that are polluted with bio- or
other contaminants are not safe to drink. Each of the records has a `source_id` which can help us link the `results` to the sources in other tables in the database. 

The `description` in the `well_pollution` table consist of notes recorded by our scientists in text format, so it will be challenging to process it. The `biological` column is measured in units of `CFU/mL`, indicating the level of contamination in the water. A reading of **0** signifies clean water, while any value above **0.01** indicates contamination.

Let's check the integrity of the data. The worst case is if we have contamination, but we think we don't. People can get sick, so we
need to make sure there are no errors here.

In [11]:
%%sql
SELECT *
FROM well_pollution
WHERE results = "Clean"
AND biological > 0.01;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
64 rows affected.


source_id,date,description,pollutant_ppm,biological,results
AkRu08936224,2021-01-08 09:22:00,Bacteria: E. coli,0.0406458,35.0068,Clean
AkRu06489224,2021-01-10 09:44:00,Clean Bacteria: Giardia Lamblia,0.0897904,38.467,Clean
SoRu38011224,2021-01-14 15:35:00,Bacteria: E. coli,0.0425095,19.2897,Clean
AkKi00955224,2021-01-22 12:47:00,Bacteria: E. coli,0.0812092,40.2273,Clean
KiHa22929224,2021-02-06 13:54:00,Bacteria: E. coli,0.0722537,18.4482,Clean
KiRu25473224,2021-02-07 15:51:00,Clean Bacteria: Giardia Lamblia,0.0630094,24.4536,Clean
HaRu17401224,2021-03-01 13:44:00,Clean Bacteria: Giardia Lamblia,0.0649209,25.8129,Clean
AkRu07137224,2021-03-04 13:41:00,Clean Bacteria: Giardia Lamblia,0.0656843,18.2978,Clean
KiRu27205224,2021-03-13 14:17:00,Clean Bacteria: Giardia Lamblia,0.0418018,49.4281,Clean
AkLu02307224,2021-03-13 15:41:00,Bacteria: E. coli,0.0709682,35.203,Clean


It seems like, in some cases, if the `description` field starts with the word “Clean”, the results have been classified as “Clean” in the `results`
column, even though the `biological` column is exceeds 0.01.
This could indicate that certain data points are  being misinterpreted based on the description text rather than its actual values. Let’s take a closer look at the root cause of this issue in the biological contamination data.

As per the project specifications, the `description` column should only have the word “Clean” if there is **no biological contamination (and no chemical pollutants)**. This means we need to find and remove the “Clean” part from all the descriptions that do have a biological contamination so this mistake is not made again.

A second, more serious issue has come from this error. Some of the wells have been marked as "Clean" in the `results` column because the `description` had the word “Clean” in it, even though they have a biological contamination. So we need to find all the results that have a value **greater than 0.01** in the `biological` column and have been set to "Clean" in the `results` column.

First, let's look at the descriptions. We need to identify the records that mistakenly have the word "Clean" in the `description`. However, it is important to remember that not all of our field surveyors used the description to set the results – some checked the actual data.

In [12]:
%%sql
# Retrieve All Records with Erroneous Descriptions
SELECT *
FROM well_pollution
WHERE description LIKE "Clean_%"

 * mysql+pymysql://root:***@localhost:3306/md_water_services
38 rows affected.


source_id,date,description,pollutant_ppm,biological,results
AkRu06489224,2021-01-10 09:44:00,Clean Bacteria: Giardia Lamblia,0.0897904,38.467,Clean
KiRu25473224,2021-02-07 15:51:00,Clean Bacteria: Giardia Lamblia,0.0630094,24.4536,Clean
HaRu17401224,2021-03-01 13:44:00,Clean Bacteria: Giardia Lamblia,0.0649209,25.8129,Clean
AkRu07137224,2021-03-04 13:41:00,Clean Bacteria: Giardia Lamblia,0.0656843,18.2978,Clean
KiRu27205224,2021-03-13 14:17:00,Clean Bacteria: Giardia Lamblia,0.0418018,49.4281,Clean
AkHa00514224,2021-04-11 12:11:00,Clean Bacteria: Giardia Lamblia,0.0305404,22.0255,Clean
AmAm09776224,2021-05-23 11:28:00,Clean Bacteria: Giardia Lamblia,0.0963821,13.6574,Clean
SoIl32894224,2021-07-11 11:37:00,Clean Bacteria: Giardia Lamblia,0.0712408,5.44957,Clean
AkRu07366224,2021-07-23 11:19:00,Clean Bacteria: Giardia Lamblia,0.0969458,26.0308,Clean
KiHa23443224,2021-09-05 12:34:00,Clean Bacteria: Giardia Lamblia,0.0828,13.7162,Clean


The query returned 38 wrong descriptions. Now we need to fix these descriptions so that we don’t encounter this issue again in the future.
Looking at the results we can see two different descriptions that we need to fix:

1. All records that mistakenly have Clean Bacteria: E. coli should updated to Bacteria: E. coli
2. All records that mistakenly have Clean Bacteria: Giardia Lamblia should updated to Bacteria: Giardia Lamblia

The second issue we need to fix is in our `results` column. We need to update the `results` column from "Clean" to "Contaminated: Bi
ological" where the `biological` column has a value **greater than 0.01****

> **NOTE**: The query below ↓ should only be run once as the changes it makes will be permanently stored in the database, so keep that in mind when running the queries in the notebook environment incase you restart the kennel and run all cells.

In [13]:
''' %%sql
# Update All Erroneous Values in Descriptions
UPDATE
    well_pollution
SET
    description = "Bacteria: E. coli"
WHERE
    description = "Clean Bacteria: E. coli";
    
UPDATE
    well_pollution
SET 
    description = "Bacteria: Giardia Lamblia"
WHERE 
    description = "Clean Bacteria: Giardia Lamblia";
    
UPDATE 
    well_pollution
SET 
    results = "Contaminated: Biological"
WHERE 
    biological > 0.01 AND results = "Clean";'''

 * mysql+pymysql://root:***@localhost:3306/md_water_services
26 rows affected.
12 rows affected.
64 rows affected.


[]

We can then check if our errors are fixed by running the query below.

In [14]:
%%sql
# Retrieve All Records with Erroneous Descriptions
SELECT *
FROM well_pollution
WHERE description LIKE "Clean_%"

 * mysql+pymysql://root:***@localhost:3306/md_water_services
0 rows affected.


source_id,date,description,pollutant_ppm,biological,results
